Multiple Regression Analysis Example with Conceptual Framework

Data analysis using multiple regression analysis is a fairly common tool used in statistics. Many graduate students find this too complicated to understand. However, this is not that difficult to do, especially with computers as everyday household items nowadays. You can now quickly analyze more than just two sets of variables in your research using multiple regression analysis. 

How is multiple regression analysis done? This article explains this handy statistical test when dealing with many variables, then provides an example of a research using multiple regression analysis to show how it works. It explains how research using multiple regression analysis is conducted.

Multiple regression is often confused with multivariate regression. Multivariate regression, while also using several variables, deals with more than one dependent variable . Karen Grace-Martin clearly explains the difference in her post on the difference between the Multiple Regression Model and Multivariate Regression Model .

Table of Contents

Statistical software applications used in computing multiple regression analysis.

Multiple regression analysis is a powerful statistical test used to find the relationship between a given dependent variable and a set of independent variables .

Using multiple regression analysis requires a dedicated statistical software like the popular  Statistical Package for the Social Sciences (SPSS) , Statistica, Microstat, and open-source statistical software applications like SOFA statistics and Jasp, among other sophisticated statistical packages.

Two decades ago, it will be near impossible to do the calculations using the obsolete simple calculator replaced by smartphones. 

However, a standard spreadsheet application like Microsoft Excel can help you compute and model the relationship between the dependent variable and a set of predictor or independent variables. But you cannot do this without activating first the setting of statistical tools that ship with MS Excel.

Activating MS Excel

To activate the add-in for multiple regression analysis in MS Excel, you may view the two-minute Youtube tutorial below. If you already have this installed on your computer, you may proceed to the next section.

Multiple Regression Analysis Example

I will illustrate the use of multiple regression analysis by citing the actual research activity that my graduate students undertook two years ago.

The study pertains to identifying the factors predicting a current problem among high school students, the long hours they spend online for a variety of reasons. The purpose is to address many parents’ concerns about their difficulty of weaning their children away from the lures of online gaming, social networking, and other engaging virtual activities.

Review of Literature on Internet Use and Its Effect on Children

Upon reviewing the literature, the graduate students discovered that very few studies were conducted on the subject. Studies on problems associated with internet use are still in its infancy as the Internet has just begun to influence everyone’s life.

Hence, with my guidance, the group of six graduate students comprising school administrators, heads of elementary and high schools, and faculty members proceeded with the study.

Given that there is a need to use a computer to analyze multiple variable data, a principal who is nearing retirement was “forced” to buy a laptop, as she had none. Anyhow, she is very much open-minded and performed the class activities that require data analysis with much enthusiasm.

The Research on High School Students’ Use of the Internet

The brief research using multiple regression analysis is a broad study or analysis of the reasons or underlying factors that significantly relate to the number of hours devoted by high school students in using the Internet. The regression analysis is broad because it only focuses on the total number of hours devoted by high school students to activities online.

They correlated the time high school students spent online with their profile. The students’ profile comprised more than two independent variables, hence the term “multiple.” The independent variables are age, gender, relationship with the mother, and relationship with the father.

The statement of the problem in this study is:

“Is there a significant relationship between the total number of hours spent online and the students’ age, gender, relationship with their mother, and relationship with their father?”

Their parents’ relationship was gauged using a scale of 1 to 10, 1 being a poor relationship, and 10 being the best experience with parents. The figure below shows the paradigm of the study.

multipleregression

Notice that in research using multiple regression studies such as this, there is only one dependent variable involved. That is the total number of hours spent by high school students online.

Although many studies have identified factors that influence the use of the internet, it is standard practice to include the respondents’ profile among the set of predictor or independent variables. Hence, the standard variables age and gender are included in the multiple regression analysis.

Also, among the set of variables that may influence internet use, only the relationship between children and their parents was tested. The intention of this research using multiple regression analysis is to determine if parents spend quality time establishing strong emotional bonds between them and their children.

exampleofmultipleregression

Findings of the Research Using Multiple Regression Analysis

What are the findings of this exploratory study? This quickly done example of a research using multiple regression analysis revealed an interesting finding.

The number of hours spent online relates significantly to the number of hours spent by a parent, specifically the mother, with her child. These two factors are inversely or negatively correlated.

The relationship means that the greater the number of hours spent by the mother with her child to establish a closer emotional bond, the fewer hours spent by her child using the internet. The number of hours spent by the children online relates significantly to the mother’s number of hours interacting with their children.

The number of hours spent by the children online relates significantly to the mother’s number of hours interacting with their children.

While this example of a research using multiple regression analysis may be a significant finding, the mother-child bond accounts for only a small percentage of the variance in total hours spent by the child online. This observation means that other factors need to be addressed to resolve long waking hours and abandonment of serious study of lessons by children.

But establishing a close bond between mother and child is a good start. Undertaking more investigations along this research concern will help strengthen the findings of this study.

The above example of a research using multiple regression analysis shows that the statistical tool is useful in predicting dependent variables’ behavior. In the above case, this is the number of hours spent by students online.

The identification of significant predictors can help determine the correct intervention to resolve the problem. Using multiple regression approaches prevents unnecessary costs for remedies that do not address an issue or a question.

Thus, this example of a research using multiple regression analysis streamlines solutions and focuses on those influential factors that must be given attention.

Once you become an expert in using multiple regression in analyzing data, you can try your hands on multivariate regression where you will deal with more than one dependent variable.

©2012 November 11 Patrick Regoniel Updated: 14 November 2020

Related Posts

Research Topics on Education: Four Child-Centered Examples

Research Topics on Education: Four Child-Centered Examples

Four statistical scales of measurement, what is a good research problem, about the author, patrick regoniel.

Dr. Regoniel, a faculty member of the graduate school, served as consultant to various environmental research and development projects covering issues and concerns on climate change, coral reef resources and management, economic valuation of environmental and natural resources, mining, and waste management and pollution. He has extensive experience on applied statistics, systems modelling and analysis, an avid practitioner of LaTeX, and a multidisciplinary web developer. He leverages pioneering AI-powered content creation tools to produce unique and comprehensive articles in this website.

mostly in monasteries.

Manuscript is a collective name for texts

the example is good but lacks the table of regression results. With the tables, a student could learn more on how to interpret regression results

this is so enlightening,hope it reaches most of the parents…

nice; but it is not good enough for reference

This is an action research Daniel. And I have updated it here. It can set off other studies. And please take note that blogs nowadays are already recognized sources of information. Please read my post here on why this is so: https://simplyeducate.me/wordpress_Y//2019/09/26/using-blogs-in-education/

Was this study published? It may have important implications

Dear Gabe, this study was presented by one of my students in a conference. I am just unsure if she was able to publish it in a journal.

SimplyEducate.Me Privacy Policy

logo

Multiple linear regression

Multiple linear regression #.

Fig. 11 Multiple linear regression #

Errors: \(\varepsilon_i \sim N(0,\sigma^2)\quad \text{i.i.d.}\)

Fit: the estimates \(\hat\beta_0\) and \(\hat\beta_1\) are chosen to minimize the residual sum of squares (RSS):

Matrix notation: with \(\beta=(\beta_0,\dots,\beta_p)\) and \({X}\) our usual data matrix with an extra column of ones on the left to account for the intercept, we can write

Multiple linear regression answers several questions #

Is at least one of the variables \(X_i\) useful for predicting the outcome \(Y\) ?

Which subset of the predictors is most important?

How good is a linear model for these data?

Given a set of predictor values, what is a likely value for \(Y\) , and how accurate is this prediction?

The estimates \(\hat\beta\) #

Our goal again is to minimize the RSS: $ \( \begin{aligned} \text{RSS}(\beta) &= \sum_{i=1}^n (y_i -\hat y_i(\beta))^2 \\ & = \sum_{i=1}^n (y_i - \beta_0- \beta_1 x_{i,1}-\dots-\beta_p x_{i,p})^2 \\ &= \|Y-X\beta\|^2_2 \end{aligned} \) $

One can show that this is minimized by the vector \(\hat\beta\) : $ \(\hat\beta = ({X}^T{X})^{-1}{X}^T{y}.\) $

We usually write \(RSS=RSS(\hat{\beta})\) for the minimized RSS.

Which variables are important? #

Consider the hypothesis: \(H_0:\) the last \(q\) predictors have no relation with \(Y\) .

Based on our model: \(H_0:\beta_{p-q+1}=\beta_{p-q+2}=\dots=\beta_p=0.\)

Let \(\text{RSS}_0\) be the minimized residual sum of squares for the model which excludes these variables.

The \(F\) -statistic is defined by: $ \(F = \frac{(\text{RSS}_0-\text{RSS})/q}{\text{RSS}/(n-p-1)}.\) $

Under the null hypothesis (of our model), this has an \(F\) -distribution.

Example: If \(q=p\) , we test whether any of the variables is important. $ \(\text{RSS}_0 = \sum_{i=1}^n(y_i-\overline y)^2 \) $

The \(t\) -statistic associated to the \(i\) th predictor is the square root of the \(F\) -statistic for the null hypothesis which sets only \(\beta_i=0\) .

A low \(p\) -value indicates that the predictor is important.

Warning: If there are many predictors, even under the null hypothesis, some of the \(t\) -tests will have low p-values even when the model has no explanatory power.

How many variables are important? #

When we select a subset of the predictors, we have \(2^p\) choices.

A way to simplify the choice is to define a range of models with an increasing number of variables, then select the best.

Forward selection: Starting from a null model, include variables one at a time, minimizing the RSS at each step.

Backward selection: Starting from the full model, eliminate variables one at a time, choosing the one with the largest p-value at each step.

Mixed selection: Starting from some model, include variables one at a time, minimizing the RSS at each step. If the p-value for some variable goes beyond a threshold, eliminate that variable.

Choosing one model in the range produced is a form of tuning . This tuning can invalidate some of our methods like hypothesis tests and confidence intervals…

How good are the predictions? #

The function predict in R outputs predictions and confidence intervals from a linear model:

Prediction intervals reflect uncertainty on \(\hat\beta\) and the irreducible error \(\varepsilon\) as well.

These functions rely on our linear regression model $ \( Y = X\beta + \epsilon. \) $

Dealing with categorical or qualitative predictors #

For each qualitative predictor, e.g. Region :

Choose a baseline category, e.g. East

For every other category, define a new predictor:

\(X_\text{South}\) is 1 if the person is from the South region and 0 otherwise

\(X_\text{West}\) is 1 if the person is from the West region and 0 otherwise.

The model will be: $ \(Y = \beta_0 + \beta_1 X_1 +\dots +\beta_7 X_7 + \color{Red}{\beta_\text{South}} X_\text{South} + \beta_\text{West} X_\text{West} +\varepsilon.\) $

The parameter \(\color{Red}{\beta_\text{South}}\) is the relative effect on Balance (our \(Y\) ) for being from the South compared to the baseline category (East).

The model fit and predictions are independent of the choice of the baseline category.

However, hypothesis tests derived from these variables are affected by the choice.

Solution: To check whether region is important, use an \(F\) -test for the hypothesis \(\beta_\text{South}=\beta_\text{West}=0\) by dropping Region from the model. This does not depend on the coding.

Note that there are other ways to encode qualitative predictors produce the same fit \(\hat f\) , but the coefficients have different interpretations.

So far, we have:

Defined Multiple Linear Regression

Discussed how to test the importance of variables.

Described one approach to choose a subset of variables.

Explained how to code qualitative variables.

Now, how do we evaluate model fit? Is the linear model any good? What can go wrong?

How good is the fit? #

To assess the fit, we focus on the residuals $ \( e = Y - \hat{Y} \) $

The RSS always decreases as we add more variables.

The residual standard error (RSE) corrects this: $ \(\text{RSE} = \sqrt{\frac{1}{n-p-1}\text{RSS}}.\) $

Fig. 12 Residuals #

Visualizing the residuals can reveal phenomena that are not accounted for by the model; eg. synergies or interactions:

Potential issues in linear regression #

Interactions between predictors

Non-linear relationships

Correlation of error terms

Non-constant variance of error (heteroskedasticity)

High leverage points

Collinearity

Interactions between predictors #

Linear regression has an additive assumption: $ \(\mathtt{sales} = \beta_0 + \beta_1\times\mathtt{tv}+ \beta_2\times\mathtt{radio}+\varepsilon\) $

i.e. An increase of 100 USD dollars in TV ads causes a fixed increase of \(100 \beta_2\) USD in sales on average, regardless of how much you spend on radio ads.

We saw that in Fig 3.5 above. If we visualize the fit and the observed points, we see they are not evenly scattered around the plane. This could be caused by an interaction.

One way to deal with this is to include multiplicative variables in the model:

The interaction variable tv \(\cdot\) radio is high when both tv and radio are high.

R makes it easy to include interaction variables in the model:

Non-linearities #

Fig. 13 A nonlinear fit might be better here. #

Example: Auto dataset.

A scatterplot between a predictor and the response may reveal a non-linear relationship.

Solution: include polynomial terms in the model.

Could use other functions besides polynomials…

Fig. 14 Residuals for Auto data #

In 2 or 3 dimensions, this is easy to visualize. What do we do when we have too many predictors?

Correlation of error terms #

We assumed that the errors for each sample are independent:

What if this breaks down?

The main effect is that this invalidates any assertions about Standard Errors, confidence intervals, and hypothesis tests…

Example : Suppose that by accident, we duplicate the data (we use each sample twice). Then, the standard errors would be artificially smaller by a factor of \(\sqrt{2}\) .

When could this happen in real life:

Time series: Each sample corresponds to a different point in time. The errors for samples that are close in time are correlated.

Spatial data: Each sample corresponds to a different location in space.

Grouped data: Imagine a study on predicting height from weight at birth. If some of the subjects in the study are in the same family, their shared environment could make them deviate from \(f(x)\) in similar ways.

Correlated errors #

Simulations of time series with increasing correlations between \(\varepsilon_i\)

Non-constant variance of error (heteroskedasticity) #

The variance of the error depends on some characteristics of the input features.

To diagnose this, we can plot residuals vs. fitted values:

If the trend in variance is relatively simple, we can transform the response using a logarithm, for example.

Outliers from a model are points with very high errors.

While they may not affect the fit, they might affect our assessment of model quality.

Possible solutions: #

If we believe an outlier is due to an error in data collection, we can remove it.

An outlier might be evidence of a missing predictor, or the need to specify a more complex model.

High leverage points #

Some samples with extreme inputs have an outsized effect on \(\hat \beta\) .

This can be measured with the leverage statistic or self influence :

Studentized residuals #

The residual \(e_i = y_i - \hat y_i\) is an estimate for the noise \(\epsilon_i\) .

The standard error of \(\hat \epsilon_i\) is \(\sigma \sqrt{1-h_{ii}}\) .

A studentized residual is \(\hat \epsilon_i\) divided by its standard error (with appropriate estimate of \(\sigma\) )

When model is correct, it follows a Student-t distribution with \(n-p-2\) degrees of freedom.

Collinearity #

Two predictors are collinear if one explains the other well:

Problem: The coefficients become unidentifiable .

Consider the extreme case of using two identical predictors limit : $ \( \begin{aligned} \mathtt{balance} &= \beta_0 + \beta_1\times\mathtt{limit} + \beta_2\times\mathtt{limit} + \epsilon \\ & = \beta_0 + (\beta_1+100)\times\mathtt{limit} + (\beta_2-100)\times\mathtt{limit} + \epsilon \end{aligned} \) $

For every \((\beta_0,\beta_1,\beta_2)\) the fit at \((\beta_0,\beta_1,\beta_2)\) is just as good as at \((\beta_0,\beta_1+100,\beta_2-100)\) .

If 2 variables are collinear, we can easily diagnose this using their correlation.

A group of \(q\) variables is multilinear if these variables “contain less information” than \(q\) independent variables.

Pairwise correlations may not reveal multilinear variables.

The Variance Inflation Factor (VIF) measures how predictable it is given the other variables, a proxy for how necessary a variable is:

Above, \(R^2_{X_j|X_{-j}}\) is the \(R^2\) statistic for Multiple Linear regression of the predictor \(X_j\) onto the remaining predictors.

Introduction to Research Methods

15 multiple regression.

In the last chapter we met our new friend (frenemy?) regression, and did a few brief examples. And at this point, regression is actually more of a roommate. If you stay in the apartment (research methods) it’s gonna be there. The good thing is regression brings a bunch of cool stuff for the apartment that we need, like a microwave.

15.1 Concepts

Let’s begin this chapter with a bit of a mystery, and then use regression to figure out what’s going on.

What would you predict, just based on what you know and your experiences, the relationship between the number of computers at a school and their math test scores is? Do you think schools with more computers do worse or better?

Computers might be useful for teaching math, and are typically more available in wealthier schools. Thus, I would predict that the number of computers at a school would predict higher scores on math tests. We can use the data on California schools to test that idea.

Oh. Interesting. The relationship is insignificant, and perhaps most surprisingly, negative. Schools with more computers did worse on the test in the sample. For each additional computer there was at a school, scores on the math test decreased by .001 points, and that result is not significant.

So computers don’t make much of a difference. Are computers distracting the test takers? Diminishing their skills in math? My old math teachers were always worried about us using calculators too much. Maybe, but maybe it’s not the computers fault.

Let’s ask a different question then.

What do you think the relationship is between the number of computers at a school and the number of students? Larger schools might not have the same number of computers per student, but if you had to bet money would you think the school with 10,000 students or 1000 students would have more computers?

If you’re guessing that schools with more students have more computers, you’d be correct. The correlation coefficient for the number of students and computers is .93 (very strong), and we can see that below in the graph.

example research question for multiple regression

More students means more computers. In the regression we ran though all it knows is that schools with more computers do worse on math, but they can’t tell why. If larger schools have more computers AND do worse on tests, a bivariate regression can’t separate those effects on its own. We did bivariate regression in the last chapter, where we just look at two variables, one independent and one dependent (bivariate means two (bi) variables (variate)).

Multiple regression can help us try though. Multiple regression doesn’t mean running multiple regressions, it refers to including multiple variables in the same regression. Most of the tools we’ve learned so far only allow for two variables to be used, but with regression we can use many (many) more.

Let’s see what happens when we look at the relationship between the number of computers and math scores, controlling for the number of students at the school.

This second regression shows something different. In the earlier regression, the number of computers was negative and not significant. Now? Now it’s positive and significant. So what happened?

We controlled for the number of students that are at the school, at the same time that we’re testing the relationship between computers and math scores. Don’t worry if that’s not clear yet, we’re going to spend some time on it. When I say “holding the number of students constant” it means comparing schools with different numbers of computers but that have the same number of students. If we compare two schools with the same number of students, we can then better identify the impact of computers.

We can interpret the variables in the same way as earlier when just testing one variable to some degree. We can see that a larger number of computers is associated with higher test scores, and that larger schools generally do worse on the math test.

Specifically, a one unit increase in computers is associated with an increase of math scores of.02 points, and that change is highly significant.

But our interpretation needs to add something more. With multiple regression what we’re doing is looking at the effect of each variable, while holding the other variable constant.

Specifically, a one unit increase in computers is associated with an increase of math scores of.002 points when holding the number of students constant , and that change is highly significant.

When we look at the effect of computers in this regression, we’re setting aside the impact of student enrollment and just looking at computers. And when we look at the coefficient for students, we’re setting aside the impact of computers and isolating the effect of larger school enrollments on test scores.

We looked at scatter plots and added a line to the graph to better understand the direction of relationships in the previous chapter. We can do that again, but it’s slightly different.

Here is the relationship of computers to math scores, and the relationship of computers to math scores holding students constant. That means we’re actually producing separate lines for both variables, but we’re doing that after accounting for the impact of computers on school enrollment, and school enrollment on computers.

example research question for multiple regression

We can also graph it in 3 dimensions, where we place the outcome on the z axis coming out of the paper/screen towards you.

example research question for multiple regression

But I’ll be honest, that doesn’t really clarify it for me. Multiple regression is still about drawing lines, but it’s more of a theoretical line. It’s really hard to actually effectively draw lines as we move beyond two variables or two dimensions. Hopefully that logic of drawing a line and the equation of a line still makes sense for you, because it’s the same formula we use in interpreting multiple regressions.

What we’re figuring out with multiple regression is what part of math scores is determined uniquely by the student enrollment at a school and what part of math scores is determined uniquely by the number of computers. Once R figures that out it gives us the slope of two lines, one for computers and one for students. The line for computers slopes upwards, because the more computers a school has the better it’s students do, when we hold constant the number of students at the school. When we hold constant the number of computers, larger schools do worse on the math test.

I don’t expect that to fully make sense yet. Understanding what it means to “hold something constant” is pretty complex and theoretical, but it’s also important to fully utilizing the powers of regression. What this example illustrates though is the dangers inherent in using regression results, and the difficulty of using them to prove causality.

Let’s go back to the bivariate regression we did, just including the number of computers at a school and math test scores. Did that prove that computers don’t impact scores? No, even though that would be the correct interpretation of the results. But lets go back to what we need for causality…

  • Co-variation
  • Temporal Precedence
  • Elimination of Extraneous Variables or Hypotheses

We failed to eliminate extraneous variables. We tested the impact of computers, but we didn’t do anything to test any other hypotheses of what impacts math scores. We didn’t test whether other factors that impact scores (number of teachers, wealth of parents, size of the school) had a mediating relationship on the number of computers. Until we test every other explanation for the relationship, we haven’t really proven anything about computers and test scores. That’s why we need to take caution in doing regression. Yes, you can now do regression, and you can hopefully correctly interpret them. But correctly interpreting a regression, and doing a regression that proves something is a little more complicated. We’ll keep working towards that though.

15.1.1 Predicting Wages

To this point the book has attempted to avoid touching on anything that is too controversial. Statistics is a math, so it’s a fairly apolitical field, but it can be used to support political or controversial matters. We’re going to wade into one in this chapter, to try and show the way that statistics can let us get at some of the thorny issues our world deals with. In addition, this example should help to clarify what it means to “hold something constant”.

We’ll work with the same income data we used in the last chapter from the Panel Study of Income Dynamics from 1982. Just to remind you, these are the variables we have available.

  • experience - Years of full-time work experience.
  • weeks - Weeks worked.
  • occupation - factor. Is the individual a white-collar (“white”) or blue-collar (“blue”) worker?
  • industry - factor. Does the individual work in a manufacturing industry?
  • south - factor. Does the individual reside in the South?
  • smsa - factor. Does the individual reside in a SMSA (standard metropolitan statistical area)?
  • married - factor. Is the individual married?
  • gender - factor indicating gender.
  • union - factor. Is the individual’s wage set by a union contract?
  • education - Years of education.
  • ethnicity - factor indicating ethnicity. Is the individual African American (“afam”) or not (“other”)?
  • wage - Wage.

Let’s say we wanted to understand wage discrimination on the basis of race or ethnicity Do African Americans earn less than others in the workplace? Let’s see what this data tells us.

And a note before we begin. The variable ethnicity has two categories, “afam” which indicates African American or “other” which means anything but African American. Obviously, that captures a lot modernly, but in the 1980 that generally can be understood to generally be white people. I’ll generally just refer to it as other races in the text though.

The average wage for African Americans in the data is 808.5, and for others the average wage is 1174. That means that African Americans earn (in this really specific data set) 61% of how much men earn 365.5 less.

Let’s say we take that fact to someone that doesn’t believe that African Americans are discriminated against. We’ll call them you’re “contrarian friend”, you can fill in other ideas of what you’d think about that person. What will their response be? Probably that it isn’t evidence of discrimination, because of course African Americans earn less, they’re less likely to work in white collar jobs. And people that work in white collar jobs earn more, so that’s the reason African Americans earn less. It’s not discrimination, it’s just that they work different jobs.

And on the surface, they’d be right. African Americans are more likely to work in blue collar jobs (65% to 50%), and blue collar jobs earn less (956 for blue collar jobs to 1350 for white collar jobs).

So what we’d want to do then is compare African Americans to others that both work blue collar jobs, and African Americans to others working white collar jobs. If there is a difference in wages between two people working the same job, that’s better evidence that the pay gap is a result not of their occupational choices but their race.

We can visualize that with a two by two chart.

Let’s work across that chart to see what it tells us. A 2 by 2 chart like that is called a cross tab because it let’s us tab ulate figures a cross different characteristics of our data. They can be a methodologically simple way (we’re just showing means/averages there) to tell a story if the data is clear.

So what do we learn? Looking at the top row, white collar workers that are labeled other for ethnicity earn on average $1373. And white collar workers that are African American earn $918. Which means that for white collar workers, African Americans earn $455 less. For blue collar workers, other races earn $977, while African Americans earn $749. That’s a gap of $228. So the size of the gap is different depending on what a persons job is, but African American’s earn less regardless of their job. So it isn’t just that African Americans are less likely to work white collar jobs that drives their lower wages. Even those in white collar jobs earn less. In fact, African Americans in white collar jobs earn less on average than other races working blue collar jobs!

This is what it means to hold something constant. In that table above we’re holding occupation constant, and comparing people based on their race to people of another race that work the same job. So differences in those jobs aren’t influencing our results now, we’ve set that effect aside for the moment.

And we can do that automatically with regression, like we did when we looked at the effect of computers on math scores, while holding the impact of school enrollment constant.

Based on that regression results, African Americans earn $309 less than other races when holding occupation constant, and that effect is highly significant. And blue collar workers earn $380 less than white collar workers when holding race constant, and that effect is significant too.

So have we proven discrimination in wages? Probably not yet for the contrarian friend. Without pause they’ll likely say that education is also important for wages, and African Americans are less likely to go to college. And in the data they’d be correct. On average African Americans completed 11.65 years of education, and other races completed 12.94.

So let’s add that to our regression too.

Now with the ethnicity variable we’re comparing people of different ethnicities that have the same occupation and education. And what do we find? Even holding both of those constant, we would expect an African American worker to earn $262 less, and that is highly significant.

What your contrarian friend is doing is proposing alternative variables and hypotheses that explain the gap in earnings for African Americans. And while those other things do make a difference they don’t explain fully why African Americans earn less than others. We have shrunk the gap somewhat. Originally the gap was 465, which fell to 309 when we held occupation constant and now 262 with the inclusion of education. So those alternative explanations do explain a portion of why African Americans earned less, it was because they had lower-status jobs and less education (setting aside the fact that their lower-status jobs and less education may be the result of discrimination).

So what else do we want to include to try and explain that difference in wages? We can insert all of the variables in the data set to see if there is still a gap in wages between African Americans and others.

Controlling for occupation, education, experience, weeks worked, the industry, the region of employment, whether they are married, their gender, and their union status, does ethnicity make a difference in earnings? Yes, if you found two workers that had the same values for all of those variables except that they were of different races, the African American would still likely earn less.

In our regression African Americans earn $167 less when holding occupation, education, experience, weeks worked, the industry, region, marriage, gender, and their union status constant, and that effect is still statistically significant.

The contrarian friend may still have another alternative hypothesis to attempt to explain away that result, but unfortunately that’s all the data will let us test.

What we’re attempting to do is minimize what is called the missing variable bias . If there is a plausible story that explains our result, whether one is predicting math test scores or wages or whatever else, if we fail to account for that explanation our model may be misleading. It was misleading to say that computers don’t increase math test scores when we didn’t control for the effect of larger school sizes.

What missing variables do we not have that may explain the difference in earnings between African Americans and others? We don’t know who is a manager at work or anything about job performance, and both of those should help explain why people earn more. So we haven’t removed our missing variable bias, the evidence we can provide is limited by that. But based on the evidence we can generate, we find evidence of racial discrimination in wages.

And I should again emphasize, even if something else did explain the gap in earnings between African Americans and others it wouldn’t prove there wasn’t discrimination in society. If differences in occupation did explain the racial gap in wages, that wouldn’t prove the discrimination didn’t push African Americans towards lower paying jobs.

But the work we’ve done above is similar to what a law firm would do if bringing a lawsuit against a large employer for wage discrimination. It’s hard to prove discrimination in individual cases. The employer will always just argue that John is a bad employee, and that’s why they earn less than their coworkers. Wage discrimination suits are typically brought as class action suits, where a large group of employees sues based on evidence that even when accounting for differences in specific job, and job performance, and experience, and other things there is still a gap in wages.

I should add a note about interpretation here. It’s the researcher that has to identify what they different coefficients means in the real world. We can talk about discrimination because of differences in earnings for African Americans and others, but we wouldn’t say that blue collar workers are discriminated against because they earn less than white collar workers. It’s unlikely that someone would say that people with more experience earning more is the result of discrimination. These are interpretations that we layer on to the analysis based on our expectations and understanding of the research question.

15.1.2 Predicting Affairs

Regression can be used to make predictions and learn more about the world in all sorts of contexts. Let’s work through another example, with a little more focus on the interpretation.

We’ll use a data set called Affairs, which unsurprisingly has data about affairs. Or more specifically, about people, and whether or not they have had an affair.

In the data set there are 10 variables.

  • affairsany - coded as 0 for those who haven’t had an affair and 1 for those who have had any number of affairs. This will be the dependent variable.
  • gender - either male or female
  • age - respondents age
  • yearsmarried - number of years of current marriage
  • children - are there children from the marriage
  • religiousness - scaled from 1-5, with 1 being anti religion and 5 being very religious
  • education -years of education
  • occupation - 1-7 based on a specific system of rating for occupations
  • rating 1-5 based on how happy the respondent reported their marriage being.

So we can throw all of those variables into a regression and see which ones have the largest impact on the likelihood someone had an affair. But before that we should pause to make predictions. We shouldn’t just include a variable just for laughs - we should have a reason for including it. We should be able to make a prediction for whether it should increase or decrease the dependent variable.

So what effect do you think these independent variables will have on the chances of someone having had an affair?

  • gender - I would guess their (on average) higher libidos and lower levels of concern about childbearing will lead to more affairs.
  • age - Young people are typically a little less ready for long term commitments, and a bit more irrational and willing to take chances, so age should decrease affairs. Although being older does give you more time to of had an affair. *yearsmarried - Longer marriages should be less likely to contain an affair. If someone was going to have an affair, i would expect it to happen earlier, and such things often end marriages.
  • children - Children, and avoiding hurting them, are hopefully a good reason for people to avoid having affairs.
  • religiousness - most religions teach that affairs are wrong, so I would guess people that are more religious are less likely to have affairs
  • education and occupation - I actually can’t make a prediction for what effect education or occupation have on affairs, and since I don’t think they’ll impact the dependent variable I wouldn’t include them in the analysis if I was doing this for myself. But I’ll keep them here as an example to talk about later.
  • rating - happier marriages will likely produce fewer affairs, in large part because it’s often unhappiness that makes couples stray.

Those arguments may be wrong or right. And they certainly wont be right in every case in the data - there will be counter examples. What I’ve tried to do is lay out predictions, or hypotheses, for what I expect the model to show us. Let’s test them all and see what predicts whether someone had an affair.

What do you see as the strongest predictors of whether someone had an affair? Let’s start by identifying what was highly statistically significant. Religiousness and rating both had p-values below .001, so we can be very confident that in the population people who are more religious and who report having happier marriages are both less likely to have affairs. Let’s interpret that more formally.

For each one unit increase in religiousness an individual’s chances of having an affair decrease by .05 holding their gender, age, years married, children, education, occupation and rating constant, and that change is significant.

That’s a long list of things we’re holding constant! When you get past 2 or 3 control variables, or when you’re describing different variables from the same model you can use “holding all else constant” in place of the list.

For each one unit increase in the happiness rating of a marriage an individual’s chances of having an affair decrease by .09, holding all else constant , and that change is significant.

What else that we included in the model is useful for predicting whether someone had an affair?

Age and years married both reach statistical significance. As individuals get older, their chances of having an affair decrease, as I predicted.

However, as their marriages get longer the chances of having had an affair increase, not decrease as I thought Interesting! Does that mean I should go back and change my prediction? No. What it likely means is that some of my assumptions were wrong, so I should update them and discuss why I was wrong (in the conclusion if this was a paper). If we only used regression to find things that we already know, we wouldn’t learn anything new. It’s still good that I made a prediction though because that highlights that the result is a little weird (to my eyes) or may be more surprising to the readers. Imagine if you found that a new jobs program actually lowered participants incomes, that would be a really important outcome of your research and just as valuable as if you’d found that incomes increase.

A surprising finding could also be evidence that there’s something wrong in the data. Did we enter years of marriage correctly, or did we possibly reverse it where longer marriages are actually coded as lower numbers. That’d be odd in this case, but it’s always worth thinking that possibility through. If I got data that showed college graduates earned less than those without a high school degree I’d be very skeptical of the data, because that would go against everything we know. It might just be an odd, fluky one-time finding, or it could be evidence something is wrong in the data.

Okay, what about everything else? All the other variables are insignificant. Should we remove them from the analysis, since they don’t have a significant effect on our dependent variable? It depends. Insignificant variables can be worth including in most cases in order to show that they don’t have an effect on the outcome. It’s worth knowing that gender and children don’t have an effect on affairs in the population. We had a reason to think they would, and it turns out they don’t really have much of an influence on whether someone has sex outside their marriage. That’s good to know.

I didn’t have a prediction for education or occupation though, and the fact they are insignificant means they aren’t really worth including. I’m not testing any interesting ideas about what affects affairs with those variables, they’re just being included because they’re in the data. That’s not a good reason for them to be there, we want to be testing something with each variable we include.

15.2 Practice

In truth, we haven’t done a lot of new work on code in this chapter. We’ve more so focused on this big idea of what it means to go from bivariate regression to multivariate regression. So we wont do a lot of practice, because the basic structure we learned in the last chapter drives most of what we’ll do.

We’ll read in some new data, that’s on Massachusetts schools and test scores there. It’s similar to the California Schools data, but from Massachusetts for variety.

We’ll focus on 4 of those variable, and try to figure out what predicts how schools do on tests in 8th grade (score8).

  • score8 - test scores for 8th graders
  • exptot - total spending for the school
  • english - percentage of students that don’t speak english as their native language
  • income - income of parents

Let’s start by practicing writing a regression to look at the impact of spending (exptot) on test scores.

That should look very similar to the last chapter. And we can interpret it the same way.

For each one unit increase in spending, we observe a .004 increase in test scores for 8th graders, and that change is significant.

Let’s add one more variable to the regression, and now include english along with exptot. To include an additional variable we just place a + sign between the two variables, as shown below.

Each one unit increase in spending is associated with a .007 increase in test scores for 8th graders, holding the percentage of english speakers constant, and that change is significant.

Each one unit increase in the percentage of students that don’t speak english as natives is associated with a 4.1 decrease in test scores for 8th graders, holding the spending constant, and that change is significant.

And one more, let’s add one more variable: income.

Interesting, spending actually lost its significance in that final regression and change directions.

Each one unit increase in spending is associated with a .002 decrease in test scores for 8th graders when holding the percentage of english speakers and parental income constant, but that change is insignificant.

Each one unit increase in the percentage of students that don’t speak english as natives is associated with a 2.2 decrease in test scores for 8th graders when holding spending and parental income constant, and that change is significant.

Each one unit increase in parental income is associated with a 2.8 increase in test scores for 8th graders when holding spending and the percentage of english speakers constant, and that change is significant.

The following video demonstrates the coding steps done above.

Multiple Regression Analysis using SPSS Statistics

Introduction.

Multiple regression is an extension of simple linear regression. It is used when we want to predict the value of a variable based on the value of two or more other variables. The variable we want to predict is called the dependent variable (or sometimes, the outcome, target or criterion variable). The variables we are using to predict the value of the dependent variable are called the independent variables (or sometimes, the predictor, explanatory or regressor variables).

For example, you could use multiple regression to understand whether exam performance can be predicted based on revision time, test anxiety, lecture attendance and gender. Alternately, you could use multiple regression to understand whether daily cigarette consumption can be predicted based on smoking duration, age when started smoking, smoker type, income and gender.

Multiple regression also allows you to determine the overall fit (variance explained) of the model and the relative contribution of each of the predictors to the total variance explained. For example, you might want to know how much of the variation in exam performance can be explained by revision time, test anxiety, lecture attendance and gender "as a whole", but also the "relative contribution" of each independent variable in explaining the variance.

This "quick start" guide shows you how to carry out multiple regression using SPSS Statistics, as well as interpret and report the results from this test. However, before we introduce you to this procedure, you need to understand the different assumptions that your data must meet in order for multiple regression to give you a valid result. We discuss these assumptions next.

SPSS Statistics

Assumptions.

When you choose to analyse your data using multiple regression, part of the process involves checking to make sure that the data you want to analyse can actually be analysed using multiple regression. You need to do this because it is only appropriate to use multiple regression if your data "passes" eight assumptions that are required for multiple regression to give you a valid result. In practice, checking for these eight assumptions just adds a little bit more time to your analysis, requiring you to click a few more buttons in SPSS Statistics when performing your analysis, as well as think a little bit more about your data, but it is not a difficult task.

Before we introduce you to these eight assumptions, do not be surprised if, when analysing your own data using SPSS Statistics, one or more of these assumptions is violated (i.e., not met). This is not uncommon when working with real-world data rather than textbook examples, which often only show you how to carry out multiple regression when everything goes well! However, don’t worry. Even when your data fails certain assumptions, there is often a solution to overcome this. First, let's take a look at these eight assumptions:

  • Assumption #1: Your dependent variable should be measured on a continuous scale (i.e., it is either an interval or ratio variable). Examples of variables that meet this criterion include revision time (measured in hours), intelligence (measured using IQ score), exam performance (measured from 0 to 100), weight (measured in kg), and so forth. You can learn more about interval and ratio variables in our article: Types of Variable . If your dependent variable was measured on an ordinal scale, you will need to carry out ordinal regression rather than multiple regression. Examples of ordinal variables include Likert items (e.g., a 7-point scale from "strongly agree" through to "strongly disagree"), amongst other ways of ranking categories (e.g., a 3-point scale explaining how much a customer liked a product, ranging from "Not very much" to "Yes, a lot").
  • Assumption #2: You have two or more independent variables , which can be either continuous (i.e., an interval or ratio variable) or categorical (i.e., an ordinal or nominal variable). For examples of continuous and ordinal variables , see the bullet above. Examples of nominal variables include gender (e.g., 2 groups: male and female), ethnicity (e.g., 3 groups: Caucasian, African American and Hispanic), physical activity level (e.g., 4 groups: sedentary, low, moderate and high), profession (e.g., 5 groups: surgeon, doctor, nurse, dentist, therapist), and so forth. Again, you can learn more about variables in our article: Types of Variable . If one of your independent variables is dichotomous and considered a moderating variable, you might need to run a Dichotomous moderator analysis .
  • Assumption #3: You should have independence of observations (i.e., independence of residuals ), which you can easily check using the Durbin-Watson statistic, which is a simple test to run using SPSS Statistics. We explain how to interpret the result of the Durbin-Watson statistic, as well as showing you the SPSS Statistics procedure required, in our enhanced multiple regression guide.
  • Assumption #4: There needs to be a linear relationship between (a) the dependent variable and each of your independent variables, and (b) the dependent variable and the independent variables collectively . Whilst there are a number of ways to check for these linear relationships, we suggest creating scatterplots and partial regression plots using SPSS Statistics, and then visually inspecting these scatterplots and partial regression plots to check for linearity. If the relationship displayed in your scatterplots and partial regression plots are not linear, you will have to either run a non-linear regression analysis or "transform" your data, which you can do using SPSS Statistics. In our enhanced multiple regression guide, we show you how to: (a) create scatterplots and partial regression plots to check for linearity when carrying out multiple regression using SPSS Statistics; (b) interpret different scatterplot and partial regression plot results; and (c) transform your data using SPSS Statistics if you do not have linear relationships between your variables.
  • Assumption #5: Your data needs to show homoscedasticity , which is where the variances along the line of best fit remain similar as you move along the line. We explain more about what this means and how to assess the homoscedasticity of your data in our enhanced multiple regression guide. When you analyse your own data, you will need to plot the studentized residuals against the unstandardized predicted values. In our enhanced multiple regression guide, we explain: (a) how to test for homoscedasticity using SPSS Statistics; (b) some of the things you will need to consider when interpreting your data; and (c) possible ways to continue with your analysis if your data fails to meet this assumption.
  • Assumption #6: Your data must not show multicollinearity , which occurs when you have two or more independent variables that are highly correlated with each other. This leads to problems with understanding which independent variable contributes to the variance explained in the dependent variable, as well as technical issues in calculating a multiple regression model. Therefore, in our enhanced multiple regression guide, we show you: (a) how to use SPSS Statistics to detect for multicollinearity through an inspection of correlation coefficients and Tolerance/VIF values; and (b) how to interpret these correlation coefficients and Tolerance/VIF values so that you can determine whether your data meets or violates this assumption.
  • Assumption #7: There should be no significant outliers , high leverage points or highly influential points . Outliers, leverage and influential points are different terms used to represent observations in your data set that are in some way unusual when you wish to perform a multiple regression analysis. These different classifications of unusual points reflect the different impact they have on the regression line. An observation can be classified as more than one type of unusual point. However, all these points can have a very negative effect on the regression equation that is used to predict the value of the dependent variable based on the independent variables. This can change the output that SPSS Statistics produces and reduce the predictive accuracy of your results as well as the statistical significance. Fortunately, when using SPSS Statistics to run multiple regression on your data, you can detect possible outliers, high leverage points and highly influential points. In our enhanced multiple regression guide, we: (a) show you how to detect outliers using "casewise diagnostics" and "studentized deleted residuals", which you can do using SPSS Statistics, and discuss some of the options you have in order to deal with outliers; (b) check for leverage points using SPSS Statistics and discuss what you should do if you have any; and (c) check for influential points in SPSS Statistics using a measure of influence known as Cook's Distance, before presenting some practical approaches in SPSS Statistics to deal with any influential points you might have.
  • Assumption #8: Finally, you need to check that the residuals (errors) are approximately normally distributed (we explain these terms in our enhanced multiple regression guide). Two common methods to check this assumption include using: (a) a histogram (with a superimposed normal curve) and a Normal P-P Plot; or (b) a Normal Q-Q Plot of the studentized residuals. Again, in our enhanced multiple regression guide, we: (a) show you how to check this assumption using SPSS Statistics, whether you use a histogram (with superimposed normal curve) and Normal P-P Plot, or Normal Q-Q Plot; (b) explain how to interpret these diagrams; and (c) provide a possible solution if your data fails to meet this assumption.

You can check assumptions #3, #4, #5, #6, #7 and #8 using SPSS Statistics. Assumptions #1 and #2 should be checked first, before moving onto assumptions #3, #4, #5, #6, #7 and #8. Just remember that if you do not run the statistical tests on these assumptions correctly, the results you get when running multiple regression might not be valid. This is why we dedicate a number of sections of our enhanced multiple regression guide to help you get this right. You can find out about our enhanced content as a whole on our Features: Overview page, or more specifically, learn how we help with testing assumptions on our Features: Assumptions page.

In the section, Procedure , we illustrate the SPSS Statistics procedure to perform a multiple regression assuming that no assumptions have been violated. First, we introduce the example that is used in this guide.

A health researcher wants to be able to predict "VO 2 max", an indicator of fitness and health. Normally, to perform this procedure requires expensive laboratory equipment and necessitates that an individual exercise to their maximum (i.e., until they can longer continue exercising due to physical exhaustion). This can put off those individuals who are not very active/fit and those individuals who might be at higher risk of ill health (e.g., older unfit subjects). For these reasons, it has been desirable to find a way of predicting an individual's VO 2 max based on attributes that can be measured more easily and cheaply. To this end, a researcher recruited 100 participants to perform a maximum VO 2 max test, but also recorded their "age", "weight", "heart rate" and "gender". Heart rate is the average of the last 5 minutes of a 20 minute, much easier, lower workload cycling test. The researcher's goal is to be able to predict VO 2 max based on these four attributes: age, weight, heart rate and gender.

Setup in SPSS Statistics

In SPSS Statistics, we created six variables: (1) VO 2 max , which is the maximal aerobic capacity; (2) age , which is the participant's age; (3) weight , which is the participant's weight (technically, it is their 'mass'); (4) heart_rate , which is the participant's heart rate; (5) gender , which is the participant's gender; and (6) caseno , which is the case number. The caseno variable is used to make it easy for you to eliminate cases (e.g., "significant outliers", "high leverage points" and "highly influential points") that you have identified when checking for assumptions. In our enhanced multiple regression guide, we show you how to correctly enter data in SPSS Statistics to run a multiple regression when you are also checking for assumptions. You can learn about our enhanced data setup content on our Features: Data Setup page. Alternately, see our generic, "quick start" guide: Entering Data in SPSS Statistics .

Test Procedure in SPSS Statistics

The seven steps below show you how to analyse your data using multiple regression in SPSS Statistics when none of the eight assumptions in the previous section, Assumptions , have been violated. At the end of these seven steps, we show you how to interpret the results from your multiple regression. If you are looking for help to make sure your data meets assumptions #3, #4, #5, #6, #7 and #8, which are required when using multiple regression and can be tested using SPSS Statistics, you can learn more in our enhanced guide (see our Features: Overview page to learn more).

Note: The procedure that follows is identical for SPSS Statistics versions 18 to 28 , as well as the subscription version of SPSS Statistics, with version 28 and the subscription version being the latest versions of SPSS Statistics. However, in version 27 and the subscription version , SPSS Statistics introduced a new look to their interface called " SPSS Light ", replacing the previous look for versions 26 and earlier versions , which was called " SPSS Standard ". Therefore, if you have SPSS Statistics versions 27 or 28 (or the subscription version of SPSS Statistics), the images that follow will be light grey rather than blue. However, the procedure is identical .

Menu for a multiple regression analysis in SPSS Statistics

Published with written permission from SPSS Statistics, IBM Corporation.

Note: Don't worry that you're selecting A nalyze > R egression > L inear... on the main menu or that the dialogue boxes in the steps that follow have the title, Linear Regression . You have not made a mistake. You are in the correct place to carry out the multiple regression procedure. This is just the title that SPSS Statistics gives, even when running a multiple regression procedure.

'Linear Regression' dialogue box for a multiple regression analysis in SPSS Statistics. All variables on the left

Interpreting and Reporting the Output of Multiple Regression Analysis

SPSS Statistics will generate quite a few tables of output for a multiple regression analysis. In this section, we show you only the three main tables required to understand your results from the multiple regression procedure, assuming that no assumptions have been violated. A complete explanation of the output you have to interpret when checking your data for the eight assumptions required to carry out multiple regression is provided in our enhanced guide. This includes relevant scatterplots and partial regression plots, histogram (with superimposed normal curve), Normal P-P Plot and Normal Q-Q Plot, correlation coefficients and Tolerance/VIF values, casewise diagnostics and studentized deleted residuals.

However, in this "quick start" guide, we focus only on the three main tables you need to understand your multiple regression results, assuming that your data has already met the eight assumptions required for multiple regression to give you a valid result:

Determining how well the model fits

The first table of interest is the Model Summary table. This table provides the R , R 2 , adjusted R 2 , and the standard error of the estimate, which can be used to determine how well a regression model fits the data:

'Model Summary' table for a multiple regression analysis in SPSS. 'R', 'R Square' & 'Adjusted R Square' highlighted

The " R " column represents the value of R , the multiple correlation coefficient . R can be considered to be one measure of the quality of the prediction of the dependent variable; in this case, VO 2 max . A value of 0.760, in this example, indicates a good level of prediction. The " R Square " column represents the R 2 value (also called the coefficient of determination), which is the proportion of variance in the dependent variable that can be explained by the independent variables (technically, it is the proportion of variation accounted for by the regression model above and beyond the mean model). You can see from our value of 0.577 that our independent variables explain 57.7% of the variability of our dependent variable, VO 2 max . However, you also need to be able to interpret " Adjusted R Square " ( adj. R 2 ) to accurately report your data. We explain the reasons for this, as well as the output, in our enhanced multiple regression guide.

Statistical significance

The F -ratio in the ANOVA table (see below) tests whether the overall regression model is a good fit for the data. The table shows that the independent variables statistically significantly predict the dependent variable, F (4, 95) = 32.393, p < .0005 (i.e., the regression model is a good fit of the data).

'ANOVA' table for a multiple regression analysis in SPSS Statistics. 'df', 'F' & 'Sig.' highlighted

Estimated model coefficients

The general form of the equation to predict VO 2 max from age , weight , heart_rate , gender , is:

predicted VO 2 max = 87.83 – (0.165 x age ) – (0.385 x weight ) – (0.118 x heart_rate ) + (13.208 x gender )

This is obtained from the Coefficients table, as shown below:

'Coefficients' table for a multiple regression analysis in SPSS Statistics. 'Unstandardized Coefficients B' highlighted

Unstandardized coefficients indicate how much the dependent variable varies with an independent variable when all other independent variables are held constant. Consider the effect of age in this example. The unstandardized coefficient, B 1 , for age is equal to -0.165 (see Coefficients table). This means that for each one year increase in age, there is a decrease in VO 2 max of 0.165 ml/min/kg.

Statistical significance of the independent variables

You can test for the statistical significance of each of the independent variables. This tests whether the unstandardized (or standardized) coefficients are equal to 0 (zero) in the population. If p < .05, you can conclude that the coefficients are statistically significantly different to 0 (zero). The t -value and corresponding p -value are located in the " t " and " Sig. " columns, respectively, as highlighted below:

'Coefficients' table for a multiple regression analysis in SPSS Statistics. 't' & 'Sig.' highlighted

You can see from the " Sig. " column that all independent variable coefficients are statistically significantly different from 0 (zero). Although the intercept, B 0 , is tested for statistical significance, this is rarely an important or interesting finding.

Putting it all together

You could write up the results as follows:

A multiple regression was run to predict VO 2 max from gender, age, weight and heart rate. These variables statistically significantly predicted VO 2 max, F (4, 95) = 32.393, p < .0005, R 2 = .577. All four variables added statistically significantly to the prediction, p < .05.

If you are unsure how to interpret regression equations or how to use them to make predictions, we discuss this in our enhanced multiple regression guide. We also show you how to write up the results from your assumptions tests and multiple regression output if you need to report this in a dissertation/thesis, assignment or research report. We do this using the Harvard and APA styles. You can learn more about our enhanced content on our Features: Overview page.

  • Privacy Policy

Research Method

Home » Regression Analysis – Methods, Types and Examples

Regression Analysis – Methods, Types and Examples

Table of Contents

Regression Analysis

Regression Analysis

Regression analysis is a set of statistical processes for estimating the relationships among variables . It includes many techniques for modeling and analyzing several variables when the focus is on the relationship between a dependent variable and one or more independent variables (or ‘predictors’).

Regression Analysis Methodology

Here is a general methodology for performing regression analysis:

  • Define the research question: Clearly state the research question or hypothesis you want to investigate. Identify the dependent variable (also called the response variable or outcome variable) and the independent variables (also called predictor variables or explanatory variables) that you believe are related to the dependent variable.
  • Collect data: Gather the data for the dependent variable and independent variables. Ensure that the data is relevant, accurate, and representative of the population or phenomenon you are studying.
  • Explore the data: Perform exploratory data analysis to understand the characteristics of the data, identify any missing values or outliers, and assess the relationships between variables through scatter plots, histograms, or summary statistics.
  • Choose the regression model: Select an appropriate regression model based on the nature of the variables and the research question. Common regression models include linear regression, multiple regression, logistic regression, polynomial regression, and time series regression, among others.
  • Assess assumptions: Check the assumptions of the regression model. Some common assumptions include linearity (the relationship between variables is linear), independence of errors, homoscedasticity (constant variance of errors), and normality of errors. Violation of these assumptions may require additional steps or alternative models.
  • Estimate the model: Use a suitable method to estimate the parameters of the regression model. The most common method is ordinary least squares (OLS), which minimizes the sum of squared differences between the observed and predicted values of the dependent variable.
  • I nterpret the results: Analyze the estimated coefficients, p-values, confidence intervals, and goodness-of-fit measures (e.g., R-squared) to interpret the results. Determine the significance and direction of the relationships between the independent variables and the dependent variable.
  • Evaluate model performance: Assess the overall performance of the regression model using appropriate measures, such as R-squared, adjusted R-squared, and root mean squared error (RMSE). These measures indicate how well the model fits the data and how much of the variation in the dependent variable is explained by the independent variables.
  • Test assumptions and diagnose problems: Check the residuals (the differences between observed and predicted values) for any patterns or deviations from assumptions. Conduct diagnostic tests, such as examining residual plots, testing for multicollinearity among independent variables, and assessing heteroscedasticity or autocorrelation, if applicable.
  • Make predictions and draw conclusions: Once you have a satisfactory model, use it to make predictions on new or unseen data. Draw conclusions based on the results of the analysis, considering the limitations and potential implications of the findings.

Types of Regression Analysis

Types of Regression Analysis are as follows:

Linear Regression

Linear regression is the most basic and widely used form of regression analysis. It models the linear relationship between a dependent variable and one or more independent variables. The goal is to find the best-fitting line that minimizes the sum of squared differences between observed and predicted values.

Multiple Regression

Multiple regression extends linear regression by incorporating two or more independent variables to predict the dependent variable. It allows for examining the simultaneous effects of multiple predictors on the outcome variable.

Polynomial Regression

Polynomial regression models non-linear relationships between variables by adding polynomial terms (e.g., squared or cubic terms) to the regression equation. It can capture curved or nonlinear patterns in the data.

Logistic Regression

Logistic regression is used when the dependent variable is binary or categorical. It models the probability of the occurrence of a certain event or outcome based on the independent variables. Logistic regression estimates the coefficients using the logistic function, which transforms the linear combination of predictors into a probability.

Ridge Regression and Lasso Regression

Ridge regression and Lasso regression are techniques used for addressing multicollinearity (high correlation between independent variables) and variable selection. Both methods introduce a penalty term to the regression equation to shrink or eliminate less important variables. Ridge regression uses L2 regularization, while Lasso regression uses L1 regularization.

Time Series Regression

Time series regression analyzes the relationship between a dependent variable and independent variables when the data is collected over time. It accounts for autocorrelation and trends in the data and is used in forecasting and studying temporal relationships.

Nonlinear Regression

Nonlinear regression models are used when the relationship between the dependent variable and independent variables is not linear. These models can take various functional forms and require estimation techniques different from those used in linear regression.

Poisson Regression

Poisson regression is employed when the dependent variable represents count data. It models the relationship between the independent variables and the expected count, assuming a Poisson distribution for the dependent variable.

Generalized Linear Models (GLM)

GLMs are a flexible class of regression models that extend the linear regression framework to handle different types of dependent variables, including binary, count, and continuous variables. GLMs incorporate various probability distributions and link functions.

Regression Analysis Formulas

Regression analysis involves estimating the parameters of a regression model to describe the relationship between the dependent variable (Y) and one or more independent variables (X). Here are the basic formulas for linear regression, multiple regression, and logistic regression:

Linear Regression:

Simple Linear Regression Model: Y = β0 + β1X + ε

Multiple Linear Regression Model: Y = β0 + β1X1 + β2X2 + … + βnXn + ε

In both formulas:

  • Y represents the dependent variable (response variable).
  • X represents the independent variable(s) (predictor variable(s)).
  • β0, β1, β2, …, βn are the regression coefficients or parameters that need to be estimated.
  • ε represents the error term or residual (the difference between the observed and predicted values).

Multiple Regression:

Multiple regression extends the concept of simple linear regression by including multiple independent variables.

Multiple Regression Model: Y = β0 + β1X1 + β2X2 + … + βnXn + ε

The formulas are similar to those in linear regression, with the addition of more independent variables.

Logistic Regression:

Logistic regression is used when the dependent variable is binary or categorical. The logistic regression model applies a logistic or sigmoid function to the linear combination of the independent variables.

Logistic Regression Model: p = 1 / (1 + e^-(β0 + β1X1 + β2X2 + … + βnXn))

In the formula:

  • p represents the probability of the event occurring (e.g., the probability of success or belonging to a certain category).
  • X1, X2, …, Xn represent the independent variables.
  • e is the base of the natural logarithm.

The logistic function ensures that the predicted probabilities lie between 0 and 1, allowing for binary classification.

Regression Analysis Examples

Regression Analysis Examples are as follows:

  • Stock Market Prediction: Regression analysis can be used to predict stock prices based on various factors such as historical prices, trading volume, news sentiment, and economic indicators. Traders and investors can use this analysis to make informed decisions about buying or selling stocks.
  • Demand Forecasting: In retail and e-commerce, real-time It can help forecast demand for products. By analyzing historical sales data along with real-time data such as website traffic, promotional activities, and market trends, businesses can adjust their inventory levels and production schedules to meet customer demand more effectively.
  • Energy Load Forecasting: Utility companies often use real-time regression analysis to forecast electricity demand. By analyzing historical energy consumption data, weather conditions, and other relevant factors, they can predict future energy loads. This information helps them optimize power generation and distribution, ensuring a stable and efficient energy supply.
  • Online Advertising Performance: It can be used to assess the performance of online advertising campaigns. By analyzing real-time data on ad impressions, click-through rates, conversion rates, and other metrics, advertisers can adjust their targeting, messaging, and ad placement strategies to maximize their return on investment.
  • Predictive Maintenance: Regression analysis can be applied to predict equipment failures or maintenance needs. By continuously monitoring sensor data from machines or vehicles, regression models can identify patterns or anomalies that indicate potential failures. This enables proactive maintenance, reducing downtime and optimizing maintenance schedules.
  • Financial Risk Assessment: Real-time regression analysis can help financial institutions assess the risk associated with lending or investment decisions. By analyzing real-time data on factors such as borrower financials, market conditions, and macroeconomic indicators, regression models can estimate the likelihood of default or assess the risk-return tradeoff for investment portfolios.

Importance of Regression Analysis

Importance of Regression Analysis is as follows:

  • Relationship Identification: Regression analysis helps in identifying and quantifying the relationship between a dependent variable and one or more independent variables. It allows us to determine how changes in independent variables impact the dependent variable. This information is crucial for decision-making, planning, and forecasting.
  • Prediction and Forecasting: Regression analysis enables us to make predictions and forecasts based on the relationships identified. By estimating the values of the dependent variable using known values of independent variables, regression models can provide valuable insights into future outcomes. This is particularly useful in business, economics, finance, and other fields where forecasting is vital for planning and strategy development.
  • Causality Assessment: While correlation does not imply causation, regression analysis provides a framework for assessing causality by considering the direction and strength of the relationship between variables. It allows researchers to control for other factors and assess the impact of a specific independent variable on the dependent variable. This helps in determining the causal effect and identifying significant factors that influence outcomes.
  • Model Building and Variable Selection: Regression analysis aids in model building by determining the most appropriate functional form of the relationship between variables. It helps researchers select relevant independent variables and eliminate irrelevant ones, reducing complexity and improving model accuracy. This process is crucial for creating robust and interpretable models.
  • Hypothesis Testing: Regression analysis provides a statistical framework for hypothesis testing. Researchers can test the significance of individual coefficients, assess the overall model fit, and determine if the relationship between variables is statistically significant. This allows for rigorous analysis and validation of research hypotheses.
  • Policy Evaluation and Decision-Making: Regression analysis plays a vital role in policy evaluation and decision-making processes. By analyzing historical data, researchers can evaluate the effectiveness of policy interventions and identify the key factors contributing to certain outcomes. This information helps policymakers make informed decisions, allocate resources effectively, and optimize policy implementation.
  • Risk Assessment and Control: Regression analysis can be used for risk assessment and control purposes. By analyzing historical data, organizations can identify risk factors and develop models that predict the likelihood of certain outcomes, such as defaults, accidents, or failures. This enables proactive risk management, allowing organizations to take preventive measures and mitigate potential risks.

When to Use Regression Analysis

  • Prediction : Regression analysis is often employed to predict the value of the dependent variable based on the values of independent variables. For example, you might use regression to predict sales based on advertising expenditure, or to predict a student’s academic performance based on variables like study time, attendance, and previous grades.
  • Relationship analysis: Regression can help determine the strength and direction of the relationship between variables. It can be used to examine whether there is a linear association between variables, identify which independent variables have a significant impact on the dependent variable, and quantify the magnitude of those effects.
  • Causal inference: Regression analysis can be used to explore cause-and-effect relationships by controlling for other variables. For example, in a medical study, you might use regression to determine the impact of a specific treatment while accounting for other factors like age, gender, and lifestyle.
  • Forecasting : Regression models can be utilized to forecast future trends or outcomes. By fitting a regression model to historical data, you can make predictions about future values of the dependent variable based on changes in the independent variables.
  • Model evaluation: Regression analysis can be used to evaluate the performance of a model or test the significance of variables. You can assess how well the model fits the data, determine if additional variables improve the model’s predictive power, or test the statistical significance of coefficients.
  • Data exploration : Regression analysis can help uncover patterns and insights in the data. By examining the relationships between variables, you can gain a deeper understanding of the data set and identify potential patterns, outliers, or influential observations.

Applications of Regression Analysis

Here are some common applications of regression analysis:

  • Economic Forecasting: Regression analysis is frequently employed in economics to forecast variables such as GDP growth, inflation rates, or stock market performance. By analyzing historical data and identifying the underlying relationships, economists can make predictions about future economic conditions.
  • Financial Analysis: Regression analysis plays a crucial role in financial analysis, such as predicting stock prices or evaluating the impact of financial factors on company performance. It helps analysts understand how variables like interest rates, company earnings, or market indices influence financial outcomes.
  • Marketing Research: Regression analysis helps marketers understand consumer behavior and make data-driven decisions. It can be used to predict sales based on advertising expenditures, pricing strategies, or demographic variables. Regression models provide insights into which marketing efforts are most effective and help optimize marketing campaigns.
  • Health Sciences: Regression analysis is extensively used in medical research and public health studies. It helps examine the relationship between risk factors and health outcomes, such as the impact of smoking on lung cancer or the relationship between diet and heart disease. Regression analysis also helps in predicting health outcomes based on various factors like age, genetic markers, or lifestyle choices.
  • Social Sciences: Regression analysis is widely used in social sciences like sociology, psychology, and education research. Researchers can investigate the impact of variables like income, education level, or social factors on various outcomes such as crime rates, academic performance, or job satisfaction.
  • Operations Research: Regression analysis is applied in operations research to optimize processes and improve efficiency. For example, it can be used to predict demand based on historical sales data, determine the factors influencing production output, or optimize supply chain logistics.
  • Environmental Studies: Regression analysis helps in understanding and predicting environmental phenomena. It can be used to analyze the impact of factors like temperature, pollution levels, or land use patterns on phenomena such as species diversity, water quality, or climate change.
  • Sports Analytics: Regression analysis is increasingly used in sports analytics to gain insights into player performance, team strategies, and game outcomes. It helps analyze the relationship between various factors like player statistics, coaching strategies, or environmental conditions and their impact on game outcomes.

Advantages and Disadvantages of Regression Analysis

About the author.

' src=

Muhammad Hassan

Researcher, Academic Writer, Web developer

You may also like

Cluster Analysis

Cluster Analysis – Types, Methods and Examples

Discriminant Analysis

Discriminant Analysis – Methods, Types and...

MANOVA

MANOVA (Multivariate Analysis of Variance) –...

Documentary Analysis

Documentary Analysis – Methods, Applications and...

ANOVA

ANOVA (Analysis of variance) – Formulas, Types...

Graphical Methods

Graphical Methods – Types, Examples and Guide

Our websites may use cookies to personalize and enhance your experience. By continuing without changing your cookie settings, you agree to this collection. For more information, please see our University Websites Privacy Notice .

Neag School of Education

Educational Research Basics by Del Siegle

More than you want to know about regression….

CORRELATION and REGRESSION are very similar with one main difference. In correlation the variables have equal status. In regression the focus is on predicting one variable from another.

  • Independent Variable = PREDICTOR Variable = X
  • Dependent Variable = CRITERION Variable = Y (Y hat) (Y is regressed on X) (Y is a function of X)
  • SIMPLE REGRESSION involves one predictor variable and one criterion variable.
  • MULTIPLE REGRESSION involves more than one predictor variable and one criterion variable.

Two Common Types of Multiple Regression

  • STEPWISE MULTIPLE REGRESSION-
  • HIERARCHICAL MULTIPLE REGRESSION–

The research question for regression is: To what extent and in what manner do the predictors explain variation in the criterion?

  • to what extent– H0: R2=0
  • in what manner– H0: beta=0

EXPLAINED (REGRESSION) is the difference between the mean of Y and the predicted Y ERROR (RESIDUAL) is the difference between the predicted Y (Y HAT or PRIME) and the observed Y

STANDARD ERROR OF ESTIMATE– square root of average residuals (distance of scores from regression line) — standard deviation of obtained score minus predicted

MULTIPLE R SQUARE– The variation in the criterion variable that can be predicted (accounted for) by the set of predictor variables.

ADJUSTED R SQUARE– Because the equation that is created with one sample will be used with a similar, although not identical population, there is some SHRINKAGE in the amount of variation that can be explained with the new population.

b weights (REGRESSION COEFFICIENT) can’t be used to compare relative importance of the predictors because the b weights are based on the measurement scale of each predictor. Can be used to compare different samples from the same population. Represents how much of an increase in the criterion variable results from one unit increase in the predictor variable. Regression coefficients and the Constant are used to write the REGRESSION EQUATION.

Beta weights (BETA COEFFICIENT — a.k.a. PARTIAL REGRESSION COEFFICIENTS) are used to judge the relative importance of predictor variables but they should not be used to compare from one sample to another because they are influenced by changes in the standard deviation. The beta is the correlation in a simple regression. Beta weights are used to write the STANDARDIZED REGRESSION EQUATIONl

CHANGE IN R SQUARE– reveals semi-partial correlations

CONSTANT– Y intercept

INDEPENDENCE– the X variables are not the same or related

HOMOSCEDASTICITY– the variation of the observed Y scores above and below the regression line is similar up and down the regression line.

MULTICOLLINEARITY– the predictor variables are highly correlated with each other. This results in unstable beta weight which cannot be trusted. Multicollinearity is tested with TOLERANCE. A high TOLERANCE represents lots of multicolinearity. TOLERANCES above .70 are good.

N:P — Ratio of observations to predictor variables. A 40:1 ratio is recommended for Stepwise and 20:1 for Hierarchical

Y(hat) = a + bX (where Y is the predicted score, a is the Y axis intercept of the regression line and b is the slope of the regression line

Academic Success Center

Statistics Resources

  • Excel - Tutorials
  • Basic Probability Rules
  • Single Event Probability
  • Complement Rule
  • Intersections & Unions
  • Compound Events
  • Levels of Measurement
  • Independent and Dependent Variables
  • Entering Data
  • Central Tendency
  • Data and Tests
  • Displaying Data
  • Discussing Statistics In-text
  • SEM and Confidence Intervals
  • Two-Way Frequency Tables
  • Empirical Rule
  • Finding Probability
  • Accessing SPSS
  • Chart and Graphs
  • Frequency Table and Distribution
  • Descriptive Statistics
  • Converting Raw Scores to Z-Scores
  • Converting Z-scores to t-scores
  • Split File/Split Output
  • Partial Eta Squared
  • Downloading and Installing G*Power: Windows/PC
  • Correlation
  • Testing Parametric Assumptions
  • One-Way ANOVA
  • Two-Way ANOVA
  • Repeated Measures ANOVA
  • Goodness-of-Fit
  • Test of Association
  • Pearson's r
  • Point Biserial
  • Mediation and Moderation

Simple Linear Regression

Multiple Linear Regression

  • Binomial Logistic Regression
  • Multinomial Logistic Regression
  • Independent Samples T-test
  • Dependent Samples T-test
  • Testing Assumptions
  • T-tests using SPSS
  • T-Test Practice
  • Predictive Analytics This link opens in a new window
  • Quantitative Research Questions
  • Null & Alternative Hypotheses
  • One-Tail vs. Two-Tail
  • Alpha & Beta
  • Associated Probability
  • Decision Rule
  • Statement of Conclusion
  • Statistics Group Sessions

Research Questions and Hypotheses

These are just a few examples of what the research questions and hypotheses may look like when a regression analysis is appropriate. 

  • H0: Bodyweight does not have an influence on cholesterol levels.
  • Ha: Bodyweight has a significant influence on cholesterol levels.
  • H0: IQ does not predict GPA.
  • Ha: IQ is a significant predictor of GPA.
  • H0: Oxygen, water, and sunlight are not related to plant growth.
  • Ha: At least one of the predictor variables is a significant predictor of plant growth.
  • H0: There is no relationship between IQ or gender, and GPA.
  • Ha: IQ and/or gender significantly predict(s) GPA.

Logistic Regression

  • H0: Income is not a predictor of gender.
  • Ha: There is a predictive relationship between gender and income.
  • H0: There is no relationship between customer satisfaction, brand perception, price perception, and purchase decision.
  • Ha: At least one of the predictor variables has a predictive relationship with purchase decision.

Multiple Logistic Regression

  • H0: There is no influence on game choice by standardized test scores.
  • Ha: There is a significant influence of at least one of the predictor variables on game choice.

Was this resource helpful?

  • << Previous: Mediation and Moderation
  • Next: Simple Linear Regression >>
  • Last Updated: Apr 19, 2024 3:09 PM
  • URL: https://resources.nu.edu/statsresources

NCU Library Home

Logo for University of Southern Queensland

Want to create or adapt books like this? Learn more about how Pressbooks supports open publishing practices.

Section 5.4: Hierarchical Regression Explanation, Assumptions, Interpretation, and Write Up

Learning Objectives

At the end of this section you should be able to answer the following questions:

  • Explain how hierarchical regression differs from multiple regression.
  • Discuss where you would use “control variables” in a hierarchical regression analyses.

Hierarchical Regression Explanation and Assumptions

Hierarchical regression is a type of regression model in which the predictors are entered in blocks. Each block represents one step (or model). The order (or which predictor goes into which block) to enter predictors into the model is decided by the researcher, but should always be based on theory.

The first block entered into a hierarchical regression can include “control variables,” which are variables that we want to hold constant. In a sense, researchers want to account for the variability of the control variables by removing it before analysing the relationship between the predictors and the outcome.

The example research question is “what is the effect of perceived stress on physical illness, after controlling for age and gender?”.  To answer this research question, we will need two blocks. One with age and gender, then the next block including perceived stress.

It is important to note that the assumptions for hierarchical regression are the same as those covered for simple or basic multiple regression. You may wish to go back to the section on multiple regression assumptions if you can’t remember the assumptions or want to check them out before progressing through the chapter.

Hierarchical Regression Interpretation

PowerPoint: Hierarchical Regression

For this example, please click on the link for Chapter Five – Hierarchical Regression below. You will find 4 slides that we will be referring to for the rest of this section.

  • Chapter Five – Hierarchical Regression

For this test, the statistical program used was Jamovi, which is freely available to use. The first two slides show the steps to get produce the results. The third slide shows the output with any highlighting. You might want to think about what you have already learned, to see if you can work out the important elements of this output.

able on lssion

Slide 2 shows the overall model statistics. The first model, with only age and gender, can be seen circled in red. This model is obviously significant. The second model (circled in green) includes age, gender, and perceived stress. As you can see, the F statistic is larger for the second model. However, does this mean it is significantly larger?

To answer this question, we will need to look at the model change statistics on Slide 3. The R value for model 1 can be seen here circled in red as .202. This model explains approximately 4% of the variance in physical illness. The R value for model 2 is circled in green, and explains a more sizeable part of the variance, about 25%.

Tables with data on model fits and comparisons

The significance of the change in the model can be seen in blue on Slide 3. The information you are looking at is the R squared change, the F statistic change, and the statistical significance of this change.

Table with data on physical illness

On Slide 4, you can examine the role of each individual independent variable on the dependant variable. For model one, as circled in red, age and gender are both significantly associated with physical illness. In this case, age is negatively associated (i.e. the younger you are, the more likely you are to be healthy), and gender is positively associated (in this case being female is more likely to result in more physical illness).  For model 2, gender is still positively associated and now perceived stress is also positively associated. However, age is no longer significantly associated with physical illness following the introduction of perceived stress. Possibly this is because older persons are experiencing less life stress than younger persons.

Hierarchical Regression Write Up

An example write up of a hierarchal regression analysis is seen below:

In order to test the predictions, a hierarchical multiple regression was conducted, with two blocks of variables. The first block included age and gender (0 = male, 1 = female) as the predictors, with difficulties in physical illness as the dependant variable. In block two, levels of perceived stress was also included as the predictor variable, with difficulties in perceived stress as the dependant variable.

Overall, the results showed that the first model was significant F (2,364) = 7.75, p = .001, R 2 =.04. Both age and gender were significantly associated with perceived life stress ( b =-0.14, t = -2.78, p = .006, and b =.14, t = 2.70, p = .007, respectively). The second model ( F (3,363) = 39.61, p < .001, R 2 =.25), which included physical illness ( b =0.47, t = 9.96, p < .001) showed significant improvement from the first model ∆ F (1,363) = 99.13, p < .001, ∆R 2 =.21,  , Overall, when age and location of participants were included in the model, the variables explained 8.6% of the variance, with the final model, including physical illness accounted for 24.7% of the variance, with model one and two representing a small, and large effect size, respectively.

Statistics for Research Students Copyright © 2022 by University of Southern Queensland is licensed under a Creative Commons Attribution 4.0 International License , except where otherwise noted.

Share This Book

  • Open access
  • Published: 14 May 2024

Exploring predictors and prevalence of postpartum depression among mothers: Multinational study

  • Samar A. Amer   ORCID: orcid.org/0000-0002-9475-6372 1 ,
  • Nahla A. Zaitoun   ORCID: orcid.org/0000-0002-5274-6061 2 ,
  • Heba A. Abdelsalam 3 ,
  • Abdallah Abbas   ORCID: orcid.org/0000-0001-5101-5972 4 ,
  • Mohamed Sh Ramadan 5 ,
  • Hassan M. Ayal 6 ,
  • Samaher Edhah Ahmed Ba-Gais 7 ,
  • Nawal Mahboob Basha 8 ,
  • Abdulrahman Allahham 9 ,
  • Emmanuael Boateng Agyenim 10 &
  • Walid Amin Al-Shroby 11  

BMC Public Health volume  24 , Article number:  1308 ( 2024 ) Cite this article

119 Accesses

Metrics details

Postpartum depression (PPD) affects around 10% of women, or 1 in 7 women, after giving birth. Undiagnosed PPD was observed among 50% of mothers. PPD has an unfavorable relationship with women’s functioning, marital and personal relationships, the quality of the mother-infant connection, and the social, behavioral, and cognitive development of children. We aim to determine the frequency of PPD and explore associated determinants or predictors (demographic, obstetric, infant-related, and psychosocial factors) and coping strategies from June to August 2023 in six countries.

An analytical cross-sectional study included a total of 674 mothers who visited primary health care centers (PHCs) in Egypt, Yemen, Iraq, India, Ghana, and Syria. They were asked to complete self-administered assessments using the Edinburgh Postnatal Depression Scale (EPDS). The data underwent logistic regression analysis using SPSS-IBM 27 to list potential factors that could predict PPD.

The overall frequency of PPD in the total sample was 92(13.6%). It ranged from 2.3% in Syria to 26% in Ghana. Only 42 (6.2%) were diagnosed. Multiple logistic regression analysis revealed there were significant predictors of PPD. These factors included having unhealthy baby adjusted odds ratio (aOR) of 11.685, 95% CI: 1.405–97.139, p  = 0.023), having a precious baby (aOR 7.717, 95% CI: 1.822–32.689, p  = 0.006), who don’t receive support (aOR 9.784, 95% CI: 5.373–17.816, p  = 0.001), and those who are suffering from PPD. However, being married and comfortable discussing mental health with family relatives are significant protective factors (aOR = 0.141 (95% CI: 0.04–0.494; p  = 0.002) and (aOR = 0.369, 95% CI: 0.146–0.933, p  = 0.035), respectively.

The frequency of PPD among the mothers varied significantly across different countries. PPD has many protective and potential factors. We recommend further research and screenings of PPD for all mothers to promote the well-being of the mothers and create a favorable environment for the newborn and all family members.

Peer Review reports

Introduction

Postpartum depression (PPD) is among the most prevalent mental health issues [ 1 ]. The onset of depressive episodes after childbirth occurs at a pivotal point in a woman’s life and can last for an extended period of 3 to 6 months; however, this varies based on several factors [ 2 ]. PPD can develop at any time within the first year after childbirth and last for years [ 2 ]. It refers to depressive symptoms that a mother experiences during the postpartum period, which are vastly different from “baby blues,” which many mothers experience within three to five days after the birth of their child [ 3 ].

Depressive episodes are twice as likely to occur during pregnancy compared to other times in a woman’s life, and they frequently go undetected and untreated [ 4 ]. According to estimates, almost 50% of mothers with PPD go undiagnosed [ 4 ]. The Diagnostic and Statistical Manual of Mental Disorders (DSM-5) criteria for PPD include mood instability, loss of interest, feelings of guilt, sleep disturbances, sleep disorders, and changes in appetite [ 5 ], as well as decreased libido, crying spells, anxiety, irritability, feelings of isolation, mental liability, thoughts of hurting oneself and/or the infant, and even suicidal ideation [ 6 ].

Approximately 1 in 10 women will experience PPD after giving birth, with some studies reporting 1 in 7 women [ 7 ]. Globally, the prevalence of PPD is estimated to be 17.22% (95% CI: 16.00–18.05) [ 4 ], with a prevalence of up to 15% in the previous year in eighty different countries or regions [ 1 ]. This estimate is lower than the 19% prevalence rate of PPD found in studies from low- and middle-income countries and higher than the 13% prevalence rate (95% CI: 12.3–13.4%) stated in a different meta-analysis of data from high-income countries [ 8 ].

The occurrence of postpartum depression is influenced by various factors, including social aspects like marital status, education level, lack of social support, violence, and financial difficulties, as well as other factors such as maternal age (particularly among younger women), obstetric stressors, parity, and unplanned pregnancy [ 4 ]. When a mother experiences depression, she may face challenges in forming a satisfying bond with her child, which can negatively affect both her partner and the emotional and cognitive development of infants and adolescents [ 4 ]. As a result, adverse effects may be observed in children during their toddlerhood, preschool years, and beyond [ 9 ].

Around one in seven women can develop PPD [ 7 ]. While women experiencing baby blues tend to recover quickly, PPD tends to last longer and severely affects women’s ability to return to normal function. PPD affects the mother and her relationship with the infant [ 7 ]. The prevalence of postpartum depression varies depending on the assessment method, timing of assessment, and cultural disparities among countries [ 7 ]. To address these aspects, we conducted a cross-sectional study focusing on mothers who gave birth within the previous 18 months. Objectives: to determine the frequency of PPD and explore associated determinants or predictors, including demographic, obstetric, infant-related, and psychosocial factors, and coping strategies from June to August 2023 in six countries.

Study design and participants

This is an analytical cross-sectional design and involved 674 mothers during the childbearing period (CBP) from six countries, based on the authors working settings, namely Egypt, Syria, Yemen, Ghana, India, and Iraq. It was conducted from June to August 2023. It involved all mothers who gave birth within the previous 18 months, citizens of one of the targeted countries, and those older than 18 years and less than 40 years. Women who visited for a routine postpartum follow-up visit and immunization of their newborns were surveyed.

Multiple pregnancies, illiteracy, or anyone deemed unfit to participate in accordance with healthcare authorities, mothers who couldn’t access or use the Internet, mothers who couldn’t read or speak Arabic or English and couldn’t deal with the online platform or smart devices, mothers whose babies were diagnosed with serious health problems, were stillborn, or experienced intrauterine fetal death, and participants with complicated medical, mental, or psychological disorders that interfered with completing the questionnaire were all exclusion criteria. There were no incentives offered to encourage participation.

Sample size and techniques

The sample size was estimated according to the following equation: n = Z 2 P (1-P)/d 2 . This calculation was based on the results of a systematic review and meta-analysis in 2020 of 17% as the worldwide prevalence of PPD and 12% as the worldwide incidence of PPD, as well as a 5% precision percentage, 80% power of the study, a 95% confidence level, and an 80% response rate [ 11 ]. The total calculated sample size is 675. The sample was diverse in terms of nationality, with the majority being Egyptian (16.3%), followed by Yemeni (24.3%) and Indian (19.1%), based on many factors discussed in the limitation section.

The sampling process for recruiting mothers utilized a multistage approach. Two governorates were randomly selected from each country. Moreover, we selected one rural and one urban area from each governorate. Through random selection, participants were chosen for the study. Popular and officially recognized online platforms, including websites and social media platforms such as Facebook, Twitter, WhatsApp groups, and registered emails across various health centers, were utilized for reaching out to participants. Furthermore, a community-based sample was obtained from different public locations, including well-baby clinics, PHCs, and family planning units.

Mothers completed the questionnaire using either tablets or cellphones provided by the data collectors or by scanning the QR code. All questions were mandatory to prevent incomplete forms. Once they provided their informed consent, they received the questionnaire, which they completed and submitted. To enhance the response rate, reminder messages and follow-up communications were employed until the desired sample size was achieved or until the end of August. To avoid seasonal affective disorders, the meteorological autumn season began on the 1st day of September, which may be associated with Autum depressive symptoms that may confound or affect our results.

Data collection tool

Questionnaire development and structure.

The questionnaire was developed and adapted based on data obtained from previous studies [ 7 , 8 , 9 , 10 , 11 , 12 ]. Initially, it was created in English and subsequently translated into Arabic. To ensure accuracy, a bilingual panel consisting of two healthcare experts and an externally qualified medical translator translated the English version into Arabic. Additionally, two English-speaking translators performed a back translation, and the original panel was consulted if any concerns arose.

Questionnaire validation

To collect the data, an online, self-administered questionnaire was utilized, designed in Arabic with a well-structured format. We conducted an assessment of the questionnaire’s reliability and validity to ensure a consistent interpretation of the questions. The questionnaire underwent validation by psychiatrists, obstetricians, and gynecologists. Furthermore, in a pilot study involving 20 women of CBA, the questionnaire’s clarity and comprehensibility were evaluated. It is important to note that the findings from the pilot study were not included in our main study.

The participants were asked to rate the questionnaire’s organization, clarity, and length, as well as provide a general opinion. Following that, certain questions were revised in light of their input. To check for reliability and reproducibility, the questionnaire was tested again on the same people one week later. The final data analysis will not include the data collected during the pilot test. We calculated a Cronbach’s alpha of 0.76 for the questionnaire.

The structure of the questionnaire

After giving their permission to take part in the study. The questionnaire consisted of the following sections:

Study information and electronic solicitation of informed consent.

Demographic and health-related factors: age, gender, place of residence, educational level, occupation, marital status, weight, height, and the fees of access to healthcare services.

Obstetric history: number of pregnancies, gravida, history of abortions, number of live children, history of dead children, inter-pregnancy space (y), current pregnancy status, type of the last delivery, weight gain during pregnancy (kg), baby age (months), premature labor, healthy baby, baby admitted to the NICU, Feeding difficulties, pregnancy problems, postnatal problems, and natal problems The nature of baby feeding.

Assessment of postpartum depression (PPD) levels using the Edinburgh 10-question scale: This scale is a simple and effective screening tool for identifying individuals at risk of perinatal depression. The EPDS (Edinburgh Postnatal Depression Scale) is a valuable instrument that helps identify the likelihood of a mother experiencing depressive symptoms of varying severity. A score exceeding 13 indicates an increased probability of a depressive illness. However, clinical discretion should not be disregarded when interpreting the EPDS score. This scale captures the mother’s feelings over the past week, and in cases of uncertainty, it may be beneficial to repeat the assessment after two weeks. It is important to note that this scale is not capable of identifying mothers with anxiety disorders, phobias, or personality disorders.

For Questions 1, 2, and 4 (without asterisks): Scores range from 0 to 3, with the top box assigned a score of 0 and the bottom box assigned a score of 3. For Questions 3 and 5–10 (with asterisks): Scores are reversed, with the top box assigned a score of 3 and the bottom box assigned a score of 0. The maximum score achievable is 30, and a probability of depression is considered when the score is 10 or higher. It is important to always consider item 10, which pertains to suicidal ideation [ 12 ].

Psychological and social characteristics: received support or treatment for PPD, awareness of symptoms and risk factors, experienced cultural stigma or judgment about PPD in the community, suffer from any disease or mental or psychiatric disorder, have you ever been diagnosed with PPD, problems with the husband, and financial problems.

Coping strategies and causes for not receiving the treatment and reactions to PPD, in descending order: social norms, cultural or traditional beliefs, personal barriers, 48.5% geographical or regional disparities in mental health resources, language or communication barriers, and financial constraints.

Statistical analysis

The collected data was computerized and statistically analyzed using the SPSS program (Statistical Package for Social Science), version 27. The data was tested for normal distribution using the Shapiro-Walk test. Qualitative data was represented as frequencies and relative percentages. Quantitative data was expressed as mean ± SD (standard deviation) if it was normally distributed; otherwise, median and interquartile range (IQR) were used. The Mann-Whitney test (MW) was used to calculate the difference between quantitative variables in two groups for non-parametric variables. Correlation analysis (using Spearman’s method) was used to assess the relationship between two nonparametric quantitative variables. All results were considered statistically significant when the significant probability was < 0.05. The chi-square test (χ 2 ) and Fisher exact were used to calculate the difference between qualitative variables.

The frequency of PPD among mothers (Fig.  1 )

figure 1

The frequency of PPD among the studied mothers

The frequency of PPD in the total sample using the Edinburgh 10-question scale was 13.5% (Table S1) and 92 (13.6%). Which significantly ( p  = 0.001) varied across different countries, being highest among Ghana mothers 13 (26.0%) out of 50 and Indians 28 (21.7%) out of 129. Egyptian 21 (19.1) out of 110, Yemen 14 (8.5%) out of 164, Iraq 13 (7.7%) out of 168, and Syria 1 (2.3%) out of 43 in descending order. Nationality is also significantly associated with PPD ( p  = 0.001).

Demographic, and health-related characteristics and their association with PPD (Table  1 )

The study included 674 participants. The median age was 27 years, with 407 (60.3%) of participants falling in the >25 to 40-year-old age group. The majority of participants were married, 650 (96.4%), had sufficient monthly income, 449 (66.6%), 498 (73.9%), had at least a preparatory or high school level of education, and were urban. Regarding health-related factors, 270 (40.01%) smoked, 645 (95.7%) smoked, 365 (54.2%) got the COVID-19 vaccine, and 297 (44.1%) got COVID-19. Moreover, 557 (82.6%) had no comorbidities, 623 (92.4%) had no psychiatric illness or family history, and they charged for health care services for themselves 494 (73.3%).

PPD is significant ( p  < 0.05). Higher among single or widowed women 9 (56.3%) and mothers who had both medical, mental, or psychological problems 2 (66.7%), with ex-cigarette smoking 5 (35.7%) ( p  = 0.033), alcohol consumption ( p  = 0.022) and mothers were charged for the health care services for themselves 59 (11.9%).

Obstetric, current pregnancy, and infant-related characteristics and their association with PPD (Table  2 )

The majority of the studied mothers were on no hormonal treatment or contraceptive pills 411 (60.9%), the current pregnancy was unplanned and wanted 311 (46.1%), they gained 10 ≥ kg 463 (68.6%), 412 (61.1%) delivered vaginal, a healthy baby 613 (90.9%), and, on breastfeeding, only 325 (48.2%).

There was a significant ( P  < 0.05) association observed between PPD, which was significantly higher among mothers on contraceptive methods, and those who had 1–2 live births (76.1%) and mothers who had interpregnancy space for less than 2 years. 86 (93.5%), and those who had a history of dead children. Moreover, among those who had postnatal problems (27.2%).

The psychosocial characteristics and their association with PPD (Table  3 )

Regarding the psychological and social characteristics of the mothers, the majority of mothers were unaware of the symptoms of PPD (75%), and only 236 (35.3%) experienced cultural stigma or judgment about PPD in the community. About 41 (6.1%) were diagnosed with PPD during the previous pregnancy, and only 42 (6.2%) were diagnosed and on medications.

A p -value of less than 0.001 demonstrates a highly statistically significant association with the presence of PPD. Mothers with PPD were significantly more likely to have a history of or be currently diagnosed with PPD, as well as financial and marital problems. Experienced cultural stigma or judgment about PPD and received more support.

Coping strategies and causes for not receiving the treatment and reaction to PPD (Table  3 ; Fig.  2 )

figure 2

Causes for not receiving the treatment and reaction to PPD

Around half of the mothers didn’t feel comfortable discussing mental health: 292 (43.3%) with a physician, 307 (45.5%) with a husband, 326 (48.4%) with family, and 472 (70.0%) with the community. Moreover, mothers with PPD felt significantly more comfortable discussing mental health in descending order: 46 (50.0%) with a physician, 41 (44.6%) with a husband, and 39 (42.3%) with a family (Table  3 ).

There were different causes for not receiving the treatment and reactions to PPD, in descending order: 65.7% social norms, 60.5% cultural or traditional beliefs, 56.5% personal barriers, 48.5% geographical or regional disparities in mental health resources, 47.4% language or communication barriers, and 39.7% financial constraints.

Prediction of PPD (significant demographics, obstetric, current pregnancy, and infant-related, and psychosocial), and coping strategies derived from multiple logistic regression analysis (Table  4 ).

Significant demographic predictors of ppd.

Marital Status (Married or Single): The adjusted odds ratio (aOR) among PPD mothers who were married in comparison to their single counterparts was 0.141 (95% CI: 0.04–0.494; p -value = 0.002).

Nationality: For PPD Mothers of Yemeni nationality compared to those with Egyptian nationality, the aOR was 0.318 (95% CI: 0.123–0.821, p  = 0.018). Similarly, for Syrian nationality in comparison to Egyptian nationality, the aOR was 0.111 (95% CI: 0.0139–0.887, p  = 0.038), and for Iraqi nationality compared to Egyptian nationality, the aOR was 0.241 (95% CI: 0.0920–0.633, p  = 0.004).

Significant obstetric, current pregnancy, and infant-related characteristics predictors of PPD

Current Pregnancy Status (Precious Baby—Planned): The aOR for the occurrence of PPD among women with a “precious baby” relative to those with a “planned” pregnancy was 7.717 (95% CI: 1.822–32.689, p  = 0.006).

Healthy Baby (No-Yes): The aOR for the occurrence of PPD among women with unhealthy babies in comparison to those with healthy ones is 11.685 (95% CI: 1.405–97.139, p  = 0.023).

Postnatal Problems (No–Yes): The aOR among PPD mothers reporting postnatal problems relative to those not reporting such problems was 0.234 (95% CI: 0.0785–0.696, p  = 0.009).

Significant psychological and social predictors of PPD

Receiving support or treatment for PPD (No-Yes): The aOR among PPD mothers who were not receiving support or treatment relative to those receiving support or treatment was 9.784 (95% CI: 5.373–17.816, p  = 0.001).

Awareness of symptoms and risk factors (No-Yes): The aOR among PPD mothers who lack awareness of symptoms and risk factors relative to those with awareness was 2.902 (95% CI: 1.633–5.154, p  = 0.001).

Experienced cultural stigma or judgement about PPD in the community (No-Yes): The aOR among PPD mothers who had experienced cultural stigma or judgment in the community relative to those who have not was 4.406 (95% CI: 2.394–8.110, p  < 0.001).

Suffering from any disease or mental or psychiatric disorder: For “Now I am suffering—not at all,” the aOR among PPD mothers was 12.871 (95% CI: 3.063–54.073, p  = 0.001). Similarly, for “Had a past history but was treated—not at all,” the adjusted odds ratio was 16.6 (95% CI: 2.528–108.965, p  = 0.003), and for “Had a family history—not at all,” the adjusted odds ratio was 3.551 (95% CI: 1.012–12.453, p  = 0.048).

Significant coping predictors of PPD comfort: discussing mental health with family (maybe yes)

The aOR among PPD mothers who were maybe more comfortable discussing mental health with family relatives was 0.369 (95% CI: 0.146–0.933, p  = 0.035).

PDD is a debilitating mental disorder that has many potential and protective risk factors that should be considered to promote the mental and psychological well-being of the mothers and to create a favorable environment for the newborn and all family members. This multinational cross-sectional survey was conducted in six different countries to determine the frequency of PDD using EPDS and to explore its predictors. It was found that PPD was a prevalent problem that varied across different nations.

The frequency of PPD across the studied countries

Using the widely used EPDS to determine the current PPD, we found that the overall frequency of PPD in the total sample was 92 (13.6%). Which significantly ( p  = 0.001) varied across different countries, being highest among Ghana mothers 13 (26.0%) out of 50 and Indians 28 (21.7%) out of 129. Egyptian 21 (19.1) out of 110, Yemen 14 (8.5%) out of 164, Iraq 13 (7.7%) out of 169, and Syria 1 (2.3%) out of 43 in descending order. This prevalence was similar to that reported by Hairol et al. (2021) in Malaysia (14.3%) [ 13 ], Yusuff et al. (2010) in Malaysia (14.3%) [ 14 ], and Nakku et al. (2006) in New Delhi (12.75%) [ 15 ].

While the frequency of PPD varied greatly based on the timing, setting, and existence of many psychosocial and post-partum periods, for example, it was higher than that reported in Italy (2012), which was 4.7% [ 16 ], in Turkey (2017) was 9.1%/110 [ 17 ], 9.2% in Sudan [ 18 ], Eritrea (2020) was 7.4% [ 19 ], in the capital Kuala Lumpur (2001) was (3.9%) [ 20 ], in Malaysia (2002) was (9.8%) [ 21 ], and in European countries. (2021) was 13–19% [ 22 ].

Lower frequencies were than those reported; PPD is a predominant problem in Asia, e.g., in Pakistan, the three-month period after childbirth, ranging from 28.8% in 2003 to 36% in 2006 to 94% in 2007, while after 12 months after childbirth, it was 62% in 2021 [ 23 – 24 ]. While in 2022 Afghanistan 45% after their first labour [ 25 ] in Canada (2015) was 40% [ 26 ], in India, the systematic review in 2022 was 22% of Primipara [ 27 ], in Malaysia (2006) was 22.8% [ 28 ], in India (2019) was 21.5% [ 29 ], in the Tigray zone in Ethiopia (2017) was 19% [ 30 ], varied in Iran between 20.3% and 35% [ 31 – 32 ], and in China was 499 (27.37%) out of 1823 [ 33 ]. A possible explanation might be the differences in the study setting and the type of design utilized. Other differences should be considered, like different populations with different socioeconomic characteristics and the variation in the timing of post-partum follow-up. It is vital to consider the role of culture, the impact of patients’ beliefs, and the cultural support for receiving help for PPD.

Demographic and health-related associations, or predictors of PPD (Tables  1 and 4 )

Regarding age, our study found no significant difference between PPD and non-PPD mothers with regard to age. In agreement with our study [ 12 , 34 , 35 ], other studies [ 36 , 37 , 38 ] found an inverse association between women’s age and PPD, with an increased risk of PPD (increases EPDS scores) at a younger age significantly, as teenage mothers, being primiparous, encounter difficulty during the postpartum period due to their inability to cope with financial and emotional difficulties, as well as the challenge of motherhood. Cultural factors and social perspectives of young mothers in different countries could be a reason for this difference. [ 38 – 39 ] and Abdollahi et al. [ 36 ] reported that older mothers were a protective factor for PPD (OR = 0.88, 95% CI: 0.84–0.92].

Regarding marital status, after controlling for other variables, married mothers exhibited a significantly diminished likelihood of experiencing PPD in comparison to single women (0.141; 95% CI: 0.04–0.494; p  = 0.002). Also, Gebregziabher et al. [ 19 ] reported that there were statistically significant differences in proportions between mothers’ PPD and marital status.

Regarding the mother’s education, in agreement with our study, Ahmed et al. [ 34 ] showed that there was no statistically significant difference between PPD and a mother’s education. While Agarwala et al. [ 29 ] showed that a higher level of mother’s education. increases the risk of PPD, Gebregziabher et al. [ 19 ] showed that the housewives were 0.24 times less likely to develop PPD as compared to the employed mothers (aOR = 0.24, 95% CI: 0.06–0.97; p  = 0.046); those mothers who perceived their socioeconomic status (SES) as low were 13 times more likely to develop PPD as compared to the mothers who had good SES (aOR = 13.33, 95% CI: 2.66–66.78; p  = 0.002).

Regarding the SES or monthly income, while other studies [ 18 , 40 ] found that there was a statistically significant association between PPD mothers and different domains of SES, 34% of depressed women were found to live under low SES conditions in comparison to only 15.4% who were found to live in high SES and experienced PPD. In disagreement with our study, Hairol et al. [ 12 ] demonstrated that the incidence of PPD was significantly p  = 0.01 higher for participants from the low-income group (27.27%) who were 2.58 times more likely to have PDD symptoms (OR: 2.58, 95% CI: 1.23–5.19; p  = 0.01 compared to those from the middle- and high-income groups (8.33%), and low household income (OR = 3.57 [95% CI: 1.49–8.5] increased the odds of PPD [ 41 ].

Adeyemo et al. (2020),and Al Nasr et al. (2020) revealed that there was no significant difference between the occurrence of PPD and socio-demographic characteristics. This difference may be due to a different sample size and ethnicity [ 42 , 43 ]. In agreement with our findings, Abdollahi et al. [ 36 ] demonstrated that after multiple logistic regression analyses, there were increased odds of PPD with a lower state of general health (OR = 1.08 [95% CI: 1.06–1.11]), gestational diabetes (OR = 2.93 [95% CI = 1.46–5.88]), and low household income (OR = 3.57 [95% CI: 1.49–8.5]). The odds of PPD decreased.

Regarding access to health care, in agreement with studies conducted at Gondar University Hospital, Ethiopia [ 18 ], North Carolina, Colorado [ 21 ], Khartoum, Sudan [ 44 ], Asaye et al. [ 45 ], the current study found that participants who did not have free access to the healthcare system were riskier for the development of PPD. the study results may be affected by the care given during the antenatal care (ANC) visits. This can be explained by the fact that PPD was four times higher than that of mothers who did not have ANC, where counseling and anticipatory guidance care are given that build maternal self-esteem and resiliency, along with knowledge about normal and problematic complications to discuss at care visits and their right to mental and physical wellness, including access to care. The increased access to care (including postpartum visits) will increase the diagnosis of PPD and provide guidance, reassurance, and appropriate referrals. Healthcare professionals have the ability to both educate and empower mothers as they care for their babies, their families, and themselves [ 46 ].

Regarding nationality, for PPD mothers of Yemeni nationality compared to those of Egyptian nationality, the aOR is 0.318 (95% CI: 0.123–0.821, p  = 0.018). Similarly, for Syrian nationality in comparison to Egyptian nationality, the aOR is 0.111 (95% CI: 0.0139–0.887, p  = 0.038), and for Iraqi nationality compared to Egyptian nationality, the aOR is 0.241 (95% CI: 0.0920–0.633, p  = 0.004). These findings indicated that, while accounting for other covariates, individuals from the aforementioned nationalities were less predisposed to experiencing PPD than their Egyptian counterparts. These findings can be explained by the fact that, in Egypt, the younger age of marriage, especially in rural areas, poor mental health services, being illiterate, dropping out of school early, unemployment, and the stigma of psychiatric illnesses are cultural factors that hinder the diagnosis and treatment of PPD [ 40 ].

Obstetric, current pregnancy, and infant-related characteristics and their association or predictors of PPD (Tables  2 and 4 )

In the present study, the number of dead children was significantly associated with PPD. This report was supported by studies conducted with Gujarati postpartum women [ 41 ] and rural southern Ethiopia [ 43 ]. This might be because mothers who have dead children pose different psychosocial problems and might regret it for fear of complications developing during their pregnancy. Agarwala et al. [ 29 ] found that a history of previous abortions and having more than two children increased the risk of developing PPD due to a greater psychological burden. The inconsistencies in the findings of these studies indicate that the occurrence of postpartum depression is not solely determined by the number of childbirths.

In obstetric and current pregnancy , there was no significant difference regarding the baby’s age, number of miscarriages, type of last delivery, premature labour, healthy baby, baby admitted to the neonatal intensive care unit (NICU), or feeding difficulties. In agreement with Al Nasr et al. [ 42 ], inconsistent with Asaye et al. [ 45 ], they showed that concerning multivariable logistic regression analysis, abortion history, birth weight, and gestational age were significant associated factors of postpartum depression at a value of p <  0.05.

However, a close association was noted between the mode of delivery and the presence of PPD in mothers, with p  = 0.107. There is a high tendency towards depression seen in mothers who have delivered more than three times (44%). In disagreement with what was reported by Adeyemo et al. [ 41 ], having more than five children ( p  = 0.027), cesarean section delivery ( p  = 0.002), and mothers’ poor state of health since delivery ( p  < 0.001) are associated with an increase in the risk of PPD [ 47 ]. An increased risk of cesarean section as a mode of delivery was observed (OR = 1.958, p  = 0.049) in a study by Al Nasr et al. [ 42 ].

We reported breastfeeding mothers had a lower, non-significant frequency of PPD compared to non-breast-feeding mothers (36.6% vs. 45%). In agreement with Ahmed et al. [ 34 ], they showed that with respect to breastfeeding and possible PPD, about 67.3% of women who depend on breastfeeding reported no PPD, while 32.7% only had PP. Inconsistency with Adeyemo et al. [ 41 ], who reported that unexclusive breastfeeding ( p  = 0.003) was associated with PPD, while Shao et al. [ 40 ] reported that mothers who were exclusively formula feeding had a higher prevalence of PPD.

Regarding postnatal problems, our results revealed that postnatal problems display a significant association with PPD. In line with our results, Agarwala et al. [ 29 ] and Gebregziabher et al. [ 19 ] showed that mothers who experienced complications during childbirth, those who became ill after delivery, and those whose babies were unhealthy had a statistically significant higher proportion of PPD.

Hormone-related contraception methods were found to have a statistically significant association with PPD, consistent with the literature [ 46 ]; this can be explained by the hormones and neurotransmitters as biological factors that play significant roles in the onset of PPD. Estrogen hormones act as regulators of transcription from brain neurotransmitters and modulate the action of serotonin receptors. This hormone stimulates neurogenesis, the process of generating new neurons in the brain, and promotes the synthesis of neurotransmitters. In the hypothalamus, estrogen modulates neurotransmitters and governs sleep and temperature regulation. Variations in the levels of this hormone or its absence are linked to depression [ 19 ].

Participants whose last pregnancy was unplanned were 3.39 times more likely to have postpartum depression (aOR = 3.39, 95% CI: 1.24–9.28; p  = 0.017). Mothers who experienced illness after delivery were more likely to develop PPD as compared to their counterparts (aOR = 7.42, 95% CI: 1.44–34.2; p  = 0.016) [ 40 ]. In agreement with Asaye et al. [ 45 ] and Abdollahi et al. [ 36 ], unplanned pregnancy has been associated with the development of PPD (aOR = 2.02, 95% CI: 1.24, 3.31) and OR = 2.5 [95% CI: 1.69–3.7] than those of those who had planned, respectively.

The psychosocial characteristics and their association with PPD

Mothers with a family history of mental illness were significantly associated with PPD. This finding was in accordance with studies conducted in Istanbul, Turkey [ 47 ], and Bahrain [ 48 ]. Other studies also showed that women with PPD were most likely to have psychological symptoms during pregnancy [ 43 , 44 , 45 , 46 , 47 , 48 , 49 ]. A meta-analysis of 24,000 mothers concluded that having depression and anxiety during pregnancy and a previous history of psychiatric illness or a history of depression are strong risk factors for developing PPD [ 50 , 51 , 52 ]. Asaye et al. [ 45 ], mothers whose relatives had mental illness history were (aOR = 1.20, 95% CI: 1.09, 3.05 0) be depressed than those whose relatives did not have mental illness history.

This can be attributed to the links between genetic predisposition and mood disorders, considering both nature and nurture are important to address PDD. PPD may be seen as a “normal” condition for those who are acquainted with relatives with mood disorders, especially during the CBP. A family history of mental illness can be easily elicited in the ANC first visit history and requires special attention during the postnatal period. There are various risk factors for PPD, including stressful life events, low social support, the infant’s gender preference, and low income [ 53 ].

Concerning familial support and possible PPD, a statistically significant association was found between them. We reported that mothers who did not have social support (a partner or the father of the baby) had higher odds (aOR = 5.8, 95% CI: 1.33–25.29; p  = 0.019) of experiencing PPD. Furthermore, Al Nasr et al. [ 42 ] revealed a significant association between the PPD and an unsupportive spouse ( P value = 0.023). while it was noted that 66.5% of women who received good familial support after giving birth had no depression, compared to 33.5% who only suffered from possible PPD [ 40 ]]. Also, Adeyemo et al. [ 41 ] showed that some psychosocial factors were significantly associated with having PPD: having an unsupportive partner ( p  < 0.001), experiencing intimate partner violence ( p  < 0.001), and not getting help in taking care of their baby ( p  < 0.001). Al Nasr et al. (2020) revealed that the predictor of PPD was an unsupportive spouse (OR = 4.53, P  = 0.049) [ 48 ].

Regarding the perceived stigma, in agreement with our study, Bina (2020) found that shame, stigma, the fear of being labeled mentally ill, and language and communication barriers were significant factors in women’s decisions to seek treatment or accept help [ 53 ]. Other mothers were hesitant about mental health services [ 54 ]. It is noteworthy that some PPD mothers refused to seek treatment due to perceived insufficient time and the inconvenience of attending appointments [ 55 ].

PPD was significantly higher among mothers with financial problems or problems with their husbands. This came in agreement with Ahmed et al. [ 34 ], who showed that, regarding stressful conditions and PPD, there was a statistically significant association with a higher percentage of PPD among mothers who had a history of stressful conditions (59.3%), compared to those with no history of stressful conditions (40.7%). Furthermore, Al Nasr et al. (2020) revealed that stressful life events contributed significantly ( P value = 0.003) to the development of PPD in the sample population. Al Nasr et al. stressful life events (OR = 2.677, p  = 0.005) [ 42 ].

Coping strategies: causes of fearing and not seeking

Feeling at ease discussing mental health topics with one’s husband, family, community, and physician and experiencing cultural stigma or judgment regarding PPD within the community was significantly associated with the presence of PPD. In the current study, there were different reasons for not receiving the treatment, including cultural or traditional beliefs, language or communication barriers, social norms, and geographical or regional disparities in mental health resources. Haque and Malebranche [ 56 ] portrayed culture and the various conceptualizations of the maternal role as barriers to women seeking help and treatment.

In the present study, marital status, nationality, current pregnancy status, healthy baby, postnatal problems, receiving support or treatment for PPD, having awareness of symptoms and risk factors of PPD, suffering from any disease or mental or psychiatric disorder, comfort discussing mental health with family, and experiencing cultural stigma or judgment about PPD in the community were the significant predictors of PPD. In agreement with Ahmed et al. [ 34 ], the final logistic regression model contained seven predictors for PPD symptoms: SES, history of depression, history of PPD, history of stressful conditions, familial support, unwanted pregnancy, and male preference.

PPD has been recognized as a public health problem and may cause negative consequences for infants. It is estimated that 20 to 40% of women living in low-income countries experience depression during pregnancy or the postpartum period. The prevalence of PPD shows a wide variation, affecting 8–50% of postnatal mothers across countries [ 19 ].

Strengths and limitations

Strengths of our study include its multinational scope, which involved participants from six different countries, enhancing the generalizability of the findings. The study also boasted a large sample size of 674 participants, increasing the statistical power and reliability of the results. Standardized measures, such as the Edinburgh Postnatal Depression Scale (EPDS), were used for assessing postpartum depression, ensuring consistency and comparability across diverse settings. Additionally, the study explored a comprehensive range of predictors and associated factors of postpartum depression, including demographic, obstetric, health-related, and psychosocial characteristics. Rigorous analysis techniques, including multiple logistic regression analyses, were employed to identify significant predictors of postpartum depression, controlling for potential confounders and providing robust statistical evidence.

However, the study has several limitations that should be considered. Firstly, its cross-sectional design limits causal inference, as it does not allow for the determination of temporal relationships between variables. Secondly, the reliance on self-reported data, including information on postpartum depression symptoms and associated factors, may be subject to recall bias and social desirability bias. Thirdly, the use of convenience sampling methods may introduce selection bias and limit the generalizability of the findings to a broader population. Lastly, cultural differences in the perception and reporting of postpartum depression symptoms among participants from different countries could influence the results.

Moreover, the variation in sample size and response rates among countries can be attributed to two main variables. (1) The methodology showed that the sample size was determined by considering several parameters, such as allocating proportionately to the mothers who gave birth and fulfilling the selection criteria during the data collection period served by each health center. (2) The political turmoil in Syria affects how often and how well people can use the Internet, especially because the data was gathered using an online survey link, leading to a relatively low number of responses from those areas. (3) Language barrier in Ghana: as we used the Arabic and English-validated versions of the EPDS, Ghana is a multilingual country with approximately eighty languages spoken. Although English is considered an official language, the primarily spoken languages in the southern region are Akan, specifically the Akuapem Twi, Asante Twi, and Fante dialects. In the northern region, primarily spoken are the Mole-Dagbani ethnic languages, Dagaare and Dagbanli. Moreover, there are around seventy ethnic groups, each with its own unique language [ 57 ]. (4) At the end of the data collection period, to avoid seasonal affective disorders, the meteorological autumn season began on the 1st day of September, which may be associated with autumm depressive symptoms that may confound or affect our results. Furthermore, the sampling methods were not universal across all Arabic countries, potentially constraining the generalizability of our findings.

Recommendations

The antenatal programme should incorporate health education programmes about the symptoms of PPD. Health education programs about the symptoms of PPD should be included in the antenatal program.

Mass media awareness campaigns have a vital role in raising public awareness about PPD-related issues. Mass media.

The ANC first visit history should elicit a family history of mental illness, enabling early detection of risky mothers. Family history of mental illness can be easily elicited in the ANC first visit history.

For effective management of PPD, effective support (from husband, friends, and family) is an essential component. For effective management of PPD effectiveness of support.

The maternal (antenatal, natal, and postnatal) services should be provided for free and of high quality The maternal (antenatal, natal, postnatal) services should be provided free and of high quality.

It should be stressed that although numerous studies have been carried out on PPD, further investigation needs to be conducted on the global prevalence and incidence of depressive symptoms in pregnant women and related risk factors, especially in other populations.

Around 14% of the studied mothers had PPD, and the frequency varies across different countries and half of them do not know. Our study identified significant associations and predictors of postpartum depression (PPD) among mothers. Marital status was significantly associated with PPD, with married mothers having lower odds of experiencing PPD compared to single mothers. Nationality also emerged as a significant predictor, with Yemeni, Syrian, and Iraqi mothers showing lower odds of PPD compared to Egyptian mothers. Significant obstetric, current pregnancy, and infant-related predictors included the pregnancy status, the health status of the baby, and the presence of postnatal problems. Among psychological and social predictors, receiving support or treatment for PPD, awareness of symptoms and risk factors, experiencing cultural stigma or judgment about PPD, and suffering from any disease or mental disorder were significantly associated with PPD. Additionally, mothers who were maybe more comfortable discussing mental health with family relatives had lower odds of experiencing PPD.

These findings underscore the importance of considering various demographic, obstetric, psychosocial, and coping factors in the identification and management of PPD among mothers. Targeted interventions addressing these predictors could potentially mitigate the risk of PPD and improve maternal mental health outcomes.

Data availability

Yes, I have research data to declare.The data is available when requested from the corresponding author [email protected].

Abbreviations

Adjusted Odds Ratio

  • Postpartum depression

Primary Health Care centers

Socioeconomic Status

program (Statistical Package for Social Science

The Edinburgh Postnatal Depression Scale

The Neonatal Intensive Care Unit

Sultan P, Ando K, Elkhateb R, George RB, Lim G, Carvalho B et al. (2022). Assessment of Patient-Reported Outcome Measures for Maternal Postpartum Depression Using the Consensus-Based Standards for the Selection of Health Measurement Instruments Guideline: A Systematic Review. JAMA Network Open; 1;5(6).

Crotty F, Sheehan J. Prevalence and detection of postnatal depression in an Irish community sample. Ir J Psychol Med. 2004;21:117–21.

Article   PubMed   Google Scholar  

Goodman SH, Brand SR. Parental psychopathology and its relation to child psychopathology. In: Hersen M, Gross AM, editors. Handbook of clinical psychology vol 2: children and adolescents. Hoboken, NJ: Wiley; 2008. pp. 937–65.

Google Scholar  

Wang Z, Liu J, Shuai H, Cai Z, Fu X, Liu Y, Xiao et al. (2021). Mapping global prevalence of depression among postpartum women. Transl Psychiatry. 2021;11(1):543. https://doi.org/10.1038/s41398-021-01663-6 . Erratum in: Transl Psychiatry; 20;11(1):640. PMID: 34671011IF: 6.8 Q1 B1; PMCID: PMC8528847IF: 6.8 Q1 B1.Lase accessed Jan 2024.

American Psychiatric Association. (2013). Diagnostic and Statistical Manual of Mental Disorders. 5th edition. Arlington, VA: American Psychiatric Association; Lase accessed October 2023.

Robertson E, Grace S, Wallington T, Stewart DE. Antenatal risk factors for postpartum depression: a synthesis of recent literature. Gen Hosp Psychiatry. 2004;26:289–95.

Gaynes BN, Gavin N, Meltzer-Brody S, Lohr KN, Swinson T, Gartlehner G, Brody S, Miller WC. Perinatal depression: prevalence, screening accuracy, and screening outcomes: Summary. AHRQ evidence report summaries; 2005. pp. 71–9.

O’hara MW, Swain AM. (1996). Rates and risk of postpartum depression: a meta-analysis. Int Rev Psychiatry. 1996; 8:37–54.

Goodman SH, Brand SR. (2008). Parental psychopathology and its relation to child psychopathology. In: Hersen M, Gross AM, editors. Handbook of clinical psychology Vol 2: Children and adolescents. Hoboken, NJ: John Wiley & Sons; 2008. pp. 937–65.

Cox JL, Holden JM, Sagovsky R. Detection of postnatal depression: development of the 10-item Edinburgh postnatal depression scale. Br J Psychiatry. 1987;150:782–6.

Article   CAS   PubMed   Google Scholar  

Martín-Gómez C, Moreno-Peral P, Bellón J, SC Cerón S, Campos-Paino H, Gómez-Gómez I, Rigabert A, Benítez I, Motrico E. Effectiveness of psychological, psychoeducational and psychosocial interventions to prevent postpartum depression in adolescent and adult mothers: study protocol for a systematic review and meta-analysis of randomised controlled trials. BMJ open. 2020;10(5):e034424. [accessed Mar 16 2024].

Article   PubMed   PubMed Central   Google Scholar  

Sehairi Z. (2020). Validation Of The Arabic Version Of The Edinburgh Postnatal Depression Scale And Prevalence Of Postnatal Depression On An Algerian Sample. https://api.semanticscholar.org/CorpusID:216391386 . Accessed August 2023.

Hairol MI, Ahmad SA, Sharanjeet-Kaur S et al. (2021). Incidence and predictors of postpartum depression among postpartum mothers in Kuala Lumpur, Malaysia: a cross-sectional study. PLoS ONE, 16(11), e0259782.

Yusuff AS, Tang L, Binns CW, Lee AH. Prevalence and risk factors for postnatal depression in Sabah, Malaysia: a cohort study. Women Birth. 2015;1(1):25–9. pmid:25466643

Article   Google Scholar  

Nakku JE, Nakasi G, Mirembe F. Postpartum major depression at six weeks in primary health care: prevalence and associated factors. Afr Health Sci. 2006;6(4):207–14. https://doi.org/10.5555/afhs.2006.6.4.207 . PMID: 17604509IF: 1.0 Q4 B4; PMCID: PMC1832062

Clavenna A, Seletti E, Cartabia M, Didoni A, Fortinguerra F, Sciascia T, et al. Postnatal depression screening in a paediatric primary care setting in Italy. BMC Psychiatry. 2017;17(1):42. pmid:28122520

Serhan N, Ege E, Ayrancı U, Kosgeroglu N. (2013). Prevalence of postpartum depression in mothers and fathers and its correlates. Journal of clinical nursing; 1;22(1–2):279–84. pmid:23216556

Deribachew H, Berhe D, Zaid T, et al. Assessment of prevalence and associated factors of postpartum depression among postpartum mothers in eastern zone of Tigray. Eur J Pharm Med Res. 2016;3(10):54–60.

Gebregziabher NK, Netsereab TB, Fessaha YG, et al. Prevalence and associated factors of postpartum depression among postpartum mothers in central region, Eritrea: a health facility based survey. BMC Public Health. 2020;20:1–10.

Grace J, Lee K, Ballard C, et al. The relationship between post-natal depression, somatization and behaviour in Malaysian women. Transcult Psychiatry. 2001;38(1):27–34.

Mahmud WMRW, Shariff S, Yaacob MJ. Postpartum depression: a survey of the incidence and associated risk factors among malay women in Beris Kubor Besar, Bachok, Kelantan. The Malaysian journal of medical sciences. Volume 9. MJMS; 2002. p. 41. 1.

Anna S. Postpartum depression and birthexperience in Russia. Psychol Russia: State Theart. 2021;14(1):28–38.

Yadav T, Shams R, Khan AF, Azam H, Anwar M et al. (2020).,. Postpartum Depression: Prevalence and Associated Risk Factors Among Women in Sindh, Pakistan. Cureus.22;12(12):e12216. https://doi.org/10.7759/cureus.12216 . PMID: 33489623IF: 1.2 NA NA; PMCID: PMC7815271IF: 1.2 NA NA.

Abdullah M, Ijaz S, Asad S. (2024). Postpartum depression-an exploratory mixed method study for developing an indigenous tool. BMC Pregnancy Childbirth 24, 49 (2024). https://doi.org/10.1186/s12884-023-06192-2 .

Upadhyay RP, Chowdhury R, Salehi A, Sarkar K, Singh SK, Sinha B et al. (2022). Postpartum depression in India: a systematic review and meta-analysis. Bull World Health Organ [Internet]. 2017 October 10 [cited 2022 October 6];95(10):706. https://doi.org/10.2471/BLT.17.192237/ .

Khalifa DS, Glavin K, Bjertness E et al. (2016). Determinants of postnatal depression in Sudanese women at 3 months postpartum: a cross-sectional study. BMJ open, 6(3), e00944327).

Khadija Sharifzade BK, Padhi S, Manna etal. (2022). Prevalence and associated factors of postpartum depression among Afghan women: a phase-wise cross-sectional study in Rezaie maternal hospital in Herat province.; Razi International Medical Journa2| 2|59| https://doi.org/10.56101/rimj.v2i2.59 .

Azidah A, Shaiful B, Rusli N, et al. Postnatal depression and socio-cultural practices among postnatal mothers in Kota Bahru, Kelantan, Malaysia. Med J Malay. 2006;61(1):76–83.

CAS   Google Scholar  

Agarwala A, Rao PA, Narayanan P. Prevalence and predictors of postpartum depression among mothers in the rural areas of Udupi Taluk, Karnataka, India: a cross-sectional study. Clin Epidemiol Global Health. 2019;7(3):342–5.

Arikan I, Korkut Y, Demir BK et al. (2017). The prevalence of postpartum depression and associated factors: a hospital-based descriptive study.

Azimi-Lolaty HMD, Hosaini SH, Khalilian A, et al. Prevalence and predictors of postpartum depression among pregnant women referred to mother-child health care clinics (MCH). Res J Biol Sci. 2007;2:285–90.

Najafi KFA, Nazifi F, Sabrkonandeh S. Prevalence of postpartum depression in Alzahra Hospital in Rasht in 2004. Guilan Univ Med Sci J. 2006;15:97–105. (In Persian.).

Deng AW, Xiong RB, Jiang TT, Luo YP, Chen WZ. (2014). Prevalence and risk factors of postpartum depression in a population-based sample of women in Tangxia Community, Guangzhou. Asian Pacific journal of tropical medicine; 1;7(3):244–9. pmid:24507649

Ahmed GK, Elbeh K, Shams RM, et al. Prevalence and predictors of postpartum depression in Upper Egypt: a multicenter primary health care study. J Affect Disord. 2021;290:211–8.

Cantilino A, Zambaldi CF, Albuquerque T, et al. Postpartum depression in Recife–Brazil: prevalence and association with bio-socio-demographic factors. J Bras Psiquiatr. 2010;59:1–9.

Abdollahi F, Zarghami M, Azhar MZ, et al. Predictors and incidence of post-partum depression: a longitudinal cohort study. J Obstet Gynecol Res. 2014;40(12):2191–200.

McCoy SJB, Beal JM, et al. Risk factors for postpartum depression: a retrospective investigation at 4-weeks postnatal and a review of the literature. JAOA. 2006;106:193–8.

PubMed   Google Scholar  

Sierra J. (2008). Risk Factors Related to Postpartum Depression in Low-Income Latina Mothers. Ann Arbor: ProQuest Information and Learning Company, 2008.

Çankaya S. The effect of psychosocial risk factors on postpartum depression in antenatal period: a prospective study. Arch Psychiatr Nurs. 2020;34(3):176–83.

Shao HH, Lee SC, Huang JP, et al. Prevalence of postpartum depression and associated predictors among Taiwanese women in a mother-child friendly hospital. Asia Pac J Public Health. 2021;33(4):411–7.

Adeyemo EO, Oluwole EO, Kanma-Okafor OJ, et al. Prevalence and predictors of postpartum depression among postnatal women in Lagos. Nigeria Afr Health Sci. 2020;20(4):1943–54.

Al Nasr RS, Altharwi K, Derbah MS et al. (2020). Prevalence and predictors of postpartum depression in Riyadh, Saudi Arabia: a cross sectional study. PLoS ONE, 15(2), e0228666.

Desai ND, Mehta RY, Ganjiwale J. Study of prevalence and risk factors of postpartum depression. Natl J Med Res. 2012;2(02):194–8.

Azale T, Fekadu A, Medhin G, et al. Coping strategies of women with postpartum depression symptoms in rural Ethiopia: a cross-sectional community study. BMC Psychiatry. 2018;18(1):1–13.

Asaye MM, Muche HA, Zelalem ED. (2020). Prevalence and predictors of postpartum depression: Northwest Ethiopia. Psychiatry journal, 2020.

Ayele TA, Azale T, Alemu K et al. (2016). Prevalence and associated factors of antenatal depression among women attending antenatal care service at Gondar University Hospital, Northwest Ethiopia. PLoS ONE, 11(5), e0155125.

Saraswat N, Wal P, Pal RS et al. (2021). A detailed Biological Approach on Hormonal Imbalance Causing Depression in critical periods (Postpartum, Postmenopausal and Perimenopausal Depression) in adult women. Open Biology J, 9.

Guida J, Sundaram S, Leiferman J. Antenatal physical activity: investigating the effects on postpartum depression. Health. 2012;4:1276–86.

Robertson E, Grace S, Wallington T, et al. Antenatal risk factors for postpartum depression: a synthesis of recent literature. Gen Hosp Psychiatry. 2004;26:289–95.

Watanabe M, Wada K, Sakata Y, et al. Maternity blues as predictor of postpartum depression: a prospective cohort study among Japanese women. J Psychosom Obstet Gynecol. 2008;29:211–7.

Kirpinar I˙, Gözüm S, Pasinliog˘ lu T. Prospective study of post-partum depression in eastern Turkey prevalence, socio- demographic and obstetric correlates, prenatal anxiety and early awareness. J Clin Nurs. 2009;19:422–31.

Zhao XH, Zhang ZH. Risk factors for postpartum depression: an evidence-based systematic review of systematic reviews and meta-analyses. Asian J Psychiatry. 2020;53:102353.

Bina R. Predictors of postpartum depression service use: a theory-informed, integrative systematic review. Women Birth. 2020;33(1):e24–32.

Jannati N, Farokhzadian J, Ahmadian L. The experience of healthcare professionals providing mental health services to mothers with postpartum depression: a qualitative study. Sultan Qaboos Univ Med J. 2021;21(4):554.

Dennis CL, Chung-Lee L. Postpartum depression help‐seeking barriers and maternal treatment preferences: a qualitative systematic review. Birth. 2006;33(4):323–31.

Haque S, Malebranche M. (2020). Impact of culture on refugee women’s conceptualization and experience of postpartum depression in high-income countries of resettlement: a scoping review. PLoS ONE, 15(9), e0238109.

https:// www.statista.com/statistics/1285335/population-in-ghana-by-languages-spoken/ .

Download references

Acknowledgements

We would like to express our deep thanks to Rovan Hossam Abdulnabi Ali for her role in completing this study and her unlimited support. Special thanks to Dr. Mohamed Liaquat Raza for his role in reviewing the questionnaire. Moreover, we would like to thank all the mothers who participated in this study.

No funding for this project.

Author information

Authors and affiliations.

Department of Public Health and Community Medicine, Faculty of Medicine, Zagazig University, Zagazig, Egypt

Samar A. Amer

Department of Family Medicine, Faculty of Medicine, Zagazig University, Zagazig, Egypt

Nahla A. Zaitoun

Department of Psychiatry, Faculty of Medicine, Zagazig University, Zagazig, Egypt

Heba A. Abdelsalam

Faculty of Medicine, Al-Azhar University, Damietta, Egypt

Abdallah Abbas

Department of Obstetrics and Gynecology, Faculty of Medicine, Zagazig University, Zagazig, Egypt

Mohamed Sh Ramadan

Hammurabi Medical College, University of Babylon, Al-Diwaniyah, Iraq

Hassan M. Ayal

Hardamout University College of Medicine, Almukalla, Yemen

Samaher Edhah Ahmed Ba-Gais

Department of General Medicine, Shadan Institute of Medical Science, Hyderabad, India

Nawal Mahboob Basha

College of Medicine, Sulaiman Alrajhi University, Albukayriah, Al-Qassim, Saudi Arabia

Abdulrahman Allahham

Department of Virology, Noguchi Memorial Institute for Medical Research, University of Ghana Legon, Accra, Ghana

Emmanuael Boateng Agyenim

Department of Public Health and Community Medicine, Faculty of Medicine, Beni-Suef University, Beni-Suef, Egypt

Walid Amin Al-Shroby

You can also search for this author in PubMed   Google Scholar

Contributions

Conceptualization: Samar A. Amer (SA); Methodology: SA, Nahal A. Zaitoun (NZ); Validation: Mohamed Ramadan Ali Shaaban (MR), Hassan Majid Abdulameer Aya (HM), Samaher Edhah Ahmed Ba-Gais (SG), Nawal Mahboob Basha (NB), Abdulrahman Allahham (AbAl), Emmanuael Boateng Agyenim (EB); Formal analysis: Abdallah Abbas (AA); Data curation: MR, HM, SG, NB, AbAl, NZ, and EB; Writing original draft preparation: SA, Heba Ahmed Abdelsalam (HAA), and NZ; Writing review and editing: MR, AA, Walid Amin Elshrowby (WE); Visualization: SA, AA; Supervision: SA; Project administration: AA. All authors have read and agreed to the published version of the manuscript.

Corresponding author

Correspondence to Samar A. Amer .

Ethics declarations

Ethical approval and consent to participate.

All participants were provided with electronic informed consent after receiving clear explanations regarding the study’s objectives, data confidentiality, voluntary participation, and the right to withdraw. The questionnaire did not contain any sensitive questions, and data collection was performed anonymously. We affirm that all relevant ethical guidelines have been adhered to, and any necessary approvals from the ethics committee have been obtained. Approval was received from the ethical committee of the family medicine department, the faculty of medicine at Zagazig University, and from the patients included in the study. IRP#ZU-IRP#11079-8/10-2023.

Practicing ethical decision-making is crucial for providing clinical treatment. Such decisions are frequently made challenging due to a lack of knowledge and the mother’s ability to handle the associated complexities and uncertainties that affect the patient’s current level of functioning and ability to take care of her child. At the end of the survey, we raised concerns regarding the red flags, such as suicidal thoughts, and called for a revisit for the psychiatrist’s evaluation of the discussion of the risks, benefits, and alternatives to using medication.

Consent for publication

All authors have read and agreed to the published version of the manuscript.

Previous publication

We declare that this research paper has not been published elsewhere in any other academic journal or platform.

Generative AI in scientific writing

We declare that we have not used AI in writing any part of this manuscript.

Conflict of interest

No conflict of interest.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary Material 1

Rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ . The Creative Commons Public Domain Dedication waiver ( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article.

Amer, S.A., Zaitoun, N.A., Abdelsalam, H.A. et al. Exploring predictors and prevalence of postpartum depression among mothers: Multinational study. BMC Public Health 24 , 1308 (2024). https://doi.org/10.1186/s12889-024-18502-0

Download citation

Received : 06 February 2024

Accepted : 02 April 2024

Published : 14 May 2024

DOI : https://doi.org/10.1186/s12889-024-18502-0

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • The Edinburgh postnatal depression scale (EPDS)
  • Determinants
  • Psychosocial

BMC Public Health

ISSN: 1471-2458

example research question for multiple regression

medRxiv

Exploring the Relationship Between Early Life Exposures and the Comorbidity of Obesity and Hypertension: Findings from the 1970 The British Cohort Study (BCS70)

  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for S Stannard
  • For correspondence: [email protected]
  • ORCID record for R Owen
  • ORCID record for A Berrington
  • ORCID record for N Ziauddeen
  • ORCID record for SDS Fraser
  • ORCID record for S Paranjothy
  • ORCID record for RB Hoyle
  • ORCID record for N A Alwan
  • Info/History
  • Supplementary material
  • Preview PDF

Background Epidemiological research commonly investigates single exposure-outcome relationships, while children’s experiences across a variety of early lifecourse domains are intersecting. To design realistic interventions, epidemiological research should incorporate information from multiple risk exposure domains to assess effect on health outcomes. In this paper we identify exposures across five pre-hypothesised childhood domains and explored their association to the odds of combined obesity and hypertension in adulthood.

Methods We used data from 17,196 participants in the 1970 British Cohort Study. The outcome was obesity (BMI of ≥30) and hypertension (blood pressure>140/90mm Hg or self-reported doctor’s diagnosis) comorbidity at age 46. Early life domains included: ‘prenatal, antenatal, neonatal and birth’, ‘developmental attributes and behaviour’, ‘child education and academic ability’, ‘socioeconomic factors’ and ‘parental and family environment’. Stepwise backward elimination selected variables for inclusion for each domain. Predicted risk scores of combined obesity and hypertension for each cohort member within each domain were calculated. Logistic regression investigated the association between domain-specific risk scores and odds of obesity-hypertension, controlling for demographic factors and other domains.

Results Adjusting for demographic confounders, all domains were associated with odds of obesity-hypertension. Including all domains in the same model, higher predicted risk values across the five domains remained associated with increased odds of obesity-hypertension comorbidity, with the strongest associations to the parental and family environment domain (OR1.11 95%CI 1.05-1.18) and the socioeconomic factors domain (OR1.11 95%CI 1.05-1.17).

Conclusions Targeted prevention interventions aimed at population groups with shared early-life characteristics could have an impact on obesity-hypertension prevalence which are known risk factors for further morbidity including cardiovascular disease.

Competing Interest Statement

R.O. is a member of the National Institute for Health and Care Excellence (NICE) Technology Appraisal Committee, member of the NICE Decision Support Unit (DSU), and associate member of the NICE Technical Support Unit (TSU). She has served as a paid consultant to the pharmaceutical industry and international reimbursement agencies, providing unrelated methodological advice. She reports teaching fees from the Association of British Pharmaceutical Industry (ABPI). R.H. is a member of the Scientific Board of the Smith Institute for Industrial Mathematics and System Engineering.

Funding Statement

This work is part of the multidisciplinary ecosystem to study lifecourse determinants and prevention of early-onset burdensome multimorbidity (MELD-B) project which is supported by the National Institute for Health Research (NIHR203988). The views expressed are those of the authors and not necessarily those of the NIHR or the Department of Health and Social Care.

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

The details of the IRB/oversight body that provided approval or exemption for the research described are given below:

Ethics approval for this work has been obtained from the University of Southampton Faculty of Medicine Ethics committee (ERGO II Reference 66810).

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.

Data Availability Statement

The BCS70 datasets generated and analysed in the current study are available from the UK Data Archive repository (available here: http://www.cls.ioe.ac.uk/page.aspx?&sitesectionid=795 ).

View the discussion thread.

Supplementary Material

Thank you for your interest in spreading the word about medRxiv.

NOTE: Your email address is requested solely to identify you as the sender of this article.

Reddit logo

Citation Manager Formats

  • EndNote (tagged)
  • EndNote 8 (xml)
  • RefWorks Tagged
  • Ref Manager
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Subject Area

  • Epidemiology
  • Addiction Medicine (324)
  • Allergy and Immunology (627)
  • Anesthesia (163)
  • Cardiovascular Medicine (2371)
  • Dentistry and Oral Medicine (289)
  • Dermatology (206)
  • Emergency Medicine (379)
  • Endocrinology (including Diabetes Mellitus and Metabolic Disease) (836)
  • Epidemiology (11768)
  • Forensic Medicine (10)
  • Gastroenterology (702)
  • Genetic and Genomic Medicine (3736)
  • Geriatric Medicine (350)
  • Health Economics (633)
  • Health Informatics (2395)
  • Health Policy (932)
  • Health Systems and Quality Improvement (896)
  • Hematology (341)
  • HIV/AIDS (782)
  • Infectious Diseases (except HIV/AIDS) (13308)
  • Intensive Care and Critical Care Medicine (767)
  • Medical Education (365)
  • Medical Ethics (104)
  • Nephrology (398)
  • Neurology (3501)
  • Nursing (198)
  • Nutrition (524)
  • Obstetrics and Gynecology (674)
  • Occupational and Environmental Health (663)
  • Oncology (1823)
  • Ophthalmology (537)
  • Orthopedics (218)
  • Otolaryngology (287)
  • Pain Medicine (232)
  • Palliative Medicine (66)
  • Pathology (446)
  • Pediatrics (1033)
  • Pharmacology and Therapeutics (426)
  • Primary Care Research (420)
  • Psychiatry and Clinical Psychology (3175)
  • Public and Global Health (6138)
  • Radiology and Imaging (1280)
  • Rehabilitation Medicine and Physical Therapy (747)
  • Respiratory Medicine (826)
  • Rheumatology (379)
  • Sexual and Reproductive Health (372)
  • Sports Medicine (323)
  • Surgery (402)
  • Toxicology (50)
  • Transplantation (172)
  • Urology (145)

User Preferences

Content preview.

Arcu felis bibendum ut tristique et egestas quis:

  • Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris
  • Duis aute irure dolor in reprehenderit in voluptate
  • Excepteur sint occaecat cupidatat non proident

Keyboard Shortcuts

Lesson 1: simple linear regression, overview section  .

Simple linear regression is a statistical method that allows us to summarize and study relationships between two continuous (quantitative) variables. This lesson introduces the concept and basic procedures of simple linear regression.

  • Distinguish between a deterministic relationship and a statistical relationship.
  • Understand the concept of the least squares criterion.
  • Interpret the intercept \(b_{0}\) and slope \(b_{1}\) of an estimated regression equation.
  • Know how to obtain the estimates \(b_{0}\) and \(b_{1}\) from Minitab's fitted line plot and regression analysis output.
  • Recognize the distinction between a population regression line and the estimated regression line.
  • Summarize the four conditions that comprise the simple linear regression model.
  • Know what the unknown population variance \(\sigma^{2}\) quantifies in the regression setting.
  • Know how to obtain the estimated MSE of the unknown population variance \(\sigma^{2 }\) from Minitab's fitted line plot and regression analysis output.
  • Know that the coefficient of determination (\(R^2\)) and the correlation coefficient (r) are measures of linear association. That is, they can be 0 even if there is a perfect nonlinear association.
  • Know how to interpret the \(R^2\) value.
  • Understand the cautions necessary in using the \(R^2\) value as a way of assessing the strength of the linear association.
  • Know how to calculate the correlation coefficient r from the \(R^2\) value.
  • Know what various correlation coefficient values mean. There is no meaningful interpretation for the correlation coefficient as there is for the \(R^2\) value.

Lesson 1 Code Files Section  

STAT501_Lesson01.zip

  • bldgstories.txt
  • carstopping.txt
  • drugdea.txt
  • fev_dat.txt
  • heightgpa.txt
  • husbandwife.txt
  • oldfaithful.txt
  • poverty.txt
  • practical.txt
  • signdist.txt
  • skincancer.txt
  • student_height_weight.txt

COMMENTS

  1. Research Using Multiple Regression Analysis: 1 Example with Conceptual

    This quickly done example of a research using multiple regression analysis revealed an interesting finding. The number of hours spent online relates significantly to the number of hours spent by a parent, specifically the mother, with her child. These two factors are inversely or negatively correlated. The relationship means that the greater ...

  2. Multiple Linear Regression

    Multiple linear regression formula. The formula for a multiple linear regression is: = the predicted value of the dependent variable. = the y-intercept (value of y when all other parameters are set to 0) = the regression coefficient () of the first independent variable () (a.k.a. the effect that increasing the value of the independent variable ...

  3. Section 5.3: Multiple Regression Explanation, Assumptions

    Multiple Regression Write Up. Here is an example of how to write up the results of a standard multiple regression analysis: In order to test the research question, a multiple regression was conducted, with age, gender (0 = male, 1 = female), and perceived life stress as the predictors, with levels of physical illness as the dependent variable.

  4. 3.1

    At some level, answering these two research questions is straightforward. Both just involve using the estimated regression equation: That is, y ^ h = b 0 + b 1 x h is the best answer to each research question. It is the best guess of the mean response at x h, and it is the best guess of a new response at x h: Our best estimate of the mean ...

  5. Beginner's Guide to Multiple Linear Regression

    Welcome to our Beginner's Guide to Multiple Linear Regression, your gateway to understanding a key concept in machine learning. Multiple linear regression helps predict outcomes by analyzing the relationship between one dependent variable and multiple independent variables, making it an essential skill for data scientists. Contents. Section 1 ...

  6. PDF Multiple Regression

    WHY Scientific research. For simple regression we found the Least Squares solution, ... Here's a typical example of a multiple regression table: Dependent variable is: Pct BF R-squared 5 71.3% R-squared (adjusted) ... question; we follow the three rules. Here's the scatterplot: Chapter 29 • Multiple Regression 29-3 40 30 20 10 0

  7. Lesson 5: Multiple Linear Regression

    Multiple linear regression, in contrast to simple linear regression, involves multiple predictors and so testing each variable can quickly become complicated. For example, suppose we apply two separate tests for two predictors, say \ (x_1\) and \ (x_2\), and both tests have high p-values. One test suggests \ (x_1\) is not needed in a model with ...

  8. Multiple linear regression

    When could this happen in real life: Time series: Each sample corresponds to a different point in time. The errors for samples that are close in time are correlated. Spatial data: Each sample corresponds to a different location in space. Grouped data: Imagine a study on predicting height from weight at birth. If some of the subjects in the study are in the same family, their shared environment ...

  9. PDF OBJECTIVES

    The research question for those using multiple regression concerns how the multiple independent variables, either by themselves or together, influence changes in the depen-dent variable. You use the same basic concepts as with simple linear regression, except that ... The multiple regression example used in this chapter is as basic as possible ...

  10. Questions the Multiple Linear Regression Answers

    Multiple Linear Regression Analysis helps answer three key types of questions: (1) identifying causes, (2) predicting effects, and (3) forecasting trends. Identifying Causes: It determines the cause-and-effect relationships between one continuous dependent variable and two or more independent variables. Unlike correlation analysis, which doesn ...

  11. 15 Multiple Regression

    With multiple regression what we're doing is looking at the effect of each variable, while holding the other variable constant. Specifically, a one unit increase in computers is associated with an increase of math scores of.002 points when holding the number of students constant, and that change is highly significant.

  12. PDF Practice Questions: Multiple Regression

    Practice Questions: Multiple Regression. An auto manufacturer was interested in pricing strategies for a new vehicle it plans to introduce in the coming year. The analysis that follows considers how other manufacturers price their vehicles. The analysis begins with the correlation of price with certain features of the vehicle, particularly ...

  13. Multiple Linear Regression. A complete study

    Here, Y is the output variable, and X terms are the corresponding input variables. Notice that this equation is just an extension of Simple Linear Regression, and each predictor has a corresponding slope coefficient (β).The first β term (βo) is the intercept constant and is the value of Y in absence of all predictors (i.e when all X terms are 0). It may or may or may not hold any ...

  14. Multiple Regression Analysis using SPSS Statistics

    The "R" column represents the value of R, the multiple correlation coefficient.R can be considered to be one measure of the quality of the prediction of the dependent variable; in this case, VO 2 max.A value of 0.760, in this example, indicates a good level of prediction. The "R Square" column represents the R 2 value (also called the coefficient of determination), which is the proportion of ...

  15. Regression Tutorial with Analysis Examples

    This tutorial covers many facets of regression analysis including selecting the correct type of regression analysis, specifying the best model, interpreting the results, assessing the fit of the model, generating predictions, and checking the assumptions. I close the post with examples of different types of regression analyses.

  16. Lesson 5: Multiple Linear Regression

    The only real difference is that whereas in simple linear regression we think of the distribution of errors at a fixed value of the single predictor, with multiple linear regression we have to think of the distribution of errors at a fixed set of values for all the predictors. All of the model-checking procedures we learned earlier are useful ...

  17. Lesson 5: Multiple Linear Regression (MLR) Model & Evaluation

    Translate research questions involving slope parameters into the appropriate hypotheses for testing. Know how to calculate a confidence interval for a single slope parameter in the multiple regression setting. Understand the general idea behind the general linear F-test. Understand the decomposition of a regression sum of squares into a sum of ...

  18. Regression Analysis

    Here is a general methodology for performing regression analysis: Define the research question: Clearly state the research question or hypothesis you want to investigate. Identify the dependent variable (also called the response variable or outcome variable) and the independent variables (also called predictor variables or explanatory variables ...

  19. Regression

    HIERARCHICAL MULTIPLE REGRESSION- researcher selects the order the predictor variables will enter the equation. The research question for regression is: To what extent and in what manner do the predictors explain variation in the criterion? to what extent- H0: R2=0; in what manner- H0: beta=0; EXPLAINED (REGRESSION) is the difference ...

  20. LibGuides: Statistics Resources: Regression Analysis

    These are just a few examples of what the research questions and hypotheses may look like when a regression analysis is appropriate. Simple Linear Regression. RQ: Does body weight influence cholesterol levels? H0: Bodyweight does not have an influence on cholesterol levels. Ha: Bodyweight has a significant influence on cholesterol levels.

  21. Writing hypothesis for linear multiple regression models

    2. I struggle writing hypothesis because I get very much confused by reference groups in the context of regression models. For my example I'm using the mtcars dataset. The predictors are wt (weight), cyl (number of cylinders), and gear (number of gears), and the outcome variable is mpg (miles per gallon). Say all your friends think you should ...

  22. Section 5.4: Hierarchical Regression Explanation, Assumptions

    The example research question is "what is the effect of perceived stress on physical illness, after controlling for age and gender?". ... In order to test the predictions, a hierarchical multiple regression was conducted, with two blocks of variables. The first block included age and gender (0 = male, 1 = female) as the predictors, with ...

  23. Exploring predictors and prevalence of postpartum depression among

    The data underwent logistic regression analysis using SPSS-IBM 27 to list potential factors that could predict PPD. The overall frequency of PPD in the total sample was 92(13.6%). It ranged from 2.3% in Syria to 26% in Ghana. Only 42 (6.2%) were diagnosed. Multiple logistic regression analysis revealed there were significant predictors of PPD.

  24. Exploring the Relationship Between Early Life Exposures and the

    Abstract Background: Epidemiological research commonly investigates single exposure-outcome relationships, while childrens experiences across a variety of early lifecourse domains are intersecting. To design realistic interventions, epidemiological research should incorporate information from multiple risk exposure domains to assess effect on health outcomes. In this paper we identify ...

  25. Lesson 1: Simple Linear Regression

    Objectives. Upon completion of this lesson, you should be able to: Distinguish between a deterministic relationship and a statistical relationship. Understand the concept of the least squares criterion. Interpret the intercept b 0 and slope b 1 of an estimated regression equation. Know how to obtain the estimates b 0 and b 1 from Minitab's ...