tips:assembling_data_smd

Suppose the goal of a meta-analysis is to aggregate the results from studies contrasting two groups (e.g., treatment versus control) and each study measured an outcome of interest using some quantitative scale. A commonly used effect size measure used to quantify the size of the group difference is then the standardized mean difference (also commonly known as Cohen's d).

As an example, consider the data reported in Normand (1999) on the length of the hospital stay of stroke patients under specialized care and under conventional/routine non-specialist care:

library(metafor) dat.normand1999

study source n1i m1i sd1i n2i m2i sd2i 1 1 Edinburgh 155 55 47 156 75 64 2 2 Orpington-Mild 31 27 7 32 29 4 3 3 Orpington-Moderate 75 64 17 71 119 29 4 4 Orpington-Severe 18 66 20 18 137 48 5 5 Montreal-Home 8 14 8 13 18 11 6 6 Montreal-Transfer 57 19 7 52 18 4 7 7 Newcastle 34 52 45 33 41 34 8 8 Umea 110 21 16 183 31 27 9 9 Uppsala 60 30 27 52 23 20

Variables `n1i`

and `n2i`

indicate the number of patients under specialized and under routine care, respectively, variables `m1i`

and `m2i`

indicate the respective mean length of stay (in days) in the two groups during the acute hospital admission, and `sd1i`

and `sd2i`

are the corresponding standard deviations of the length of stay.

With this information, we can compute the standardized mean difference (and corresponding sampling variance) for each study with:

dat1 <- escalc(measure="SMD", m1i=m1i, sd1i=sd1i, n1i=n1i, m2i=m2i, sd2i=sd2i, n2i=n2i, data=dat.normand1999) dat1

study source n1i m1i sd1i n2i m2i sd2i yi vi 1 1 Edinburgh 155 55 47 156 75 64 -0.3552 0.0131 2 2 Orpington-Mild 31 27 7 32 29 4 -0.3479 0.0645 3 3 Orpington-Moderate 75 64 17 71 119 29 -2.3176 0.0458 4 4 Orpington-Severe 18 66 20 18 137 48 -1.8880 0.1606 5 5 Montreal-Home 8 14 8 13 18 11 -0.3840 0.2054 6 6 Montreal-Transfer 57 19 7 52 18 4 0.1721 0.0369 7 7 Newcastle 34 52 45 33 41 34 0.2721 0.0603 8 8 Umea 110 21 16 183 31 27 -0.4246 0.0149 9 9 Uppsala 60 30 27 52 23 20 0.2896 0.0363

Finally, a random-effects model can be fitted to these data with:

res1 <- rma(yi, vi, data=dat1) res1

Random-Effects Model (k = 9; tau^2 estimator: REML) tau^2 (estimated amount of total heterogeneity): 0.7908 (SE = 0.4281) tau (square root of estimated tau^2 value): 0.8893 I^2 (total heterogeneity / total variability): 95.49% H^2 (total variability / sampling variability): 22.20 Test for Heterogeneity: Q(df = 8) = 123.7293, p-val < .0001 Model Results: estimate se zval pval ci.lb ci.ub -0.5371 0.3087 -1.7401 0.0818 -1.1421 0.0679 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Therefore, stroke patients receiving specialized care are estimated to spent on average about half of a standard deviation fewer days in the hospital compared to patients receiving routine care ($\hat{\mu} = -0.54$ with 95% CI: $-1.14$ to $0.07$), but there is a considerable amount of heterogeneity in the findings (as indicated by the large estimate of $\tau^2$, the large $I^2$ value, and the significant $Q$-test).

*Note:* Since the exact same variable (number of days of the hospital stay) was measured in each study, it would also be possible (and probably better) to conduct the meta-analysis with the raw (unstandardized) mean difference as the effect size measure (or possibly with the (log transformed) ratio of means, since the variable was measured on a ratio scale). The data are used here just for illustration purposes.

Now suppose that the means and standard deviations of the length of stay variable are not reported in all studies, but that the following dataset could be assembled based on information reported in the studies:

dat2 <- dat.normand1999 dat2$dval <- with(dat2, (m1i - m2i) / sqrt(((n1i-1)*sd1i^2 + (n2i-1)*sd2i^2)/(n1i + n2i - 2))) dat2$tval <- with(dat2, dval / sqrt(1/n1i + 1/n2i)) dat2$pval <- 2 * with(dat2, pt(abs(tval), df = n1i + n2i - 2, lower.tail=FALSE)) dat2$sign <- with(dat2, ifelse(m1i > m2i, 1, -1)) dat2[,c("dval","tval","pval")] <- round(dat2[,c("dval","tval","pval")], 2) dat2[c(1,7),c(4,5,7,8,10,11,12)] <- NA dat2[c(5,8),c(4,5,7,8, 9,11,12)] <- NA dat2[c(2,9),c(4,5,7,8, 9,10)] <- NA dat2[c(3,4,6),c(9:12)] <- NA dat2

study n1i m1i sd1i n2i m2i sd2i dval tval pval sign V12 1 1 155 55 NA NA 75 NA NA -3.14 NA NA NA 2 2 31 27 NA NA 29 NA NA NA NA -1 NA 3 3 75 64 17 71 119 29 -2.33 NA NA NA NA 4 4 18 66 20 18 137 48 -1.93 NA NA NA NA 5 5 8 14 NA NA 18 NA NA NA 0.38 NA NA 6 6 57 19 7 52 18 4 0.17 NA NA NA NA 7 7 34 52 NA NA 41 NA NA 1.13 NA NA NA 8 8 110 21 NA NA 31 NA NA NA 0.00 NA NA 9 9 60 30 NA NA 23 NA NA NA NA 1 NA

In particular, in studies 1 and 7, authors reported only the standardized mean difference value (Cohen's d value). In studies 5 and 8, the authors only reported the t-statistic from an independent samples t-test comparing the two groups, and in studies 2 and 9, the authors only reported the (two-sided) p-value corresponding to the t-test. For all studies, the group sizes are known.^{1)}

Given only this information, it is possible to reconstruct the full dataset for the meta-analysis (after each step, I would recommend examining the contents of the `dat2`

object to better appreciate what the code is doing).

- For the studies reporting means and standard deviations (studies 3, 4, and 6), we can again use the
`escalc()`

function to obtain the (bias-corrected) standardized mean differences and corresponding sampling variances with:dat2 <- escalc(measure="SMD", m1i=m1i, sd1i=sd1i, n1i=n1i, m2i=m2i, sd2i=sd2i, n2i=n2i, data=dat2)

- The p-values from the t-tests can be converted into the corresponding t-values with:
dat2$tval <- replmiss(dat2$tval, with(dat2, sign * qt(pval/2, df=n1i+n2i-2, lower.tail=FALSE)))

Note that the two-sided p-values need to be turned into one-sided p-values first (i.e., by dividing by 2) before using the

`qt()`

function. Also, the p-values do not contain information on the sign of the t-test values (i.e., whether the first or the second group had the lower mean), so that information needs to be available as well (as encoded in the`sign`

variable). The`replmiss()`

function is a little helper function in the metafor package that can be used to replace only the missing values in a variable. - The t-test values can be converted into the corresponding Cohen's d values with:
dat2$dval <- replmiss(dat2$dval, with(dat2, tval * sqrt(1/n1i + 1/n2i)))

This conversion is based on the well-known relationship between Cohen's d values and the test statistic of the independent samples t-test. Again, only missing values are replaced.

- Any missing values for the
`vi`

variable (the sampling variances) can now be replaced with:dat2$vi <- replmiss(dat2$vi, with(dat2, 1/n1i+ 1/n2i + yi^2/(2*(n1i+n2i))))

The code above uses the usual equation for computing (or rather: estimating) the large-sample sampling variance of standardized mean differences.

We can now examine the contents of the dataset:

dat2[-2]

study n1i m1i sd1i n2i m2i sd2i dval tval pval sign yi vi 1 1 155 NA NA 156 NA NA -0.3600 NA NA NA -0.3591 0.0131 2 2 31 NA NA 32 NA NA -0.3499 -1.3886 0.17 -1 -0.3456 0.0645 3 3 75 64 17 71 119 29 NA NA NA NA -2.3176 0.0458 4 4 18 66 20 18 137 48 NA NA NA NA -1.8880 0.1606 5 5 8 NA NA 13 NA NA -0.3999 -0.8900 NA NA -0.3839 0.2054 6 6 57 19 7 52 18 4 NA NA NA NA 0.1721 0.0369 7 7 34 NA NA 33 NA NA 0.2800 NA NA NA 0.2768 0.0603 8 8 110 NA NA 183 NA NA -0.4259 -3.5300 NA NA -0.4248 0.0149 9 9 60 NA NA 52 NA NA 0.2890 1.5255 0.13 1 0.2871 0.0363

Any differences in the `yi`

and `vi`

variables compared to `dat1`

are purely a result of the fact that rounded values were reported for variables `dval`

, `tval`

, and `pval`

. However, as we can see, the differences are in this example negligible.

We can then fit a random-effects model to these data with:

res2 <- rma(yi, vi, data=dat2) res2

Random-Effects Model (k = 9; tau^2 estimator: REML) tau^2 (estimated amount of total heterogeneity): 0.7912 (SE = 0.4283) tau (square root of estimated tau^2 value): 0.8895 I^2 (total heterogeneity / total variability): 95.50% H^2 (total variability / sampling variability): 22.20 Test for Heterogeneity: Q(df = 8) = 123.7121, p-val < .0001 Model Results: estimate se zval pval ci.lb ci.ub -0.5371 0.3087 -1.7398 0.0819 -1.1422 0.0680 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The results are essentially the same as obtained earlier.

Some final notes:

- The conversion of a p-value to the corresponding t-value as shown above assumes that the exact p-value is reported. If authors only report that the p-value fell below a certain threshold (e.g., $p < .01$ or if authors only state that the test was significant – which typically implies $p < .05$), then a common approach is to use the value of the cutoff reported (e.g., if $p < .01$ is reported, then assume $p = .01$), which is conservative (since the actual p-value was below that assumed value by some unknown amount). The conversion will therefore tend to be less accurate.

- The conversion of the p-value shown above is only applicable if authors actually conducted an independent samples t-test (i.e., it is not appropriate if that p-value was based on a Mann–Whitney U test or some other non-parametric test).

- When authors report only the Cohen's d value and/or the test statistic from an independent samples t-test, care must be taken to ensure the correct sign of the value (authors may only report the absolute value or may have interchanged the groups).

- A useful reference for these types of transformations is Lipsey and Wilson (2001).

Some of the steps above could be further simplified by tricking the `escalc()`

function into computing the (bias-corrected) standardized mean differences and corresponding sampling variances from reported Cohen's d values directly. Let's put together the dataset one more time:

dat3 <- dat.normand1999 dat3$dval <- with(dat3, (m1i - m2i) / sqrt(((n1i-1)*sd1i^2 + (n2i-1)*sd2i^2)/(n1i + n2i - 2))) dat3$tval <- with(dat3, dval / sqrt(1/n1i + 1/n2i)) dat3$pval <- 2 * with(dat3, pt(abs(tval), df = n1i + n2i - 2, lower.tail=FALSE)) dat3$sign <- with(dat3, ifelse(m1i > m2i, 1, -1)) dat3[,c("dval","tval","pval")] <- round(dat3[,c("dval","tval","pval")], 2) dat3[c(1,7),c(4,5,7,8,10,11,12)] <- NA dat3[c(5,8),c(4,5,7,8, 9,11,12)] <- NA dat3[c(2,9),c(4,5,7,8, 9,10)] <- NA dat3[c(3,4,6),c(9:12)] <- NA

First, as above, we transform the p-values to corresponding t-values with:

dat3$tval <- replmiss(dat3$tval, with(dat3, sign * qt(pval/2, df=n1i+n2i-2, lower.tail=FALSE)))

And we again transform the t-values to d-values with:

dat3$dval <- replmiss(dat3$dval, with(dat3, tval * sqrt(1/n1i + 1/n2i)))

Now for the studies for which we have only d-values, the trick is that we set variable `m1i`

to these d-values, set `m2i`

to 0, and `sd1i`

and `sd2i`

to 1:

dat3$m1i <- replmiss(dat3$m1i, dat3$dval) dat3$m2i <- replmiss(dat3$m2i, 0) dat3$sd1i <- replmiss(dat3$sd1i, 1) dat3$sd2i <- replmiss(dat3$sd2i, 1)

The dataset now looks like this:

study source n1i m1i sd1i n2i m2i sd2i dval tval pval sign 1 1 Edinburgh 155 -0.3600 1 156 0 1 -0.3600 NA NA NA 2 2 Orpington-Mild 31 -0.3499 1 32 0 1 -0.3499 -1.3886 0.17 -1 3 3 Orpington-Moderate 75 64.0000 17 71 119 29 NA NA NA NA 4 4 Orpington-Severe 18 66.0000 20 18 137 48 NA NA NA NA 5 5 Montreal-Home 8 -0.3999 1 13 0 1 -0.3999 -0.8900 NA NA 6 6 Montreal-Transfer 57 19.0000 7 52 18 4 NA NA NA NA 7 7 Newcastle 34 0.2800 1 33 0 1 0.2800 NA NA NA 8 8 Umea 110 -0.4259 1 183 0 1 -0.4259 -3.5300 NA NA 9 9 Uppsala 60 0.2890 1 52 0 1 0.2890 1.5255 0.13 1

Now we can use `escalc()`

again:

dat3 <- escalc(measure="SMD", m1i=m1i, sd1i=sd1i, n1i=n1i, m2i=m2i, sd2i=sd2i, n2i=n2i, data=dat3) dat3[-2]

study n1i m1i sd1i n2i m2i sd2i dval tval pval sign yi vi 1 1 155 -0.3600 1 156 0 1 -0.3600 NA NA NA -0.3591 0.0131 2 2 31 -0.3499 1 32 0 1 -0.3499 -1.3886 0.17 -1 -0.3456 0.0645 3 3 75 64.0000 17 71 119 29 NA NA NA NA -2.3176 0.0458 4 4 18 66.0000 20 18 137 48 NA NA NA NA -1.8880 0.1606 5 5 8 -0.3999 1 13 0 1 -0.3999 -0.8900 NA NA -0.3839 0.2054 6 6 57 19.0000 7 52 18 4 NA NA NA NA 0.1721 0.0369 7 7 34 0.2800 1 33 0 1 0.2800 NA NA NA 0.2768 0.0603 8 8 110 -0.4259 1 183 0 1 -0.4259 -3.5300 NA NA -0.4248 0.0149 9 9 60 0.2890 1 52 0 1 0.2890 1.5255 0.13 1 0.2871 0.0363

As we can see, the `yi`

and `vi`

values are exactly identical to the ones from `dat2`

above, which are nearly identical to those from `dat1`

.

Lipsey, M. W., & Wilson, D. B. (2001). *Practical meta-analysis.* Thousand Oaks, CA: Sage.

Normand, S. T. (1999). Meta-analysis: Formulating, evaluating, combining, and reporting. *Statistics in Medicine, 18*(3), 321–359.

In some cases, only the total sample size of a study may be known (i.e.,

`ni = n1i + n2i`

). If it is not at all possible to obtain information about the group sizes (e.g., by contacting the author(s) of the study), one possibility would be to assume that both groups are of the same size (i.e., `n1i = n2i = ni/2`

). In the present dataset, this would not be far from the truth in most cases.tips/assembling_data_smd.txt · Last modified: 2021/11/01 09:38 by Wolfgang Viechtbauer

Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Noncommercial-Share Alike 4.0 International