Differences between R and SPSS on probit analysis

Discussion:

(too old to reply)

b***@gmail.com

2017-02-23 20:12:41 UTC

Hi!!!
I'm working on the effects of alternative larvicides on Aedes aegypti. Right now, I am doing a binary mortality response with a single explanatory variable (dose) on 4 concentrations of one larvicide (+ control). Our university is fond of SPSS, and I have learned to conduct the basic probit model with it, including a natural logarithm transformation on my dosis data.
Not so long ago, I've started working with R, and through a combination of the 'glm' and 'dose.p' functions, I get the same slope and intercept, as well as LD50 calculations. Nevertheless, the standard errors and Z-scores calculated through the Probit model in SPSS comes out completely different in R. Additionally, the 95% confidence intervals for the LD50 come out differently between the two programs. I really don't have a clue on how I am getting the same slopes, intercepts and LD50's, but totally different SE, Z, and 95% CI. Can anybody help me so I can get the same results in R??
I'll pass you the script and hypothetical data:
dose <- c(6000, 4500, 3000, 1500, 0)
total <- c(100, 100, 100, 100, 100)
affected <- c(91, 82, 69, 49, 0)
finney71 <- data.frame(dose, total, affected)
fm1 <- glm(affected/total ~ log(dose),
family=binomial(link = probit), data=finney71[finney71$dose != 0, ])
xp1 <- dose.p(fm1, p=c(0.5,0.9))
xp.ci <- xp1 + attr(xp1, "SE") %*% matrix(qnorm(1 - 0.05/2)*c(-1,1), nrow=1)
EAUS.Aa <- exp(cbind(xp1, attr(xp1, "SE"), xp.ci[,1], xp.ci[,2]))
dimnames(EAUS.Aa)[[2]] <- c("LD", "SE", "LCL","UCL")
So, this is the regression results I get with R:
summary(fm1)
Deviance Residuals:
1 2 3 4
0.06655 -0.02814 -0.06268 0.03474
Coefficients:
Estimate Std. Error z value
(Intercept) -6.8940 10.7802 -0.640
log(dose) 0.9333 1.3441 0.694
Pr(>|z|)
(Intercept) 0.522
log(dose) 0.487
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 0.513878 on 3 degrees of freedom
Residual deviance: 0.010356 on 2 degrees of freedom
AIC: 6.5458
Number of Fisher Scoring iterations: 5
And the LD50 and CI transformed:
print(EAUS.Aa)
LD SE LCL UCL
p = 0.5: 1614.444 3.207876 164.3822 15855.91
p = 0.9: 6373.473 3.764879 474.1600 85669.72
These are the values I get on SPSS (just replacing the values on R output) :
Coefficients:
Estimate Std. Error z value
(Intercept) -6.8940 1.082 -6.373
(dose) 2.149 0.311 6.918
And the LD50 and CI transformed:
LD LCL UCL
p = 0.5: 1614.444 1198.932 1953.120
p = 0.9: 6373.473 5145.767 9013.354
So, please if somebody can help me with this, I'd be grateful. If working with those functions won't do it, I'll use another, the one you recommend.
Thank you very much!
PD. I've already googled it but there's no satisfactory answer.

David Duffy

2017-02-24 23:40:55 UTC

Permalink

***@gmail.com wrote:
(+ control). Our university is fond of SPSS, and I have learned to
conduct the basic probit model with it

Post by b***@gmail.com
dose <- c(6000, 4500, 3000, 1500, 0)
total <- c(100, 100, 100, 100, 100)
affected <- c(91, 82, 69, 49, 0)

R MASS::dose.p

Post by b***@gmail.com
LD SE LCL UCL
p = 0.5: 1614.444 3.207876 164.3822 15855.91

SPSS

Post by b***@gmail.com
LD LCL UCL
p = 0.5: 1614.444 1198.932 1953.120

I'm too lazy to work out where your SPSS programming went wrong, but
with such a small dataset, don't you think the CIs should be wide rather
than narrow?

predict(fm1, newdata=data.frame(dose=1614.444), type="response", se.fit=TRUE)
$fit 0.5 $se.fit 0.4339901

or, using some Monte Carlo based CI estimate,
library("HelpersMG")
LD50(finney71[1:4,], equation="probit")
The LD50 is 7.387 SE 1.231
The lower limit of transitional range of doses is 5.625 SE 0.929
The higher limit of transitional range of doses is 9.149 SE 1.278

Incidentally, when we are log transforming a range of values including 0,
it is common to try and include this data point (fudge up to 0.5 or 1).

Rich Ulrich

2017-02-25 02:11:47 UTC

Permalink

On Fri, 24 Feb 2017 23:40:55 +0000 (UTC), David Duffy

Post by b***@gmail.com
(+ control). Our university is fond of SPSS, and I have learned to
conduct the basic probit model with it

Post by b***@gmail.com
dose <- c(6000, 4500, 3000, 1500, 0)
total <- c(100, 100, 100, 100, 100)
affected <- c(91, 82, 69, 49, 0)

R MASS::dose.p

Post by b***@gmail.com
LD SE LCL UCL
p = 0.5: 1614.444 3.207876 164.3822 15855.91

SPSS

Post by b***@gmail.com
LD LCL UCL
p = 0.5: 1614.444 1198.932 1953.120

I'm too lazy to work out where your SPSS programming went wrong, but
with such a small dataset, don't you think the CIs should be wide rather
than narrow?
predict(fm1, newdata=data.frame(dose=1614.444), type="response", se.fit=TRUE)
$fit 0.5 $se.fit 0.4339901
or, using some Monte Carlo based CI estimate,
library("HelpersMG")
LD50(finney71[1:4,], equation="probit")
The LD50 is 7.387 SE 1.231
The lower limit of transitional range of doses is 5.625 SE 0.929
The higher limit of transitional range of doses is 9.149 SE 1.278
Incidentally, when we are log transforming a range of values including 0,
it is common to try and include this data point (fudge up to 0.5 or 1).

Are you saying that the 0 is being ignored in both analyses, since
log(0) is undefined?

I have an ignorant question. Or observation, leading to a question.

I've never run R, and I've never run probit. I Googled to
try to find what "finney71( ) referred to, and that led to an
ambiguous result: An example using R (which I do not use)
seemed to show, in a plot, that the "0" was scored as 0 while
the logarithm was used for the other values. - and there was
a warning that "outlying" values could create non-robust SEs
and models. - and the model would depend on the scaling
of the dose variable, you might note, which /should/ be
a very dubious feature.

Well, for the data above, that would lead to a strong effect:
all the log(dose) values are in a narrow range, compared to 0;
and the 0-dose result is 0 affected, compared to 49%-91%
for the rest. So, by chance, it might fit a fairly linear model.
But, still....

So, question: What does the "Finney71( )" mean? - beyond
the fact that it refers to /something/ in Finney's fine 1971 book
on Probit analysis.

--
Rich Ulrich

David Duffy

2017-02-25 06:34:03 UTC

Permalink

Post by Rich Ulrich
On Fri, 24 Feb 2017 23:40:55 +0000 (UTC), David Duffy

Post by b***@gmail.com
(+ control). Our university is fond of SPSS, and I have learned to
conduct the basic probit model with it

Post by b***@gmail.com
dose <- c(6000, 4500, 3000, 1500, 0)
total <- c(100, 100, 100, 100, 100)
affected <- c(91, 82, 69, 49, 0)

R MASS::dose.p

Post by b***@gmail.com
LD SE LCL UCL
p = 0.5: 1614.444 3.207876 164.3822 15855.91

SPSS

Post by b***@gmail.com
LD LCL UCL
p = 0.5: 1614.444 1198.932 1953.120

Are you saying that the 0 is being ignored in both analyses, since
log(0) is undefined?
So, question: What does the "Finney71( )" mean? - beyond

I presume it's an example.

Nothing like Usenet to parade one's mistakes (meaning mine)! I should
downvote myself. The correct R code should have been:
finney71$ldose <- log(dose)
finney71$ldose[dose==0] <- 0

fm1 <- glm(cbind(affected, total-affected) ~ ldose,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
family=binomial(link=probit),data=finney71)
ld50 <- dose.p(fm1)
exp(ld50 + c(0,-1.96,1.96)*attr(ld50,"SE"))
[1] 1614.444 1284.708 2028.811

Closer to the SPSS result. Again, showing my ignorance,
including/excluding the zero dose made no difference if logdose was set
to zero (recall 0/100 affected, so more important if nonzero base rate).

Cheers, David Duffy.

David Smith

2017-02-28 16:30:39 UTC

Permalink

I notice that the results from R are for log-dose. Results from SPSS are for dose. The results have to be different.

Results from R
Estimate Std. Error z value
(Intercept) -6.8940 10.7802 -0.640
log(dose) 0.9333 1.3441 0.694

Results from SPSS
Estimate Std. Error z value
(Intercept) -6.8940 1.082 -6.373
(dose) 2.149 0.311 6.918

Results from R give the same coefficients that I get from Stata (see below) but the standard errors are bigger in R by a factor of 10 and the z values are smaller proportionally.

I believe that the reason is that R is not taking into account the sample sizes that you have for each rate. If R is treating the sample size as 1 instead of 100 then this would result in a standard error bigger by a factor of 10, which is about what you have compared with Stata.

For the intercept your results show a similar difference between R and SPSS. This suggests that SPSS is accounting for the frequency counts. I was not able to reproduce your results for SPSS.

If you use the proper frequency weights then all your confidence intervals will be shorter including those for the LD-50 estimate.

Does R permit the specification of a frequency weight? If not you will need a data set with 500 observations and lots of duplicates.

Results from Stata for log-dose using logistic regression (400 observations)

Coef. Std. Err. z
constant -6.893981 1.08167 -6.37
lndose .9332907 .1349154 6.92

David Duffy

2017-02-28 22:59:02 UTC

Permalink

Post by David Smith
I notice that the results from R are for log-dose. Results from SPSS are for dose. The results have to be different.
Results from Stata for log-dose using logistic regression (400 observations)
Coef. Std. Err. z
constant -6.893981 1.08167 -6.37
lndose .9332907 .1349154 6.92

As I pointed out above, the R syntax for the model was incorrect. Once
corrected, one obtains for the probit link:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.8940 1.0780 -6.395 1.60e-10 ***
log(dose) 0.9333 0.1344 6.944 3.82e-12 ***

I am guessing you meant probit rather than logistic regression.
The logit link gives
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -11.4797 1.8312 -6.269 3.64e-10 ***
log(dose) 1.5539 0.2301 6.753 1.45e-11 ***

If the dose are not logged, the coefficients are wildly different from this,
so I agree I don't know exactly what model was fitted in SPSS, and given the
OP never responded...

Cheers, David Duffy.

e***@gmail.com

2017-06-22 17:12:15 UTC

Permalink

Post by b***@gmail.com
Hi!!!
I'm working on the effects of alternative larvicides on Aedes aegypti. Right now, I am doing a binary mortality response with a single explanatory variable (dose) on 4 concentrations of one larvicide (+ control). Our university is fond of SPSS, and I have learned to conduct the basic probit model with it, including a natural logarithm transformation on my dosis data.
Not so long ago, I've started working with R, and through a combination of the 'glm' and 'dose.p' functions, I get the same slope and intercept, as well as LD50 calculations. Nevertheless, the standard errors and Z-scores calculated through the Probit model in SPSS comes out completely different in R. Additionally, the 95% confidence intervals for the LD50 come out differently between the two programs. I really don't have a clue on how I am getting the same slopes, intercepts and LD50's, but totally different SE, Z, and 95% CI. Can anybody help me so I can get the same results in R??
dose <- c(6000, 4500, 3000, 1500, 0)
total <- c(100, 100, 100, 100, 100)
affected <- c(91, 82, 69, 49, 0)
finney71 <- data.frame(dose, total, affected)
fm1 <- glm(affected/total ~ log(dose),
family=binomial(link = probit), data=finney71[finney71$dose != 0, ])
xp1 <- dose.p(fm1, p=c(0.5,0.9))
xp.ci <- xp1 + attr(xp1, "SE") %*% matrix(qnorm(1 - 0.05/2)*c(-1,1), nrow=1)
EAUS.Aa <- exp(cbind(xp1, attr(xp1, "SE"), xp.ci[,1], xp.ci[,2]))
dimnames(EAUS.Aa)[[2]] <- c("LD", "SE", "LCL","UCL")
summary(fm1)
1 2 3 4
0.06655 -0.02814 -0.06268 0.03474
Estimate Std. Error z value
(Intercept) -6.8940 10.7802 -0.640
log(dose) 0.9333 1.3441 0.694
Pr(>|z|)
(Intercept) 0.522
log(dose) 0.487
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 0.513878 on 3 degrees of freedom
Residual deviance: 0.010356 on 2 degrees of freedom
AIC: 6.5458
Number of Fisher Scoring iterations: 5
print(EAUS.Aa)
LD SE LCL UCL
p = 0.5: 1614.444 3.207876 164.3822 15855.91
p = 0.9: 6373.473 3.764879 474.1600 85669.72
Estimate Std. Error z value
(Intercept) -6.8940 1.082 -6.373
(dose) 2.149 0.311 6.918
LD LCL UCL
p = 0.5: 1614.444 1198.932 1953.120
p = 0.9: 6373.473 5145.767 9013.354
So, please if somebody can help me with this, I'd be grateful. If working with those functions won't do it, I'll use another, the one you recommend.
Thank you very much!
PD. I've already googled it but there's no satisfactory answer.

Hopefully you've solved this problem already, but if not, here is the solution.

You've used family=binomial in your model, which fixes the dispersion parameter at 1. Based on your ratio of residual deviance to residual deviance df, you are under-dispersed. The assumption that the dispersion ratio is 1 is what it throwing your results off. Instead, use family=quasibinomial. This allows the dispersion parameter to be estimated, which in turn affects the variance estimate, SE, etc. Give it a try by running one model with family=binomial and one model with family = quasibinomial. Notice the SE for the slopes and intercepts. This in turn should change your goodness of fit, and 95% CI's to almost exactly align with the SPSS results, as well as any other pertinent result SPSS spits out. Any minor differences in numbers are likely due to rounding under the hood of each piece of software and are of no theoretical concern.

Hope that helps you and anybody else having this issue!

Sincerely,

Edwin R. Burgess IV, PhD