Modelling standardized dependent variables

Discussion:

(too old to reply)

Bruce Bradbury

2014-06-17 14:03:16 UTC

I'm estimating a regression y* = a + bx + e where y* is a z-score (ie y*=(y-mean(y))/stddev(y)). The means and standard deviations are estimated across the same sample as the regression.

How do I calculate appropriate standard errors for b, which take account of the estimation of mean(y) and stddev(y)? I would have thought this was a relatively common procedure, but haven't been able to find any literature on it.

Herman Rubin

2014-06-17 18:47:25 UTC

Permalink

Post by Bruce Bradbury
I'm estimating a regression y* = a + bx + e where y* is a z-score

(ie y*=(y-mean(y))/stddev(y)). The means and standard deviations are
estimated across the same sample as the regression.

Post by Bruce Bradbury
How do I calculate appropriate standard errors for b, which take account

of the estimation of mean(y) and stddev(y)? I would have thought this
was a relatively common procedure, but haven't been able to find any
literature on it.

For good reason; it is very messy. Also, why should z-scores even
be used? In comparing samples from the same or different populations,
comparing the raw values makes sense, but comparing z-scores does not.
In fact, do z-scores ever make sense?

Anyhow, to reduce the problem to a probability problem, let the
regression of y on x be given by

y = A + Bx + E

B, and also b, are unchanged if the mean of x is 0, and we will
assume this. Then the classical estimates of A and B will have
a covariance matrix which is \sigma^2/n times a diagonal matrix,
and the variance of Bhat is \sigma^2/n divided by the the average
of (x-xbar)^2. Now let s^2 be the average of (E-Ebar)^2. Then
stdev(y)^2 = Bhat^2*sd(x)^2+sd(E)^2; I am assuming that all
divisions are by n. Howver, b-bhat = (B-Bhat)/stdev(y)+b*(var(y)/sd(y)-1);
that second term enlarges the error The above shows that the error
consists of two asymptotically orthogonal terms, one of which is
the error in estimating B normalized, and the other is the true
value of b times the relative error in estimating the variance.

--
This address is for information only. I do not claim that these views
are those of the Statistics Department or of Purdue University.
Herman Rubin, Department of Statistics, Purdue University
***@stat.purdue.edu Phone: (765)494-6054 FAX: (765)494-0558

Bruce Bradbury

2014-06-19 15:54:15 UTC

Permalink

Herman,
Thanks for the response. Addressing the first part of your comment first. We are using z-scores for two main reasons.

First, our research question is how much of the variation in y is explained by x. So expressing the effect of a binary x in terms of standard deviations in y seems most natural.

Second, we are comparing data from two countries where our y variables are composite scores (of cognitive ability) which are not measured in precisely the same way (and indeed are usually reported in z-score form). Data from other studies with different ability measures which are defined the same in the two countries does suggest that the standard deviation of ability is identical in the two countries - but I don't think we need this assumption if we stick with the research question in the previous paragraph.

As I write this, it strikes me that a test of whether b is zero should be the same as whether R2 is zero (for which there is an F test) and the latter should be the same on standardised or unstandardised y. So is my question the same as asking whether we can put a confidence interval on R2?

Yet another way of doing this might be to estimate a standard regression on unstandardised y, then divide b by the standard deviation of y. Taylor series expansion then suggests the standard error of the result will be a function of the standard error of b, the standard error of s(y) and the covariance between the two estimates. Does you statement about orthoganality imply that the covariance will be zero?

Herman Rubin

2014-06-19 19:12:08 UTC

Permalink

Post by Bruce Bradbury
Herman,
Thanks for the response. Addressing the first part of your comment

first. We are using z-scores for two main reasons.

Post by Bruce Bradbury
First, our research question is how much of the variation in y is

explained by x. So expressing the effect of a binary x in terms of
standard deviations in y seems most natural.

Post by Bruce Bradbury
Second, we are comparing data from two countries where our y variables

are composite scores (of cognitive ability) which are not measured
in precisely the same way (and indeed are usually reported in z-score
form). Data from other studies with different ability measures which are
defined the same in the two countries does suggest that the standard
deviation of ability is identical in the two countries - but I don't
think we need this assumption if we stick with the research question in
the previous paragraph.

Post by Bruce Bradbury
As I write this, it strikes me that a test of whether b is zero should

be the same as whether R2 is zero (for which there is an F test) and
the latter should be the same on standardised or unstandardised y. So is
my question the same as asking whether we can put a confidence interval
on R2?

No, it is not the same. The difference is that the variance of y
depends on b, and unless one makes the assumption that x is normal,
the joint distribution is essentially impossible to calculate. You
do get more information about b by treating the x's as constants.

Post by Bruce Bradbury
Yet another way of doing this might be to estimate a standard regression

on unstandardised y, then divide b by the standard deviation of y. Taylor
series expansion then suggests the standard error of the result will
be a function of the standard error of b, the standard error of s(y)
and the covariance between the two estimates. Does you statement about
orthoganality imply that the covariance will be zero?

I was generalizing incorrectly from a simpler problem when I stated
asymptotic orthogonality. Look at the estimates of b and s(y).
bhat=b+cov(x,e)/var(x), shat(y)^2 =bhat^2*var(x)+var(e); these are
sample variances and covariances. The two terms in the square of
the variance of x are asymptotically independent, but especially if
b is large bhat and shat(y) are positively dependent. This reduces
the error in b/s(y).

So consider bhat/shat(y). If we write the square of this as
qhat, where qhat^2 = 1 + var(e)/bhat^2, we find that the first
order terms in qhat-q are

[(var(e)-E(var(e))/b^2 - 2cov(x,e)/b^3]/2q.

How accurate this is for reasonable sized samples is difficult to decide,
but this is the first term in the expansion of the error.