Lund's Test & Data Transformations

Rich Ulrich

2015-12-12 03:33:01 UTC

On Fri, 11 Dec 2015 13:07:06 -0800 (PST), "Ilovestats!!"

Post by Ilovestats!!
Hi,
My data does not follow a normal distribution. I wanted to run a Lund's Test first to remove outliers from my data then transform my data. Is it a good idea to run both procedures together on my data? Or would I just run one procedure?

Let's see here. This rather turns logic upside down:

The outliers of a reasonable distribution are the scores that
have the most information about what transformation will
work. Using a /test/ to remove outliers before doing a
transformation is like shooting yourself in the foot. Crippling.

Here are a few guidelines.
1. What transformation is appropriate? Consider /what/ is
being measured, for what purpose.

Counts suggest Poisson, chemical concentrations suggest
logarithms, distances sometimes suggest reciprocals, and so on.
But the purpose, which might be a reflection of something like
a "latent factor", may be determinative instead. "Dollars"
are usually untransformed by economists, but as a measure
or latent score for a construct or factor for "wealth" over a
wide range, some transforming is surely needed.

Three purposes of transformation are (a) to achieve
linearity with an outcome; (b) to achieve equal error variace
across the range of the variable; (c) to achieve a normal-
looking distribution. It is surprisingly often that all three of
these occur at the same time with natural data... but
that has led to some ignorant reliance on (c), "looks",
when it is the least important of the three. Linearity
matters most for simple model-building; and homogeneous
residual variance matters most for the robustness of the
distribution of the test statistic.

2. Are some scores simply unreasonable? This is not a
question for Lund's test. 2a. There are outliers that /need/
to be removed because they are bad data -- data cleaning,
not analysis. 2b. There are outliers that /need/ to be
removed because they are not invalid in the sense of 'bad
data', but they are invalid in the sense of belonging to a
homogeneous set that the analyses should deal with.
Data of this sort may be set aside from the analyses, and
probably be explained by a note in some eventual report.

3. Throwing away data is usually a bad idea. I have, on a
few occasions, drawn in the outside few percent of scores
in order to avoid the bad effect that the extremes would have
on the test statistics; this is done when I also expect that the
extreme scores reflect more scoring error than actually-
extreme phenomena.

--
Rich Ulrich