two-sample KS test: data becomes significantly different after normalization

Rich Ulrich

2015-01-12 21:20:10 UTC

Post by m***@gmail.com
Hi all,
I'm currently working on a data set with two sets of samples. The csv file of the data could be found here: http://pastebin.com/200v10py
I would like to use KS test to see if these two sets of samples are from different distributions.
# read data from the file

data = read.csv('data.csv')
ks.test(data[[1]], data[[2]])

Two-sample Kolmogorov-Smirnov test
data: data[[1]] and data[[2]]
D = 0.025, p-value = 0.9132
alternative hypothesis: two-sided
The KS test shows that these two samples are very similar. (In fact, they should come from same distribution.)

a) I have never imagined that anyone would want to
make z-scores out of data and then do a K-S.

b) It is absolutely stupid to use z-scores for your data
for two groups when they are characterized (each) by
a few huge outliers. If your incoming data is going to be
of that nature ... you might give your sponsors a strong
fore-warning that they have already thrown away much of
the information that is apt to be interesting. And that,
therefore, they should be prepared for a very skimpy sort
of analyses that can follow.

Post by m***@gmail.com

ks.test(scale(data[[1]]), scale(data[[2]]))

Two-sample Kolmogorov-Smirnov test
data: scale(data[[1]]) and scale(data[[2]])
D = 0.3273, p-value < 2.2e-16
alternative hypothesis: two-sided
The p-value becomes almost zero after normalization indicating these two samples are significantly different (from different distributions).
My question is: How the normalization could make two similar samples becomes different from each other?

Do you know what is tested by the K-S?
- the samples are sorted together so that one may plot
the cumulative distributions together. The "D" represents
the furthest vertical distance between them. D= 0.025 is
a small distance, on the scale of 0 to 1. D=.327 is a very
large distance.
- The K-S test without the Lillienfors correction is weak for
testing the extremes. That is, if the "D=0.025" happens to
say that the largest 25 scores all belong to sample 2, K-S
will not find that to be significant. If you are going to be
concerned about the outliers - which is what this example deserves
- where the testing is dominated by the big numbers - a test on the
outliers (some sort of test) will be more direct, and *might* show
a difference between the original scores.

Post by m***@gmail.com
I can see that if two samples are different, then normalization could make them similar. However, if two sets of data are similar, then intuitively, applying same operation onto them should make them still similar, at least not different from each other too much.

See above. Obviously, you want to pay more attention to
the nature of the test.

Post by m***@gmail.com
I did some further analysis about the data. I also tried to normalize the data into [0,1] range (using the formula (x-min(x))/(max(x)-min(x))), but same thing happened. At first, I thought it might be outliers caused this problem (I can see that an outlier may cause this problem if I normalize the data into [0,1] range.) I deleted all data whose abs value is larger than 4 standard deviation. But it still didn't help.

hmm. Well, if you deleted the outliers *after* normalizing,
you change very little of what KS tests. Dropping off a few
scores from the top of each will not stop the mid-region of one
sample from being dominated by negative z-scores, owing to
the few very high scores that have to be balanced to maintain
an overall mean of 0.

Your data: Look at it as "millions", and drop off 6 places.

Then: about 95% of the scores truncate to one digit, usually to
the value of 2 or 3. And 5% of the scores range from 55 to 276.
I would never z-score data (except for looking for outliers) that
look like this. By the way, the BIG numbers occur with absolute
regularity, as every 20th score (both numbers). Was this an accident?
Did you do this on purpose? Should they not be considered separately?

Post by m***@gmail.com
Plus, I even plotted the eCDFs, they *really* look the same to me even after normalization. Anything wrong with my usage of the R function?
Since the data contains ties, I also tried ks.boot ( http://sekhon.berkeley.edu/matching/ks.boot.html ), but I got the same result.
Could anyone help me to explain why it happened? Also, any suggestion about the hypothesis testing on normalized data? (The data I have right now is simulated data. In real world, I cannot get raw data, but only normalized one.)

--
Rich Ulrich