m***@gmail.com
2015-01-12 04:26:56 UTC
Hi all,
I'm currently working on a data set with two sets of samples. The csv file of the data could be found here: http://pastebin.com/200v10py
I would like to use KS test to see if these two sets of samples are from different distributions.
I ran the following R script:
# read data from the file
data: data[[1]] and data[[2]]
D = 0.025, p-value = 0.9132
alternative hypothesis: two-sided
The KS test shows that these two samples are very similar. (In fact, they should come from same distribution.)
data: scale(data[[1]]) and scale(data[[2]])
D = 0.3273, p-value < 2.2e-16
alternative hypothesis: two-sided
The p-value becomes almost zero after normalization indicating these two samples are significantly different (from different distributions).
My question is: How the normalization could make two similar samples becomes different from each other? I can see that if two samples are different, then normalization could make them similar. However, if two sets of data are similar, then intuitively, applying same operation onto them should make them still similar, at least not different from each other too much.
I did some further analysis about the data. I also tried to normalize the data into [0,1] range (using the formula (x-min(x))/(max(x)-min(x))), but same thing happened. At first, I thought it might be outliers caused this problem (I can see that an outlier may cause this problem if I normalize the data into [0,1] range.) I deleted all data whose abs value is larger than 4 standard deviation. But it still didn't help.
Plus, I even plotted the eCDFs, they *really* look the same to me even after normalization. Anything wrong with my usage of the R function?
Since the data contains ties, I also tried ks.boot ( http://sekhon.berkeley.edu/matching/ks.boot.html ), but I got the same result.
Could anyone help me to explain why it happened? Also, any suggestion about the hypothesis testing on normalized data? (The data I have right now is simulated data. In real world, I cannot get raw data, but only normalized one.)
Regards,
-Monnand
I'm currently working on a data set with two sets of samples. The csv file of the data could be found here: http://pastebin.com/200v10py
I would like to use KS test to see if these two sets of samples are from different distributions.
I ran the following R script:
# read data from the file
data = read.csv('data.csv')
ks.test(data[[1]], data[[2]])
Two-sample Kolmogorov-Smirnov testks.test(data[[1]], data[[2]])
data: data[[1]] and data[[2]]
D = 0.025, p-value = 0.9132
alternative hypothesis: two-sided
The KS test shows that these two samples are very similar. (In fact, they should come from same distribution.)
ks.test(scale(data[[1]]), scale(data[[2]]))
Two-sample Kolmogorov-Smirnov testdata: scale(data[[1]]) and scale(data[[2]])
D = 0.3273, p-value < 2.2e-16
alternative hypothesis: two-sided
The p-value becomes almost zero after normalization indicating these two samples are significantly different (from different distributions).
My question is: How the normalization could make two similar samples becomes different from each other? I can see that if two samples are different, then normalization could make them similar. However, if two sets of data are similar, then intuitively, applying same operation onto them should make them still similar, at least not different from each other too much.
I did some further analysis about the data. I also tried to normalize the data into [0,1] range (using the formula (x-min(x))/(max(x)-min(x))), but same thing happened. At first, I thought it might be outliers caused this problem (I can see that an outlier may cause this problem if I normalize the data into [0,1] range.) I deleted all data whose abs value is larger than 4 standard deviation. But it still didn't help.
Plus, I even plotted the eCDFs, they *really* look the same to me even after normalization. Anything wrong with my usage of the R function?
Since the data contains ties, I also tried ks.boot ( http://sekhon.berkeley.edu/matching/ks.boot.html ), but I got the same result.
Could anyone help me to explain why it happened? Also, any suggestion about the hypothesis testing on normalized data? (The data I have right now is simulated data. In real world, I cannot get raw data, but only normalized one.)
Regards,
-Monnand