Arranging groups of numbers into a normal distribution without histogram

Discussion:

(too old to reply)

m***@gmail.com

2016-02-29 21:07:04 UTC

Hi All,

I'm trying to figure out a formula or method to organize a list of numbers into a normal distribution shaped like a bell curve.

I'm working Excel...

Currently I have a list of text representing store locations in column A and sales figures in column C. In column B I'm trying to generate a sales tier or bin. I want to organize the bins into a normal data distribution shaped like a bell curve. So the middle sales tier would have the highest count of locations.

I've come up with a few statistical methods which generate the (1) size of each bin, and (2) the number of bins, but I can't seem to get it quite into a normal distribution.

Is there anyway to write a formula to generate the number of bins and bin size to organize the data into a normal distribution? Any help would be appreciated.

Thank you!

Bruce Weaver

2016-02-29 22:50:55 UTC

Permalink

Post by m***@gmail.com
Hi All,
I'm trying to figure out a formula or method to organize a list of numbers into a normal distribution shaped like a bell curve.
I'm working Excel...
Currently I have a list of text representing store locations in column A and sales figures in column C. In column B I'm trying to generate a sales tier or bin. I want to organize the bins into a normal data distribution shaped like a bell curve. So the middle sales tier would have the highest count of locations.
I've come up with a few statistical methods which generate the (1) size of each bin, and (2) the number of bins, but I can't seem to get it quite into a normal distribution.
Is there anyway to write a formula to generate the number of bins and bin size to organize the data into a normal distribution? Any help would be appreciated.
Thank you!

Why do you want to do that? I ask, because as George Box famously observed, "in nature there never was a normal distribution, there never was a straight line, yet with normal and linear assumptions, known to be false, [the statistician] can often derive results which match, to a useful approximation, those found in the real world."

Source: Section 2.5 of http://mkweb.bcgsc.ca/pointsofsignificance/img/Boxonmaths.pdf.

HTH.

Rich Ulrich

2016-03-01 08:27:32 UTC

Permalink

My! Arbitrary widths? That sounds like someone is trying
for an entry in some new edition of "How to Lie with Statistics".

Don't do it.

I Googled, and I have just glanced at this: Wikipedia has a nice
article under "Histograms". It has a good discussion of algorithms
for "number of bins".

However, I don't see that it mentions what to do with data that
cover a huge range and deserve to be transformed .... Maybe
I should add that, or suggest it on the discussion page ....

If you have a wide range, it could be natural to use transformation.

Keep it a simple one, and one that is fairly natural for whatever is
measured. I would have to say that the first choice is always "logs".

Second, consider inverting the measure, such as Miles-per-gallon
becoming Gallons-per-100 miles. - I once saw a very nice data
presentation concerning the records at various distances for
track meets, which achieved fine unification by converting all
"times" for the records into "speed". They plotted 100 years of
world-records, all distances, for males and females.

For random counts of events, it could be that the square-root
will be the "natural" transformation. What would inhibit me from
using it is that it is seldom used by others. However, it is another
one to consider.

--
Rich Ulrich

Herman Rubin

2016-03-01 21:07:00 UTC

Permalink

Post by Rich Ulrich

Post by m***@gmail.com
Hi All,
I'm trying to figure out a formula or method to organize a list of numbers into a normal distribution shaped like a bell curve.
I'm working Excel...

One of my colleagues had on his door a sign, the exact wording of
which I do not remember, saying that friends should prevent friends
from using Excel.

Post by Rich Ulrich

Post by m***@gmail.com
Currently I have a list of text representing store locations in column

A and sales figures in column C. In column B I'm trying to generate
a sales tier or bin. I want to organize the bins into a normal data
distribution shaped like a bell curve. So the middle sales tier would
have the highest count of locations.

Post by Rich Ulrich

Post by m***@gmail.com
I've come up with a few statistical methods which generate the (1)

size of each bin, and (2) the number of bins, but I can't seem to get
it quite into a normal distribution.

Post by Rich Ulrich

Post by m***@gmail.com
Is there anyway to write a formula to generate the number of bins and

bin size to organize the data into a normal distribution? Any help would
be appreciated.

Post by Rich Ulrich

Post by m***@gmail.com
Thank you!

My! Arbitrary widths? That sounds like someone is trying
for an entry in some new edition of "How to Lie with Statistics".
Don't do it.
I Googled, and I have just glanced at this: Wikipedia has a nice
article under "Histograms". It has a good discussion of algorithms
for "number of bins".
However, I don't see that it mentions what to do with data that
cover a huge range and deserve to be transformed .... Maybe
I should add that, or suggest it on the discussion page ....
If you have a wide range, it could be natural to use transformation.
Keep it a simple one, and one that is fairly natural for whatever is
measured. I would have to say that the first choice is always "logs".
Second, consider inverting the measure, such as Miles-per-gallon
becoming Gallons-per-100 miles. - I once saw a very nice data
presentation concerning the records at various distances for
track meets, which achieved fine unification by converting all
"times" for the records into "speed". They plotted 100 years of
world-records, all distances, for males and females.
For random counts of events, it could be that the square-root
will be the "natural" transformation. What would inhibit me from
using it is that it is seldom used by others. However, it is another
one to consider.

One should never use a transformation without having a good reason
not depending on the observed distribution to do it. Transforming to
a normal distribution makes ALL subsequent analyses suspect.

I suggest you discuss your problem with someone who understands
statistical theory, and who will try to get from YOU the underlying
assumptions. ALL statistical procedures have assumptions, and the
theorist is in a position to help you find the necessary approximations.

--
This address is for information only. I do not claim that these views
are those of the Statistics Department or of Purdue University.
Herman Rubin, Department of Statistics, Purdue University
***@stat.purdue.edu Phone: (765)494-6054 FAX: (765)494-0558

Rich Ulrich

2016-03-02 01:28:17 UTC

Permalink

On Tue, 1 Mar 2016 21:07:00 -0000 (UTC), Herman Rubin

Post by Herman Rubin
One should never use a transformation without having a good reason
not depending on the observed distribution to do it. Transforming to
a normal distribution makes ALL subsequent analyses suspect.

I would ask that the second sentence be started with some
qualifier like "Arbitrarily..."

Trying to analyze a distribution with tests that have normality
assumptions, when you can be sure that the metric in use
has great heterogeneity of variance, is a way to assure
that your statistical tests are wrong.

--
Rich Ulrich

m***@gmail.com

2016-03-02 17:53:10 UTC

Permalink

Thanks for all the replies guys. I have read up in depth on the Wikipedia page regarding histograms. I have come up with a few formulas that get my distribution as close to normal as possible. One was the Sturges' method.

In it the formulas I use are:

# of bins = log(# of data points)+1
bin size = (max of data points - min of data points)/# of bins

This always gets me close to a normal distribution. I was just wondering if anyone had anything better. Thanks!

Herman Rubin

2016-03-02 19:57:51 UTC

Permalink

Post by Rich Ulrich
On Tue, 1 Mar 2016 21:07:00 -0000 (UTC), Herman Rubin

I would ask that the second sentence be started with some
qualifier like "Arbitrarily..."
Trying to analyze a distribution with tests that have normality
assumptions, when you can be sure that the metric in use
has great heterogeneity of variance, is a way to assure
that your statistical tests are wrong.

There are a few, such as tests for independence, which might still
go over, but how much better are they than nonparametric tests?

Testing for a correlation coefficient of 0 is fairly robust, but
testing for any other value, or testing whether two are equal, is
highly dependent on higher moments.

Once Gauss proved the Gauss-Markov Theorem, he ceased worrying
about whether the errors of observation were normal. And the
word "normal" was introdued by Quetelet, who stated that it was
the distribution of properties of a "normal" person, and the
biologists and social scientists have been misusing it ever since.

m***@gmail.com

2016-03-02 22:04:02 UTC

Permalink

Thanks Herman, but I think we are getting slightly off topic here :).

I have a list of locations and their related sales volumes.

Assuming the data is a normal distribution then all I'm trying to do is determine a formula that:
(1) gives me the optimal number of bins and
(2) bin ranges which create a normal distribution

so that the locations fall into the bin ranges in a normal distribution pattern. i.e. the highest count of locations will be in the middle tier.

Thanks for the interesting commentary though!

m***@gmail.com

2016-03-02 22:19:54 UTC

Permalink

What are your thoughts on if I use percentiles to set my ranges? So I group locations and their sales into groups based on percentiles from the bell curve. Example:

Tier A Upper Bound: Top 100% or Max of Data Set
Tier B Upper Bound: Top 97.5%
Tier C Upper Bound: Top 84%
Tier D Upper Bound: Top 50%
Tier E Upper Bound: Bottom 16%
Tier F Upper Bound: Bottom 2.5%

That way it forces it into a normal distribution.

Rich Ulrich

2016-03-03 02:49:15 UTC

Permalink

Post by m***@gmail.com
Tier A Upper Bound: Top 100% or Max of Data Set
Tier B Upper Bound: Top 97.5%
Tier C Upper Bound: Top 84%
Tier D Upper Bound: Top 50%
Tier E Upper Bound: Bottom 16%
Tier F Upper Bound: Bottom 2.5%
That way it forces it into a normal distribution.

Yeah, but nobody does that. Notice, defining the points that
way makes the histogram totally redundant -- all you need
to show is the cutoffs for the ranges.

If you want to show "typical" points along distribution,
you are tryingn to start with "normal" cutoffs, which I think
most readers will not relate to. (Nobody does that.)

I suggest using the percentiles 0-25-50-75-100.
Or, similarly, showing what scores mark the deciles.

--
Rich Ulrich

Gordon Sande

2016-03-03 00:32:49 UTC

Permalink

Post by m***@gmail.com
Thanks Herman, but I think we are getting slightly off topic here :).
I have a list of locations and their related sales volumes.

Business data is more likely to be exponential than normal. Firm sizes
are an accretion of multiplicative effects rather than additive effects
in the usual economic theory.

Post by m***@gmail.com
Assuming the data is a normal distribution then all I'm trying to do is
(1) gives me the optimal number of bins and
(2) bin ranges which create a normal distribution
so that the locations fall into the bin ranges in a normal distribution
pattern. i.e. the highest count of locations will be in the middle
tier.
Thanks for the interesting commentary though!