| By Anna Lapuk Alap on Tuesday, March 19, 2002 - 01:16 am: |
Hello!
Please, could you explain to me in a simple words what's the sense
of term "weighted" when applied to some statistical terms? For
example "weighted frequency"? What's the relationship with just
frequency (I'd appreciate if you'd give me a formula).
Thanx so much.
| By Dan Goodman on Tuesday, March 19, 2002 - 02:48 am: |
Probably the best way to explain the term
"weighted" is with an example.
Suppose you wanted to find the mean (average) age of people in
Britain (don't ask me why). Also, suppose that 51% of the
population are female and 49% male (I think this is about right).
If you know that the mean age of women is 45yrs and the mean age of
men is 40yrs, then you can find the mean age of men and women by
taking the weighted mean of these two. In other words, the mean age
will be:
(51% × 45yrs + 49% × 40yrs) / 100%
In general, a weighted mean is given by the formula
m =
(w1x1+w2x2+...+wnx
n)/(w1+w2+...+wn)
Here the w's are the "weight" given to the x's. In the example
above, we had x1 being the mean age of women and
x2 being the mean age of men, w1 being the
percentage of women and w2 being the percentage of
men.
| By Anna Lapuk on Tuesday, March 19, 2002 - 06:15 pm: |
Thank you very much! This is very demonstrative! So, these
"weights" are used for taking into account additional
characteristic which devides the dataset into subsets, right?
Perhaps you could also clearify a sense and applicability of
variance and standard deviation for me? I do understand, that the
standart deviation is a measure of how spread out the
data set is and is also a square root of variance. But what the
variance itself reflects is a bit vague to me. It's definition -
the measure of spread of distribution. What does this mean in
a simple words? Could you give me an example to feel the difference
between these two.
| By Dan Goodman on Wednesday, March 20, 2002 - 12:09 am: |
Yes, that's what weighting usually seems
to be used for.
Variance and standard deviation are a bit more complicated. You're
right that it is a measure of how spread out a data set is, but
that's very vague. There is a formula for variance which you
probably know:
Variance = s2 =
((x1-m)2)+(x2-m)2)+...+(x
n-m)2))/n
(I hope I got that right)
The question is - why are variance and standard deviation the best
measure of spread of a data set? After all, the difference between
the biggest and smallest elements of a data set is also a measure
of the spread.
The problem is, unless you know about the normal (or Gaussian)
distribution I don't think I can explain why variance and standard
deviation are more important than these other measures of spread.
Basically, it turns out that knowing the mean and variance of a
large data set tells you all you need to know for most
purposes.
A good example is "confidence intervals". Suppose you have a data
set of ages of people, you've collected the ages of 1000 randomly
selected people and worked out the mean m and the standard
deviation s of this data set. What you really want to know is the
mean age of everyone in the country, but there are 60,000,000
people in the UK so you don't want to go and find out everyone's
age. What you can say is that with 95% certainty the mean age of
people in the UK is between m-a and m+a for some number a. The
point is that the number a only depends on the variance of the data
set (it would be a bit complicated to explain how to calculate a,
but it can be done).
I'm sorry this explanation is a bit useless, I can't really think
of an easy way of explaining it. Someone else on this site might
post something better.
| By Anna Lapuk on Wednesday, March 20, 2002 - 01:22 am: |
OK, this seems to make sense to me somehow. So the distribution
function is related with standard deviation in the manner:
f(x)~ 1/s × exp(1/s^2). So the more s, the less f(0) (if x=0,
f(x)~1/s) and hense the wider shape of the distribution graph and
the bigger the fraction of x with larger deviation from the mean.
In other words, the bigger standard deviation, the more "spread
out" the dataset. This is my understanding of stand. deviation and
its relationship with the Gaussian distribution of any dataset. But
when I come to the variance and it's sense, I don't feel a
difference between itself and stand. deviation. Or there's no
differense in a sense of characterizing the dataset? Does it matter
what to use - the var. or stand. dev.? Is there any shade of their
sense or use?
| By Dan Goodman on Wednesday, March 20, 2002 - 02:06 am: |
Er, yes. Actually, the Gaussian
probability distribution function (which I think is what you're
talking about above) is
f(x)=1/(Ö(2p)s) exp(-(x-m)2/(2s2))
The variance is just the square of the standard deviation, so if v
is the variance then the above equation is just:
f(x)=1/Ö(2pv) exp(-(x-m)2/(2v))
So f(0)=1/Ö(2pv)
In other words, if v is bigger then f(0) is smaller and the
distribution is more spread out, just the same as for the standard
deviation. For technical reasons, it is easier to prove things
about variance than it is to prove things about the standard
deviation (because of the square root sign), but standard deviation
is better for thinking about it - that's why we use
both...