Outliers - definition?


By Ben Jelley (P4197) on Monday, April 2, 2001 - 07:26 pm :

I'm carrying out a piece of A-level Statistics coursework. I have heard about, and know generally what an outlier is, but could someone tell me what the actual definition of an outlier is. My teacher thinks it might be a piece of data that lies a distance greater than 3 standard deviations away from the mean, but he's not sure. Please help, it's due in on Friday!


By Oliver Samson (P3202) on Wednesday, April 4, 2001 - 11:33 pm :

I'm only doing one module in Statistics - S1 (new syllabus), but I believe that there is more than one definition of an outlier. I think 3 Standard Deviations away from the mean is a bit much. In exams I think they're supposed to state any formula they want you to use, but obviously this won't help your coursework. I had a mock today, and the formula they gave involved

(3/SD) x (something)

Unfortunately my memory's not too hot, but this formula generated a number that should be subtracted from Q1 and added to Q3 to give the boundaries for Outliers. Sorry I couldn't be more help. The something might be something to do with the Quartiles, but I really don't remember. Sorry.


By Brad Rodgers (P1930) on Wednesday, April 4, 2001 - 11:56 pm :

I have no idea what this means as I've never done statistics before, but through web searches I've found that the 'book defintion of outlier is': "more than 1.5 times the IQR smaller than Q1 or larger than Q3". (I can give you the webpage if you want)

Not sure if that's what you need though,

Brad


By Kerwin Hui (Kwkh2) on Thursday, April 5, 2001 - 03:53 pm :
Well, there are also other definitions of outliers. For example, in modelling linear regressions, you pose the model

yi=a+b xi+ei

where ei follows a normal distribution N(0,s2). Then the outlier in this case would be points where, if you omit that point, produces a redefined model with a significantly less s2.

The requirement of more than 3 s.d away from the mean should be OK if you have only one random variable.

Kerwin


By Oliver Samson (P3202) on Thursday, April 5, 2001 - 11:48 pm :

OK then. I heard somewhere that about 70% of the values in a set of data lie less than one S.D. away from the mean, so 3 S.D.'s sounded ridiculous, but I hate stats, so it doesn't really matter.


By David Loeffler (P865) on Sunday, April 15, 2001 - 09:48 pm :

Well, that 70% formula is true if the data are normally distributed. An outlier is normally the result of something going wrong with your experiment etc. so it will not follow the same normal distribution as the others.

That is why 3 SD's may be used; it is highly unlikely that anything could end up so far from the mean 'naturally', so something must have gone wrong somewhere and that individual data point should be ignored when fitting distributions to the data.

David