Re: how to identify outliers

Поиск

Список

Период

Сортировка

От	Chris Spotts
Тема	Re: how to identify outliers
Дата	28 октября 2009 г. 12:34:15
Msg-id	00d601ca57ca$ef5229b0$cdf67d10$@com обсуждение исходный текст
Ответ на	Re: how to identify outliers (Sam Mason <sam@samason.me.uk>)
Список	pgsql-general

Дерево обсуждения

>
> I'd agree, stddev is probably best and the following should do
> something
> reasonable for what the OP was asking:
>
>   SELECT d.*
>   FROM data d, (
>     SELECT avg(distance), stddev(distance) FROM data) x
>   WHERE abs(d.distance - x.avg) < x.stddev * 2;
>
[Spotts, Christopher]
Statistically speaking if you dataset is of a fairly normal distribution the
following works "well" and is a *fairly* standard outlier definition.

First get a median function (there's a million of them on the net, doogle a
google).
You'll need one pass to get the median.
Divide your data set in half based on that median.
Get the median of the first half (this is Q1).
Get the median of the second half (this is Q3).
Then your range for your good data should be from (Q1 - (Q3-Q1)*1.5) TO (Q3
+ (Q3-Q1)*1.5).
Anything outside that range is an outlier.  Adjust the 1.5 up or down to be
more or less aggressive.

Using the "avg" formula for outliers is bad news.

I HIGHLY suggest installing PL/R for this, it makes it trivial.

Chris

В списке pgsql-general по дате отправления:

Предыдущее

От: Sam Mason
Дата: 28 октября 2009 г., 11:00:50
Сообщение: Re: how to identify outliers

Следующее

От: fox7
Дата: 28 октября 2009 г., 12:42:51
Сообщение: Re: Slow running query with views...how to increase efficiency? with index?

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: how to identify outliers

Предыдущее

Следующее