Обсуждение: finding medians

Поиск
Список
Период
Сортировка

finding medians

От
Josh Burdick
Дата:
At the end of this message is some code I used to find medians.   It's kind of a hack, but approximately works, and
isintended as a 
 
somewhat awkward stopgap for people who need to use medians.  It 
illustrates the limitations of the current aggregate function setup, 
which works so nicely for avg() and stddev().   I don't have any good solutions.  I tried using a float4[] to store 
each element as it's added, but I couldn't get array updates working in 
PL/PgSQL, so that didn't help.   Perhaps aggregate functions could be passed an array?  Or a cursor, 
pointing at the first line?  I'm not sure.
   Anyways, perhaps it'll be helpful.   Josh

-- 
Josh Burdick
jburdick@gradient.cis.upenn.edu
http://www.cis.upenn.edu/~jburdick



/* Implementing median-finding in "pure Postgres."  Does this by
copying data to a temporary table.
  A weakness of this code is that it uses sorting, instead of
Hoare's linear-time median algorithm.  Presumably sorting is
implemented so efficiently that it'll be faster than anything
written in PL/PgSQL.  (Although Hoare's algorithm implemented
in C would be faster than either.)
  BUG: this isn't properly set up to deal with multiple users.
For example, if A computes a median, then B could read the data
from the median_tmp table.  Possibly you could fiddle with
transaction isolation levels, or add a user field to median_tmp,
or something else complicated, to prevent this, but for now I'm
not worrying about this.
  Written by Josh Burdick (jburdick@gradient.cis.upenn.edu).
Anyone can use this under the same license as Postgres.
  20020524, jtb: started. */

drop aggregate median(float4);
drop table median_tmp;
drop sequence median_id;
drop index median_tmp_median_id;
drop function median_sfunc_float4(bigint, float4);
drop function median_finalfunc_float4(bigint);

create sequence median_id;
create table median_tmp ( median_id int, x float4
);
create index median_tmp_median_id on median_tmp(median_id);

create function median_sfunc_float4
(bigint, float4) returns bigint as '

insert into median_tmp
values (case when $1 = 0 then nextval(''median_id'') else $1 end, $2);

select currval(''median_id'');

' language 'SQL';

create function median_finalfunc_float4
(bigint) returns float4 as '
declare

i bigint;
n bigint;
c refcursor;
m float4;
m1 float4;

begin

n := (select count(*) from median_tmp where median_id = $1);

open c for select x from median_tmp where median_id = $1 order by x;

for i in 1..((n+1)/2) loop fetch c into m;
end loop;

/* if n is even, fetch the next value, and average the two */
if (n % int8(2) = int8(0)) then fetch c into m1; m := (m + m1) / 2;
end if;

delete from median_tmp where median_id = $1;

return m;

end
' language 'plpgsql';

create aggregate median ( basetype = float4, stype = bigint, initcond = 0, sfunc = median_sfunc_float4, finalfunc =
median_finalfunc_float4
);








Re: finding medians

От
Josh Berkus
Дата:
Josh,

>     At the end of this message is some code I used to find medians.
>     It's kind of a hack, but approximately works, and is intended as a
> somewhat awkward stopgap for people who need to use medians.  It
> illustrates the limitations of the current aggregate function setup,
> which works so nicely for avg() and stddev().

Actually, finding the median is one of the classic SQL problems.  You can't do
it without 2 passes through the data set, disallowing the use of traditional
aggregate functions.  Joe Celko has half a chapter devoted to various methods
of finding the median.

Can I talk you into submitting your code to Techdocs?  I'd love to have it
somewhere where it won't get buried in the mailing list archives.

--
-Josh Berkus



Re: finding medians

От
"Dann Corbit"
Дата:
ACK!  Sorting to find a median is criminal.

"Introduction to Algorithms" by Thomas H. Cormen, Charles E. Leiserson,
Ronald L. Rivest
ISBN: 0262031418
explains the better algorithm very well.

Here is a freely available C++ template (written by me) for a bunch of
statistics (everything *but* the selection problem):
ftp://cap.connx.com/pub/tournament_software/STATS.HPP

It uses this template for improved summation accuracy:
ftp://cap.connx.com/pub/tournament_software/Kahan.Hpp

Here is an outline for selection.  I wrote it in C++, but a rewrite to C
is trivial:

// Quickselect: find Kth smallest of first N items in array A
// recursive routine finds Kth smallest in A[Low..High]
// Etype: must have copy constructor, oeprator=, and operator<
// Nonrecursive driver is omitted.

template < class Etype >
void
QuickSelect (Etype A[], int Low, int High, int k)
{   if (Low + Cutoff > High)       InsertionSort (&A[Low], High - Low + 1);   else   {
// Sort Low, Middle, High       int Middle = (Low + High) / 2;
       if (A[Middle] < A[Low])           Swap (A[Low], A[Middle]);       if (A[High] < A[Low])           Swap (A[Low],
A[High]);      if (A[High] < A[Middle])           Swap (A[Middle], A[High]); 

// Place pivot at Position High-1       Etype Pivot = A[Middle];       Swap (A[Middle], A[High - 1]);

// Begin partitioning       int i, j;       for (i = Low, j = High - 1;;)       {           while (A[++i] < Pivot);
     while (Pivot < A[--j]);           if (i < j)               Swap (A[i], A[j]);           else               break;
    } 

// Restore pivot       Swap (A[i], A[High - 1]);

// Recurse: only this part changes       if (k < i)           QuickSelect (A, Low, i - 1, k);       else if (k > i)
     QuickSelect (A, i + 1, High, k);   } 
}


template < class Etype >
void
QuickSelect (Etype A[], int N, int k)
{   QuickSelect (A, 0, N - 1, k - 1);
}

If you want to use this stuff to improve statistics with vacuum, be my
guest.


Re: finding medians

От
"Dann Corbit"
Дата:
Here is a program written in C that demonstrates 2 median/selection
computation techniques:
ACM Algorithm 727 (implementation by Sherif Hashem)
QuickSelect (implemented by me).

Since it is written in C, it would be useful to PostgreSQL project
without any fanfare.
ftp://cap.connx.com/pub/chess-engines/new-approach/727.c

The ACM agorithm 727 is an approximation.
The QuickSelect result is exact.



Re: finding medians

От
Tom Lane
Дата:
Josh Burdick <jburdick@gradient.cis.upenn.edu> writes:
> illustrates the limitations of the current aggregate function setup, 
> which works so nicely for avg() and stddev().
>     I don't have any good solutions.  I tried using a float4[] to store 
> each element as it's added, but I couldn't get array updates working in 
> PL/PgSQL, so that didn't help.
>     Perhaps aggregate functions could be passed an array?  Or a cursor, 
> pointing at the first line?  I'm not sure.

I don't think that would help.  The real problem here is the amount of
internal storage needed.  AFAIK there are no exact algorithms for finding
the median that require less than O(N) workspace for N input items.
Your "hack" with a temporary table is not a bad approach if you want to
work for large N.

There are algorithms out there for finding approximate medians using
limited workspace; it might be reasonable to transform one of these into
a Postgres aggregate function.
        regards, tom lane


Re: finding medians

От
Hannu Krosing
Дата:
On Fri, 2002-05-31 at 01:16, Josh Burdick wrote:
> BUG: this isn't properly set up to deal with multiple users.
> For example, if A computes a median, then B could read the data
> from the median_tmp table.  Possibly you could fiddle with
> transaction isolation levels, or add a user field to median_tmp,
> or something else complicated, to prevent this, but for now I'm
> not worrying about this.

You could just use temp tables and indexes - they are local to
connection .

create TEMP sequence median_id;
create TEMP table median_tmp ( median_id int, x float4
);

---------------
Hannu




Re: finding medians

От
"Dann Corbit"
Дата:
> -----Original Message-----
> From: Hannu Krosing [mailto:hannu@tm.ee]
> Sent: Thursday, May 30, 2002 1:17 PM
> To: Josh Burdick
> Cc: pgsql-hackers@postgresql.org; josh@agliodbs.com
> Subject: Re: [HACKERS] finding medians
>
>
> On Fri, 2002-05-31 at 01:16, Josh Burdick wrote:
> > BUG: this isn't properly set up to deal with multiple users.
> > For example, if A computes a median, then B could read the data
> > from the median_tmp table.  Possibly you could fiddle with
> > transaction isolation levels, or add a user field to median_tmp,
> > or something else complicated, to prevent this, but for now I'm
> > not worrying about this.
>
> You could just use temp tables and indexes - they are local to
> connection .
>
> create TEMP sequence median_id;
> create TEMP table median_tmp (
>   median_id int,
>   x float4
> );

Another pure SQL solution would be to perform an order by on the column
of interest.

Use a cursor to seek to the middle element.  If the data set is odd,
then the median is the center element.  If the data set is even, the
median is the average of the two center elements.

A SQL function would be pretty easy.  Of course, it is not the most
efficient way to do it.  A nice thing about having the data ordered is
that you can then extract the statistic for any kth partition.  Hence,
you could generate a function quantile() and call quantile (0.5) to get
the median.  If you have a query:
select quantile(.1, col), quantile(.2, col), quantile(.3,col), ...
quantile(.9, col) from sometable
you only have to do the sort once and then operate on the sorted data.
For queries like that, sorting is probably just fine, since the
selection algorithm is only approximately linear for each single
instance.