Обсуждение: PG & random() strangeness

Поиск
Список
Период
Сортировка

PG & random() strangeness

От
"Sergey E. Koposov"
Дата:
Hello,

I'm getting strange results with PostgreSQL random() function. It would be
great if someone could either show where I am wrong or PG is wrong. Here
is what I do (PG 8.4.3, x86_64 platform);

I basically try to create the table with the column filled with random
numbers (either integer or doubles). And I'm getting too many collisions.

Here is the SQL code:
-----------------------

begin;
select setseed(0);
create temporary table tmpx1 as select ((random())) as id from
     generate_series(1,100000);
select id ,count(*)  from tmpx1 group by (id) having count(*)>1;

drop table tmpx1;

select setseed(0);
create temporary table tmpx2 as select ((random())*9223372036854775806)::bigint as id from
     generate_series(1,100000);
select id ,count(*)  from tmpx2 group by (id) having count(*)>1;
commit;
----
Here is the output:

------------
BEGIN
  setseed
---------

(1 row)

SELECT
          id         | count
--------------------+-------
   0.519465064629912 |     2
  0.0100625408813357 |     2
   0.394671014975756 |     2
(3 rows)

DROP TABLE
  setseed
---------

(1 row)

SELECT
          id          | count
---------------------+-------
  4791219551230492672 |     2
    92810558184620032 |     2
  3640197603284484096 |     2
(3 rows)

COMMIT
-------------

So among 10^5 random numbers there are already 3 collisions. Which doesn't
seem right for the function which generate randoms of double precision

It is also interesting that in the table tmpx1 (if I actually output the
values using select id from tmpx1) I do not see equal numbers at all but I see
for example two values which are close to each other:
    0.511193082667887
    0.511194097343832

First it seems strange that they have been merged into one group by "groupby".
Although I understand all standard issues related to the storage of
floats and their comparison, I don't get why two numbers which
differ by much less than precision were actually merged into one (given
the double datatype). Although I'm surprized by that result, I can accept
it because "groupby" doesn't make too much sense with floats anyway. But
then I dont understand why am I getting the same problems when try to
scale my  random number to bigint (the part of the example with tmpx2
table).

In order to understand what's going on. I also tried to produce a simple
program which replicates what PG does when generating random numbers
(from utils/adt/float.c). I also use the same seed=0 to get the same
numbers.

----------------
#include <stdlib.h>
#include <stdio.h>
#include <limits.h>
#define MAX_RANDOM_VALUE  (0x7FFFFFFF)

main()
{
   int i;
   int iseed = (int) (0 * MAX_RANDOM_VALUE);
   srandom((unsigned int) iseed);
   for(i=0; i<=100000; i++)
   {
     double result = (double) random() / ((double) MAX_RANDOM_VALUE + 1);
     fprintf(stdout,"%.20f\n",result);
   }
}

--------------

And this program ./a.out | sort -n | uniq -c Doesn't produce any
duplicates at all. There are two close numbers
     0.511193082667887
     0.511194097343832
exactly the same as what PG sees.

So it doesn't seem like related to some random generator problems. So it
rather seems to me that the floating precision is lost somewhere in PG
(although I may be wrong).

Does anyone have an explanation for what I see ? Did I miss something
obvious ?

Thanks in advance,
     Sergey

*******************************************************************
Sergey E. Koposov, PhD
Institute for Astronomy, University of Cambridge, UK
Web: http://lnfm1.sai.msu.ru/~math
E-mail: math@sai.msu.ru

Re: PG & random() strangeness

От
Tom Lane
Дата:
"Sergey E. Koposov" <math@sai.msu.ru> writes:
> So among 10^5 random numbers there are already 3 collisions. Which doesn't
> seem right for the function which generate randoms of double precision

The underlying random() function only generates 31-bit integers, so
collisions aren't as improbable as they might seem.  See
http://en.wikipedia.org/wiki/Birthday_paradox

> It is also interesting that in the table tmpx1 (if I actually output the
> values using select id from tmpx1) I do not see equal numbers at all but I see
> for example two values which are close to each other:
>     0.511193082667887
>     0.511194097343832

> First it seems strange that they have been merged into one group by "groupby".

They aren't; or at least you've not provided any evidence that they were.

            regards, tom lane

Re: PG & random() strangeness

От
"justin@magwerks.com"
Дата:
---- Message from "Sergey E. Koposov" <math@sai.msu.ru> at 05-04-2010 06:36:23 PM ------
Hello,

I'm getting strange results with PostgreSQL random() function. It would be
great if someone could either show where I am wrong or PG is wrong. Here
is what I do (PG 8.4.3, x86_64 platform);

I basically try to create the table with the column filled with random
numbers (either integer or doubles). And I'm getting too many collisions.

Here is the SQL code:
-----------------------

begin;
select setseed(0);
create temporary table tmpx1 as select ((random())) as id from
  generate_series(1,100000);
select id ,count(*)  from tmpx1 group by (id) having count(*)>1;


***snip**
Random() does not guarantee the results will not repeat.  its just like rolling dice  every time the dice rolls its an independent result so it will repeat at some point, so  every time Random() is called its an independent event that has no idea about past results.   You are calling random 100,000 times. 


To create a random list of numbers that don't repeat you have to filter the repeated values out one at a time, with another function.

Tip trying to use sql to filter out duplicated values will sort the result set first before filtering duplicates resulting in a not so random list of numbers.




All legitimate Magwerks Corporation quotations are sent in a .PDF file attachment with a unique ID number generated by our proprietary quotation system. Quotations received via any other form of communication will not be honored.

CONFIDENTIALITY NOTICE: This e-mail, including attachments, may contain legally privileged, confidential or other information proprietary to Magwerks Corporation and is intended solely for the use of the individual to whom it addresses. If the reader of this e-mail is not the intended recipient or authorized agent, the reader is hereby notified that any unauthorized viewing, dissemination, distribution or copying of this e-mail is strictly prohibited. If you have received this e-mail in error, please notify the sender by replying to this message and destroy all occurrences of this e-mail immediately.
Thank you.

Re: PG & random() strangeness

От
"Sergey E. Koposov"
Дата:
On Tue, 4 May 2010, Tom Lane wrote:

> "Sergey E. Koposov" <math@sai.msu.ru> writes:
>> So among 10^5 random numbers there are already 3 collisions. Which doesn't
>> seem right for the function which generate randoms of double precision
>
> The underlying random() function only generates 31-bit integers, so

Okay, the fact that they are 31bit, instead of 64 bit seems to be part of
the issue, I agree. I forgot about that, and that should obviously highly
increase the number of collisions. (it maybe worthwile to add the
31bittness of random to the docs, because the fact that it returns double
may be confusing (as it was to me)).

>> It is also interesting that in the table tmpx1 (if I actually output the
>> values using select id from tmpx1) I do not see equal numbers at all but I see
>> for example two values which are close to each other:
>>     0.511193082667887
>>     0.511194097343832
>
>> First it seems strange that they have been merged into one group by "groupby".
>
> They aren't; or at least you've not provided any evidence that they were.

I think I did, since:

## select id ,count(*) from tmpx group by (id) having count(*)>1;
          id         | count
--------------------+-------
   0.519465064629912 |     2
  0.0100625408813357 |     2
   0.394671014975756 |     2
(3 rows)

and
## select id from tmpx where id > 0.51119 and id < 0.51120;
         id
-------------------
  0.511193082667887
  0.511194097343832
(2 rows)


Regards,
     S.

*******************************************************************
Sergey E. Koposov, PhD
Institute for Astronomy, University of Cambridge, UK
Web: http://lnfm1.sai.msu.ru/~math
E-mail: math@sai.msu.ru

Re: PG & random() strangeness

От
Tom Lane
Дата:
"Sergey E. Koposov" <math@sai.msu.ru> writes:
> On Tue, 4 May 2010, Tom Lane wrote:
>> They aren't; or at least you've not provided any evidence that they were.

> I think I did, since:

> ## select id ,count(*) from tmpx group by (id) having count(*)>1;
>           id         | count
> --------------------+-------
>    0.519465064629912 |     2
>   0.0100625408813357 |     2
>    0.394671014975756 |     2
> (3 rows)

> and
> ## select id from tmpx where id > 0.51119 and id < 0.51120;
>          id
> -------------------
>   0.511193082667887
>   0.511194097343832
> (2 rows)

And?  Those aren't in any of the groups the first query identified.

            regards, tom lane

Re: PG & random() strangeness

От
"Sergey E. Koposov"
Дата:
On Tue, 4 May 2010, Tom Lane wrote:
>> ## select id ,count(*) from tmpx group by (id) having count(*)>1;
>>           id         | count
>> --------------------+-------
>>    0.519465064629912 |     2
>>   0.0100625408813357 |     2
>>    0.394671014975756 |     2
>> (3 rows)
>
>> and
>> ## select id from tmpx where id > 0.51119 and id < 0.51120;
>>          id
>> -------------------
>>   0.511193082667887
>>   0.511194097343832
>> (2 rows)
>
> And?  Those aren't in any of the groups the first query identified.

Oops, I have confused 0.519xxx  and 0.51119yyy numbers.

Sorry for the noise.

Regards,
     S.


*******************************************************************
Sergey E. Koposov, PhD
Institute for Astronomy, University of Cambridge, UK
Web: http://lnfm1.sai.msu.ru/~math
E-mail: math@sai.msu.ru

Re: PG & random() strangeness

От
Marc Schablewski
Дата:
Hi,

On 04.05.2010 18:24, Sergey E. Koposov wrote:
> ## select id ,count(*) from tmpx group by (id) having count(*)>1;
>          id         | count
> --------------------+-------
>   0.519465064629912 |     2
>  0.0100625408813357 |     2
>   0.394671014975756 |     2
> (3 rows)
>
> and
> ## select id from tmpx where id > 0.51119 and id < 0.51120;
>         id
> -------------------
>  0.511193082667887
>  0.511194097343832
> (2 rows)
The range in your second query does not include any of the duplicated
values from your first query.

Kind regards,
Marc