Обсуждение: Netflix Prize data

Поиск

Список

Период

Сортировка

Netflix Prize data

От

"Mark Woodward"

Дата:

04 октября 2006 г., 20:40:24

I signed up for the Netflix Prize. (www.netflixprize.com) and downloaded
their data and have imported it into PostgreSQL. Here is how I created the
table:   Table "public.ratings"Column |  Type   | Modifiers
--------+---------+-----------item   | integer |client | integer |rating | integer |rdate  | text    |
Indexes:   "ratings_client" btree (client)   "ratings_item" btree (item)

markw@snoopy:~/netflix$ time psql netflix -c "select count(*) from ratings"  count
-----------100480507
(1 row)


real    2m6.270s
user    0m0.004s
sys     0m0.005s


The one thing I notice is that it is REAL slow. I know it is, in fact, 100
million records, but I don't think PostgreSQL is usually slow like this.
I'm going to check with some other machines to see if there is a problem
with my test machine or if something is wierd about PostgreSQL and large
numbers of rows.

I tried to cluster the data along a particular index but had to cancel it
after 3 hours.

I'm using 8.1.4. The "rdate" field looks something like: "2005-09-06" So,
the raw data is 23 bytes, the date string will probably be rounded up to
12 bytes, that's 24 bytes per row of data. What is the overhead per
variable? per row?

Is there any advantage to using "varchar(10)" over "text" ?

Re: Netflix Prize data

От

"Luke Lonergan"

Дата:

04 октября 2006 г., 21:00:30

Mark,

On 10/4/06 1:43 PM, "Mark Woodward" <pgsql@mohawksoft.com> wrote:

> markw@snoopy:~/netflix$ time psql netflix -c "select count(*) from ratings"
>    count
> -----------
>  100480507
> (1 row)
> 
> 
> real    2m6.270s
> user    0m0.004s
> sys     0m0.005s

I think you are getting about 40MB/s on your sequential scan of about 5GB of
heap data in this case.  I calculate the size of the data as:

3 Integers (12 bytes), one text date field (10 bytes ?) and tuple overhead
(24 bytes) = 46 bytes per row

100 million rows x 46 bytes / row = 4.6 Gbytes

- Luke

Re: Netflix Prize data

От

Tom Lane

Дата:

04 октября 2006 г., 21:00:58

"Mark Woodward" <pgsql@mohawksoft.com> writes:
> The one thing I notice is that it is REAL slow.

How fast is your disk?  Counting on my fingers, I estimate you are
scanning the table at about 47MB/sec, which might or might not be
disk-limited...

> I'm using 8.1.4. The "rdate" field looks something like: "2005-09-06"

So why aren't you storing it as type "date"?
        regards, tom lane

Re: Netflix Prize data

От

Gregory Stark

Дата:

04 октября 2006 г., 21:46:32

"Mark Woodward" <pgsql@mohawksoft.com> writes:

> I'm using 8.1.4. The "rdate" field looks something like: "2005-09-06" So,
> the raw data is 23 bytes, the date string will probably be rounded up to
> 12 bytes, that's 24 bytes per row of data. What is the overhead per
> variable? per row?
>
> Is there any advantage to using "varchar(10)" over "text" ?

I'll second the "use a date" comment.

But to answer the questions, text and varchar are handled identically in
almost every respect. The overhead per variable width field (like text or
varchar) is 4 bytes. The overhead per row depends on a few factors, but figure
28 bytes.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com

Re: Netflix Prize data

От

"Greg Sabino Mullane"

Дата:

04 октября 2006 г., 22:32:47

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


> I signed up for the Netflix Prize. (www.netflixprize.com)
> and downloaded their data and have imported it into PostgreSQL.
> Here is how I created the table:

I signed up as well, but have the table as follows:

CREATE TABLE rating ( movie  SMALLINT NOT NULL, person INTEGER  NOT NULL, rating SMALLINT NOT NULL, viewed DATE     NOT
NULL
);

I also recommend not loading the entire file until you get further
along in the algorithm solution. :)

Not that I have time to really play with this....

- --
Greg Sabino Mullane greg@turnstep.com
PGP Key: 0x14964AC8 200610041827
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
-----BEGIN PGP SIGNATURE-----

iD8DBQFFJDZ0vJuQZxSWSsgRAr6OAKCiOuspNm8QCsujaEN0Kgie6RsTjgCdGPda
9zVzpkrhTEhySEVBwMBTOdU=
=zF7u
-----END PGP SIGNATURE-----

Re: Netflix Prize data

От

"Mark Woodward"

Дата:

04 октября 2006 г., 22:48:03

>> I signed up for the Netflix Prize. (www.netflixprize.com)
>> and downloaded their data and have imported it into PostgreSQL.
>> Here is how I created the table:
>
> I signed up as well, but have the table as follows:
>
> CREATE TABLE rating (
>   movie  SMALLINT NOT NULL,
>   person INTEGER  NOT NULL,
>   rating SMALLINT NOT NULL,
>   viewed DATE     NOT NULL
> );
>
> I also recommend not loading the entire file until you get further
> along in the algorithm solution. :)
>
> Not that I have time to really play with this....

As luck would have it, I wrote a recommendations system based on music
ratings a few years ago.

After reading the NYT article, it seems as though one or more of the guys
behind "Net Perceptions" is either helping them or did their system, I'm
not sure. I wrote my system because Net Perceptions was too slow and did a
lousy job.

I think the notion of "communities" in general is an interesting study in
statistics, but every thing I've seen in the form of bad recommendations
shows that while [N] people may share certain tastes, but that doesn't
nessisarily mean that what one likes the others do. This is especially
flawed with movie rentals because it is seldom a 1:1 ratio of movies to
people. There are often multiple people in a household. Also, movies are
almost always for multiple people.

Anyway, good luck! (Not better than me, of course :-)

Re: Netflix Prize data

От

"Mark Woodward"

Дата:

04 октября 2006 г., 22:54:39

> "Mark Woodward" <pgsql@mohawksoft.com> writes:
>> The one thing I notice is that it is REAL slow.
>
> How fast is your disk?  Counting on my fingers, I estimate you are
> scanning the table at about 47MB/sec, which might or might not be
> disk-limited...
>
>> I'm using 8.1.4. The "rdate" field looks something like: "2005-09-06"
>
> So why aren't you storing it as type "date"?
>

You are assuming I gave it any thought at all. :-)

I converted it to a date type (create table ratings2 as ....)
markw@snoopy:~/netflix/download$ time psql -c "select count(*) from
ratings" netflix  count
-----------100480507
(1 row)


real    1m29.852s
user    0m0.002s
sys     0m0.005s

That's about the right increase based on the reduction in data size.

OK, I guess I am crying wolf, 47M/sec isn't all that bad for the system.

Re: Netflix Prize data

От

Gregory Stark

Дата:

04 октября 2006 г., 23:36:19

"Greg Sabino Mullane" <greg@turnstep.com> writes:

> CREATE TABLE rating (
>   movie  SMALLINT NOT NULL,
>   person INTEGER  NOT NULL,
>   rating SMALLINT NOT NULL,
>   viewed DATE     NOT NULL
> );

You would probably be better off putting the two smallints first followed by
the integer and date. Otherwise both the integer and the date field will have
an extra two bytes of padding wasting 4 bytes of space.

If you reorder the fields that way you'll be down to 28 bytes of tuple header
overhead and 12 bytes of data. There's actually another 4 bytes in the form of
the line pointer so a total of 44 bytes per record. Ie, almost 73% of the disk
i/o you're seeing is actually per-record overhead.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com

Re: Netflix Prize data

От

"Mark Woodward"

Дата:

04 октября 2006 г., 23:50:05

>
> "Greg Sabino Mullane" <greg@turnstep.com> writes:
>
>> CREATE TABLE rating (
>>   movie  SMALLINT NOT NULL,
>>   person INTEGER  NOT NULL,
>>   rating SMALLINT NOT NULL,
>>   viewed DATE     NOT NULL
>> );
>
> You would probably be better off putting the two smallints first followed
> by
> the integer and date. Otherwise both the integer and the date field will
> have
> an extra two bytes of padding wasting 4 bytes of space.
>
> If you reorder the fields that way you'll be down to 28 bytes of tuple
> header
> overhead and 12 bytes of data. There's actually another 4 bytes in the
> form of
> the line pointer so a total of 44 bytes per record. Ie, almost 73% of the
> disk
> i/o you're seeing is actually per-record overhead.
>

That's good advice, however, It is said that Netflix has greater than 64K
movies, so, while the test info may work with a small int, I doubt the
overall system would work.

The rating, however, is one char 1~9. Would making it a char(1) buy anything?

In wonder....

If I started screwing around with movie ID and rating, and moved them into
one int. One byte for rating, three bytes for movie ID. That could reduce
the data size by at least half gig.

Re: Netflix Prize data

От

Tom Lane

Дата:

05 октября 2006 г., 00:18:19

"Mark Woodward" <pgsql@mohawksoft.com> writes:
> The rating, however, is one char 1~9. Would making it a char(1) buy anything?

No, that would actually hurt because of the length word for the char
field.  Even if you used the "char" type, which really is only one byte,
you wouldn't win anything because of alignment issues.  Personally I'd
just go for three ints and a date, rather than trying to be cute with
the rating.
        regards, tom lane

Re: Netflix Prize data

От

"Mark Woodward"

Дата:

05 октября 2006 г., 00:25:19

> "Mark Woodward" <pgsql@mohawksoft.com> writes:
>> The rating, however, is one char 1~9. Would making it a char(1) buy
>> anything?
>
> No, that would actually hurt because of the length word for the char
> field.  Even if you used the "char" type, which really is only one byte,
> you wouldn't win anything because of alignment issues.  Personally I'd
> just go for three ints and a date, rather than trying to be cute with
> the rating.


Actually, the date is just days, right? I don't actualy need it too much.
So, create a small int for date and do this: smalldate =
date('1970-01-01') - rdate. And use small int for rating.
Column |   Type   | Modifiers
--------+----------+-----------movie  | integer  |client | integer  |day    | smallint |rating | smallint |

Re: Netflix Prize data

От

Heikki Linnakangas

Дата:

05 октября 2006 г., 08:35:17

Mark Woodward wrote:
> 
> I tried to cluster the data along a particular index but had to cancel it
> after 3 hours.

If the data is in random order, it's faster to do

SELECT * INTO foo_sorted FROM foo ORDER BY bar

then CREATE INDEX, than to run CLUSTER.

That's because CLUSTER does a full index scan of the table, which is 
slower than a seqscan + sort if the table is not already clustered.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Netflix Prize data