Critical performance problems on large databases

Поиск
Список
Период
Сортировка
От Gunther Schadow
Тема Critical performance problems on large databases
Дата
Msg-id 3CB4BF29.1000700@aurora.regenstrief.org
обсуждение исходный текст
Ответы Re: Critical performance problems on large databases  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: Critical performance problems on large databases  (Stephan Szabo <sszabo@megazone23.bigpanda.com>)
Re: Critical performance problems on large databases  (Bill Gribble <grib@linuxdevel.com>)
Re: Critical performance problems on large databases  (Shaun Thomas <sthomas@townnews.com>)
Список pgsql-general
Hi,

it had been repeatedly noted that pgsql was sort of slow on
selects on large tables. For instance, I know a colleague of
mine has asked this recently but on the Google archives I
found the following pertient question and response.

http://archives.postgresql.org/pgsql-sql/2000-03/msg00031.php

Basically if you have a large table with, say, a million rows
and you do

SELECT * FROM Bigtable;

it takes a long time for it to come up with the first page
of results using up lots of computer resources etc and after
the first page is returned the backend basically goes into
idle mode for all the rest of the query results retrieval.

We also noted that a

SELECT COUNT(*) FROM BigQuery;

can take quite a long time and again use a lot of resources,
whereas

SELECT COUNT(smallcolumn) FROM BigQuery;

may be faster and less resource consuming.

Off the bat, this indicates to me that there is something
sub-obtimal about PostgreSQL handling simple queries. From
a database that should perform well in online user transactions
one would want the query processing to be streamed as much
as possible, i.e., since in a SELECT * FROM Bigtable; there is
no work needed other than to retrieve the tuples out of
physical storage, the response should be immediate and resource
usage low. There should not be large buffer allocations.

Conversely it looks as if PostgreSQL will always read a sizeable
piece (if not all?) of the result set into some buffer area before
returning a single row. This would explain the slow startup on
the SELECT * FROM Bigtable; query as well as the fact that
COUNT(smallcolumn) behaves much faster than COUNT(*).

Again, count should be streamed as well such as to use no
significant memory resources other than the counter. Apparently
a COUNT(*) in postgres is executed as

SELECT * FROM Bigtable INTO $somebuffer
COUNT(tuples in $somebuffer)

Admittedly I am conjecturing here, but the evidence is strong.
Especially because I can make a 3 level subquery with group-
by and all kinds of stuff go almost faster than a SELECT *
FROM Bigtable;

Any enlightenments? Am I wrong? Will this be fixed soon?
Is it hard to change pgsql to do better streaming of its
operations. As a corollary, I would presume that pgsql
won't benefit much from multi-CPU machines because it cannot
parallelize its activities. It may be less hard to make all
of pgsql thread-safe and threaded than it is to rewrite the
execution engine to stream between the different task
and do less buffering, right?

Or may be all is fine with pgsql and we just didn't figure
out how to set up the configuration right?

thanks
-Gunther


PS: we are seriously looking into using pgsql as the core
of a BIG medical record system, but we already know that
if we can't get quick online responses (< 2 s) on
large rasult sets (10000 records)  at least at the first
page (~ 100 records) we are in trouble.

--
Gunther Schadow, M.D., Ph.D.                    gschadow@regenstrief.org
Medical Information Scientist      Regenstrief Institute for Health Care
Adjunct Assistant Professor        Indiana University School of Medicine
tel:1(317)630-7960                         http://aurora.regenstrief.org



В списке pgsql-general по дате отправления:

Предыдущее
От: "P. Jourdan"
Дата:
Сообщение: migration
Следующее
От: Tom Lane
Дата:
Сообщение: Re: "Bad date external representation '2002-04-09 00:02:60.00+10'".