Re: Compression and on-disk sorting

Поиск
Список
Период
Сортировка
От Greg Stark
Тема Re: Compression and on-disk sorting
Дата
Msg-id 87r72sot7t.fsf@stark.xeocode.com
обсуждение исходный текст
Ответ на Re: Compression and on-disk sorting  ("Jim C. Nasby" <jnasby@pervasive.com>)
Ответы Re: Compression and on-disk sorting  (Bruce Momjian <pgman@candle.pha.pa.us>)
Список pgsql-hackers
"Jim C. Nasby" <jnasby@pervasive.com> writes:

> Which means we need all the interface bits to be able to tell PostgreSQL
> where every single temp storage area is. Presumably much of the
> tablespace mechanism could be used for this, but it's still a bunch of
> work. And you can't just say "I have 8 spindles", you have to tell
> PostgreSQL exactly where to put each temporary area (unless you just
> have it put one on every tablespace you have defined).

Yes, if you have more than one temporary area you definitely need a way to
tell Postgres where to put them since obviously they won't be in the postgres
base directory. But I think that's all Postgres really needs to know.

One could imagine a more complex version where Postgres has meta information
about the bandwidth and seek penalty for each sort area separately. Presumably
also for each table space. But that's a whole lot more complexity than
Postgres's current cost model.


> > that it should strive to maximize sequential reads within one temp area and
> > expect switching between temp areas (which represent multiple spindles) to be
> > better than multiplexing multiple tapes within a single temp area (which
> > represents a single spindle).
> 
> Which adds yet more complexity to all the code that uses the temp area.
> And as others have brought up, you still have to allow for the case when
> splitting all of this out into multiple files means you end up using
> substantially more disk space. That further drives up the complexity.

You also have to consider that it won't always be a benefit to spread the sort
over multiple sort areas. If there's only one sort going on and you can reach
a 1:1 ratio between tapes and spindles then I think it would be a huge
benefit. Effectively boosting the sort speed by random_page_cost.

But if you don't have as many spindles as your algorithm needs tapes
then it's unclear which to multiplex down and whether you gain any benefit
once you're multiplexing over simply using a single sort area.

And worse, if there are multiple sorts going on in the system then you're not
going to get sequential access even if you have multiple sort areas available.
If you have N sort areas and N sorts are going on then you're probably better
off multiplexing each one down to a single sort area and letting them each
proceed without interfering with each other rather than having each one hog
all the sort areas and forcing the OS to do the multiplexing blindly.

> My point is that unless someone shows that there's a non-trivial
> performance gain here, it's not going to happen.

I think two extreme cases are well worth pursuing: 

1) Use n sort areas for n tapes making everything purely sequential access.  That would be useful for large DSS systems
wherelarge sorts are running  and i/o bandwidth is high for sequential access. That gives effectively a
random_page_costspeed boost.
 

2) Use the current algorithm unchanged but have each sort use a different sort  area in some sort of round-robin
fashion.That helps the OLTP type  environment (ignoring for the moment that OLTP environments really ought  not be
doingdisk sorts) where people complain about unreliable execution  times more than slow execution times. If you can
provideenough sort areas  then it would remove one big reason other queries concurrent impact the  execution time of
queries.

-- 
greg



В списке pgsql-hackers по дате отправления:

Предыдущее
От: "Jaime Casanova"
Дата:
Сообщение: Re: PL/pgSQL 'i = i + 1' Syntax
Следующее
От: Greg Stark
Дата:
Сообщение: Re: pg_dump and backslash escapes