Обсуждение: Re: getting the most of out multi-core systems for repeated complex SELECT statements

Поиск

Список

Период

Сортировка

Re: getting the most of out multi-core systems for repeated complex SELECT statements

От

Дата:

03 февраля 2011 г., 15:07:11

Time for my pet meme to wiggle out of its hole (next to Phil's, and a day later).  For PG to prosper in the future, it
hasto embrace the multi-core/processor/SSD machine at the query level.  It has to.  And it has to because the Big Boys
alreadydo so, to some extent, and they've realized that the BCNF schema on such machines is supremely efficient.
PG/MySql/OSEngineOfChoicewill get left behind simply because the efficiency offered will be worth the price. 

I know this is far from trivial, and my C skills are such that I can offer no help.  These machines have been the
obvious"current" machine in waiting for at least 5 years, and those applications which benefit from parallelism
(serversof all kinds, in particular) will filter out the winners and losers based on exploiting this parallelism. 

Much as it pains me to say it, but the MicroSoft approach to software: write to the next generation processor and force
usersto upgrade, will be the winning strategy for database engines.  There's just way too much to gain. 

-- Robert

---- Original message ----
>Date: Thu, 03 Feb 2011 09:44:03 -0600
>From: pgsql-performance-owner@postgresql.org (on behalf of Andy Colson <andy@squeakycode.net>)
>Subject: Re: [PERFORM] getting the most of out multi-core systems for repeated complex SELECT statements
>To: Mark Stosberg <mark@summersault.com>
>Cc: pgsql-performance@postgresql.org
>
>On 2/3/2011 9:08 AM, Mark Stosberg wrote:
>>
>> Each night we run over a 100,000 "saved searches" against PostgreSQL
>> 9.0.x. These are all complex SELECTs using "cube" functions to perform a
>> geo-spatial search to help people find adoptable pets at shelters.
>>
>> All of our machines in development in production have at least 2 cores
>> in them, and I'm wondering about the best way to maximally engage all
>> the processors.
>>
>> Now we simply run the searches in serial. I realize PostgreSQL may be
>> taking advantage of the multiple cores some in this arrangement, but I'm
>> seeking advice about the possibility and methods for running the
>> searches in parallel.
>>
>> One naive I approach I considered was to use parallel cron scripts. One
>> would run the "odd" searches and the other would run the "even"
>> searches. This would be easy to implement, but perhaps there is a better
>> way.  To those who have covered this area already, what's the best way
>> to put multiple cores to use when running repeated SELECTs with PostgreSQL?
>>
>> Thanks!
>>
>>      Mark
>>
>>
>
>1) I'm assuming this is all server side processing.
>2) One database connection will use one core.  To use multiple cores you
>need multiple database connections.
>3) If your jobs are IO bound, then running multiple jobs may hurt
>performance.
>
>Your naive approach is the best.  Just spawn off two jobs (or three, or
>whatever).  I think its also the only method.  (If there is another
>method, I dont know what it would be)
>
>-Andy
>
>--
>Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
>To make changes to your subscription:
>http://www.postgresql.org/mailpref/pgsql-performance

Re: getting the most of out multi-core systems for repeated complex SELECT statements

От

Aljoša Mohorović

Дата:

03 февраля 2011 г., 16:56:43

On Thu, Feb 3, 2011 at 4:57 PM, <gnuoytr@rcn.com> wrote:
> Time for my pet meme to wiggle out of its hole (next to Phil's, and a day later). For PG to prosper in the future,
ithas to embrace the multi-core/processor/SSD machine at the query level. It has to. And it has to because the Big
Boysalready do so, to some extent, and they've realized that the BCNF schema on such machines is supremely efficient.
PG/MySql/OSEngineOfChoicewill get left behind simply because the efficiency offered will be worth the price.

this kind of view on what postgres community has to do can only be
true if postgres has no intention to support "cloud environments" or
any kind of hardware virtualization.
while i'm sure targeting specific hardware features can greatly
improve postgres performance it should be an option not a requirement.
forcing users to have specific hardware is basically telling users
that you can forget about using postgres in amazon/rackspace cloud
environments (or any similar environment).
i'm sure that a large part of postgres community doesn't care about
"cloud environments" (although this is only my personal impression)
but if plan is to disable postgres usage in such environments you are
basically loosing a large part of developers/companies targeting
global internet consumers with their online products.
cloud environments are currently the best platform for internet
oriented developers/companies to start a new project or even to
migrate from custom hardware/dedicated data center.

> Much as it pains me to say it, but the MicroSoft approach to software: write to the next generation processor and
forceusers to upgrade, will be the winning strategy for database engines. There's just way too much to gain.

it can arguably be said that because of this approach microsoft is
losing ground in most of their businesses/strategies.

Aljosa Mohorovic

Re: getting the most of out multi-core systems for repeated complex SELECT statements

От

Scott Marlowe

Дата:

03 февраля 2011 г., 17:21:33

On Thu, Feb 3, 2011 at 8:57 AM,  <gnuoytr@rcn.com> wrote:
> Time for my pet meme to wiggle out of its hole (next to Phil's, and a day later).  For PG to prosper in the future,
ithas to embrace the multi-core/processor/SSD machine at the query level.  It has to.  And 

I'm pretty sure multi-core query processing is in the TODO list.  Not
sure anyone's working on it tho.  Writing a big check might help.

Re: getting the most of out multi-core systems for repeated complex SELECT statements

От

Greg Smith

Дата:

03 февраля 2011 г., 21:57:08

Scott Marlowe wrote:

On Thu, Feb 3, 2011 at 8:57 AM,  <gnuoytr@rcn.com> wrote:

Time for my pet meme to wiggle out of its hole (next to Phil's, and a day later).  For PG to prosper in the future, it has to embrace the multi-core/processor/SSD machine at the query level.  It has to.  And

I'm pretty sure multi-core query processing is in the TODO list.  Not
sure anyone's working on it tho.  Writing a big check might help.

Work on the exciting parts people are interested in is blocked behind completely mundane tasks like coordinating how the multiple sessions are going to end up with a consistent view of the database. See "Export snapshots to other sessions" at http://wiki.postgresql.org/wiki/ClusterFeatures for details on that one.

Parallel query works well for accelerating CPU-bound operations that are executing in RAM. The reality here is that while the feature sounds important, these situations don't actually show up that often. There are exactly zero clients I deal with regularly who would be helped out by this. The ones running web applications whose workloads do fit into memory are more concerned about supporting large numbers of users, not optimizing things for a single one. And the ones who have so much data that single users running large reports would seemingly benefit from this are usually disk-bound instead.

The same sort of situation exists with SSDs. Take out the potential users whose data can fit in RAM instead, take out those who can't possibly get an SSD big enough to hold all their stuff anyway, and what's left in the middle is not very many people. In a database context I still haven't found anything better to do with a SSD than to put mid-sized indexes on them, ones a bit too large for RAM but not so big that only regular hard drives can hold them.

I would rather strongly disagree with the suggestion that embracing either of these fancy but not really as functional as they appear at first approaches is critical to PostgreSQL's future. They're specialized techniques useful to only a limited number of people.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books

Re: getting the most of out multi-core systems for repeated complex SELECT statements

От

Andy Colson

Дата:

04 февраля 2011 г., 02:21:41

On 02/03/2011 04:56 PM, Greg Smith wrote:
> Scott Marlowe wrote:
>> On Thu, Feb 3, 2011 at 8:57 AM,<gnuoytr@rcn.com>  wrote:
>>
>>> Time for my pet meme to wiggle out of its hole (next to Phil's, and a day later).  For PG to prosper in the future,
ithas to embrace the multi-core/processor/SSD machine at the query level.  It has to.  And 
>>>
>>
>> I'm pretty sure multi-core query processing is in the TODO list.  Not
>> sure anyone's working on it tho.  Writing a big check might help.
>>
>
> Work on the exciting parts people are interested in is blocked behind completely mundane tasks like coordinating how
themultiple sessions are going to end up with a consistent view of the database. See "Export snapshots to other
sessions"at http://wiki.postgresql.org/wiki/ClusterFeatures for details on that one. 
>
> Parallel query works well for accelerating CPU-bound operations that are executing in RAM. The reality here is that
whilethe feature sounds important, these situations don't actually show up that often. There are exactly zero clients I
dealwith regularly who would be helped out by this. The ones running web applications whose workloads do fit into
memoryare more concerned about supporting large numbers of users, not optimizing things for a single one. And the ones
whohave so much data that single users running large reports would seemingly benefit from this are usually disk-bound
instead.
>
> The same sort of situation exists with SSDs. Take out the potential users whose data can fit in RAM instead, take out
thosewho can't possibly get an SSD big enough to hold all their stuff anyway, and what's left in the middle is not very
manypeople. In a database context I still haven't found anything better to do with a SSD than to put mid-sized indexes
onthem, ones a bit too large for RAM but not so big that only regular hard drives can hold them. 
>
> I would rather strongly disagree with the suggestion that embracing either of these fancy but not really as
functionalas they appear at first approaches is critical to PostgreSQL's future. They're specialized techniques useful
toonly a limited number of people. 
>
> --
> Greg Smith   2ndQuadrant USgreg@2ndQuadrant.com    Baltimore, MD
> PostgreSQL Training, Services, and 24x7 Supportwww.2ndQuadrant.us
> "PostgreSQL 9.0 High Performance":http://www.2ndQuadrant.com/books
>

4 cores is cheap and popular now, 6 in a bit, 8 next year, 16/24 cores in 5 years.  You can do 16 cores now, but its a
bitexpensive.  I figure hundreds of cores will be expensive in 5 years, but possible, and available. 

Cpu's wont get faster, but HD's and SSD's will.  To have one database connection, which runs one query, run fast, it's
goingto need multi-core support. 

That's not to say we need "parallel query's".  Or we need multiple backends to work on one query.  We need one backend,
workingon one query, using mostly the same architecture, to just use more than one core. 

You'll notice I used _mostly_ and _just_, and have no knowledge of PG internals, so I fully expect to be wrong.

My point is, there must be levels of threading, yes?  If a backend has data to sort, has it collected, nothing locked,
whatwould it hurt to use multi-core sorting? 

-- OR --

Threading (and multicore), to me, always mean queues.  What if new type's of backend's were created that did "simple"
things,that normal backends could distribute work to, then go off and do other things, and come back to collect the
results.

I thought I read a paper someplace that said shared cache (L1/L2/etc) multicore cpu's would start getting really slow
at16/32 cores, and that message passing was the way forward past that.  If PG started aiming for 128 core support right
now,it should use some kinda message passing with queues thing, yes? 

-Andy

Re: getting the most of out multi-core systems for repeated complex SELECT statements

От

Greg Smith

Дата:

04 февраля 2011 г., 03:00:50

Andy Colson wrote:
> Cpu's wont get faster, but HD's and SSD's will.  To have one database
> connection, which runs one query, run fast, it's going to need
> multi-core support.

My point was that situations where people need to run one query on one
database connection that aren't in fact limited by disk I/O are far less
common than people think.  My troublesome database servers aren't ones
with a single CPU at its max but wishing there were more workers,
they're the ones that have >25% waiting for I/O.  And even that crowd is
still a subset, distinct from people who don't care about the speed of
any one core, they need lots of connections to go at once.

> That's not to say we need "parallel query's".  Or we need multiple
> backends to work on one query.  We need one backend, working on one
> query, using mostly the same architecture, to just use more than one
> core.

That's exactly what we mean when we say "parallel query" in the context
of a single server.

> My point is, there must be levels of threading, yes?  If a backend has
> data to sort, has it collected, nothing locked, what would it hurt to
> use multi-core sorting?

Optimizer nodes don't run that way.  The executor "pulls" rows out of
the top of the node tree, which then pulls from its children, etc.  If
you just blindly ran off and executed every individual node to
completion in parallel, that's not always going to be faster--could be a
lot slower, if the original query never even needed to execute portions
of the tree.

When you start dealing with all of the types of nodes that are out there
it gets very messy in a hurry.  Decomposing the nodes of the query tree
into steps that can be executed in parallel usefully is the hard problem
hiding behind the simple idea of "use all the cores!"

> I thought I read a paper someplace that said shared cache (L1/L2/etc)
> multicore cpu's would start getting really slow at 16/32 cores, and
> that message passing was the way forward past that.  If PG started
> aiming for 128 core support right now, it should use some kinda
> message passing with queues thing, yes?

There already is a TupleStore type that is going to serve as the message
being sent between the client backends.  Unfortunately we won't get
anywhere near 128 cores without addressing the known scalability issues
that are in the code right now, ones you can easily run into even with 8
cores.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books

Re: getting the most of out multi-core systems for repeated complex SELECT statements

От

Scott Marlowe

Дата:

04 февраля 2011 г., 03:04:19

On Thu, Feb 3, 2011 at 9:00 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> Andy Colson wrote:
>>
>> Cpu's wont get faster, but HD's and SSD's will.  To have one database
>> connection, which runs one query, run fast, it's going to need multi-core
>> support.
>
> My point was that situations where people need to run one query on one
> database connection that aren't in fact limited by disk I/O are far less
> common than people think.  My troublesome database servers aren't ones with
> a single CPU at its max but wishing there were more workers, they're the
> ones that have >25% waiting for I/O.  And even that crowd is still a subset,
> distinct from people who don't care about the speed of any one core, they
> need lots of connections to go at once.

The most common case where I can use > 1 core is loading data.  and
pg_restore supports parallel restore threads, so that takes care of
that pretty well.

Re: getting the most of out multi-core systems for repeated complex SELECT statements

От

Andy Colson

Дата:

04 февраля 2011 г., 03:19:59

On 02/03/2011 10:00 PM, Greg Smith wrote:
> Andy Colson wrote:
>> Cpu's wont get faster, but HD's and SSD's will. To have one database connection, which runs one query, run fast,
it'sgoing to need multi-core support. 
>
> My point was that situations where people need to run one query on one database connection that aren't in fact
limitedby disk I/O are far less common than people think. My troublesome database servers aren't ones with a single CPU
atits max but wishing there were more workers, they're the ones that have >25% waiting for I/O. And even that crowd is
stilla subset, distinct from people who don't care about the speed of any one core, they need lots of connections to go
atonce. 
>

Yes, I agree... for today.  If you gaze into 5 years... double the core count (but not the speed), double the IO rate.
Whatdo you see? 


>> My point is, there must be levels of threading, yes? If a backend has data to sort, has it collected, nothing
locked,what would it hurt to use multi-core sorting? 
>
> Optimizer nodes don't run that way. The executor "pulls" rows out of the top of the node tree, which then pulls from
itschildren, etc. If you just blindly ran off and executed every individual node to completion in parallel, that's not
alwaysgoing to be faster--could be a lot slower, if the original query never even needed to execute portions of the
tree.
>
> When you start dealing with all of the types of nodes that are out there it gets very messy in a hurry. Decomposing
thenodes of the query tree into steps that can be executed in parallel usefully is the hard problem hiding behind the
simpleidea of "use all the cores!" 
>


What if... the nodes were run in separate threads, and interconnected via queues?  A node would not have to run to
completioneither.  A queue could be setup to have a max items.  When a node adds 5 out of 5 items it would go to sleep.
Its parent node, removing one of the items could wake it up. 

-Andy

Re: getting the most of out multi-core systems for repeated complex SELECT statements

От

Scott Marlowe

Дата:

04 февраля 2011 г., 03:57:41

On Thu, Feb 3, 2011 at 9:19 PM, Andy Colson <andy@squeakycode.net> wrote:
> On 02/03/2011 10:00 PM, Greg Smith wrote:
>>
>> Andy Colson wrote:
>>>
>>> Cpu's wont get faster, but HD's and SSD's will. To have one database
>>> connection, which runs one query, run fast, it's going to need multi-core
>>> support.
>>
>> My point was that situations where people need to run one query on one
>> database connection that aren't in fact limited by disk I/O are far less
>> common than people think. My troublesome database servers aren't ones with a
>> single CPU at its max but wishing there were more workers, they're the ones
>> that have >25% waiting for I/O. And even that crowd is still a subset,
>> distinct from people who don't care about the speed of any one core, they
>> need lots of connections to go at once.
>>
>
> Yes, I agree... for today.  If you gaze into 5 years... double the core
> count (but not the speed), double the IO rate.  What do you see?

I run a cluster of pg servers under slony replication, and we have 112
cores between three servers, soon to go to 144 cores.  We have no need
for individual queries to span the cores, honestly.  Our real limit is
the ability get all those cores working at the same time on individual
queries efficiently without thundering herd issues.  Yeah, it's only
one datapoint, but for us, with a lot of cores, we need each one to
run one query as fast as it can.

Re: getting the most of out multi-core systems for repeated complex SELECT statements

От

Greg Smith

Дата:

04 февраля 2011 г., 10:34:00

Andy Colson wrote:
> Yes, I agree... for today.  If you gaze into 5 years... double the
> core count (but not the speed), double the IO rate.  What do you see?

Four more versions of PostgreSQL addressing problems people are having
right now.  When we reach the point where parallel query is the only way
around the actual bottlenecks in the software people are running into,
someone will finish parallel query.  I am not a fan of speculative
development in advance of real demand for it.  There are multiple much
more serious bottlenecks impacting scalability in PostgreSQL that need
to be addressed before this one is #1 on the development priority list
to me.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books

Re: getting the most of out multi-core systems for repeated complex SELECT statements

От

Mark Stosberg

Дата:

04 февраля 2011 г., 20:18:29

On 02/03/2011 10:57 AM, gnuoytr@rcn.com wrote:
> For PG to prosper in the future, it has to embrace the multi-core/processor/SSD machine at the query level

As the person who brought up the original concern, I'll add that
"multi-core at the query level" really isn't important for us. Most of
our PostgreSQL usage is through a web application which fairly
automatically takes advantage of multiple cores, because there are
several parallel connections.

A smaller but important piece of what we do is run this cron script
needs to run hundreds of thousands of variations of the same complex
SELECT as fast it can.

What honestly would have helped most is not technical at all-- it would
have been some documentation on how to take advantage of multiple cores
for this case.

It looks like it's going to be trivial-- Divide up the data with a
modulo, and run multiple parallel cron scripts that each processes a
slice of the data. A benchmark showed that this approach sped up our
processing 3x when splitting the application 4 ways across 4 processors.
(I think we failed to achieve a 4x improvement because the server was
already busy handling some other tasks).

Part of our case is likely fairly common *today*: many servers are
multi-core now, but people don't necessarily understand how to take
advantage of that if it doesn't happen automatically.

    Mark

Re: Re: getting the most of out multi-core systems for repeated complex SELECT statements

От

Scott Marlowe

Дата:

04 февраля 2011 г., 21:58:09

On Fri, Feb 4, 2011 at 2:18 PM, Mark Stosberg <mark@summersault.com> wrote:
> It looks like it's going to be trivial-- Divide up the data with a
> modulo, and run multiple parallel cron scripts that each processes a
> slice of the data. A benchmark showed that this approach sped up our
> processing 3x when splitting the application 4 ways across 4 processors.
> (I think we failed to achieve a 4x improvement because the server was
> already busy handling some other tasks).

I once had about 2 months of machine work ahead of me for one server.
Luckily it was easy to break up into chunks and run it on all the
workstations at night in the office, and we were done in < 1 week.
pgsql was the data store for it, and it was just like what you're
talking about, break it into chunks, spread it around.

Re: getting the most of out multi-core systems for repeated complex SELECT statements

От

Chris Browne

Дата:

04 февраля 2011 г., 23:20:56

gnuoytr@rcn.com writes:
> Time for my pet meme to wiggle out of its hole (next to Phil's, and a
> day later).  For PG to prosper in the future, it has to embrace the
> multi-core/processor/SSD machine at the query level.  It has to.  And
> it has to because the Big Boys already do so, to some extent, and
> they've realized that the BCNF schema on such machines is supremely
> efficient.  PG/MySql/OSEngineOfChoice will get left behind simply
> because the efficiency offered will be worth the price.
>
> I know this is far from trivial, and my C skills are such that I can
> offer no help.  These machines have been the obvious "current" machine
> in waiting for at least 5 years, and those applications which benefit
> from parallelism (servers of all kinds, in particular) will filter out
> the winners and losers based on exploiting this parallelism.
>
> Much as it pains me to say it, but the MicroSoft approach to software:
> write to the next generation processor and force users to upgrade,
> will be the winning strategy for database engines.  There's just way
> too much to gain.

I'm not sure how true that is, really.  (e.g. - "too much to gain.")

I know that Jan Wieck and I have been bouncing thoughts on valid use of
threading off each other for *years*, now, and it tends to be
interesting but difficult to the point of impracticality.

But how things play out are quite fundamentally different for different
usage models.

It's useful to cross items off the list, so we're left with the tough
ones that are actually a problem.

1.  For instance, OLTP applications, that generate a lot of concurrent
connections, already do perfectly well in scaling on multi-core systems.
Each connection is a separate process, and that already harnesses
multi-core systems perfectly well.  Things have improved a lot over the
last 10 years, and there may yet be further improvements to be found,
but it seems pretty reasonable to me to say that the OLTP scenario can
be treated as "solved" in this context.

The scenario where I can squint and see value in trying to multithread
is the contrast to that, of OLAP.  The case where we only use a single
core, today, is where there's only a single connection, and a single
query, running.

But that can reasonably be further constrained; not every
single-connection query could be improved by trying to spread work
across cores.  We need to add some further assumptions:

2.  The query needs to NOT be I/O-bound.  If it's I/O bound, then your
system is waiting for the data to come off disk, rather than to do
processing of that data.

That condition can be somewhat further strengthened...  It further needs
to be a query where multi-processing would not increase the I/O burden.

Between those two assumptions, that cuts the scope of usefulness to a
very considerable degree.

And if we *are* multiprocessing, we introduce several new problems, each
of which is quite troublesome:

 - How do we decompose the query so that the pieces are processed in
   ways that improve processing time?

   In effect, how to generate a parallel query plan?

   It would be more than stupid to consider this to be "obvious."  We've
   got 15-ish years worth of query optimization efforts that have gone
   into Postgres, and many of those changes were not "obvious" until
   after they got thought through carefully.  This multiplies the
   complexity, and opportunity for error.

 - Coordinating processing

   Becomes quite a bit more complex.  Multiple threads/processes are
   accessing parts of the same data concurrently, so a "parallelized
   query" that harnesses 8 CPUs might generate 8x as many locks and
   analogous coordination points.

 - Platform specificity

   Threading is a problem in that each OS platform has its own
   implementation, and even when they claim to conform to common
   standards, they still have somewhat different interpretations.  This
   tends to go in one of the following directions:

    a) You have to pick one platform to do threading on.

       Oops.  There's now PostgreSQL-Linux, that is the only platform
       where our multiprocessing thing works.  It could be worse than
       that; it might work on a particular version of a particular OS...

    b) You follow some apparently portable threading standard

       And find that things are hugely buggy because the platforms
       follow the standard a bit differently.  And perhaps this means
       that, analogous to a), you've got a set of platforms where this
       "works" (for some value of "works"), and others where it can't.
       That's almost as evil as a).

    c) You follow some apparently portable threading standard

       And need to wrap things in a pretty thick safety blanket to make
       sure it is compatible with all the bugs in interpretation and
       implementation.  Complexity++, and performance probably suffers.

   None of these are particularly palatable, which is why threading
   proposals get a lot of pushback.

At the end of the day, if this is only providing value for a subset of
use cases, involving peculiar-ish conditions, well, it's quite likely
wiser for most would-be implementors to spend their time on improvements
likely to help a larger set of users that might, in fact, include those
that imagine that this parallelization would be helpful.
--
select 'cbbrowne' || '@' || 'acm.org';
http://www3.sympatico.ca/cbbrowne/x.html
FLORIDA: Where your vote counts and counts and counts.

Re: getting the most of out multi-core systems for repeated complex SELECT statements

От

david@lang.hm

Дата:

05 февраля 2011 г., 06:16:03

On Fri, 4 Feb 2011, Chris Browne wrote:

> 2.  The query needs to NOT be I/O-bound.  If it's I/O bound, then your
> system is waiting for the data to come off disk, rather than to do
> processing of that data.

yes and no on this one.

it is very possible to have a situation where the process generating the
I/O is waiting for the data to come off disk, but if there are still idle
resources in the disk subsystem.

it may be that the best way to address this is to have the process
generating the I/O send off more requests, but that sometimes is
significantly more complicated than splitting the work between two
processes and letting them each generate I/O requests

with rotating disks, ideally you want to have at least two requests
outstanding, one that the disk is working on now, and one for it to start
on as soon as it finishes the one that it's on (so that the disk doesn't
sit idle while the process decides what the next read should be). In
practice you tend to want to have even more outstanding from the
application so that they can be optimized (combined, reordered, etc) by
the lower layers.

if you end up with a largish raid array (say 16 disks), this can translate
into a lot of outstanding requests that you want to have active to fully
untilize the array, but having the same number of requests outstanding
with a single disk would be counterproductive as the disk would not be
able to see all the outstanding requests and therefor would not be able to
optimize them as effectivly.

David Lang

Re: getting the most of out multi-core systems for repeated complex SELECT statements

От

Vitalii Tymchyshyn

Дата:

07 февраля 2011 г., 09:02:59

Hi, all

My small thoughts about parallelizing single query.
AFAIK in the cases where it is needed, there is usually one single
operation that takes a lot of CPU, e.g. hashing or sorting. And this are
usually tasks that has well known algorithms to parallelize.
The main problem, as for me, is thread safety. First of all, operations
that are going to be parallelized, must be thread safe. Then functions
and procedures they call must be thread safe too. So, a marker for a
procedure must be introduced and all standard ones should be
checked/fixed for parallel processing with marker set.
Then, one should not forget optimizer checks for when to introduce
parallelizing. How should it be accounted in the query plan? Should it
influence optimizer decisions (should it count CPU or wall time when
optimizing query plan)?
Or can it simply be used by an operation when it can see it will benefit
from it.

Best regards, Vitalii Tymchyshyn

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Re: getting the most of out multi-core systems for repeated complex SELECT statements