Обсуждение: sub queries and caching.

Поиск
Список
Период
Сортировка

sub queries and caching.

От
mlw
Дата:
Take these queries:

select * from foo as F, (select * from bar where name = 'bla') as B where
F.name = B.name
union all
select * from foo as F, (select * from bar where name = 'bla') as B where
F.type = B.type

OR 

create temp table B as select * from bar where name = 'bla';
select * from foo as F, B where F.name = B.name
union all
select * from foo as F, B where F.type = B.type;
drop table B;

My question is, which would be more efficient, or is it a wash?
(A note, I will be calling this from an external programming laguage, PHP, so
the first query would be one Postgres pq_exec call, while the second query
would be three separate calls.)


-- 
5-4-3-2-1 Thunderbirds are GO!
------------------------
http://www.mohawksoft.com


Re: sub queries and caching.

От
mlw
Дата:
Andrew McMillan wrote:
> 
> mlw wrote:
> >
> > Take these queries:
> >
> > select * from foo as F, (select * from bar where name = 'bla') as B where
> > F.name = B.name
> > union all
> > select * from foo as F, (select * from bar where name = 'bla') as B where
> > F.type = B.type
> >
> > OR
> >
> > create temp table B as select * from bar where name = 'bla';
> > select * from foo as F, B where F.name = B.name
> > union all
> > select * from foo as F, B where F.type = B.type;
> > drop table B;
> >
> > My question is, which would be more efficient, or is it a wash?
> > (A note, I will be calling this from an external programming laguage, PHP, so
> > the first query would be one Postgres pq_exec call, while the second query
> > would be three separate calls.)
> 
> The second could also be done as a single PHP call, given that you should be able to
> "create temp table ...; select ..." in a single pg_Exec call.
> 
> You don't need a 'DROP TABLE B' because it's a temp table and will be dropped
> anyway, won't it, unless you're using pg_pconnect.

For a high volume website, where processing is done and latency are important
considerations.

Suppose, you have a few apache/php systems load balanced on top of a single
database system. (This is a very standard configuration.) The apache/php
machine cycles are cheaper than the database machine cycles because they can
usually be scaled easily. A database system is very difficult to scale. While
adding an apache/php box to this configuration is usually trivial, setting up a
database system across two or more machines is a hugely more complex problem.

Then there is latency, the longer a web pages takes to process, it holds its
resources longer, this means you will probably have more web server processes,
and if each process holds a database connection, you will probably have more
open database connections. So, latency costs you the ram on the local web
server and the resources on the back-end application service machines.

So the real trick to getting good salability is to reduce latency AND move as
much processing to the boxes which can be scaled.

Persistent connections to a database are vital in this scenario. Creation of a
new connection to a database impacts backend processing time and page latency.
So one has to drop the temporary table.

So, which is more expensive? Issuing the subquery multiple times within the
larger query, or creating a temporary table, performing the simpler query, and
then dropping the temp table.

My guess would be that creating the temp table and dropping it again do use
backend processing cycles. I wonder if PostgreSQL would be smart enough to
perform that query only once? 


-- 
5-4-3-2-1 Thunderbirds are GO!
------------------------
http://www.mohawksoft.com


Re: sub queries and caching.

От
mlw
Дата:
Kevin wrote:
> 
> While I'm sure it's just because of the simplicity of this example, it
> seems that the query could be reorganized to avoid this double query:
> 
> select * from foo F, bar B where B.name = 'bla' and (B.name = F.name or
> B.type = F.type);

That was the original format of the query, and while not obvious, forced a full
table scan on foo and bar. The query presented is the result of many iterations
of "explain" and execution timings. Even with sequential scans disabled,
Postgres still does them. 

> 
> (granted that this gives a slightly different results, rows matching
> both conditions don't appear twice, which I would imagine to be an
> unwanted side effect of the original query, anyway).
> 
> I would guess it's easier for the query writer to figure this out than
> the db itself, however, since the two queries look very different (and I
> suppose the database wouldn't come up with this query since the result
> /is/ different).  Here's a new question: Would it be useful for the
> database to try and simplify queries before executing them?  Or would
> this just take more time than it's worth for most cases?
> 
> According to EXPLAIN, it /plans/ on doing the query twice, but I don't
> know enough about the internal workings to know if it caches results (so
> I can't answer the original question, sorry).

That's the problem I see as well. You would think that if Postgres sees the
same subquery, it should only do it once. Oh well, neither does it seem Oracle.

-- 
5-4-3-2-1 Thunderbirds are GO!
------------------------
http://www.mohawksoft.com