Обсуждение: Is there a good discussion of optimizations?

Поиск
Список
Период
Сортировка

Is there a good discussion of optimizations?

От
Guyren Howe
Дата:
I’d like to put together a good video and writeup about what the… philosophy behind relational databases is.

Most folks, in my experience, who use relational databases don’t really understand the basic theory or even more important the why — the philosophy — of what a relational database is and how to get the most out of them. I see a lot of folks trying to use SQL in an imperative manner — make this temp table, then update it some, then make this other temp table, etc… I see this particularly among analysts who for some reason often prefer SQL Server. I think this is down to afaict SQL Server having an abominable query optimizer. Comments on SQL Server’s query optimizer or this culture of temp tables are welcome.

Anyway, I’d like to put together something that explains this. I would go into Codd’s original insight and how simple yet powerful it is. I’d like to discuss how if you stick to SQL, rather than forcing it into this imperative straight-jacket, the database can work out all the details so you don’t have to do the bad things. It would be useful if there is some reasonably accessible discussion to go through the sort of query optimizations that Postgres can do for you, to convince folks that they can stick to describing what they want, not how to get it.

I’d be interested in any comments anyone has about useful things to discuss here, and in particular I’m interested in an accessible discussion of the sorts of query optimizations Postgres can do for us.

Re: Is there a good discussion of optimizations?

От
Bruce Momjian
Дата:
On Wed, Dec 23, 2020 at 05:55:57PM -0800, Guyren Howe wrote:
> I’d be interested in any comments anyone has about useful things to discuss
> here, and in particular I’m interested in an accessible discussion of the sorts
> of query optimizations Postgres can do for us.

I have a few talks on this:

    https://momjian.us/main/presentations/performance.html#optimizer
    https://momjian.us/main/presentations/internals.html#internal_pics

and this webpage:

    http://www.interdb.jp/pg/pgsql03.html

Robert Haas has a number of them:

    https://sites.google.com/site/robertmhaas/presentations

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EnterpriseDB                             https://enterprisedb.com

  The usefulness of a cup is in its emptiness, Bruce Lee




Re: Is there a good discussion of optimizations?

От
Michael Lewis
Дата:
On Wed, Dec 23, 2020 at 6:56 PM Guyren Howe <guyren@gmail.com> wrote:
I’d like to put together a good video and writeup about what the… philosophy behind relational databases is.

Most folks, in my experience, who use relational databases don’t really understand the basic theory or even more important the why — the philosophy — of what a relational database is and how to get the most out of them. I see a lot of folks trying to use SQL in an imperative manner — make this temp table, then update it some, then make this other temp table, etc… I see this particularly among analysts who for some reason often prefer SQL Server. I think this is down to afaict SQL Server having an abominable query optimizer.

I find temp tables quite helpful to get needed and consistent performance when doing large data warehouse type queries on source data especially when it isn't fully & properly normalized. Many row estimates being low because of correlation with specified client_id and sometimes having 15-25 tables involved in a report, has meant that temp tables (that are analyzed to ensure statistics are present) have seemed the best tool for the job. Perhaps that's all a hack though.

I look forward to when extended statistics may help with join planning and building out a comprehensive warehouse that facilitates use of simpler queries, but for now the "imperative straight-jacket" seems to help more often than it hurts.

RE: Is there a good discussion of optimizations?

От
Kevin Brannen
Дата:

From: Guyren Howe <guyren@gmail.com>

 

>Most folks, in my experience, who use relational databases don’t really understand the basic theory or even more important the why - the philosophy - of what a relational database is and how to get the most out of them. I see a lot of folks trying to use SQL in an imperative manner - make this temp table, then update it some, then make this other temp table, etc...

 

>Anyway, I’d like to put together something that explains this. I would go into Codd’s original insight and how simple yet powerful it is. I’d like to discuss how if you stick to SQL, rather than forcing it into this imperative straight-jacket, the database can work out all the details so you don’t have to do the bad things.

 

Be sure you point out where SQL either fails or comes up short to give a

balanced view. To be fair, sometimes the reason people perceive SQL as failing

them (and why they go with the imperative process) is because they don't know

everything SQL does, or even perhaps their tool doesn't implement the whole

standard.

 

An example of this is that we have a report we're trying to write that I'd

like to think can be done in SQL, but I can't think of a way to do it. Yet,

if I do the base query and pull the data back into my application, I can do

the last bit with 3 lines of Perl very easily. The problem here revolves

around comparing a row of data to the previous row to know whether the data

changed "significantly enough" to keep the new row.

 

Another example is doing running totals. A couple of years ago I would have

said SQL can't do that. Now I know about the OVER clause, something that I

would categorize as somewhat obscure, I can do it as needed.

 

As Michael Lewis pointed out, large dataset can also cause you to choose not

to use SQL in 1 big statemet for everything (i.e. advocating the use to temp

tables). In some ways, using a CTE is a type of temp table, or at least I

view it as such. That allows a person to solve a problem in bite-sized chunks.

I will agree that optimization can do it better at times, but the code also has

to be maintained as well – a balancing act.

 

To continue the above, the idea of using a sub-select in a FROM clause is

generally not thought of by new learners. So experience is also a factor.

 

I think your idea is a good one, but I do hope you present that SQL can't

solve everything ... else why do we have plpgsql. :) You’re correct though,

SQL isn’t used as much as it should be in many places.

 

Kevin

This e-mail transmission, and any documents, files or previous e-mail messages attached to it, may contain confidential information. If you are not the intended recipient, or a person responsible for delivering it to the intended recipient, you are hereby notified that any disclosure, distribution, review, copy or use of any of the information contained in or attached to this message is STRICTLY PROHIBITED. If you have received this transmission in error, please immediately notify us by reply e-mail, and destroy the original transmission and its attachments without reading them or saving them to disk. Thank you.

RE: Is there a good discussion of optimizations?

От
Guyren Howe
Дата:
On Jan 7, 2021, 13:07 -0800, Kevin Brannen <KBrannen@efji.com>, wrote:
From: Guyren Howe <guyren@gmail.com>
 
>Most folks, in my experience, who use relational databases don’t really understand the basic theory or even more important the why - the philosophy - of what a relational database is and how to get the most out of them. I see a lot of folks trying to use SQL in an imperative manner - make this temp table, then update it some, then make this other temp table, etc...

Actually, I’m mostly going to talk about the relational model, rather than SQL. Our industry seems to always settle for third-best, and SQL is the worst of all the examples of this. The world desperately needs a good relational database based on a better query language — datalog, for example.

I put up with SQL so I can use the relational model, and I think that understanding SQL has to start with that.

Anyhow.
An example of this is that we have a report we're trying to write that I'd
like to think can be done in SQL, but I can't think of a way to do it. Yet,
if I do the base query and pull the data back into my application, I can do
the last bit with 3 lines of Perl very easily. The problem here revolves
around comparing a row of data to the previous row to know whether the data
changed "significantly enough" to keep the new row.
 
Another example is doing running totals. A couple of years ago I would have
said SQL can't do that. Now I know about the OVER clause, something that I
would categorize as somewhat obscure, I can do it as needed.

Actually, Window functions might be “advanced”, but are certainly not obscure. Your example sounds like it’s trivially solved with LAG().
As Michael Lewis pointed out, large dataset can also cause you to choose not
to use SQL in 1 big statemet for everything (i.e. advocating the use to temp
tables). In some ways, using a CTE is a type of temp table, or at least I
view it as such. That allows a person to solve a problem in bite-sized chunks.
I will agree that optimization can do it better at times, but the code also has
to be maintained as well – a balancing act.

This appears to be good advice with SQL Server, which I’m coming to learn has a fairly poor query optimizer. But I would have thought Postgres’s optimizer would usually use a temporary table where appropriate.

Curious to hear if that’s wrong.
I think your idea is a good one, but I do hope you present that SQL can't
solve everything ... else why do we have plpgsql. :) You’re correct though,
SQL isn’t used as much as it should be in many places.

An important consideration will be when relational is inappropriate. My biggest goal, though, is to get folks to understand how much relations *can* do — far too many devs in my experience don’t use the power of SQL because they don’t understand it.

Thanks for taking the time to give me this feedback.

Re: Is there a good discussion of optimizations?

От
Florents Tselai
Дата:
Apologies for the shameless self-promotion :) 

Around a year ago I collected my thoughts on this topic. You can read the post here Modern Data Practice and the SQL Tradition .
It looks like it resonated with a lot of folks in the community. HN Discussion https://news.ycombinator.com/item?id=21482114

I would specifically underline the fact that the newer generation of programmers & data pros (my former self included) don't really appreciate things like triggers and server-side programming. Triggers and DB-side functions are considered something like Assembly code. 

Not many neophytes have been shown with use cases why for example writing 2-3 lines of PL/SQL can save you a huge overhead of back and forth and environment set up to write the same thing in say Pandas. 

I would focus on Triggers, Indices on expressions, and time-related functions. Probably on generated columns too. They may be considered a new feature, but the reasoning of building successively columns on top of a few base ones is quite appealing nowadays, especially for ML purposes.

I also wouldn't hesitate to discuss advanced topics. They are usually considered obscure because people don't present useful examples, rather toy and arbitrary ones. 

In a recent O'Reilly training, I was skeptical of talking about triggers for optimization but it looks like it was probably the most useful part of my training as students could actually "steal and tweak" my code.


Florents Tselai

Data Scientist & Engineer

Jack of All Trades and Master of Some

http://tselai.com


If I'm late for a meeting: I'll be there in 5'. If not, reread this.

Sent from mobile excuse any...Nah... typos are due to sheer laziness or neglect.



On Thu, Jan 7, 2021 at 11:21 PM Guyren Howe <guyren@gmail.com> wrote:
On Jan 7, 2021, 13:07 -0800, Kevin Brannen <KBrannen@efji.com>, wrote:
From: Guyren Howe <guyren@gmail.com>
 
>Most folks, in my experience, who use relational databases don’t really understand the basic theory or even more important the why - the philosophy - of what a relational database is and how to get the most out of them. I see a lot of folks trying to use SQL in an imperative manner - make this temp table, then update it some, then make this other temp table, etc...

Actually, I’m mostly going to talk about the relational model, rather than SQL. Our industry seems to always settle for third-best, and SQL is the worst of all the examples of this. The world desperately needs a good relational database based on a better query language — datalog, for example.

I put up with SQL so I can use the relational model, and I think that understanding SQL has to start with that.

Anyhow.
An example of this is that we have a report we're trying to write that I'd
like to think can be done in SQL, but I can't think of a way to do it. Yet,
if I do the base query and pull the data back into my application, I can do
the last bit with 3 lines of Perl very easily. The problem here revolves
around comparing a row of data to the previous row to know whether the data
changed "significantly enough" to keep the new row.
 
Another example is doing running totals. A couple of years ago I would have
said SQL can't do that. Now I know about the OVER clause, something that I
would categorize as somewhat obscure, I can do it as needed.

Actually, Window functions might be “advanced”, but are certainly not obscure. Your example sounds like it’s trivially solved with LAG().
As Michael Lewis pointed out, large dataset can also cause you to choose not
to use SQL in 1 big statemet for everything (i.e. advocating the use to temp
tables). In some ways, using a CTE is a type of temp table, or at least I
view it as such. That allows a person to solve a problem in bite-sized chunks.
I will agree that optimization can do it better at times, but the code also has
to be maintained as well – a balancing act.

This appears to be good advice with SQL Server, which I’m coming to learn has a fairly poor query optimizer. But I would have thought Postgres’s optimizer would usually use a temporary table where appropriate.

Curious to hear if that’s wrong.
I think your idea is a good one, but I do hope you present that SQL can't
solve everything ... else why do we have plpgsql. :) You’re correct though,
SQL isn’t used as much as it should be in many places.

An important consideration will be when relational is inappropriate. My biggest goal, though, is to get folks to understand how much relations *can* do — far too many devs in my experience don’t use the power of SQL because they don’t understand it.

Thanks for taking the time to give me this feedback.

Re: Is there a good discussion of optimizations?

От
Guyren Howe
Дата:
On Jan 7, 2021, 13:42 -0800, Florents Tselai <florents.tselai@gmail.com>, wrote:
Apologies for the shameless self-promotion :) 

Around a year ago I collected my thoughts on this topic. You can read the post here Modern Data Practice and the SQL Tradition .
It looks like it resonated with a lot of folks in the community. HN Discussion https://news.ycombinator.com/item?id=21482114

I would specifically underline the fact that the newer generation of programmers & data pros (my former self included) don't really appreciate things like triggers and server-side programming. Triggers and DB-side functions are considered something like Assembly code. 

Not many neophytes have been shown with use cases why for example writing 2-3 lines of PL/SQL can save you a huge overhead of back and forth and environment set up to write the same thing in say Pandas. 

I would focus on Triggers, Indices on expressions, and time-related functions. Probably on generated columns too. They may be considered a new feature, but the reasoning of building successively columns on top of a few base ones is quite appealing nowadays, especially for ML purposes.

I also wouldn't hesitate to discuss advanced topics. They are usually considered obscure because people don't present useful examples, rather toy and arbitrary ones. 

In a recent O'Reilly training, I was skeptical of talking about triggers for optimization but it looks like it was probably the most useful part of my training as students could actually "steal and tweak" my code.

Thanks for this. May I steal some of your examples if they prove useful? I’ll credit you of course.

I’m planning on somewhat emphasizing that a relational database is a logic engine. Viewed through this lens, a query or view is a “backward” implication and a trigger is a “forward” one. This leads to considering triggers (and the moral equivalent in external code) as requiring “truth maintenance”, and is a great way to think about when the database is the appropriate place for some bit of logic.
On Thu, Jan 7, 2021 at 11:21 PM Guyren Howe <guyren@gmail.com> wrote:
On Jan 7, 2021, 13:07 -0800, Kevin Brannen <KBrannen@efji.com>, wrote:
From: Guyren Howe <guyren@gmail.com>
 
>Most folks, in my experience, who use relational databases don’t really understand the basic theory or even more important the why - the philosophy - of what a relational database is and how to get the most out of them. I see a lot of folks trying to use SQL in an imperative manner - make this temp table, then update it some, then make this other temp table, etc...

Actually, I’m mostly going to talk about the relational model, rather than SQL. Our industry seems to always settle for third-best, and SQL is the worst of all the examples of this. The world desperately needs a good relational database based on a better query language — datalog, for example.

I put up with SQL so I can use the relational model, and I think that understanding SQL has to start with that.

Anyhow.
An example of this is that we have a report we're trying to write that I'd
like to think can be done in SQL, but I can't think of a way to do it. Yet,
if I do the base query and pull the data back into my application, I can do
the last bit with 3 lines of Perl very easily. The problem here revolves
around comparing a row of data to the previous row to know whether the data
changed "significantly enough" to keep the new row.
 
Another example is doing running totals. A couple of years ago I would have
said SQL can't do that. Now I know about the OVER clause, something that I
would categorize as somewhat obscure, I can do it as needed.

Actually, Window functions might be “advanced”, but are certainly not obscure. Your example sounds like it’s trivially solved with LAG().
As Michael Lewis pointed out, large dataset can also cause you to choose not
to use SQL in 1 big statemet for everything (i.e. advocating the use to temp
tables). In some ways, using a CTE is a type of temp table, or at least I
view it as such. That allows a person to solve a problem in bite-sized chunks.
I will agree that optimization can do it better at times, but the code also has
to be maintained as well – a balancing act.

This appears to be good advice with SQL Server, which I’m coming to learn has a fairly poor query optimizer. But I would have thought Postgres’s optimizer would usually use a temporary table where appropriate.

Curious to hear if that’s wrong.
I think your idea is a good one, but I do hope you present that SQL can't
solve everything ... else why do we have plpgsql. :) You’re correct though,
SQL isn’t used as much as it should be in many places.

An important consideration will be when relational is inappropriate. My biggest goal, though, is to get folks to understand how much relations *can* do — far too many devs in my experience don’t use the power of SQL because they don’t understand it.

Thanks for taking the time to give me this feedback.

Re: Is there a good discussion of optimizations?

От
Rob Sargent
Дата:


On 1/7/21 2:48 PM, Guyren Howe wrote:
On Jan 7, 2021, 13:42 -0800, Florents Tselai <florents.tselai@gmail.com>, wrote:
Apologies for the shameless self-promotion :) 

Around a year ago I collected my thoughts on this topic. You can read the post here Modern Data Practice and the SQL Tradition .
It looks like it resonated with a lot of folks in the community. HN Discussion https://news.ycombinator.com/item?id=21482114

I would specifically underline the fact that the newer generation of programmers & data pros (my former self included) don't really appreciate things like triggers and server-side programming. Triggers and DB-side functions are considered something like Assembly code. 

Not many neophytes have been shown with use cases why for example writing 2-3 lines of PL/SQL can save you a huge overhead of back and forth and environment set up to write the same thing in say Pandas. 

I would focus on Triggers, Indices on expressions, and time-related functions. Probably on generated columns too. They may be considered a new feature, but the reasoning of building successively columns on top of a few base ones is quite appealing nowadays, especially for ML purposes.

I also wouldn't hesitate to discuss advanced topics. They are usually considered obscure because people don't present useful examples, rather toy and arbitrary ones. 

In a recent O'Reilly training, I was skeptical of talking about triggers for optimization but it looks like it was probably the most useful part of my training as students could actually "steal and tweak" my code.

Thanks for this. May I steal some of your examples if they prove useful? I’ll credit you of course.

I’m planning on somewhat emphasizing that a relational database is a logic engine. Viewed through this lens, a query or view is a “backward” implication and a trigger is a “forward” one. This leads to considering triggers (and the moral equivalent in external code) as requiring “truth maintenance”, and is a great way to think about when the database is the appropriate place for some bit of logic.
On Thu, Jan 7, 2021 at 11:21 PM Guyren Howe <guyren@gmail.com> wrote:
On Jan 7, 2021, 13:07 -0800, Kevin Brannen <KBrannen@efji.com>, wrote:
From: Guyren Howe <guyren@gmail.com>
 
>Most folks, in my experience, who use relational databases don’t really understand the basic theory or even more important the why - the philosophy - of what a relational database is and how to get the most out of them. I see a lot of folks trying to use SQL in an imperative manner - make this temp table, then update it some, then make this other temp table, etc...

Actually, I’m mostly going to talk about the relational model, rather than SQL. Our industry seems to always settle for third-best, and SQL is the worst of all the examples of this. The world desperately needs a good relational database based on a better query language — datalog, for example.

I put up with SQL so I can use the relational model, and I think that understanding SQL has to start with that.

Anyhow.
An example of this is that we have a report we're trying to write that I'd
like to think can be done in SQL, but I can't think of a way to do it. Yet,
if I do the base query and pull the data back into my application, I can do
the last bit with 3 lines of Perl very easily. The problem here revolves
around comparing a row of data to the previous row to know whether the data
changed "significantly enough" to keep the new row.
 
Another example is doing running totals. A couple of years ago I would have
said SQL can't do that. Now I know about the OVER clause, something that I
would categorize as somewhat obscure, I can do it as needed.

Actually, Window functions might be “advanced”, but are certainly not obscure. Your example sounds like it’s trivially solved with LAG().
As Michael Lewis pointed out, large dataset can also cause you to choose not
to use SQL in 1 big statemet for everything (i.e. advocating the use to temp
tables). In some ways, using a CTE is a type of temp table, or at least I
view it as such. That allows a person to solve a problem in bite-sized chunks.
I will agree that optimization can do it better at times, but the code also has
to be maintained as well – a balancing act.

This appears to be good advice with SQL Server, which I’m coming to learn has a fairly poor query optimizer. But I would have thought Postgres’s optimizer would usually use a temporary table where appropriate.

Please include, for all those front-end coders who might want to hit the database, the expense/overhead involved.   I've seen "foreach id, read database, process record" all too often.

RE: Is there a good discussion of optimizations?

От
Kevin Brannen
Дата:

From Guyren Howe

>Actually, I’m mostly going to talk about the relational model, rather than SQL. Our industry seems to always settle for third-best, and SQL is the worst of all the examples of this. The world desperately needs a good relational database based on a better query language — datalog, for example.

 

>I put up with SQL so I can use the relational model, and I think that understanding SQL has to start with that.

 

I can understand that that. :) Yet SQL is the interface/lens we most use to interact with an RDBMS.

 

 

>>An example of this is that we have a report we're trying to write that I'd

like to think can be done in SQL, but I can't think of a way to do it. Yet,

if I do the base query and pull the data back into my application, I can do

the last bit with 3 lines of Perl very easily. The problem here revolves

around comparing a row of data to the previous row to know whether the data

changed "significantly enough" to keep the new row.

>>Another example is doing running totals. A couple of years ago I would have

said SQL can't do that. Now I know about the OVER clause, something that I

would categorize as somewhat obscure, I can do it as needed.

 

>Actually, Window functions might be "advanced", but are certainly not obscure. Your example sounds like it’s trivially solved with LAG().

 

Fair enough. Perhaps it would be better to say that many don't go to the depths to learn

window functions until forced to, and a lot of problems can be solved without them. I can

say that would be true for me and those I work with.

 

Thanks for the tip on LAG()! It's not a function I've used before (which makes it a great

example for the previous para).

 

 

>>As Michael Lewis pointed out, large dataset can also cause you to choose not

to use SQL in 1 big statemet for everything (i.e. advocating the use to temp

tables). In some ways, using a CTE is a type of temp table, or at least I

view it as such. That allows a person to solve a problem in bite-sized chunks.

I will agree that optimization can do it better at times, but the code also has

to be maintained as well – a balancing act.

 

>This appears to be good advice with SQL Server, which I’m coming to learn has a fairly poor query optimizer. But I would have thought Postgres’s optimizer would usually use a temporary table where appropriate.

 

I'm sure Pg's optimizer does its work better than I would manually, but that's not

always the point. For some I've talked to, thinking relationally and in sets is hard.

I'll even admit that while I think I do it pretty well, there are times I have to stop

and really consider what I'm doing. If you're used to thinking about how to solve a

problem with a computer language, most of them are imperative and that mindset can be

hard to get out of. So your "philosophy" approach is a well-aimed arrow in many ways, IMO.

 

Also, don't forget about "maintenance". Solving something in bite-sized chunks allows

for easier changes in the future, as well as understanding by newcomers -- generally speaking.

 

>Curious to hear if that’s wrong.

 

Depends on what version. In the earlier versions of Pg, CTE's were a fence and there was

no crossing over. That fence now has gates in some places in the latest versions from

what I read.

 

 

>>I think your idea is a good one, but I do hope you present that SQL can't

solve everything ... else why do we have plpgsql. :) You’re correct though,

SQL isn’t used as much as it should be in many places.

 

>An important consideration will be when relational is inappropriate. My biggest goal, though, is to get folks to understand how much relations *can* do — far too many devs in my experience don’t use the power of SQL because they don’t understand it.

 

>Thanks for taking the time to give me this feedback.

 

I appreciate the efforts of teachers in all forms. I hope your project goes well and the feedback has some use.

 

Kevin

This e-mail transmission, and any documents, files or previous e-mail messages attached to it, may contain confidential information. If you are not the intended recipient, or a person responsible for delivering it to the intended recipient, you are hereby notified that any disclosure, distribution, review, copy or use of any of the information contained in or attached to this message is STRICTLY PROHIBITED. If you have received this transmission in error, please immediately notify us by reply e-mail, and destroy the original transmission and its attachments without reading them or saving them to disk. Thank you.