Обсуждение: Query Crashes PG 10.3 using partitions, works on 9.6

Поиск
Список
Период
Сортировка

Query Crashes PG 10.3 using partitions, works on 9.6

От
Cory Tucker
Дата:
Hey guys, I am in the middle of testing out a database migration from 9.6 to 10.3.  We have a quasi-multi tenant based application and so are utilizing native partitions on some relations to help improve some performance.

I was issuing a query on both databases to cleanup some duplicates in preparation of applying new indexes.  On the 9.6 database with all the data in one table, the query runs fine in about 6 min.  On 10.3, with a work_mem setting of 1GB the query runs for about 7 minutes and then gets terminated with an out of memory error.  If I bump the work_mem up fairly high (12GB out of 52GB available) the server actually crashes.

On the both databases, the total dataset size is exactly the same, ~29M records.  The table looks like this:

                      Table "candidate_person"
    Column    |           Type           | Collation | Nullable | Default
--------------+--------------------------+-----------+----------+---------
 created_at   | timestamp with time zone |           | not null | now()
 modified_at  | timestamp with time zone |           | not null | now()
 account_id   | bigint                   |           | not null |
 candidate_id | character varying(40)    | C         | not null |
 first_name   | text                     |           |          |
 middle_name  | text                     |           |          |
 last_name    | text                     |           |          |
 spouse       | boolean                  |           | not null | false
 suffix       | text                     |           |          |
 salutation   | text                     |           |          |
Partition key: LIST (account_id)

With the only difference being on 9.6 there obviously isn't any partitions.  On 10.3 there are ~250 partition tables.

I have attached the server log that shows the first out of memory and then the server crash and recovery.
Вложения

RE: Query Crashes PG 10.3 using partitions, works on 9.6

От
"Kumar, Virendra"
Дата:

Would be nice if you can attach explain plan of course, explain analyze is not going to work if server is crashing.

 

 

Regards,

Virendra

 

From: Cory Tucker [mailto:cory.tucker@gmail.com]
Sent: Wednesday, March 28, 2018 5:49 PM
To: pgsql-general@postgresql.org
Subject: Query Crashes PG 10.3 using partitions, works on 9.6

 

Hey guys, I am in the middle of testing out a database migration from 9.6 to 10.3.  We have a quasi-multi tenant based application and so are utilizing native partitions on some relations to help improve some performance.

 

I was issuing a query on both databases to cleanup some duplicates in preparation of applying new indexes.  On the 9.6 database with all the data in one table, the query runs fine in about 6 min.  On 10.3, with a work_mem setting of 1GB the query runs for about 7 minutes and then gets terminated with an out of memory error.  If I bump the work_mem up fairly high (12GB out of 52GB available) the server actually crashes.

 

On the both databases, the total dataset size is exactly the same, ~29M records.  The table looks like this:

 

                      Table "candidate_person"

    Column    |           Type           | Collation | Nullable | Default

--------------+--------------------------+-----------+----------+---------

 created_at   | timestamp with time zone |           | not null | now()

 modified_at  | timestamp with time zone |           | not null | now()

 account_id   | bigint                   |           | not null |

 candidate_id | character varying(40)    | C         | not null |

 first_name   | text                     |           |          |

 middle_name  | text                     |           |          |

 last_name    | text                     |           |          |

 spouse       | boolean                  |           | not null | false

 suffix       | text                     |           |          |

 salutation   | text                     |           |          |

Partition key: LIST (account_id)

 

With the only difference being on 9.6 there obviously isn't any partitions.  On 10.3 there are ~250 partition tables.

 

I have attached the server log that shows the first out of memory and then the server crash and recovery.




This message is intended only for the use of the addressee and may contain
information that is PRIVILEGED AND CONFIDENTIAL.

If you are not the intended recipient, you are hereby notified that any
dissemination of this communication is strictly prohibited. If you have
received this communication in error, please erase all copies of the message
and its attachments and notify the sender immediately. Thank you.

RE: Query Crashes PG 10.3 using partitions, works on 9.6

От
"Kumar, Virendra"
Дата:

Hi Cory,

 

You are running the query (DELETE statement) as if the table is not partitioned which is causing the server to crash.

Please run that query for each partitions separately in a loop with dynamic query and you should see the improvement. It should be pretty quick.

 

Regards,

Virendra.

 

From: Kumar, Virendra
Sent: Wednesday, March 28, 2018 5:57 PM
To: Cory Tucker; pgsql-general@postgresql.org
Subject: RE: Query Crashes PG 10.3 using partitions, works on 9.6

 

Would be nice if you can attach explain plan of course, explain analyze is not going to work if server is crashing.

 

 

Regards,

Virendra

 

From: Cory Tucker [mailto:cory.tucker@gmail.com]
Sent: Wednesday, March 28, 2018 5:49 PM
To: pgsql-general@postgresql.org
Subject: Query Crashes PG 10.3 using partitions, works on 9.6

 

Hey guys, I am in the middle of testing out a database migration from 9.6 to 10.3.  We have a quasi-multi tenant based application and so are utilizing native partitions on some relations to help improve some performance.

 

I was issuing a query on both databases to cleanup some duplicates in preparation of applying new indexes.  On the 9.6 database with all the data in one table, the query runs fine in about 6 min.  On 10.3, with a work_mem setting of 1GB the query runs for about 7 minutes and then gets terminated with an out of memory error.  If I bump the work_mem up fairly high (12GB out of 52GB available) the server actually crashes.

 

On the both databases, the total dataset size is exactly the same, ~29M records.  The table looks like this:

 

                      Table "candidate_person"

    Column    |           Type           | Collation | Nullable | Default

--------------+--------------------------+-----------+----------+---------

 created_at   | timestamp with time zone |           | not null | now()

 modified_at  | timestamp with time zone |           | not null | now()

 account_id   | bigint                   |           | not null |

 candidate_id | character varying(40)    | C         | not null |

 first_name   | text                     |           |          |

 middle_name  | text                     |           |          |

 last_name    | text                     |           |          |

 spouse       | boolean                  |           | not null | false

 suffix       | text                     |           |          |

 salutation   | text                     |           |          |

Partition key: LIST (account_id)

 

With the only difference being on 9.6 there obviously isn't any partitions.  On 10.3 there are ~250 partition tables.

 

I have attached the server log that shows the first out of memory and then the server crash and recovery.

 



This message is intended only for the use of the addressee and may contain
information that is PRIVILEGED AND CONFIDENTIAL.

If you are not the intended recipient, you are hereby notified that any
dissemination of this communication is strictly prohibited. If you have
received this communication in error, please erase all copies of the message
and its attachments and notify the sender immediately. Thank you.




This message is intended only for the use of the addressee and may contain
information that is PRIVILEGED AND CONFIDENTIAL.

If you are not the intended recipient, you are hereby notified that any
dissemination of this communication is strictly prohibited. If you have
received this communication in error, please erase all copies of the message
and its attachments and notify the sender immediately. Thank you.

Re: Query Crashes PG 10.3 using partitions, works on 9.6

От
Tom Lane
Дата:
Cory Tucker <cory.tucker@gmail.com> writes:
> I was issuing a query on both databases to cleanup some duplicates in
> preparation of applying new indexes.  On the 9.6 database with all the data
> in one table, the query runs fine in about 6 min.  On 10.3, with a work_mem
> setting of 1GB the query runs for about 7 minutes and then gets terminated
> with an out of memory error.

Hm, this seems a bit excessive:

  MessageContext: 1333788672 total in 169 blocks; 2227176 free (9 chunks); 1331561496 used

and this is really grim:

        65678 more child contexts containing 47607478048 total in 2577 blocks; 12249392 free (446 chunks); 47595228656
used

and this is just silly:

2018-03-28 19:20:33.264 UTC [10580] cory@match ERROR:  out of memory
2018-03-28 19:20:33.264 UTC [10580] cory@match DETAIL:  Failed on request of size 1610612736.

Can you extract a self-contained test case that uses unreasonable amounts
of memory?  It seems from this trace that the wheels are coming off in
at least two places, but identifying exactly where is impossible without
more info.

If you can't make a publishable test case, capturing a stack trace from
the point of the OOM error (set the breakpoint at errfinish) would
probably be enough info to figure out what is trying to grab 1.6GB in one
bite.  But it won't help us find out why so many empty ExprContexts are
getting created.

            regards, tom lane


Re: Query Crashes PG 10.3 using partitions, works on 9.6

От
Cory Tucker
Дата:
Can you extract a self-contained test case that uses unreasonable amounts
of memory?  It seems from this trace that the wheels are coming off in
at least two places, but identifying exactly where is impossible without
more info.

I will try to make a test case.  The data in this table is PII so I can't just export it, but I might be able to replicate using fake data.

Will try in the next few days and hopefully post back.

--Cory

Re: Query Crashes PG 10.3 using partitions, works on 9.6

От
Cory Tucker
Дата:


You are running the query (DELETE statement) as if the table is not partitioned which is causing the server to crash.

Please run that query for each partitions separately in a loop with dynamic query and you should see the improvement. It should be pretty quick. 


I understand that I can do that and it would be quicker. But the point of the email was that this thing that works on 9.6, and is valid SQL, causes the server to crash in a newer version using a valid supported configuration.
 

Re: Query Crashes PG 10.3 using partitions, works on 9.6

От
Tom Lane
Дата:
Cory Tucker <cory.tucker@gmail.com> writes:
>> Can you extract a self-contained test case that uses unreasonable amounts
>> of memory?  It seems from this trace that the wheels are coming off in
>> at least two places, but identifying exactly where is impossible without
>> more info.

> I will try to make a test case.  The data in this table is PII so I can't
> just export it, but I might be able to replicate using fake data.

My guess is that the specific data is not critical at all, just the
table properties.  You might need to take care that the same EXPLAIN
plan is selected, which might require fake data that has roughly
similar properties as to number of distinct values and so on.

            regards, tom lane