Re: Remove_temp_files_after_crash and significant recovery/startup time
От | Jeremy Schneider |
---|---|
Тема | Re: Remove_temp_files_after_crash and significant recovery/startup time |
Дата | |
Msg-id | 7514f739-41f1-63d6-a0ed-673448ace1b5@amazon.com обсуждение исходный текст |
Ответ на | Re: Remove_temp_files_after_crash and significant recovery/startup time (Tomas Vondra <tomas.vondra@enterprisedb.com>) |
Список | pgsql-hackers |
On 9/10/21 14:57, Tomas Vondra wrote: > On 9/10/21 10:58 PM, McCoy, Shawn wrote: >> I noticed that the new parameter remove_temp_files_after_crash is >> currently set to a default value of "true" in the version 14 release. >> It seems this was discussed in this thread [1], and it doesn't look to >> me like there's been a lot of stress testing of this feature. > > Not sure what could we learn from a stress test? IMHO it's fairly > natural that if there are many temporary files and/or if deleting a file > is expensive on a given filesystem, the cleanup may take time. The thing that comes to mind for me is just getting a sense of what the curve looks like for number of files versus startup time. If I can find some time then I'll poke around and share numbers. I remember awhile ago, I worked with a PostgreSQL user who had a major outage crisis on their primary production database. They were having some minor issues, and they decided to do a "quick" restart to see if it would clear things out. The restart ended up taking something like a day or two and the business was down the whole time. Working with them to figure out what was happening, we found out that their very-DDL-heavy workload had combined with a stuck checkpointer process. No checkpoints had been completed for over a week. Only choices were waiting for WAL to replay or taking data loss; we couldn't even get out of pain with a restore from backup - sinces a restore still required replaying all the same WAL. There are certain core features in a database that you really need to be as reliable and robust as possible. IMO, for critical production databases, quick-as-possible-restarts are one of those. >> In our fleet there have been cases where we have seen hundreds of >> thousands of temp files generated. I found a case where we helped a >> customer that had a little over 2.2 million temp files. Single >> threaded cleanup of these takes a significant amount of time and >> delays recovery. In RDS, we mitigated this by moving the pgsql_tmp >> directory aside, start the engine and then separately remove the old >> temp files. >> >> After noticing the current plans to default this GUC to "on" in v14, >> just thought I'd raise the question of whether this should get a >> little more discussion or testing with higher numbers of temp files? >> > > I doubt we can lean anything new from such testing. > > Of course, we can discuss the default for the GUC. I see it as a trade > off between risk of running out of disk space and increased recovery > time, and perhaps the decision to prioritize lower risk of running out > of disk space was not the right one ... I'm doing a little asking around with colleagues. I'm having trouble finding cases where people went end-to-end and figured out exactly what in the workload was causing the high number of temp files. However, there seems to be a fair number of incidents with numbers of temp files in the hundreds of thousands. One thing that seems possible is that in some of these cases, the temp files were accumulating across many engine crashes - those cases would not be an issue once you started cleaning up on every restart. However I suspect there are still some cases where high connection counts and some erratic workload characteristic or bugs are causing accumulation without multiple crashes. If I learn more, I'll relay it along. Frankly, if the GUC defaults to off, then we're a lot less likely to find out if there /are/ issues. Kinda like LLVM and parallel query... at some point you just have to turn it on... even if you're not 100% sure where all the sharp edges are yet... PostgreSQL meme: "I test in someone else's production" All of that said, FWIW, if a restart is taking too long then a user can always turn the GUC off and cancel/retry the startup. So this is not the same as a stuck checkpointer, because there's simple recourse. For my part, I appreciate this discussion. I missed it if these points were debated when the feature was first committed and I can see arguments both ways. It's not without precedent to have a new feature turned off by default for its' first major release version. But we're talking about a corner case situation, and it's not like users are without recourse. -Jeremy -- Jeremy Schneider Database Engineer Amazon Web Services
В списке pgsql-hackers по дате отправления: