Обсуждение: Do we need to rethink how to parallelize regression tests to speedup CLOBBER_CACHE_ALWAYS?

Поиск
Список
Период
Сортировка

Do we need to rethink how to parallelize regression tests to speedup CLOBBER_CACHE_ALWAYS?

От
David Rowley
Дата:
Right now Tom is doing a bit of work to try and improve the
performance of regression test runs with CLOBBER _CACHE_ALWAYS.  I'm
on board with making this go faster too.

I did a CLOBBER_CACHE_ALWAYS_RUN today and it took my machine almost 7
hours to complete.  I occasionally checked top -c and was a bit
disappointed that the majority of the time just a single backend was
busy.  The reason for this is that most groups have some test that
takes much longer to run than others and I just often caught it once
it had finished all the faster tests and was stuck on the slow one.

I did a bit of analysis into the runtimes and found that:

1. Without parallelism, the total run-time of all tests was 12.29 hours.
2. The run took 6.45 hours. (I took the max time from each group and
summed that from each group)

That means the average backends utilized here was about 1.9.

I wondered if there might be a better way to handle how parallel tests
work in pg_regress.  We have many parallel groups that have reached 20
tests and we often just create another parallel group because of the
not exceeding 20 rule.  In many cases, we could get busy running
another test instead of sitting around idle.

Right now we start 1 backend for each test in a parallel group then
wait for the final backend to complete before running the next group.

Is a particular reason for it to work that way?

Why can we not just have a much larger parallel group and lump all of
the tests that have no special needs not to be run concurrently or
concurrently with another test in particular and just run all those
with up to N workers.  Once a worker completes, give it another test
to process until there are none left. We could still limit the total
concurrency with --max-connections=20. I don't think we'd need to make
any code changes to make this idea work.

I did the maths on that and if it worked that way, and assuming all
the parallel tests don't mind being run at the same time with any
other parallel test, then the theoretical run-time comes down to  3.75
hours with 8 workers, or 4.11 with 4 workers.  The primary reason it
does not become much faster is due to the "privileges" test taking 3
hours. If I calculate assuming 128 workers the time only drops to 3.46
hours.  Here there are enough workers to start the slow privileges
test on a worker that's not done anything else yet. So the 3.46 hours
is just the time for the privileges test plus the time to do the
serial tests, one by one.

For the above, I didn't do anything to change the order of the tests
to start the long-running ones first, but if I do that, I can get the
times down to 3.46 with just 4 workers. That's 1.86x faster than my
run.

I've attached a text file with the method I used to calculate each of
the numbers above and I've also attached the results with timings from
my CLOBBER_CACHE_ALWAYS run for anyone who'd like to check my maths.

If I split the "privileges" test into 2 even parts, then 8 workers
would run the tests in 1.95 hours which is 3.2x faster than my run.

David

Вложения
David Rowley <dgrowleyml@gmail.com> writes:
> Right now we start 1 backend for each test in a parallel group then
> wait for the final backend to complete before running the next group.

> Is a particular reason for it to work that way?

There are a whole lot of cases where test Y depends on an earlier test X.
Some of those dependencies are annotated in parallel_schedule, but I fear
most are not.

If we had a full list of such dependencies then we could imagine building
a job scheduler that would dispatch any script that has no remaining
dependencies.

The cases where "script X can't run concurrently with script Y" are
also problematic.  It's not as easy to discover those through testing,
since it might happen to work depending on timing.

            regards, tom lane



On Thu, 13 May 2021 at 01:50, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> There are a whole lot of cases where test Y depends on an earlier test X.
> Some of those dependencies are annotated in parallel_schedule, but I fear
> most are not.
>
> If we had a full list of such dependencies then we could imagine building
> a job scheduler that would dispatch any script that has no remaining
> dependencies.

I wonder if it could be done by starting a new parallel group and then
just move existing tests into it first verifying that:

1.  The test does not display results from any pg_catalog table, or if
it does the filter is restrictive enough that there's no possibility
that the results will change due to other sessions changing the
catalogues.
2.  If the test creates any new objects that those objects have a name
that's unlikely to conflict with other tests. e.g no tablenames like
t1
3.  The test does not INSERT/DELETE/UPDATE/VACUUM/ALTER/ANALYZE any
tables that exist for more than 1 test.
4. Does not globally modify the system state. e.g ALTER SYSTEM.

We could document in parallel_schedule that tests in this particular
group must meet the above requirement, plus any others I've not
thought about.  That list of reasons could be updated when we discover
other things I've neglected to think about.

I hope that now since we no longer have serial_schedule that just
having one source of truth for tests that the comments in the
parallel_schedule are more likely to be read and kept up to date.

I imagine there are many tests that could also just be run entirely in
a single begin; commit;. That would mean any catalogue changes they
made would not be visible to any other test which happens to query
that.

David