On 2017-12-20 06:27, Michael Paquier wrote:
> On Wed, Dec 20, 2017 at 7:46 AM, Erik Rijkers <er@xs4all.nl> wrote:
TRAP: FailedAssertion("!(TransactionIdPrecedesOrEquals(safeXid,
snap->xmin))", File: "snapbuild.c", Line: 580)
>> Sorry, that was probably too terse, I should explain that a little.
>>
>> After initing 50 instances, I set up and run a pgbench session in the
>> master
>> session; the pgbench lines are:
>>
>> init: pgbench --port=6515 --quiet --initialize --scale=1 postgres
>> run: pgbench -M prepared -c 16 -j 8 -T 1 -P 1 -n postgres -- scale
>> 1
>>
>> the other instances then catch up. The whole takes 5 minutes or so
>>
>> I vary scale, duration, and number of instances. I haven't had it
>> fail in
>> this way yet but I mostly tried with lower number of instances (up to
>> 25 or
>> so).
>
> Hm. Are you saying that it takes at least 50 cascading instances to
> see the problem you are seeing? And that you are not seeing any
> problems with a lower number of cascading instances? Are you enabling
> hot_standby_feedback?
That sounds more definitive than I meant it, but yes, only now that I
tried a higher number of instances did I see this. But is also often
succeeds at up to 100 instances (100 is the highest I have tried).
These 50 instances were a logical replication chain, and
hot_standby_feedback is off.
Overnight I ran 80x the test that failed yesterday: now they all 80
succeeded. I am not sure what causes failure over success.
(logical replication does the initial syncing of the instances one by
one (sequentially) so it isn't as busy as expected; it just takes a long
time)
I wrote a simple perl program to test logical replication (attached,
FWIW), running:
./cascade.pl --instances=50 --scale=1 --clients=16 --threads=8
--duration=1 --repeats=3 --waiting=10
This cascade.pl program uses knowledge of my setup so probably won't run
elsewhere as is but it shows how the failing test was done.
Erik