RE: Parallel Inserts in CREATE TABLE AS

Поиск
Список
Период
Сортировка
От houzj.fnst@fujitsu.com
Тема RE: Parallel Inserts in CREATE TABLE AS
Дата
Msg-id OS0PR01MB571699DD426B43CD2F9A106C94239@OS0PR01MB5716.jpnprd01.prod.outlook.com
обсуждение исходный текст
Ответ на Re: Parallel Inserts in CREATE TABLE AS  (Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>)
Ответы Re: Parallel Inserts in CREATE TABLE AS  (Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>)
Re: Parallel Inserts in CREATE TABLE AS  (Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>)
Список pgsql-hackers
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Sent: Wednesday, May 26, 2021 7:22 PM
> Thanks for trying that out.
> 
> Please see the code around the use_fsm flag in RelationGetBufferForTuple for
> more understanding of the points below.
> 
> What happens if FSM is skipped i.e. myState->ti_options =
> TABLE_INSERT_SKIP_FSM;?
> 1) The flag use_fsm will be false in heap_insert->RelationGetBufferForTuple.
> 2) Each worker initially gets a block and keeps inserting into it until it is full.
> When the block is full, the worker doesn't look in FSM GetPageWithFreeSpace
> as use_fsm is false. It directly goes for relation extension and tries to acquire
> relation extension lock with LockRelationForExtension. Note that the bulk
> extension of blocks with RelationAddExtraBlocks is not reached as use_fsm is
> false.
> 3) After acquiring the relation extension lock, it adds an extra new block with
> ReadBufferBI(relation, P_NEW, ...), see the comment "In addition to whatever
> extension we performed above, we always add at least one block to satisfy our
> own request." The tuple is inserted into this new block.
> 
> Basically, the workers can't look for the empty pages from the pages added by
> other workers, they keep doing the above steps in silos.
> 
> What happens if FSM is not skipped i.e. myState->ti_options = 0;?
> 1) The flag use_fsm will be true in heap_insert->RelationGetBufferForTuple.
> 2) Each worker initially gets a block and keeps inserting into it until it is full.
> When the block is full, the worker looks for the page with free space in FSM
> GetPageWithFreeSpace as use_fsm is true.
> If it can't find any page with the required amount of free space, it goes for bulk
> relation extension(RelationAddExtraBlocks) after acquiring relation extension
> lock with ConditionalLockRelationForExtension. Then the worker adds
> extraBlocks = Min(512, lockWaiters * 20); new blocks in
> RelationAddExtraBlocks and immediately updates the bottom level of FSM for
> each block (see the comment around RecordPageWithFreeSpace for why only
> the bottom level, not the entire FSM tree). After all the blocks are added, then
> it updates the entire FSM tree FreeSpaceMapVacuumRange.
> 4) After the bulk extension, then the worker adds another block see the
> comment "In addition to whatever extension we performed above, we always
> add at least one block to satisfy our own request." and inserts tuple into this
> new block.
> 
> Basically, the workers can benefit from the bulk extension of the relation and
> they always can look for the empty pages from the pages added by other
> workers. There are high chances that the blocks will be available after bulk
> extension. Having said that, if the added extra blocks are consumed by the
> workers so fast i.e. if the tuple sizes are big i.e very less tuples per page, then,
> the bulk extension too can't help much and there will be more contention on
> the relation extension lock. Well, one might think to add more blocks at a time,
> say Min(1024, lockWaiters * 128/256/512) than currently extraBlocks = Min(512,
> lockWaiters * 20);. This will work (i.e. we don't see any regression with parallel
> inserts in CTAS patches), but it can't be a practical solution. Because the total
> pages for the relation will be more with many pages having more free space.
> Furthermore, the future sequential scans on that relation might take a lot of
> time.
> 
> If myState->ti_options = TABLE_INSERT_SKIP_FSM; in only the place(within if
> (myState->is_parallel)), then it will be effective for leader i.e. leader will not
> look for FSM, but all the workers will, because within if
> (myState->is_parallel_worker) in intorel_startup,
> myState->ti_options = 0; for workers.
> 
> I ran tests with configuration shown at [1] for the case 4 (2 bigint(of 8 bytes
> each) columns, 16 name(of 64 bytes each) columns, tuple size 1064 bytes, 10mn
> tuples) with leader participation where I'm seeing regression:
> 
> 1) when myState->ti_options = TABLE_INSERT_SKIP_FSM; for both leader and
> workers, then my results are as follows:
> 0 workers - 116934.137, 2 workers - 209802.060, 4 workers - 248580.275
> 2) when myState->ti_options = 0; for both leader and workers, then my results
> are as follows:
> 0 workers - 1116184.718, 2 workers - 139798.055, 4 workers - 143022.409
> I hope the above explanation and the test results should clarify the fact that
> skipping FSM doesn't solve the problem. Let me know if anything is not clear or
> I'm missing something.

Thanks for the explanation.
I followed your above test steps and the below configuration, but my test results are a little different from yours.
I am not sure the exact reason, maybe because of the hardware..

Test INSERT 10000000 rows((2 bigint(of 8 bytes) 16 name(of 64 bytes each) columns):
SERIAL: 22023.631 ms
PARALLEL 2 WORKER [NOT SKIP FSM]: 21824.934 ms  [SKIP FSM]: 19381.474 ms
PARALLEL 4 WORKER [NOT SKIP FSM]: 20481.117 ms   [SKIP FSM]: 18381.305 ms

I am afraid that the using the FSM seems not get a stable performance gain(at least on my machine), 
I will take a deep look into this to figure out the difference. A naive idea it that the benefit that bulk extension
bring is not much greater than the cost in FSM.
Do you have some ideas on it ?

My test machine:
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              40
On-line CPU(s) list: 0-39
Thread(s) per core:  2
Core(s) per socket:  10
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz
Stepping:            7
CPU MHz:             2901.005
CPU max MHz:         3200.0000
CPU min MHz:         1000.0000
BogoMIPS:            4400.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            14080K

Best regards,
houzj

> [1] postgresql.conf parameters used:
> shared_buffers = 40GB
> max_worker_processes = 32
> max_parallel_maintenance_workers = 24
> max_parallel_workers = 32
> synchronous_commit = off
> checkpoint_timeout = 1d
> max_wal_size = 24GB
> min_wal_size = 15GB
> autovacuum = off
> port = 5440
> 
> System Configuration:
> RAM:     528GB
> Disk Type:   SSD
> Disk Size:   1.5TB
> lscpu
> Architecture:          x86_64
> CPU op-mode(s):        32-bit, 64-bit
> Byte Order:            Little Endian
> CPU(s):                128
> On-line CPU(s) list:   0-127
> Thread(s) per core:    2
> Core(s) per socket:    8
> Socket(s):             8
> NUMA node(s):          8
> Vendor ID:             GenuineIntel
> CPU family:            6
> Model:                 47
> Model name:            Intel(R) Xeon(R) CPU E7- 8830  @ 2.13GHz
> Stepping:              2
> CPU MHz:               1064.000
> CPU max MHz:           2129.0000
> CPU min MHz:           1064.0000
> BogoMIPS:              4266.62
> Virtualization:        VT-x
> L1d cache:             32K
> L1i cache:             32K
> L2 cache:              256K
> L3 cache:              24576K


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Julien Rouhaud
Дата:
Сообщение: Re: Speed up pg_checksums in cases where checksum already set
Следующее
От: Andres Freund
Дата:
Сообщение: Re: Move pg_attribute.attcompression to earlier in struct for reduced size?