Re: Avoiding repeated snapshot computation
От | Andres Freund |
---|---|
Тема | Re: Avoiding repeated snapshot computation |
Дата | |
Msg-id | 201111292155.52460.andres@anarazel.de обсуждение исходный текст |
Ответ на | Re: Avoiding repeated snapshot computation (Pavan Deolasee <pavan.deolasee@gmail.com>) |
Список | pgsql-hackers |
On Tuesday, November 29, 2011 05:51:40 AM Pavan Deolasee wrote: > On Tue, Nov 29, 2011 at 8:30 AM, Kevin Grittner > > <Kevin.Grittner@wicourts.gov> wrote: > > Andres Freund wrote: > >> I would like to see somebody running a benchmark on a machine with > >> higher concurrency... > > > > Yeah, me too. I don't find it at all hard to believe that we will > > see significant performance benefit by changing the PGPROC structure > > so that all parts of it can be accessed through one cache line rather > > than two. The fact that we saw improvement when changing it down to > > two, even though there is extra indirection in some places, seems > > like pretty solid evidence on the face of it. > > I think there is fundamental difference between the PGPROC patch that > we did and the rearranging SnapshotData members that Andres has > proposed. The PGPROC structure is a shared memory area, very heavily > accessed by GetSnapshotData (and some others). There was a linear > increase in the access as number of clients go up. The SnapshotData > structure is mostly from a backend local memory (unless we do > something what I suggested in the OP) and hence fitting that in a > single cache line may or may not have much impact. We don't even > guarantee that it starts at a cacheline which means in most cases it > will still be spread on multiple cache lines. Well, I could measure a ~1.3% benefit on a modestly concurrent machine (see the earlier reformatted mail from Gurjeet) and the benefits were definitely smaller without concurrency. But I aggree - the benefits wont be even remotely as big as the PGPROC splitup patch. Youre also right about the alignment not being predictable enough. Perhaps pg should start using/introducing a force_cacheline_align macro which can easily implemented on most compilers? Especially for SnapshotData and consorts which are aligned statically that should be good enough. Something along thelines of #ifdef _GNUC__ #define force_align_to(alignment) __attribute__((align(alignment))) #elif __MSVC__ #define force_align_to(alignment) __declspec(align(alignment)) #else #define force_align_to(alignment) #endif #define CACHELIGN_ALIGNMENT 64 #define force_cacheline_align force_align_to(CACHELING_ALIGNMENT) Looks complete enough for now. Back to PGPROC: Looking at the current PGPROC layout on x86-64 with linux abi (generated by pahole): struct PGPROC { SHM_QUEUE links; /* 0 16 */ PGSemaphoreData sem; /* 16 8 */ int waitStatus; /* 24 4 */ Latch procLatch; /* 28 12 */ LocalTransactionId lxid; /* 40 4 */ int pid; /* 44 4 */ int pgprocno; /* 48 4 */ BackendId backendId; /* 52 4 */ Oid databaseId; /* 56 4 */ Oid roleId; /* 60 4 */ /* --- cacheline 1 boundary (64 bytes) --- */ bool recoveryConflictPending;/* 64 1 */ bool lwWaiting; /* 65 1 */ bool lwExclusive; /* 66 1 */ /* XXX 5 bytes hole, try to pack */ struct PGPROC * lwWaitLink; /* 72 8 */ LOCK * waitLock; /* 80 8 */ PROCLOCK * waitProcLock; /* 88 8 */ LOCKMODE waitLockMode; /* 96 4 */ LOCKMASK heldLocks; /* 100 4*/ XLogRecPtr waitLSN; /* 104 8 */ int syncRepState; /* 112 4 */ /* XXX 4 bytes hole, try to pack */ SHM_QUEUE syncRepLinks; /* 120 16 */ /* --- cacheline 2 boundary (128 bytes) was8 bytes ago --- */ SHM_QUEUE myProcLocks[16]; /* 136 256 */ /* --- cacheline 6 boundary(384 bytes) was 8 bytes ago --- */ struct XidCache subxids; /* 392 256 */ /* --- cacheline 10 boundary (640 bytes) was 8 bytes ago --- */ LWLockId backendLock; /* 648 4 */ /* XXX 4 bytes hole, try to pack */ uint64 fpLockBits; /* 656 8 */ Oid fpRelId[16]; /* 664 64 */ /* --- cacheline 11 boundary (704 bytes) was 24 bytes ago --- */ bool fpVXIDLock; /* 728 1 */ /* XXX 3 bytes hole, try to pack */ LocalTransactionId fpLocalTransactionId; /* 732 4 */ /* size: 736, cachelines: 12, members: 28 */ /* sum members: 720, holes: 4, sum holes: 16 */ /* last cacheline:32 bytes */ } It seems possible that reordering it to avoid sharing unrelated stuff on the same cacheline might be beneficial. E.g. moving fastpath lock stuf away from the more common lwlock infrastructure might be a good idea. Currently those lines will bounce continually even though the fastpath stuff will mostly be accessed locally. Subxids and syncRepLinks look like good candidates too. Anyway, sorry for the topic drift. Andres Andres
В списке pgsql-hackers по дате отправления:
Предыдущее
От: Bruce MomjianДата:
Сообщение: Re: Inlining comparators as a performance optimisation