Re: Avoiding repeated snapshot computation

Поиск
Список
Период
Сортировка
От Andres Freund
Тема Re: Avoiding repeated snapshot computation
Дата
Msg-id 201111292155.52460.andres@anarazel.de
обсуждение исходный текст
Ответ на Re: Avoiding repeated snapshot computation  (Pavan Deolasee <pavan.deolasee@gmail.com>)
Список pgsql-hackers
On Tuesday, November 29, 2011 05:51:40 AM Pavan Deolasee wrote:
> On Tue, Nov 29, 2011 at 8:30 AM, Kevin Grittner
> 
> <Kevin.Grittner@wicourts.gov> wrote:
> > Andres Freund  wrote:
> >> I would like to see somebody running a benchmark on a machine with
> >> higher concurrency...
> > 
> > Yeah, me too.  I don't find it at all hard to believe that we will
> > see significant performance benefit by changing the PGPROC structure
> > so that all parts of it can be accessed through one cache line rather
> > than two.  The fact that we saw improvement when changing it down to
> > two, even though there is extra indirection in some places, seems
> > like pretty solid evidence on the face of it.
> 
> I think there is fundamental difference between the PGPROC patch that
> we did and the rearranging SnapshotData members that Andres has
> proposed. The PGPROC structure is a shared memory area, very heavily
> accessed by GetSnapshotData (and some others). There was a linear
> increase in the access as number of clients go up. The SnapshotData
> structure is mostly from a backend local memory (unless we do
> something what I suggested in the OP) and hence fitting that in a
> single cache line may or may not have much impact. We don't even
> guarantee that it starts at a cacheline which means in most cases it
> will still be spread on multiple cache lines.
Well, I could measure a ~1.3% benefit on a modestly concurrent machine (see 
the earlier reformatted mail from Gurjeet) and the benefits were definitely 
smaller without concurrency. But I aggree - the benefits wont be even remotely 
as big as the PGPROC splitup patch.
Youre also right about the alignment not being predictable enough.

Perhaps pg should start using/introducing a force_cacheline_align macro which 
can easily implemented on most compilers? Especially for SnapshotData and 
consorts which are aligned statically that should be good enough.

Something along thelines of

#ifdef _GNUC__
#define force_align_to(alignment) __attribute__((align(alignment)))
#elif __MSVC__
#define force_align_to(alignment) __declspec(align(alignment))
#else
#define force_align_to(alignment)
#endif

#define CACHELIGN_ALIGNMENT 64
#define force_cacheline_align force_align_to(CACHELING_ALIGNMENT)

Looks complete enough for now.


Back to PGPROC:

Looking at the current PGPROC layout on x86-64 with linux abi (generated by 
pahole):

struct PGPROC {       SHM_QUEUE                  links;                /*     0    16 */       PGSemaphoreData
 sem;                  /*    16     8 */       int                        waitStatus;           /*    24     4 */
Latch                     procLatch;            /*    28    12 */       LocalTransactionId         lxid;
/*    40     4 */       int                        pid;                  /*    44     4 */       int
   pgprocno;             /*    48     4 */       BackendId                  backendId;            /*    52     4 */
 Oid                        databaseId;           /*    56     4 */       Oid                        roleId;
  /*    60     4 */       /* --- cacheline 1 boundary (64 bytes) --- */       bool
recoveryConflictPending;/*    64     1 */       bool                       lwWaiting;            /*    65     1 */
bool                       lwExclusive;          /*    66     1 */
 
       /* XXX 5 bytes hole, try to pack */
       struct PGPROC *            lwWaitLink;           /*    72     8 */       LOCK *                     waitLock;
        /*    80     8 */       PROCLOCK *                 waitProcLock;         /*    88     8 */       LOCKMODE
           waitLockMode;         /*    96     4 */       LOCKMASK                   heldLocks;            /*   100
4*/       XLogRecPtr                 waitLSN;              /*   104     8 */       int
syncRepState;        /*   112     4 */
 
       /* XXX 4 bytes hole, try to pack */
       SHM_QUEUE                  syncRepLinks;         /*   120    16 */       /* --- cacheline 2 boundary (128 bytes)
was8 bytes ago --- */       SHM_QUEUE                  myProcLocks[16];      /*   136   256 */       /* --- cacheline 6
boundary(384 bytes) was 8 bytes ago --- */       struct XidCache            subxids;              /*   392   256 */
 /* --- cacheline 10 boundary (640 bytes) was 8 bytes ago --- */       LWLockId                   backendLock;
/*  648     4 */
 
       /* XXX 4 bytes hole, try to pack */
       uint64                     fpLockBits;           /*   656     8 */       Oid                        fpRelId[16];
        /*   664    64 */       /* --- cacheline 11 boundary (704 bytes) was 24 bytes ago --- */       bool
         fpVXIDLock;           /*   728     1 */
 
       /* XXX 3 bytes hole, try to pack */
       LocalTransactionId         fpLocalTransactionId; /*   732     4 */
       /* size: 736, cachelines: 12, members: 28 */       /* sum members: 720, holes: 4, sum holes: 16 */       /* last
cacheline:32 bytes */
 
}

It seems possible that reordering it to avoid sharing unrelated stuff on the 
same cacheline might be beneficial. E.g. moving fastpath lock stuf away from 
the more common lwlock infrastructure might be a good idea. Currently those 
lines will bounce continually even though the fastpath stuff will mostly be 
accessed locally.
Subxids and syncRepLinks look like good candidates too.


Anyway, sorry for the topic drift.

Andres


Andres


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Bruce Momjian
Дата:
Сообщение: Re: Inlining comparators as a performance optimisation
Следующее
От: Brar Piening
Дата:
Сообщение: Re: Review of VS 2010 support patches