Обсуждение: Btree internal node data?
While looking into a btree internal page using pg_filedump against an int4 index generated pgbench, I noticed that only item 2 has length 8, which indicates that the index tuple has only tuple header and has no index data. In my understanding this indicates that the item is used to represent a down link to a page. Question is, why the item is 2, not 1. I thought an index tuple indicating down link is always 1. Is this a sign that something goes wrong? Block 3 ******************************************************** <Header> -----Block Offset: 0x00006000 Offsets: Lower 1164 (0x048c)Block: Size 8192 Version 4 Upper 3624 (0x0e28)LSN: logid 2 recoff 0x1550a608 Special 8176 (0x1ff0)Items: 285 FreeSpace: 2460Checksum: 0x0000 Prune XID: 0x00000000 Flags: 0x0000 ()Length (including item array): 1164 <Data> ------ Item 1 -- Length: 16 Offset: 3624 (0x0e28) Flags: NORMALItem 2 -- Length: 8 Offset: 8168 (0x1fe8) Flags: NORMALItem 3 -- Length: 16 Offset: 8152 (0x1fd8) Flags: NORMALItem 4 -- Length: 16 Offset: 8136(0x1fc8) Flags: NORMALItem 5 -- Length: 16 Offset: 8120 (0x1fb8) Flags: NORMAL [snip]Item 281 -- Length: 16 Offset: 3704 (0x0e78) Flags: NORMALItem 282 -- Length: 16 Offset: 3688 (0x0e68) Flags:NORMALItem 283 -- Length: 16 Offset: 3672 (0x0e58) Flags: NORMALItem 284 -- Length: 16 Offset: 3656 (0x0e48) Flags: NORMALItem 285 -- Length: 16 Offset: 3640 (0x0e38) Flags: NORMAL <Special Section> -----BTree Index Section: Flags: 0x0000 () Blocks: Previous (0) Next (289) Level (1) CycleId (0) Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp
On Wed, Aug 27, 2014 at 7:08 PM, Tatsuo Ishii <ishii@postgresql.org> wrote: > While looking into a btree internal page using pg_filedump against an > int4 index generated pgbench, I noticed that only item 2 has length 8, > which indicates that the index tuple has only tuple header and has no > index data. In my understanding this indicates that the item is used > to represent a down link to a page. Question is, why the item is 2, > not 1. I thought an index tuple indicating down link is always 1. Is > this a sign that something goes wrong? No. On a non-rightmost page, the "high key" item is physically first (which is a bit odd, because it serves as a high-bound invariant on the items that the page stores, but it's convenient to do it that way for other reasons). On an internal page (that is also non-rightmost), the second item (which is the first "real" item - i.e. the item which P_FIRSTDATAKEY() returns) is just placeholder garbage. The reason for that is noted above _bt_compare(): * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be* "minus infinity": this routine will always claimit is less than the* scankey. The actual key value stored (if any, which there probably isn't)* does not matter. Thisconvention allows us to implement the Lehman and* Yao convention that the first down-link pointer is before the firstkey.* See backend/access/nbtree/README for details. -- Peter Geoghegan