diff -c -r -N doc.orig/src/sgml/config.sgml doc/src/sgml/config.sgml *** doc.orig/src/sgml/config.sgml Wed Sep 13 13:57:21 2006 --- doc/src/sgml/config.sgml Wed Sep 13 13:57:44 2006 *************** *** 2172,2178 **** ! --- 2172,2191 ---- ! ! ! gin_fuzzy_search_limit (integer) ! ! gin_fuzzy_search_limit configuration parameter ! ! ! ! Soft upper limit of the size of the returned set by GIN index. For more ! information see . ! ! ! ! diff -c -r -N doc.orig/src/sgml/filelist.sgml doc/src/sgml/filelist.sgml *** doc.orig/src/sgml/filelist.sgml Wed Sep 13 13:57:19 2006 --- doc/src/sgml/filelist.sgml Wed Sep 13 13:57:44 2006 *************** *** 78,83 **** --- 78,84 ---- + diff -c -r -N doc.orig/src/sgml/geqo.sgml doc/src/sgml/geqo.sgml *** doc.orig/src/sgml/geqo.sgml Wed Sep 13 13:57:19 2006 --- doc/src/sgml/geqo.sgml Wed Sep 13 13:57:44 2006 *************** *** 49,56 **** methods (e.g., nested loop, hash join, merge join in PostgreSQL) to process individual joins and a diversity of indexes (e.g., ! B-tree, hash, GiST in PostgreSQL) as access ! paths for relations. --- 49,56 ---- methods (e.g., nested loop, hash join, merge join in PostgreSQL) to process individual joins and a diversity of indexes (e.g., ! B-tree, hash, GiST and GIN in PostgreSQL) as ! access paths for relations. diff -c -r -N doc.orig/src/sgml/gin.sgml doc/src/sgml/gin.sgml *** doc.orig/src/sgml/gin.sgml Thu Jan 1 03:00:00 1970 --- doc/src/sgml/gin.sgml Wed Sep 13 13:57:44 2006 *************** *** 0 **** --- 1,231 ---- + + + + GIN Indexes + + + index + GIN + + + + Introduction + + + GIN stands for Generalized Inverted Index. It is + an index structure storing a set of (key, posting list) pairs, where + 'posting list' is a set of rows in which the key occurs. The + row may contains a lot of keys. + + + + It is generalized in the sense that a GIN index + does not need to be aware of the operation that it accelerates. + Instead, it uses custom strategies defined for particular data types. + + + + One advantage of GIN is that it allows the development + of custom data types with the appropriate access methods, by + an expert in the domain of the data type, rather than a database expert. + This is much the same advantage as using GiST. + + + + The GIN + implementation in PostgreSQL is primarily + maintained by Teodor Sigaev and Oleg Bartunov, and there is more + information on their + website. + + + + + + Extensibility + + + The GIN interface has a high level of abstraction, + requiring the access method implementer to only implement the semantics of + the data type being accessed. The GIN layer itself + takes care of concurrency, logging and searching the tree structure. + + + + All it takes to get a GIN access method working + is to implement four user-defined methods, which define the behavior of + keys in the tree. In short, GIN combines extensibility + along with generality, code reuse, and a clean interface. + + + + + + Implementation + + + Internally, GIN consists of a B-tree index constructed + over keys, where each key is an element of the indexed value + (element of array, for example) and where each tuple in a leaf page is + either a pointer to a B-tree over heap pointers (PT, posting tree), or a + list of heap pointers (PL, posting list) if the tuple is small enough. + + + + There are four methods that an index operator class for + GIN must provide (prototypes are in pseudocode): + + + + + int compare( Datum a, Datum b ) + + + Compares keys (not indexed values!) and returns an integer less than + zero, zero, or greater than zero, indicating whether the first key is + less than, equal to, or greater than the second. + + + + + + Datum* extractValue(Datum inputValue, uint32 *nkeys) + + + Returns an array of keys of value to be indexed, nkeys should + contain the number of returned keys. + + + + + + Datum* extractQuery(Datum query, uint32 nkeys, + StrategyNumber n) + + + Returns an array of keys of the query to be executed. n contains + strategy number of operation (see ). + Depending on n, query may be different type. + + + + + + bool consistent( bool check[], StrategyNumber n, Datum query) + + + Returns TRUE if indexed value satisfies query qualifier with strategy n + (or may satisfy in case of RECHECK mark in operator class). + Each element of the check array is TRUE if indexed value has a + corresponding key in the query: if (check[i] == TRUE ) the i-th key of + the query is present in the indexed value. + + + + + + + + + + GIN tips and trics + + + + Create vs insert + + + In most cases, insertion into GIN index is slow enough + due to a lot keys should be inserted per one value. So, for bulk upload + data in table it will be useful to drop index and create it + after finishing upload. + + + + + + gin_fuzzy_search_limit + + + The primary goal of development GIN indices was + support for highly scalable, full-text search in + PostgreSQL and there are often situations when + a full-text search returns a very large set of results. Since reading + tuples from the disk and sorting them could take a lot of time, this is + unacceptable for production. (Note that the index search itself is very + fast.) + + + Such queries usually contain very frequent words, so the results are not + very helpful. To facilitate execution of such queries + GIN has a configurable soft upper limit of the size + of the returned set, determined by the + gin_fuzzy_search_limit GUC variable. It is set to 0 by + default (no limit). + + + If a non-zero search limit is set, then the returned set is a subset of + the whole result set, chosen at random. + + + "Soft" means that the actual number of returned results could slightly + differ from the specified limit, depending on the query and the quality + of the system's random number generator. + + + + + + + + + Limitations + + + GIN doesn't support full scan of index due to it's + extremely inefficiency: because of a lot of keys per value, + each heap pointer will returned several times. + + + + When extractQuery returns zero number of keys, GIN will + emit a error: for different opclass and strategy semantic meaning of void + query may be different (for example, any array contains void array, + but they aren't overlapped with void one), and GIN can't + suggest reasonable answer. + + + + GIN searches keys only by equality matching. This may + be improved in future. + + + + Examples + + + The PostgreSQL source distribution includes + GIN classes for one-dimensional arrays of all internal + types. The following + contrib modules also contain GIN + operator classes: + + + + + intarray + + Enhanced support for int4[] + + + + + tsearch2 + + Support for inverted text indexing. This is much faster for very + large, mostly-static sets of documents. + + + + + diff -c -r -N doc.orig/src/sgml/indices.sgml doc/src/sgml/indices.sgml *** doc.orig/src/sgml/indices.sgml Wed Sep 13 13:57:19 2006 --- doc/src/sgml/indices.sgml Wed Sep 13 13:57:44 2006 *************** *** 115,121 **** PostgreSQL provides several index types: ! B-tree, Hash, and GiST. Each index type uses a different algorithm that is best suited to different types of queries. By default, the CREATE INDEX command will create a B-tree index, which fits the most common situations. --- 115,121 ---- PostgreSQL provides several index types: ! B-tree, Hash, GIN and GiST. Each index type uses a different algorithm that is best suited to different types of queries. By default, the CREATE INDEX command will create a B-tree index, which fits the most common situations. *************** *** 236,241 **** --- 236,272 ---- Many other GiST operator classes are available in the contrib collection or as separate projects. For more information see . + + + + index + GIN + + + GIN + index + + GIN is a inverted index and it's usable for values which have more + than one key, arrays for example. Like to GiST, GIN may support + many different user-defined indexing strategies and the particular + operators with which a GIN index can be used vary depending on the + indexing strategy. + As an example, the standard distribution of + PostgreSQL includes GIN operator classes + for one-dimentional arrays, which support indexed + queries using these operators: + + + <@ + @> + = + && + + + (See for the meaning of + these operators.) + Another GIN operator classes are available in the contrib + tsearch2 and intarray modules. For more information see . diff -c -r -N doc.orig/src/sgml/mvcc.sgml doc/src/sgml/mvcc.sgml *** doc.orig/src/sgml/mvcc.sgml Wed Sep 13 13:57:20 2006 --- doc/src/sgml/mvcc.sgml Wed Sep 13 13:57:44 2006 *************** *** 987,992 **** --- 987,1007 ---- + + + + GIN indexes + + + + Short-term share/exclusive page-level locks are used for + read/write access. Locks are released immediately after each + index row is fetched or inserted. But note, that GIN index + usually requires produce several inserts per one row, so, + GIN makes more work per one value's insertion. + + + *************** *** 995,1001 **** applications; since they also have more features than hash indexes, they are the recommended index type for concurrent applications that need to index scalar data. When dealing with ! non-scalar data, B-trees are not useful, and GiST indexes should be used instead. --- 1010,1016 ---- applications; since they also have more features than hash indexes, they are the recommended index type for concurrent applications that need to index scalar data. When dealing with ! non-scalar data, B-trees are not useful, and GiST or GIN indexes should be used instead. diff -c -r -N doc.orig/src/sgml/ref/create_opclass.sgml doc/src/sgml/ref/create_opclass.sgml *** doc.orig/src/sgml/ref/create_opclass.sgml Wed Sep 13 13:57:17 2006 --- doc/src/sgml/ref/create_opclass.sgml Wed Sep 13 13:57:44 2006 *************** *** 192,198 **** The data type actually stored in the index. Normally this is the same as the column data type, but some index methods ! (only GiST at this writing) allow it to be different. The STORAGE clause must be omitted unless the index method allows a different type to be used. --- 192,198 ---- The data type actually stored in the index. Normally this is the same as the column data type, but some index methods ! (GIN and GiST for now) allow it to be different. The STORAGE clause must be omitted unless the index method allows a different type to be used. diff -c -r -N doc.orig/src/sgml/xindex.sgml doc/src/sgml/xindex.sgml *** doc.orig/src/sgml/xindex.sgml Wed Sep 13 13:57:21 2006 --- doc/src/sgml/xindex.sgml Wed Sep 13 14:01:06 2006 *************** *** 243,248 **** --- 243,286 ---- + GIN indexes are similar to GiST in flexibility: it hasn't a fixed set + of strategies. Instead, the consistency support routine + interprets the strategy numbers accordingly with operator class + definition. As an example, strategies of operator class over arrays + is shown in . + + + + GiST Two-Dimensional <quote>R-tree</> Strategies + + + + Operation + Strategy Number + + + + + overlap + 1 + + + contains + 2 + + + is contained by + 3 + + + equal + 4 + + + +
+ + Note that all strategy operators return Boolean values. In practice, all operators defined as index method strategies must return type boolean, since they must appear at the top *************** *** 349,380 **** ! consistent 1 ! union 2 ! compress 3 ! decompress 4 ! penalty 5 ! picksplit 6 ! equal 7 --- 387,465 ---- ! consistent - determine whether key satifies the ! query qualifier 1 ! union - compute union of of a set of given keys 2 ! compress - computes a compressed representation of a key or value ! to be indexed 3 ! decompress - computes a decompressed representation of a ! compressed key 4 ! penalty - compute penalty for inserting new key into subtree ! with given subtree's key 5 ! picksplit - determine which entries of a page are to be moved ! to the new page and compute the union keys for resulting pages 6 ! equal - compare two keys and returns true if they are equal ! 7 + + + + + + + GIN indexes require four support functions, + shown in . + + + + GIN Support Functions + + + + Function + Support Number + + + + + + compare - Compare two keys and return an integer less than zero, zero, or + greater than zero, indicating whether the first key is less than, equal to, + or greater than the second. + + 1 + + + extractValue - extract keys from value to be indexed + 2 + + + extractQuery - extract keys from query + 3 + + + consistent - determine whether value matches by the + query + 4 +