Обсуждение: Re: [PATCHES] char/varchar locale support
(moved to hackers list) > I am working on extending locale support for char/varchar types. > Q1. I touched ...src/include/utils/builtins.h to insert the following > macros: > ----- > #ifdef USE_LOCALE > #define pgstrcmp(s1,s2,l) strcoll(s1,s2) > #else > #define pgstrcmp(s1,s2,l) strncmp(s1,s2,l) > #endif > ----- > Is it right place? I think so, am I wrong? Probably the right place. Probably the wrong code; see below... > Q2. Bartunov said me I should read varlena.c. I read it and found > that for every strcoll() for both strings there are calls to allocate > memory (to make them null-terminated). Oleg said I need the same for > varchar. > Do I really need to allocate space for varchar? What about char? Is it > 0-terminated already? No, neither bpchar nor varchar are guaranteed to be null terminated. Yes, you will need to allocate (palloc()) local memory for this. Your pgstrcmp() macros are not equivalent, since strncmp() will stop the comparison at the specified limit (l) where strcoll() requires a null terminated string. If you look in varlena.c you will find several places with #if USE_LOCALE ... #else ... #endif Those blocks will need to be replicated in varchar.c for both bpchar and varchar support routines. The first example I looked at in varlena.c seems to have trouble in that the code looks a bit troublesome :( In the code snippet below (from text_lt), both input strings are replicated and copied to the same output length, even though the input lengths can be different. Looks wrong to me: memcpy(a1p, VARDATA(arg1), len); *(a1p + len) = '\0'; memcpy(a2p, VARDATA(arg2), len); *(a2p + len) = '\0'; Instead of "len" in each expression it should probably be len1 = VARSIZE(arg1)-VARHDRSZ len2 = VARSIZE(arg2)-VARHDRSZ Another possibility for implementation is to write a string comparison routine (e.g. varlena_cmp()) which takes two arguments and returns -1, 0, or 1 for less than, equals, and greater than. All of the comparison routines can call that one (which would have the #if USE_LOCALE), rather than having USE_LOCALE spread through each comparison routine. - Tom
Hi! On Fri, 15 May 1998, Thomas G. Lockhart wrote: > Another possibility for implementation is to write a string comparison > routine (e.g. varlena_cmp()) which takes two arguments and returns -1, > 0, or 1 for less than, equals, and greater than. All of the comparison > routines can call that one (which would have the #if USE_LOCALE), rather > than having USE_LOCALE spread through each comparison routine. Yes, I thinked about this recently. It seems the best solution, perhaps. Thank you. I'll continue my work. Oleg. ---- Oleg Broytmann http://members.tripod.com/~phd2/ phd2@earthling.net Programmers don't die, they just GOSUB without RETURN.
Oleg Broytmann wrote: > > Hi! > > On Fri, 15 May 1998, Thomas G. Lockhart wrote: > > Another possibility for implementation is to write a string comparison > > routine (e.g. varlena_cmp()) which takes two arguments and returns -1, > > 0, or 1 for less than, equals, and greater than. All of the comparison > > routines can call that one (which would have the #if USE_LOCALE), rather > > than having USE_LOCALE spread through each comparison routine. > > Yes, I thinked about this recently. It seems the best solution, perhaps. > Thank you. I'll continue my work. > > Oleg. > ---- > Oleg Broytmann http://members.tripod.com/~phd2/ phd2@earthling.net > Programmers don't die, they just GOSUB without RETURN. Shouldn't this be done only for NATIONAL CHAR? /* m */
Hi! On Mon, 18 May 1998, Mattias Kregert wrote: > > > Another possibility for implementation is to write a string comparison > > > routine (e.g. varlena_cmp()) which takes two arguments and returns -1, > > > 0, or 1 for less than, equals, and greater than. All of the comparison > > > routines can call that one (which would have the #if USE_LOCALE), rather > > > than having USE_LOCALE spread through each comparison routine. > > Shouldn't this be done only for NATIONAL CHAR? It is what USE_LOCALE is intended for, isn't it? Oleg. ---- Oleg Broytmann http://members.tripod.com/~phd2/ phd2@earthling.net Programmers don't die, they just GOSUB without RETURN.
> > Shouldn't this be done only for NATIONAL CHAR? > It is what USE_LOCALE is intended for, isn't it? SQL92 defines NATIONAL CHAR/VARCHAR as the data type to support implicit local character sets. The usual CHAR/VARCHAR would use the default SQL_TEXT character set. I suppose we could extend it to include NATIONAL TEXT also... Additionally, SQL92 allows one to specify an explicit character set and an explicit collating sequence. The standard is not explicit on how one actually makes these known to the database, but Postgres should be well suited to accomplishing this. Anyway, I'm not certain how common and wide-spread the NATIONAL CHAR usage is. Would users with installations having non-English data find using NCHAR/NATIONAL CHAR/NATIONAL CHARACTER an inconvenience? Or would most non-English installations find this better and more solid?? At the moment we have support for Russian and Japanese character sets, and these would need the maintainers to agree to changes. btw, if we do implement NATIONAL CHARACTER I would like to do so by having it fit in with the full SQL92 character sets and collating sequences capabilities. Then one could specify what NATIONAL CHAR means for an installation or perhaps at run time without having to recompile... - Tom
Thomas G. Lockhart wrote: > btw, if we do implement NATIONAL CHARACTER I would like to do so by > having it fit in with the full SQL92 character sets and collating > sequences capabilities. Then one could specify what NATIONAL CHAR means > for an installation or perhaps at run time without having to > recompile... I fully agree that there should be a CREATE COLLATION syntax or similiar with ability to add collation keyword in every place that needs a character comparision, like btree indexes, orders, or simply comparision operators. This mean that we should start probably from creating three-parameter comparision functions with added a third parameter to select collation. Additionally, it's worth to note that using strcoll is highly expensive. I've got some reports from people who used postgreSQL with national characters and noticed performance drop-downs up to 20 times (Linux). So it's needed to create a cheap comparision functions that will preserve it's translation tables during sessions. Anyhow, if anybody wants to try inefficient strcoll, long time ago I've sent a patch to sort chars/varchars using it. But I don't recommend it. Mike -- WWW: http://www.lodz.pdi.net/~mimo tel: Int. Acc. Code + 48 42 148340 add: Michal Mosiewicz * Bugaj 66 m.54 * 95-200 Pabianice * POLAND
>> > Shouldn't this be done only for NATIONAL CHAR? >> It is what USE_LOCALE is intended for, isn't it? LOCALE is not very usefull for multi-byte speakers. >SQL92 defines NATIONAL CHAR/VARCHAR as the data type to support implicit >local character sets. The usual CHAR/VARCHAR would use the default >SQL_TEXT character set. I suppose we could extend it to include NATIONAL >TEXT also... > >Additionally, SQL92 allows one to specify an explicit character set and >an explicit collating sequence. The standard is not explicit on how one >actually makes these known to the database, but Postgres should be well >suited to accomplishing this. > >Anyway, I'm not certain how common and wide-spread the NATIONAL CHAR >usage is. Would users with installations having non-English data find >using NCHAR/NATIONAL CHAR/NATIONAL CHARACTER an inconvenience? Or would >most non-English installations find this better and more solid?? The capability to specify implicit character sets for CHAR (that's what MB does) looks enough for multi-byte speakers except the collation sequences. One question to the SQL92's NCHAR is how one can specify several charcter sets at one time. As you might know Japanese, Chineses, Korean uses multiple charcter sets. For example, EUC_JP, a widly used Japanese encoding system on Unix, includes 4 character sets: ASCII, JISX0201, JISX0208 and JISX0212. >At the moment we have support for Russian and Japanese character sets, >and these would need the maintainers to agree to changes. Additionally we have support for Chinese, Korean. Moreover if the mule internal code or unicode is prefered for the internal encoding system, one could use almost any language in the world:-) >btw, if we do implement NATIONAL CHARACTER I would like to do so by >having it fit in with the full SQL92 character sets and collating >sequences capabilities. Then one could specify what NATIONAL CHAR means >for an installation or perhaps at run time without having to >recompile... Collating sequences look very usesful. Also it would be nice if we could specify default character sets when creating a database, table or fields. -- Tatsuo Ishii t-ishii@sra.co.jp