Re: Re: LIKE gripes
От | Tatsuo Ishii |
---|---|
Тема | Re: Re: LIKE gripes |
Дата | |
Msg-id | 20000811171347P.t-ishii@sra.co.jp обсуждение исходный текст |
Ответ на | Re: Re: LIKE gripes (Thomas Lockhart <lockhart@alumni.caltech.edu>) |
Список | pgsql-hackers |
> To get the length I'm now just running through the output string looking > for a zero value. This should be more efficient than reading the > original string twice; it might be nice if the conversion routines > (which now return nothing) returned the actual number of pg_wchars in > the output. Sounds resonable. I'm going to enhance them as you suggested. > The original like() code allocates a pg_wchar array dimensioned by the > number of bytes in the input string (which happens to be the absolute > upper limit for the size of the 32-bit-encoded string). Worst case, this > results in a 4-1 expansion of memory, and always requires a > palloc()/pfree() for each call to the comparison routines. Right. There would be another approach to avoid use such that extra memory space. However I am not sure it worth to implement right now. > I think I have a solution for the current code; could someone test its > behavior with MB enabled? It is now committed to the source tree; I know > it compiles, but afaik am not equipped to test it :( It passed the MB test, but fails the string test. Yes, I know it fails becasue ILIKE for MB is not implemented (yet). I'm looking forward to implement the missing part. Is it ok for you, Thomas? > I am not planning on converting everything to UniCode for disk storage. Glad to hear that. > What I would *like* to do is the following: > > 1) support each encoding "natively", using Postgres' type system to > distinguish between them. This would allow strings with the same > encodings to be used without conversion, and would both minimize storage > requirements *and* run-time conversion costs. > > 2) support conversions between encodings, again using Postgres' type > system to suggest the appropriate conversion routines. This would allow > strings with different but compatible encodings to be mixed, but > requires internal conversions *only* if someone is mixing encodings > inside their database. > > 3) one of the supported encodings might be Unicode, and if one chooses, > that could be used for on-disk storage. Same with the other existing > encodings. > > 4) this difference approach to encoding support can coexist with the > existing MB support since (1) - (3) is done without mention of existing > MB internal features. So you can choose which scheme to use, and can > test the new scheme without breaking the existing one. > > imho this comes closer to one of the important goals of maximizing > performance for internal operations (since there is less internal string > copying/conversion required), even at the expense of extra conversion > cost when doing input/output (a good trade since *usually* there are > lots of internal operations to a few i/o operations). > > Comments? Please note that existing MB implementation does not need such an extra conversion cost except some MB-aware-functions(text_length etc.), regex, like and the input/output stage. Also MB stores native encodings 'as is' onto the disk. Anyway, it looks like MB would eventually be merged into/deplicated by your new implementaion of multiple encodings support. BTW, Thomas, do you have a plan to support collation functions? -- Tatsuo Ishii
В списке pgsql-hackers по дате отправления: