Re: Re: LIKE gripes

Поиск
Список
Период
Сортировка
От Tatsuo Ishii
Тема Re: Re: LIKE gripes
Дата
Msg-id 20000811171347P.t-ishii@sra.co.jp
обсуждение исходный текст
Ответ на Re: Re: LIKE gripes  (Thomas Lockhart <lockhart@alumni.caltech.edu>)
Список pgsql-hackers
> To get the length I'm now just running through the output string looking
> for a zero value. This should be more efficient than reading the
> original string twice; it might be nice if the conversion routines
> (which now return nothing) returned the actual number of pg_wchars in
> the output.

Sounds resonable. I'm going to enhance them as you suggested.

> The original like() code allocates a pg_wchar array dimensioned by the
> number of bytes in the input string (which happens to be the absolute
> upper limit for the size of the 32-bit-encoded string). Worst case, this
> results in a 4-1 expansion of memory, and always requires a
> palloc()/pfree() for each call to the comparison routines.

Right.

There would be another approach to avoid use such that extra memory
space.  However I am not sure it worth to implement right now.

> I think I have a solution for the current code; could someone test its
> behavior with MB enabled? It is now committed to the source tree; I know
> it compiles, but afaik am not equipped to test it :(

It passed the MB test, but fails the string test. Yes, I know it fails
becasue ILIKE for MB is not implemented (yet). I'm looking forward to
implement the missing part. Is it ok for you, Thomas?

> I am not planning on converting everything to UniCode for disk storage.

Glad to hear that.

> What I would *like* to do is the following:
> 
> 1) support each encoding "natively", using Postgres' type system to
> distinguish between them. This would allow strings with the same
> encodings to be used without conversion, and would both minimize storage
> requirements *and* run-time conversion costs.
> 
> 2) support conversions between encodings, again using Postgres' type
> system to suggest the appropriate conversion routines. This would allow
> strings with different but compatible encodings to be mixed, but
> requires internal conversions *only* if someone is mixing encodings
> inside their database.
> 
> 3) one of the supported encodings might be Unicode, and if one chooses,
> that could be used for on-disk storage. Same with the other existing
> encodings.
> 
> 4) this difference approach to encoding support can coexist with the
> existing MB support since (1) - (3) is done without mention of existing
> MB internal features. So you can choose which scheme to use, and can
> test the new scheme without breaking the existing one.
> 
> imho this comes closer to one of the important goals of maximizing
> performance for internal operations (since there is less internal string
> copying/conversion required), even at the expense of extra conversion
> cost when doing input/output (a good trade since *usually* there are
> lots of internal operations to a few i/o operations).
> 
> Comments?

Please note that existing MB implementation does not need such an
extra conversion cost except some MB-aware-functions(text_length
etc.), regex, like and the input/output stage. Also MB stores native
encodings 'as is' onto the disk.

Anyway, it looks like MB would eventually be merged into/deplicated by
your new implementaion of multiple encodings support.

BTW, Thomas, do you have a plan to support collation functions?
--
Tatsuo Ishii



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Stephan Szabo
Дата:
Сообщение: Re: Arrays and foreign keys
Следующее
От: "Allan Huffman"
Дата:
Сообщение: db Comparisons - Road Show