Обсуждение: type design guidance needed

Поиск
Список
Период
Сортировка

type design guidance needed

От
Brook Milligan
Дата:
I am working on designing some new datatypes and could use some
guidance.

Along with each data item, I must keep additional information about
the scale of measurement.  Further, the relevant scales of measurement
fall into a few major families of related scales, so at least a
different type will be required for each of these major families.
Additionally, I wish to be able to convert data measured according to
one scale into other scales (both within the same family and between
different families), and these interconversions require relatively
large sets of parameters.

It seems that there are several alternative approaches, and I am
seeking some guidance from the wizards here who have some
understanding of the backend internals, performance tradeoffs, and
such issues.

Possible solutions:

1.  Store the data and all the scale parameters within the type.
   Advantages:  All information contained within each type.  Can be   implemented with no backend changes.  No access
toancillary tables   required, so processing might be fast.
 
   Disadvantages: Duplicate information on the scales recorded in   each field of the types; i.e., waste of space.  I/O
iseither   cumbersome (if all parameters are required) or they type-handling   code has built-in tables for supplying
missingparameters, in   which case the available types and families cannot be extended by   users without recompiling
thecode.
 

2.  Store only the data and a reference to a compiled-in data table   holding the scale parameters.
   Advantages:  No duplicate information stored in the fields.   Access to scale data compiled into backend, so
processingmight be   fast.
 
   Disadvantages: Tables of scale data fixed at compile time, so   users cannot add additional scales or families of
scales.  Requires backend changes to implement, but these changes are   relatively minor since all the scale parameters
arecompiled into   the code handling the type.
 

3.  Store only the data and a reference to a new system table (or   tables) holding the scale parameters.
   Advantages:  No duplicate information stored in the fields.   Access to scale data _not_ compiled into backend, so
userscould   add scales or families of scales by modifying the system tables.
 
   Disadvantages: Requires access to system tables to perform   conversions, so processing might be slow.  Requires
morecomplex   backend changes to implement, including the ability to retrieve   information from system tables.
 

Clearly, option 3 is optimal (more flexible, no data duplication)
unless the access to system tables by the backend presents too much
overhead.  (Other suggestions are welcome, especially if I have
misjudged the relative merits of these ideas or missed one
altogether.)  The advice I need is the following:

- How much of an overhead is introduced by requiring the backend to query system tables during tuple processing?  Is
thisunacceptable from the outset or is it reasonable to consider this option further? Note that the size of these new
tableswill not be large (probably less than 100 tuples) if that matters.
 

- How does one access system tables from the backend code?  I seem to recall that issuing straight queries via SPI is
notnecessarily the right way to go about this, but I'm not sure where to look for alternatives.
 

Thanks for your help.

Cheers,
Brook



Re: type design guidance needed

От
"Evgeni E. Selkov"
Дата:
Brook,

I have been contemplating such data type for years. I believe I have
assembled the most important parts, but I did not have time to
complete the whole thing.

The idea is that hte units of measurement can be treated as arithmetic
expressions. One can assign each of the few existing base units a
fixed position in a bit vector, parse the expression, then evaluate it
to obtain three things: scale factor, numerator and quotient, the
latter two being bit vectors.

So, if you assign the base units as
 'm'    => 1, 'kg'   => 2, 's'    => 4, 'K'    => 8, 'mol'  => 16, 'A'    => 32, 'cd'   => 64,

the unit, umol/min/mg, will be represented as 

(0.01667, 00010000,00000110). 

Such structure is compact enough to be stashed into an atomic type.
In fact, one needs more than just a plain bit vector to represent
exponents:

umol/min/ml => (0.01667, '00010000', '00000103') (because ml is a m^3)

Here I use the whole charater per bit for clarity, but one does not
need more than two or three bits -- you normally don't have kg^4 or
m^7 in your units.

I considered other alternatives, but none seemed as good as an atomic
type. I can bet you will see performance problems and indexing
nightmare with non-atomic solutions well before you hit the space
constraints with the atomic type. You are even likely to see the space
problems with the non-atomic storage: pointers can easily cost more
than compacted units.

There are numerous benefits to the atomic type. The units can be
re-assembled on the output, the operators can be written to work on
non-normalized units and discard the incompatible ones, and the
chances that you screw up the unit integrity are none.

So, if that makes sense, I will be willing to funnel more energy into
this project, and I would aprreciate any co-operation.

In the meanwhile, you might want to check out what I have done so far.

1. A perl parser for the units of measurement that computes units as  algebraic expressions. I have done it in perl for
theease of  prototyping, but it is flex- and bison-generated and can be ported  to c and included into the data type.
 
  Get it from  http://wit.mcs.anl.gov/~selkovjr/Unit.tgz
  This is a regular perl extension; do a 
perl Makefile.PL; make; make install
  type of thing, but first you need to build and install my version of  bison,
http://wit.mcs.anl.gov/~selkovjr/camel-1.24.tar.gz
  There is a demo script that you can run as follows
       perl browse.pl units

2. The postgres extension, seg, to which I was planning to add the  units of measurement. It has its own use already,
andit  exemplifies the use of the yacc parser in an extension.
 
  Please see the README in 
http://wit.mcs.anl.gov/~selkovjr/pg_extensions/
  as well as a brief description in 
http://wit.mcs.anl.gov/EMP/seg-type.html
  and a running demo in 
http://wit.mcs.anl.gov/EMP/indexing.html (search for seg)

Food for thought.

--Gene


Re: type design guidance needed

От
Tom Lane
Дата:
Brook Milligan <brook@biology.nmsu.edu> writes:
> Along with each data item, I must keep additional information about
> the scale of measurement.  Further, the relevant scales of measurement
> fall into a few major families of related scales, so at least a
> different type will be required for each of these major families.
> Additionally, I wish to be able to convert data measured according to
> one scale into other scales (both within the same family and between
> different families), and these interconversions require relatively
> large sets of parameters.

It'd be useful to know more about your measurement scales.  Evgeni
remarks that for his applications, units can be broken down into
simple linear combinations of fundamental units --- but if you're
doing something like converting between different device-dependent
color spaces, I can well believe that that model wouldn't work...

> - How much of an overhead is introduced by requiring the backend to
>   query system tables during tuple processing?  Is this unacceptable
>   from the outset or is it reasonable to consider this option further?

Assuming that the scale tables are not too large and not frequently
changed, the ideal access mechanism seems to be the "system cache"
mechanism (cf src/backend/utils/cache/syscache.c,
src/backend/utils/cache/lsyscache.c).  The cache support allows each
backend to keep copies in memory of recently-used rows of a cached
table.  Updating a cached table requires rather expensive cross-
backend signaling, but as long as that doesn't happen often compared
to accesses, you win.  The only real restriction is that you have to
look up cached rows by a unique key that corresponds to an index, but
that seems not to be a problem for your application.

Adding a new system cache is a tad more invasive than the usual sort of
user-defined-type addition, but it's certainly not out of the question.
Bruce Momjian has done it several times and documented the process,
IIRC.
        regards, tom lane


Re: type design guidance needed

От
Brook Milligan
Дата:
It'd be useful to know more about your measurement scales.  Evgeni  remarks that for his applications, units can be
brokendown into  simple linear combinations of fundamental units --- but if you're  doing something like converting
betweendifferent device-dependent  color spaces, I can well believe that that model wouldn't work...
 

Those ideas about linear combinations are great, but I think too
simplistic for what I have in mind.  I'll give it more thought,
though, as I further define the structure of all the interconversions.
  > - How much of an overhead is introduced by requiring the backend to  >   query system tables during tuple
processing? Is this unacceptable  >   from the outset or is it reasonable to consider this option further?
 
  Assuming that the scale tables are not too large and not frequently  changed, the ideal access mechanism seems to be
the"system cache"  mechanism (cf src/backend/utils/cache/syscache.c,  src/backend/utils/cache/lsyscache.c).  The cache
supportallows each  backend to keep copies in memory of recently-used rows of a cached  table.  Updating a cached table
requiresrather expensive cross-  backend signaling, but as long as that doesn't happen often compared  to accesses, you
win. The only real restriction is that you have to  look up cached rows by a unique key that corresponds to an index,
but that seems not to be a problem for your application.
 

I have in mind cases in which the system tables will almost never be
updated.  That is, the table installed initially will serve the vast
majority of purposes, but I'd still like the flexibility of updating
it when needed.  Caches may very well be perfectly appropriate, here;
thanks for the pointer.
  Adding a new system cache is a tad more invasive than the usual sort of  user-defined-type addition, but it's
certainlynot out of the question.  Bruce Momjian has done it several times and documented the process,  IIRC.
 

Bruce, is that the case?  Do you really have it documented?  If so,
where?

Cheers,
Brook


Re: type design guidance needed

От
Bernard Frankpitt
Дата:
Hi Brook,It seems to me that the answer depends on how much effort you want to
put in to it.

Brook Milligan wrote:
> 
> I am working on designing some new datatypes and could use some
> guidance.

> It seems that there are several alternative approaches, and I am
> seeking some guidance from the wizards here who have some
> understanding of the backend internals, performance tradeoffs, and
> such issues.
> 
> Possible solutions:
> 
> 1.  Store the data and all the scale parameters within the type.
> 

Probably the easiest solution, but it might leave a furry taste in your
mouth.

> 
> 2.  Store only the data and a reference to a compiled-in data table
>     holding the scale parameters.
> 

If you can fix all the parameters at compile time this is a good
solution.  Don't forget that the code for the type is going to
dynamically linked into the backend, so "compile time" can happen on the
fly.  You can write an external script to update your function's source
code with the new data, compile the updated source and relink the new
executable. This is of course a hack. If you really want to do this on
the fly you would need to be sure that simultaneously executing
backends, which might be linked to different versions of the library,
always do consistent things with the datatypes. Also, you might want to
check out what exactly happens when a backend links new symbols over old
ones.

I have actually done this in a situation where I wanted to load a bunch
of values in to a database, do some analysis, change parameters in
backend functions, and repeat. I was only using a single backend at a
time, and I closed the backend between relinks. It is easier than it
sounds, the only part of the backend that you need to understand is the
function manager and dynamic loader.

The major advantage of this approach is that you are not hacking the
backend, and your code might actually continue to work across release
versions.

> 
> 3.  Store only the data and a reference to a new system table (or
>     tables) holding the scale parameters.
> 
> 

This solution is probably the neatest, and in the long term the most
robust. However it might also involve the most effort.  Depending on how
you are using the datatypes, you will probably want to cache the
information that you are storing in "system tables" locally in each
backend.  Basically this means allocating an array to a statically
allocated pointer in your code, and populating the array with the data
from the system table the first time that you use it.  You also need to
write a trigger that will invalidate the caches in all backends in the
event that the system table is updated.  There is already a lot of
caching code in the backend, and a system for cache invalidation.  I
expect that you would end up modifying or copying that.

Another point about writing internal backend code is that you end up
writing to changing interfaces.  You can expect your custom
modifications to break with every release.  The new function manager,
which is a much needed and neatly executed improvement, broke all of my
code. This would be a major consideration if you had to support the code
across several releases.  Of course if the code is going to be generally
useful, and the person who pays you is amenable, you can always submit
the modifications as patches.

> 
> Cheers,
> Brook


Re: type design guidance needed

От
Tom Lane
Дата:
>> Bruce, is that the case?  Do you really have it documented?  If so,
>> where?

> src/backend/utils/cache/syscache.c

BTW, it occurs to me that the real reason adding a syscache is invasive
is that the syscache routines accept parameters that are integer indexes
into syscache.c's cacheinfo[] array.  So there's no way to add a
syscache without changing this file.  But suppose that the routines
instead accepted pointers to cachedesc structs.  Then an add-on module
could define its own syscache without ever touching syscache.c.  This
wouldn't even take any widespread code change, just change what the
macros AGGNAME &etc expand to...
        regards, tom lane