Обсуждение: type design guidance needed
I am working on designing some new datatypes and could use some guidance. Along with each data item, I must keep additional information about the scale of measurement. Further, the relevant scales of measurement fall into a few major families of related scales, so at least a different type will be required for each of these major families. Additionally, I wish to be able to convert data measured according to one scale into other scales (both within the same family and between different families), and these interconversions require relatively large sets of parameters. It seems that there are several alternative approaches, and I am seeking some guidance from the wizards here who have some understanding of the backend internals, performance tradeoffs, and such issues. Possible solutions: 1. Store the data and all the scale parameters within the type. Advantages: All information contained within each type. Can be implemented with no backend changes. No access toancillary tables required, so processing might be fast. Disadvantages: Duplicate information on the scales recorded in each field of the types; i.e., waste of space. I/O iseither cumbersome (if all parameters are required) or they type-handling code has built-in tables for supplying missingparameters, in which case the available types and families cannot be extended by users without recompiling thecode. 2. Store only the data and a reference to a compiled-in data table holding the scale parameters. Advantages: No duplicate information stored in the fields. Access to scale data compiled into backend, so processingmight be fast. Disadvantages: Tables of scale data fixed at compile time, so users cannot add additional scales or families of scales. Requires backend changes to implement, but these changes are relatively minor since all the scale parameters arecompiled into the code handling the type. 3. Store only the data and a reference to a new system table (or tables) holding the scale parameters. Advantages: No duplicate information stored in the fields. Access to scale data _not_ compiled into backend, so userscould add scales or families of scales by modifying the system tables. Disadvantages: Requires access to system tables to perform conversions, so processing might be slow. Requires morecomplex backend changes to implement, including the ability to retrieve information from system tables. Clearly, option 3 is optimal (more flexible, no data duplication) unless the access to system tables by the backend presents too much overhead. (Other suggestions are welcome, especially if I have misjudged the relative merits of these ideas or missed one altogether.) The advice I need is the following: - How much of an overhead is introduced by requiring the backend to query system tables during tuple processing? Is thisunacceptable from the outset or is it reasonable to consider this option further? Note that the size of these new tableswill not be large (probably less than 100 tuples) if that matters. - How does one access system tables from the backend code? I seem to recall that issuing straight queries via SPI is notnecessarily the right way to go about this, but I'm not sure where to look for alternatives. Thanks for your help. Cheers, Brook
Brook, I have been contemplating such data type for years. I believe I have assembled the most important parts, but I did not have time to complete the whole thing. The idea is that hte units of measurement can be treated as arithmetic expressions. One can assign each of the few existing base units a fixed position in a bit vector, parse the expression, then evaluate it to obtain three things: scale factor, numerator and quotient, the latter two being bit vectors. So, if you assign the base units as 'm' => 1, 'kg' => 2, 's' => 4, 'K' => 8, 'mol' => 16, 'A' => 32, 'cd' => 64, the unit, umol/min/mg, will be represented as (0.01667, 00010000,00000110). Such structure is compact enough to be stashed into an atomic type. In fact, one needs more than just a plain bit vector to represent exponents: umol/min/ml => (0.01667, '00010000', '00000103') (because ml is a m^3) Here I use the whole charater per bit for clarity, but one does not need more than two or three bits -- you normally don't have kg^4 or m^7 in your units. I considered other alternatives, but none seemed as good as an atomic type. I can bet you will see performance problems and indexing nightmare with non-atomic solutions well before you hit the space constraints with the atomic type. You are even likely to see the space problems with the non-atomic storage: pointers can easily cost more than compacted units. There are numerous benefits to the atomic type. The units can be re-assembled on the output, the operators can be written to work on non-normalized units and discard the incompatible ones, and the chances that you screw up the unit integrity are none. So, if that makes sense, I will be willing to funnel more energy into this project, and I would aprreciate any co-operation. In the meanwhile, you might want to check out what I have done so far. 1. A perl parser for the units of measurement that computes units as algebraic expressions. I have done it in perl for theease of prototyping, but it is flex- and bison-generated and can be ported to c and included into the data type. Get it from http://wit.mcs.anl.gov/~selkovjr/Unit.tgz This is a regular perl extension; do a perl Makefile.PL; make; make install type of thing, but first you need to build and install my version of bison, http://wit.mcs.anl.gov/~selkovjr/camel-1.24.tar.gz There is a demo script that you can run as follows perl browse.pl units 2. The postgres extension, seg, to which I was planning to add the units of measurement. It has its own use already, andit exemplifies the use of the yacc parser in an extension. Please see the README in http://wit.mcs.anl.gov/~selkovjr/pg_extensions/ as well as a brief description in http://wit.mcs.anl.gov/EMP/seg-type.html and a running demo in http://wit.mcs.anl.gov/EMP/indexing.html (search for seg) Food for thought. --Gene
Brook Milligan <brook@biology.nmsu.edu> writes: > Along with each data item, I must keep additional information about > the scale of measurement. Further, the relevant scales of measurement > fall into a few major families of related scales, so at least a > different type will be required for each of these major families. > Additionally, I wish to be able to convert data measured according to > one scale into other scales (both within the same family and between > different families), and these interconversions require relatively > large sets of parameters. It'd be useful to know more about your measurement scales. Evgeni remarks that for his applications, units can be broken down into simple linear combinations of fundamental units --- but if you're doing something like converting between different device-dependent color spaces, I can well believe that that model wouldn't work... > - How much of an overhead is introduced by requiring the backend to > query system tables during tuple processing? Is this unacceptable > from the outset or is it reasonable to consider this option further? Assuming that the scale tables are not too large and not frequently changed, the ideal access mechanism seems to be the "system cache" mechanism (cf src/backend/utils/cache/syscache.c, src/backend/utils/cache/lsyscache.c). The cache support allows each backend to keep copies in memory of recently-used rows of a cached table. Updating a cached table requires rather expensive cross- backend signaling, but as long as that doesn't happen often compared to accesses, you win. The only real restriction is that you have to look up cached rows by a unique key that corresponds to an index, but that seems not to be a problem for your application. Adding a new system cache is a tad more invasive than the usual sort of user-defined-type addition, but it's certainly not out of the question. Bruce Momjian has done it several times and documented the process, IIRC. regards, tom lane
It'd be useful to know more about your measurement scales. Evgeni remarks that for his applications, units can be brokendown into simple linear combinations of fundamental units --- but if you're doing something like converting betweendifferent device-dependent color spaces, I can well believe that that model wouldn't work... Those ideas about linear combinations are great, but I think too simplistic for what I have in mind. I'll give it more thought, though, as I further define the structure of all the interconversions. > - How much of an overhead is introduced by requiring the backend to > query system tables during tuple processing? Is this unacceptable > from the outset or is it reasonable to consider this option further? Assuming that the scale tables are not too large and not frequently changed, the ideal access mechanism seems to be the"system cache" mechanism (cf src/backend/utils/cache/syscache.c, src/backend/utils/cache/lsyscache.c). The cache supportallows each backend to keep copies in memory of recently-used rows of a cached table. Updating a cached table requiresrather expensive cross- backend signaling, but as long as that doesn't happen often compared to accesses, you win. The only real restriction is that you have to look up cached rows by a unique key that corresponds to an index, but that seems not to be a problem for your application. I have in mind cases in which the system tables will almost never be updated. That is, the table installed initially will serve the vast majority of purposes, but I'd still like the flexibility of updating it when needed. Caches may very well be perfectly appropriate, here; thanks for the pointer. Adding a new system cache is a tad more invasive than the usual sort of user-defined-type addition, but it's certainlynot out of the question. Bruce Momjian has done it several times and documented the process, IIRC. Bruce, is that the case? Do you really have it documented? If so, where? Cheers, Brook
Hi Brook,It seems to me that the answer depends on how much effort you want to put in to it. Brook Milligan wrote: > > I am working on designing some new datatypes and could use some > guidance. > It seems that there are several alternative approaches, and I am > seeking some guidance from the wizards here who have some > understanding of the backend internals, performance tradeoffs, and > such issues. > > Possible solutions: > > 1. Store the data and all the scale parameters within the type. > Probably the easiest solution, but it might leave a furry taste in your mouth. > > 2. Store only the data and a reference to a compiled-in data table > holding the scale parameters. > If you can fix all the parameters at compile time this is a good solution. Don't forget that the code for the type is going to dynamically linked into the backend, so "compile time" can happen on the fly. You can write an external script to update your function's source code with the new data, compile the updated source and relink the new executable. This is of course a hack. If you really want to do this on the fly you would need to be sure that simultaneously executing backends, which might be linked to different versions of the library, always do consistent things with the datatypes. Also, you might want to check out what exactly happens when a backend links new symbols over old ones. I have actually done this in a situation where I wanted to load a bunch of values in to a database, do some analysis, change parameters in backend functions, and repeat. I was only using a single backend at a time, and I closed the backend between relinks. It is easier than it sounds, the only part of the backend that you need to understand is the function manager and dynamic loader. The major advantage of this approach is that you are not hacking the backend, and your code might actually continue to work across release versions. > > 3. Store only the data and a reference to a new system table (or > tables) holding the scale parameters. > > This solution is probably the neatest, and in the long term the most robust. However it might also involve the most effort. Depending on how you are using the datatypes, you will probably want to cache the information that you are storing in "system tables" locally in each backend. Basically this means allocating an array to a statically allocated pointer in your code, and populating the array with the data from the system table the first time that you use it. You also need to write a trigger that will invalidate the caches in all backends in the event that the system table is updated. There is already a lot of caching code in the backend, and a system for cache invalidation. I expect that you would end up modifying or copying that. Another point about writing internal backend code is that you end up writing to changing interfaces. You can expect your custom modifications to break with every release. The new function manager, which is a much needed and neatly executed improvement, broke all of my code. This would be a major consideration if you had to support the code across several releases. Of course if the code is going to be generally useful, and the person who pays you is amenable, you can always submit the modifications as patches. > > Cheers, > Brook
>> Bruce, is that the case? Do you really have it documented? If so, >> where? > src/backend/utils/cache/syscache.c BTW, it occurs to me that the real reason adding a syscache is invasive is that the syscache routines accept parameters that are integer indexes into syscache.c's cacheinfo[] array. So there's no way to add a syscache without changing this file. But suppose that the routines instead accepted pointers to cachedesc structs. Then an add-on module could define its own syscache without ever touching syscache.c. This wouldn't even take any widespread code change, just change what the macros AGGNAME &etc expand to... regards, tom lane