Обсуждение: tree ordering with varbit

Поиск
Список
Период
Сортировка

tree ordering with varbit

От
Scott Lamb
Дата:
Got a couple of questions. Short version:

- Are there conversion functions from integer and timestamp to bit varying?

- Would sorting a timestamp by its "bit varying" value be equivalent to
sorting by the timestamp itself?

The long version, which includes why I'm asking, is below:

I saw something in the OpenACS code about using the bit varying type to
order trees. I think it worked something like this. Given a structure
like this:

create table mb.message (
         message_id      serial primary key,
         messageroot_id  integer not null references mb.messageroot,
         parent_id       integer references mb.message (message_id),
         ...
);

all of the messages with the same messageroot make a forest. If I wanted
to sort them hierarchically based when they were posted, I'd want a sort
key that has their post time prefixed by that of all their ancestors, so
the greatest ancestor comes first. Or better yet, their IDs, since
that's unique and means children of two parents that happened to be
posted at the same time wouldn't be lumped together, and IDs should
increase as posting times increase.

So I need a type that can expand. An array or a varying-size type.
Arrays might work for the above, but if I want to sort by a couple of
different types, then I'm screwed. varbit already sorts in the right way
for integer, at least. So I need conversion functions and hopefully to
know that it sorts right for timestamps also.

Thanks,
Scott


Re: tree ordering with varbit

От
Joe Conway
Дата:
Scott Lamb wrote:
> create table mb.message (
>         message_id      serial primary key,
>         messageroot_id  integer not null references mb.messageroot,
>         parent_id       integer references mb.message (message_id),
>         ...
> );
>
> all of the messages with the same messageroot make a forest. If I wanted
> to sort them hierarchically based when they were posted, I'd want a sort
> key that has their post time prefixed by that of all their ancestors, so
> the greatest ancestor comes first. Or better yet, their IDs, since
> that's unique and means children of two parents that happened to be
> posted at the same time wouldn't be lumped together, and IDs should
> increase as posting times increase.

Do you mean something like this?

regression=# CREATE TABLE connectby_int(keyid int, parent_keyid int);
CREATE TABLE
regression=# \copy connectby_int from 'data/connectby_int.data'
\.
regression=# select * from connectby_int;
  keyid | parent_keyid
-------+--------------
      1 |
      2 |            1
      3 |            1
      4 |            2
      5 |            2
      6 |            4
      7 |            3
      8 |            6
      9 |            5
(9 rows)

regression=# SELECT * FROM connectby('connectby_int', 'keyid', 'parent_keyid',
'2', 0, '~') AS t(keyid int, parent_keyid int, level int, branch text);
  keyid | parent_keyid | level | branch
-------+--------------+-------+---------
      2 |              |     0 | 2
      4 |            2 |     1 | 2~4
      6 |            4 |     2 | 2~4~6
      8 |            6 |     3 | 2~4~6~8
      5 |            2 |     1 | 2~5
      9 |            5 |     2 | 2~5~9
(6 rows)

If so, the connectby() function is in contrib/tablefunc in the
soon-to-be-released version 7.3

There are some imperfections in the way this currently works from the
standpoint of using "branch" to sort, but in many cases it will do pretty much
what you want.

In the next version (i.e. for 7.4) I'll probably add a way to pad each segment
in the branch to a user specified length with a user specified character. The
example above would then look something like:

regression=# SELECT * FROM connectby('connectby_int', 'keyid', 'parent_keyid',
'2', 0, '~', 3, '0') AS t(keyid int, parent_keyid int, level int, branch text);
  keyid | parent_keyid | level | branch
-------+--------------+-------+---------
      2 |              |     0 | 002
      4 |            2 |     1 | 002~004
      6 |            4 |     2 | 002~004~006
      8 |            6 |     3 | 002~004~006~008
      5 |            2 |     1 | 002~005
      9 |            5 |     2 | 002~005~009

That way 010 would sort after 009, instead of before it (10 vs 9).

I didn't really directly answer your questions, but I hope this helps anyway.

Joe


Re: tree ordering with varbit

От
Scott Lamb
Дата:
Joe Conway wrote:
> regression=# SELECT * FROM connectby('connectby_int', 'keyid',
> 'parent_keyid', '2', 0, '~') AS t(keyid int, parent_keyid int, level
> int, branch text);

Yeah, I saw that. But I don't think I can use it for a few reasons:

- I could't see a way to do multi-item sorts, like by score descending
and then post time ascending, for example.

- I couldn't see a way to filter for "messageroot_id = ?" first, so it
would probably be unnecesarily slow as the number of messages in this
discussion / the number of messages total ratio becomes small.

- Also performance - I can't cache the results as a row in the table,
which I could see being important at some point. Kind of a pain to keep
in sync, like all denormalized things, but I could see this being an
expensive operation.

Thanks,
Scott


Re: tree ordering with varbit

От
Joe Conway
Дата:
Scott Lamb wrote:
> Joe Conway wrote:
>
>> regression=# SELECT * FROM connectby('connectby_int', 'keyid',
>> 'parent_keyid', '2', 0, '~') AS t(keyid int, parent_keyid int, level
>> int, branch text);
>
>
> Yeah, I saw that. But I don't think I can use it for a few reasons:
>
> - I could't see a way to do multi-item sorts, like by score descending
> and then post time ascending, for example.

If I understand what you're saying correctly, I think you can just join back
to the real table from the table function by the primary key. Then you can
sort on anything you want.

>
> - I couldn't see a way to filter for "messageroot_id = ?" first, so it
> would probably be unnecesarily slow as the number of messages in this
> discussion / the number of messages total ratio becomes small.

Do you mean "where messageroot_id = any_value"? Yeah, I couldn't think of a
good, flexible way to do this. The problem is that you can either run the
function for every value in the table (too slow and doesn't make sense
anyway), or you can somehow try to determine which records are root nodes. I
suppose we could allow a NULL be passed in for "start_with" to indicate a
wildcard, and then iterate internally over all records that have "parent_id"
IS NULL. Still won't work for every case, but perhaps better than not having
the option at all? Any other suggestions?

> - Also performance - I can't cache the results as a row in the table,
> which I could see being important at some point. Kind of a pain to keep
> in sync, like all denormalized things, but I could see this being an
> expensive operation.

Why not? Just SELECT the results of the connectby() function (optionally
joined with other tables) into a table (temp or otherwise).


I'm interested in making this function useful for as many cases as possible. I
appreciate the feedback.

Thanks,

Joe


Re: tree ordering with varbit

От
Scott Lamb
Дата:
Thomas T. Thai wrote:
> try ltree in contrib.

I looked at it. It is very specialized toward trees that are labelled
like google categories. That's a different problem than this. I'm
definitely not doing complex queries on the tree like they are.

Thanks,
Scott


Re: tree ordering with varbit

От
Scott Lamb
Дата:
Joe Conway wrote:
> Scott Lamb wrote:
>> - I could't see a way to do multi-item sorts, like by score descending
>> and then post time ascending, for example.
>
> If I understand what you're saying correctly, I think you can just join
> back to the real table from the table function by the primary key. Then
> you can sort on anything you want.

I'd like to do this but hierarchically also. I.e., sorting based on its
deepest ancestors score then post time, second deepest's score then post
time, ..., its score then post time. So I guess in your function's
language, I'd need to be able to put a couple of columns into the branch
text.

>> - I couldn't see a way to filter for "messageroot_id = ?" first, so it
>> would probably be unnecesarily slow as the number of messages in this
>> discussion / the number of messages total ratio becomes small.
>
> Do you mean "where messageroot_id = any_value"? Yeah, I couldn't think
> of a good, flexible way to do this. The problem is that you can either
> run the function for every value in the table (too slow and doesn't make
> sense anyway), or you can somehow try to determine which records are
> root nodes. I suppose we could allow a NULL be passed in for
> "start_with" to indicate a wildcard, and then iterate internally over
> all records that have "parent_id" IS NULL. Still won't work for every
> case, but perhaps better than not having the option at all? Any other
> suggestions?

Hmm, not at the moment anyway.

>> - Also performance - I can't cache the results as a row in the table,
>> which I could see being important at some point. Kind of a pain to
>> keep in sync, like all denormalized things, but I could see this being
>> an expensive operation.
>
>
> Why not? Just SELECT the results of the connectby() function (optionally
> joined with other tables) into a table (temp or otherwise).

Good point. I could do that.

Thanks,
Scott


Re: tree ordering with varbit

От
Joe Conway
Дата:
Scott Lamb wrote:
> Joe Conway wrote:
>> Scott Lamb wrote:
>>> - I could't see a way to do multi-item sorts, like by score
>>> descending and then post time ascending, for example.
>>
>> If I understand what you're saying correctly, I think you can just
>> join back to the real table from the table function by the primary
>> key. Then you can sort on anything you want.
>
> I'd like to do this but hierarchically also. I.e., sorting based on its
> deepest ancestors score then post time, second deepest's score then post
> time, ..., its score then post time. So I guess in your function's
> language, I'd need to be able to put a couple of columns into the branch
> text.

Ahh, now I understand. This just came up on another list (I think it was the
sql list) -- someone wanted to build a pathname, but is using integer ids to
form the hierarchy. That made me think about an alternative version of the
connectby() function that would allow you to specify a column name to be used
for building something similar to the "branch" field. The "branch" would still
be needed (at least internally) to prevent infinite recursion, so this would
represent additional overhead, and therefore be optional.

It's too late to get functionality changes included in 7.3, but I can make
interim versions available via url. I'll post a message and a link when I get
a chance to address these issues.

Thanks,

Joe