Обсуждение: interval origami

Поиск
Список
Период
Сортировка

interval origami

От
Adam Jensen
Дата:
Hi,

I am working on a hobby project that might eventually make its way into
the public domain. The goal is to create a Media Annotation, Analysis
and Playback Synthesis System (I call the project MAAPSS). Conceptually,
it is somewhat similar to vcode[1,2] but more "hackable" with broader
general applications.

[1]: http://social.cs.uiuc.edu/projects/vcode.html
[2]: https://youtu.be/Sv_OZ174wpg

In the configuration that I am currently exploring, the database will
contain many entries with intervals that could overlap in every possible
way. The entries could be labeled in many ways other than "interesting"
and "fail". This little example is a simplified attempt to isolate what
I see as the core problem, which is probably mostly based in my current
lack of understanding of SQL.

CREATE TABLE Example (start REAL, stop REAL, tag TEXT);
INSERT INTO Example VALUES
(10, 30, 'interesting'),
(12, 32, 'interesting'),
(15, 20, 'fail'),
(16, 39, 'interesting'),
(17, 21, 'fail'),
(50, 80, 'interesting'),
(60, 65, 'fail'),
(85, 90, 'fail');

The 'start' and 'stop' values stored in the database are numbers of type
 REAL, representing seconds. There is no need for calculations with, or
transformations between, elaborate time representation formats.

Here is a cognitive approach to a solution:

0. Starting with a dataset like this:
10.0|30.0|interesting
12.0|32.0|interesting
15.0|20.0|fail
16.0|39.0|interesting
17.0|21.0|fail
50.0|80.0|interesting
60.0|65.0|fail
85.0|90.0|fail

1. Find all of the "interesting" segments.
10.0|30.0|interesting
12.0|32.0|interesting
16.0|39.0|interesting
50.0|80.0|interesting

2. Find any "interesting" segments with overlaps and merge them.
10.0|39.0|interesting
50.0|80.0|interesting

3. Find all "fail" segments with a period that overlaps the period of any
(step 2) "interesting" segment.
15.0|20.0|fail
17.0|21.0|fail
60.0|65.0|fail

4. Find any overlaps in those (step 3) "fail" segments and merge.
15.0|21.0|fail
60.0|65.0|fail

5. Derive a new list of "interesting" segments from (step 2) that omits
all "fail" segments from (step 4).
10.0|15.0|interesting
21.0|39.0|interesting
50.0|60.0|interesting
65.0|80.0|interesting

A relational database solution is needed. I am not yet very familiar
with relational database design and SQL beyond the very basics. So any
pointers to similar SQL problems with solutions and explanations would
be tremendously useful at this point. Any hints at all, really, will be
much appreciated.

Beyond a basic solution, I suspect something like an R-Tree index might
be useful. I gather this from the SQLite page on R-Trees:

https://sqlite.org/rtree.html

--

SELECT version();
                                                 version

---------------------------------------------------------------------------------------------------------
 PostgreSQL 11.1 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5
20150623 (Red Hat 4.8.5-28), 64-bit
(1 row)



Re: interval origami

От
Alvaro Herrera
Дата:
On 2018-Nov-30, Adam Jensen wrote:


> A relational database solution is needed. I am not yet very familiar
> with relational database design and SQL beyond the very basics. So any
> pointers to similar SQL problems with solutions and explanations would
> be tremendously useful at this point. Any hints at all, really, will be
> much appreciated.

This sounds like something you can easily do with range types in
Postgres.  Probably not terribly portable to other DBMSs though.
  https://www.postgresql.org/docs/11/rangetypes.html

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: interval origami

От
Adam Jensen
Дата:
On 11/30/18 12:57 PM, Alvaro Herrera wrote:
> This sounds like something you can easily do with range types in
> Postgres.  Probably not terribly portable to other DBMSs though.
>   https://www.postgresql.org/docs/11/rangetypes.html


Nice hint. Thanks!

PostgreSQL as the application platform is fine. Portability to other
DBMS's isn't a requirement.

The 'numrange' type with the 'overlaps' and 'intersection' operators
seem to cover the fundamental computations in a very natural way.

hanzer=# SELECT numrange(10, 30) && numrange(12, 32);
 ?column?
----------
 t
(1 row)

hanzer=# SELECT numrange(10, 30) * numrange(12, 32);
 ?column?
----------
 [12,30)
(1 row)

How might they be used in an SQL query to solve the problem? I'm an SQL
noob. Seeing a few examples would be very informative. My impression is
that it might require some serious kung-fu. :)


Re: interval origami

От
Adam Jensen
Дата:
On 11/30/18 3:19 PM, Adam Jensen wrote:
> The 'numrange' type with the 'overlaps' and 'intersection' operators
> seem to cover the fundamental computations in a very natural way.

Actually, those operators might not be entirely sufficient. Given two
ranges like this:

10.0|39.0|interesting
15.0|21.0|fail

Something like the negative or inverse of the intersection is needed:

10.0|15.0|interesting
21.0|39.0|interesting

Hmm...


Re: interval origami

От
Adam Jensen
Дата:
On 11/30/18 4:02 PM, Adam Jensen wrote:
> On 11/30/18 3:19 PM, Adam Jensen wrote:
>> The 'numrange' type with the 'overlaps' and 'intersection' operators
>> seem to cover the fundamental computations in a very natural way.
> 
> Actually, those operators might not be entirely sufficient. Given two
> ranges like this:
> 
> 10.0|39.0|interesting
> 15.0|21.0|fail
> 
> Something like the negative or inverse of the intersection is needed:
> 
> 10.0|15.0|interesting
> 21.0|39.0|interesting

I've mapped out nine time segment overlap scenarios:

1. good(10, 40) | bad(05, 15) -> good(15, 40)
2. good(10, 40) | bad(10, 15) -> good(15, 40)
3. good(10, 40) | bad(20, 30) -> good(10, 20), good(30, 40)
4. good(10, 40) | bad(20, 40) -> good(10, 20)
5. good(10, 40) | bad(20, 45) -> good(10, 20)
6. good(10, 40) | bad(05, 40) -> good()
7. good(10, 40) | bad(05, 45) -> good()
8. good(10, 40) | bad(10, 40) -> good()
9. good(10, 40) | bad(10, 45) -> good()

Letting gs/gf and bs/bf represent "good start-time"/"good finish-time"
and so on, pseudo-code to remove the bad segments looks like this:

find overlap: good(gs, gf) | bad(bs, bf)

CASE
    WHEN ((bs <= gs) AND (bf <  gf)) THEN  # 1 & 2
        -> (bf, gf)
    WHEN ((bs >  gs) AND (bf <  gf)) THEN  # 3
        -> (gs, bs), (bf, gf)
    WHEN ((bs >  gs) AND (bf >= gf)) THEN  # 4 &
        -> (gs, bs)
    WHEN ((bs <= gs) AND (bf >= gf)) THEN  # 6 & 7 & 8 & 9
        -> ()
END CASE;

And my first attempt at writing a PostgreSQL function looks like this:

CREATE FUNCTION find_overlap(gs REAL, gf REAL, bs REAL, bf REAL)
    RETURNS TABLE (start REAL, stop REAL) AS $$
    BEGIN
        CASE
            WHEN ((bs <= gs) AND (bf <  gf)) THEN
                RETURN NEXT (bf, gf);
                RETURN;
            WHEN ((bs >  gs) AND (bf <  gf)) THEN
                RETURN NEXT (gs, bs);
                RETURN NEXT (bf, gf);
                RETURN;
            WHEN ((bs >  gs) AND (bf >= gf)) THEN
                RETURN NEXT (gs, bs);
                RETURN;
            WHEN ((bs <= gs) AND (bf >= gf)) THEN
                RETURN;
        END CASE;
    END; $$
LANGUAGE plpgsql;

It results in:

ERROR:  RETURN NEXT cannot have a parameter in function with OUT parameters
LINE 6: RETURN NEXT (bf, gf);

Page 83 of the book "PostgreSQL Server Programming" mentions this
situation but doesn't actually describe or explain anything; nor does it
present a working example...

Any ideas?




Re: interval origami

От
Adam Jensen
Дата:
On 11/30/18 7:04 PM, Adam Jensen wrote:
> On 11/30/18 4:02 PM, Adam Jensen wrote:
>> On 11/30/18 3:19 PM, Adam Jensen wrote:
>>> The 'numrange' type with the 'overlaps' and 'intersection' operators
>>> seem to cover the fundamental computations in a very natural way.
>>
>> Actually, those operators might not be entirely sufficient. Given two
>> ranges like this:
>>
>> 10.0|39.0|interesting
>> 15.0|21.0|fail
>>
>> Something like the negative or inverse of the intersection is needed:
>>
>> 10.0|15.0|interesting
>> 21.0|39.0|interesting
> 
> I've mapped out nine time segment overlap scenarios:
> 
> 1. good(10, 40) | bad(05, 15) -> good(15, 40)
> 2. good(10, 40) | bad(10, 15) -> good(15, 40)
> 3. good(10, 40) | bad(20, 30) -> good(10, 20), good(30, 40)
> 4. good(10, 40) | bad(20, 40) -> good(10, 20)
> 5. good(10, 40) | bad(20, 45) -> good(10, 20)
> 6. good(10, 40) | bad(05, 40) -> good()
> 7. good(10, 40) | bad(05, 45) -> good()
> 8. good(10, 40) | bad(10, 40) -> good()
> 9. good(10, 40) | bad(10, 45) -> good()
> 
> Letting gs/gf and bs/bf represent "good start-time"/"good finish-time"
> and so on, pseudo-code to remove the bad segments looks like this:
> 
> find overlap: good(gs, gf) | bad(bs, bf)
> 
> CASE
>     WHEN ((bs <= gs) AND (bf <  gf)) THEN  # 1 & 2
>         -> (bf, gf)
>     WHEN ((bs >  gs) AND (bf <  gf)) THEN  # 3
>         -> (gs, bs), (bf, gf)
>     WHEN ((bs >  gs) AND (bf >= gf)) THEN  # 4 &
>         -> (gs, bs)
>     WHEN ((bs <= gs) AND (bf >= gf)) THEN  # 6 & 7 & 8 & 9
>         -> ()
> END CASE;
> 
> And my first attempt at writing a PostgreSQL function looks like this:
> 
> CREATE FUNCTION find_overlap(gs REAL, gf REAL, bs REAL, bf REAL)
>     RETURNS TABLE (start REAL, stop REAL) AS $$
>     BEGIN
>         CASE
>             WHEN ((bs <= gs) AND (bf <  gf)) THEN
>                 RETURN NEXT (bf, gf);
>                 RETURN;
>             WHEN ((bs >  gs) AND (bf <  gf)) THEN
>                 RETURN NEXT (gs, bs);
>                 RETURN NEXT (bf, gf);
>                 RETURN;
>             WHEN ((bs >  gs) AND (bf >= gf)) THEN
>                 RETURN NEXT (gs, bs);
>                 RETURN;
>             WHEN ((bs <= gs) AND (bf >= gf)) THEN
>                 RETURN;
>         END CASE;
>     END; $$
> LANGUAGE plpgsql;
> 
> It results in:
> 
> ERROR:  RETURN NEXT cannot have a parameter in function with OUT parameters
> LINE 6: RETURN NEXT (bf, gf);
> 
> Page 83 of the book "PostgreSQL Server Programming" mentions this
> situation but doesn't actually describe or explain anything; nor does it
> present a working example...
> 
> Any ideas?
> 

This seems to work. Trial and error got me there.

CREATE FUNCTION find_overlap(gs REAL, gf REAL, bs REAL, bf REAL)
    RETURNS TABLE (start REAL, stop REAL) AS $$
    BEGIN
        CASE
            WHEN ((bs <= gs) AND (bf <  gf)) THEN
                RETURN QUERY VALUES (bf, gf);
            WHEN ((bs >  gs) AND (bf <  gf)) THEN
                RETURN QUERY VALUES (gs, bs), (bf, gf);
            WHEN ((bs >  gs) AND (bf >= gf)) THEN
                RETURN QUERY VALUES (gs, bs);
            WHEN ((bs <= gs) AND (bf >= gf)) THEN
                RETURN;
        END CASE;
    END; $$
LANGUAGE plpgsql;

1. good(10, 40) | bad(05, 15) -> good(15, 40)
2. good(10, 40) | bad(10, 15) -> good(15, 40)
3. good(10, 40) | bad(20, 30) -> good(10, 20), good(30, 40)
4. good(10, 40) | bad(20, 40) -> good(10, 20)
5. good(10, 40) | bad(20, 45) -> good(10, 20)
6. good(10, 40) | bad(05, 40) -> good()
7. good(10, 40) | bad(05, 45) -> good()
8. good(10, 40) | bad(10, 40) -> good()
9. good(10, 40) | bad(10, 45) -> good()

SELECT find_overlap(10,40, 05,15);
SELECT find_overlap(10,40, 10,15);
SELECT find_overlap(10,40, 20,30);
SELECT find_overlap(10,40, 20,40);
SELECT find_overlap(10,40, 20,45);
SELECT find_overlap(10,40, 05,40);
SELECT find_overlap(10,40, 05,45);
SELECT find_overlap(10,40, 10,40);
SELECT find_overlap(10,40, 10,45);



Re: interval origami

От
Adam Jensen
Дата:
Given the original:

CREATE TABLE Example (start REAL, stop REAL, tag TEXT);

I suppose it might make more sense to do it like this:

CREATE FUNCTION remove_bad(gs REAL, gf REAL, bs REAL, bf REAL)
   RETURNS SETOF Example AS $$
   BEGIN
      CASE
         WHEN ((bs <= gs) AND (bf <  gf)) THEN
            RETURN QUERY VALUES (bf, gf, 'interesting');
         WHEN ((bs >  gs) AND (bf <  gf)) THEN
            RETURN QUERY VALUES (gs, bs, 'interesting'),
               (bf, gf, 'interesting');
         WHEN ((bs >  gs) AND (bf >= gf)) THEN
            RETURN QUERY VALUES (gs, bs, 'interesting');
         WHEN ((bs <= gs) AND (bf >= gf)) THEN
            RETURN;
      END CASE;
   END; $$
LANGUAGE plpgsql;

SELECT remove_bad(10,40, 05,15);
SELECT remove_bad(10,40, 10,15);
SELECT remove_bad(10,40, 20,30);
SELECT remove_bad(10,40, 20,40);
SELECT remove_bad(10,40, 20,45);
SELECT remove_bad(10,40, 05,40);
SELECT remove_bad(10,40, 05,45);
SELECT remove_bad(10,40, 10,40);
SELECT remove_bad(10,40, 10,45);



Re: interval origami

От
Joe Conway
Дата:
On 11/30/18 4:02 PM, Adam Jensen wrote:
> On 11/30/18 3:19 PM, Adam Jensen wrote:
>> The 'numrange' type with the 'overlaps' and 'intersection' operators
>> seem to cover the fundamental computations in a very natural way.
> 
> Actually, those operators might not be entirely sufficient. Given two
> ranges like this:
> 
> 10.0|39.0|interesting
> 15.0|21.0|fail
> 
> Something like the negative or inverse of the intersection is needed:
> 
> 10.0|15.0|interesting
> 21.0|39.0|interesting

Perhaps overkill, but if you represent your timeline as actual line
segments, perhaps PostGIS would be useful. E.g.:

https://postgis.net/docs/manual-2.5/ST_Difference.html

HTH,

Joe

-- 
Crunchy Data - http://crunchydata.com
PostgreSQL Support for Secure Enterprises
Consulting, Training, & Open Source Development


Re: interval origami

От
Adam Jensen
Дата:
On 12/1/18 8:24 AM, Joe Conway wrote:
> Perhaps overkill, but if you represent your timeline as actual line
> segments, perhaps PostGIS would be useful. E.g.:
> 
> https://postgis.net/docs/manual-2.5/ST_Difference.html

That's an interesting notion. Thanks, Joe!

I think I learned enough about plpgsql programming last night to write
the three basic functions that each operate on two time intervals:

1. determine if two intervals overlap
2. merge two overlapping intervals into one interval
3. given two overlapping intervals, produce the difference interval or
interval set

My thinking here is that since these functions seem to be
computationally simple, it might be more convenient for exploration,
development, distribution and maintenance if there is some uniformity in
their style and control over their implementation and behavior.

Currently, I am thinking about how these functions might be used to
solve the problem. Since I am not familiar with the capabilities of SQL,
my tendency is to think in terms of a function that iterates over the
data set multiple times and eventually converges to produce the solution
set. I am concerned that this might be a very goofy way to solve the
problem in a relational database.


Re: interval origami

От
Steve Midgley
Дата:


On Sat, Dec 1, 2018, 12:38 PM Adam Jensen <hanzer@riseup.net wrote:
On 12/1/18 8:24 AM, Joe Conway wrote:
> Perhaps overkill, but if you represent your timeline as actual line
> segments, perhaps PostGIS would be useful. E.g.:
>
> https://postgis.net/docs/manual-2.5/ST_Difference.html

That's an interesting notion. Thanks, Joe!

I think I learned enough about plpgsql programming last night to write
the three basic functions that each operate on two time intervals:

1. determine if two intervals overlap
2. merge two overlapping intervals into one interval
3. given two overlapping intervals, produce the difference interval or
interval set

My thinking here is that since these functions seem to be
computationally simple, it might be more convenient for exploration,
development, distribution and maintenance if there is some uniformity in
their style and control over their implementation and behavior.

Currently, I am thinking about how these functions might be used to
solve the problem. Since I am not familiar with the capabilities of SQL,
my tendency is to think in terms of a function that iterates over the
data set multiple times and eventually converges to produce the solution
set. I am concerned that this might be a very goofy way to solve the
problem in a relational database.

If you use postgis then you get all the fundamental primitives and (critically) indexing that is virtually guaranteed to function well. I used postgis awhile back to solve a data problem that was based in abstract dimensions and it worked incredibly well. I'm not an expert in the kind of extension programming you're attempting but I'd be worried that it's easy to make a mistake. 

Postgis is going to give you maybe more guardrails to keep your work within safe design boundaries.. 

Just two cents from the gallery, 
Steve