Обсуждение: LEFT JOIN and missing values

Поиск
Список
Период
Сортировка

LEFT JOIN and missing values

От
mila
Дата:
Hi,

I have a problem, like this:

I have 2 tables, T1 (id, val1) -- a vector
and T2 (id1, id2, val2) -- this is actually a sqare matrix
in the form of a very long table.

I need to do some operations with them, such as take a
dot-product of 2 rows of the matrix (encoded as T2), weighted with T1..         .

The query is the following:

select c.id1, d.id1, sum(c.val2*d.val2*T1.val1)      from T1, T2 c, T2 d      where c.ID2 = T1.id AND d.id2 = T1.id AND
c.id1< d.id1
 
group by c.id1, d.id1;

This gives me all possible pairs of (weighted) dot-products, as I want them.

Now, some 90% of values in the T2 are the same, say, 0.5, so I could
thouw them away. But then the query above is not valid anymore, since
if there is mo match on the condition   c.ID2 = T1.id AND d.id2 =
T1.id, nothing is added (it might be that c.id2 is there and d.id2 is
not!).

I quiess I need a left join + substitute for missing value. Can anyone
help me with the query? I cannot figure it out, too bad ad left
joins...

thanks,
Mila



Re: LEFT JOIN and missing values

От
Tomasz Myrta
Дата:
mila wrote:
> Hi,
> 
> I have a problem, like this:
> 
> I have 2 tables, T1 (id, val1) -- a vector
> and T2 (id1, id2, val2) -- this is actually a sqare matrix
> in the form of a very long table.
> 
> I need to do some operations with them, such as take a
> dot-product of 2 rows of the matrix (encoded as T2), weighted with T1..         .
> 
> The query is the following:
> 
> select c.id1, d.id1, sum(c.val2*d.val2*T1.val1)
>        from T1, T2 c, T2 d
>        where c.ID2 = T1.id AND d.id2 = T1.id AND c.id1 < d.id1
> group by c.id1, d.id1;

> 
> This gives me all possible pairs of (weighted) dot-products, as I want them.
> 
> Now, some 90% of values in the T2 are the same, say, 0.5, so I could
> thouw them away. But then the query above is not valid anymore, since
> if there is mo match on the condition   c.ID2 = T1.id AND d.id2 =
> T1.id, nothing is added (it might be that c.id2 is there and d.id2 is
> not!).
> 
> I quiess I need a left join + substitute for missing value. Can anyone
> help me with the query? I cannot figure it out, too bad ad left
> joins...
> 
> thanks,
> Mila

First let's rewrite your query to explicit joins (I like them very much, 
because they show the problem more clearly)

select c.id1, d.id1, sum(c.val2*d.val2*T1.val1)
from T1 join T2 c on (c.ID2 = T1.id) join T2 d on (d.id2 = T1.id and c.id1 < d.id1)
group by c.id1, d.id1;

If I understand well you want to delete useless values:

delete from t2 where val2=0.5

your query should look then:

select c.id1, d.id1, sum( case when c.id2 is null then 0.5 else c.val2 end * case when d.id2 is null then 0.5 else
d.val2end *T1.val1)
 
from T1 left join T2 c on (c.ID2 = T1.id) left join T2 d on (d.id2 = T1.id and (c.id1 < d.id1 or c.id1 is null))
group by c.id1, d.id1;

I'm not sure about this query... There is a problem - is it possible to 
have missing values in your matrix? This solution replaces all missing 
values into 0.5. What should happen to "T2 alias d" if there is missing 
value in "T2 alias c"? What happens then to clause "c.id1<d.id1" if "c" 
is not found?

Regards,
Tomasz Myrta



Re: LEFT JOIN and missing values

От
mila
Дата:
Tomasz,

> select c.id1, d.id1, sum(
>   case when c.id2 is null then 0.5 else c.val2 end *
>   case when d.id2 is null then 0.5 else d.val2 end *T1.val1)
> from
>   T1
>   left join T2 c on (c.ID2 = T1.id)
>   left join T2 d on (d.id2 = T1.id and (c.id1 < d.id1 or c.id1 is null))
> group by c.id1, d.id1;


> I'm not sure about this query... There is a problem - is it possible to
> have missing values in your matrix? This solution replaces all missing 
> values into 0.5.
yes, that's what I want to do, remove the tuples with default value
and substitute it later if I need it.


>  What should happen to "T2 alias d" if there is missing
> value in "T2 alias c"?

if a value, c.val2, is missing, it should be replaced with 0.5, since the only
deleted tuples are those that had contained 0.5 in "val2" field. The
(other) value, d.val2, that exisits in the matrix, should be used "as is". If
they are both missing, then t1.val1 is multiplied twice by 0.5


> What happens then to clause "c.id1<d.id1" if "c" 
> is not found?

I removed this condition "c.id1 < d.id1", since it is indeed
unclear, what to compare when a row is "missing" (I can filter it
later).

Then, I tried the query that you proposed,
on a test table T1 - 10 rows,
T2 = 10x10 = 100 rows.

With the full T2 the query is equivalent to the one that I gave, but
when I remove the tuples with default values (call it table T3), then
the resulting table is only 90 rows instead of 100, besides, the
values are different from the "right" ones from "full" T2.

The missing combinations seem to be random.

How can that be if T1 contains all values that might be in T2.id2,
T2.id1?
and what do do next?

Mila



Re: LEFT JOIN and missing values

От
Tomasz Myrta
Дата:
I looked at your example once again and I noticed that wrong result 
comes from "group by" clause. If we delete useless rows with value 
"0.5", we also remove information about possible values of t2.id1.

Let's say your table t2 is full of 0.5 values. It means, we can empty 
it. If we do this, how Postgres would know which t2.id1 values to 
return? What size of matrix should it return?

I think if you really need to remove your 0.5 values, you need another 
table with all possible values of id1. This table will look like:

create table t3
(  id1 integer;
);

your query should look then:

select cc.id1, dd.id1, sum( case when c.id2 is null then 0.5 else c.val2 end * case when d.id2 is null then 0.5 else
d.val2end *T1.val1)
 
from T1 cross join T3 cc left join T2 c on (c.ID2 = T1.id and c.id1=cc.id1) cross join T3 dd left join T2 d on (d.id2 =
T1.idand d.id1=dd.id1)
 
group by cc.id1, dd.id1;

Anyway I'm not sure if it is what you really want :-(

Regards,
Tomasz Myrta



Re: LEFT JOIN and missing values

От
mila boldareva
Дата:
Tomasz,
> I think if you really need to remove your 0.5 values, you need another
> table with all possible values of id1. This table will look like:
> create table t3
> (
>    id1 integer;
> );

T1 contains values of id's,
The T2 matrix is a matrix of some sort of "distances" between id's
from T1, so original T2 is a T1 x T1.
(that's what you are proposing right?)

So there is no id1, id2 in T2 that don't exists in T1,
I checked my test tables and this is indeed the case.
In other words T1 is dense, I make a left join on it and I get wrong results!
And even wrong number of tuples :-(

cheers,
Mila