Обсуждение: Different results from identical matviews

Поиск

Список

Период

Сортировка

Different results from identical matviews

От

Anders Steinlein

Дата:

02 июля 2020 г., 02:39:48

Hi folks,

We have a materialized view from which a customer reported some confusing/invalid results, leading us to inspect the query and not finding anything wrong. Running the query defining the matview manually, or creating a new (identical) materialized view returns the correct result. Obviously, we've done REFRESH MATERIALIZED VIEW just before doing the comparison, and all runs are in the same schema.

It's a pretty big query, but let's describe the two matviews to see that they are identical. The first is the original returning invalid results, the one with _2 name postfix is the re-created one.

mm_prod=> \d+ segments_with_contacts
Materialized view "aakpnews.segments_with_contacts"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
--------+---------------+-----------+----------+---------+----------+--------------+-------------
lid | integer | | | | plain | |
sid | integer | | | | plain | |
email | public.citext | | | | extended | |
Indexes:
"segments_with_contacts_sid_lid_email_idx" UNIQUE, btree (sid, lid, email)
View definition:
WITH tagged_contacts AS (
SELECT cl.lid,
cl.email,
cl.skip_preexisting_campaigns AS skip_subscribed,
ct.skip_preexisting_campaigns AS skip_tags,
cl.ladded,
ct.tagname,
ct.created
FROM contacts_lists cl
LEFT JOIN contacts_tags ct USING (email)
WHERE cl.lstatus::bpchar = 'a'::bpchar
), tagged_segments AS (
SELECT s.lid,
s.cid,
s.sid,
sp.type,
sp.mid,
sp.matchdelay,
sp.tagname,
sp.event,
count(*) OVER (PARTITION BY s.sid) AS requirements,
campaigns.activated_at
FROM segments s
LEFT JOIN campaigns USING (cid)
JOIN segments_predicates sp USING (sid)
WHERE s.archived_at IS NULL AND (s.cid IS NULL OR campaigns.activated_at IS NOT NULL)
), segments_contacts AS (
SELECT s.lid,
s.sid,
s.requirements,
CASE
WHEN s.type = 'subscribed'::public.predicate THEN ( SELECT array_agg(DISTINCT tagged_contacts.email::public.citext) AS array_agg
FROM tagged_contacts
WHERE tagged_contacts.lid = s.lid AND tagged_contacts.ladded >= s.activated_at AND (s.matchdelay IS NULL OR (tagged_contacts.ladded + s.matchdelay) < now()) AND NOT COALESCE(tagged_contacts.skip_subscribed, false))
WHEN s.type = 'has_tag'::public.predicate THEN ( SELECT array_agg(DISTINCT tagged_contacts.email::public.citext) AS array_agg
FROM tagged_contacts
WHERE tagged_contacts.lid = s.lid AND tagged_contacts.tagname OPERATOR(public.=) s.tagname AND (s.matchdelay IS NULL OR (tagged_contacts.created + s.matchdelay) < now()) AND (s.cid IS NULL OR tagged_contacts.created >= s.activated_at AND NOT COALESCE(tagged_contacts.skip_tags, false)))
WHEN s.type = 'not_has_tag'::public.predicate THEN ( SELECT array_agg(aggregated_tags.email::public.citext) AS array_agg
FROM ( SELECT tagged_contacts.email,
array_agg(tagged_contacts.tagname) AS tags
FROM tagged_contacts
WHERE tagged_contacts.lid = s.lid AND (s.cid IS NULL OR tagged_contacts.ladded >= s.activated_at AND NOT COALESCE(tagged_contacts.skip_subscribed, false))
GROUP BY tagged_contacts.email) aggregated_tags
WHERE NOT aggregated_tags.tags @> ARRAY[s.tagname])
WHEN s.type = 'received'::public.predicate THEN ( SELECT array_agg(DISTINCT mails_contacts_sent.email::public.citext) AS array_agg
FROM mails_contacts_sent
WHERE mails_contacts_sent.mid = s.mid AND (s.matchdelay IS NULL OR (mails_contacts_sent.senttime + s.matchdelay) < now()))
WHEN s.type = 'not_received'::public.predicate THEN ( SELECT array_agg(x.email::public.citext) AS array_agg
FROM ( SELECT tagged_contacts.email
FROM tagged_contacts
WHERE tagged_contacts.lid = s.lid
EXCEPT
SELECT DISTINCT mails_contacts_sent.email
FROM mails_contacts_sent
WHERE mails_contacts_sent.mid = s.mid AND (s.matchdelay IS NULL OR (mails_contacts_sent.senttime + s.matchdelay) < now())) x)
WHEN s.type = 'opened'::public.predicate THEN ( SELECT array_agg(DISTINCT mails_contacts_opens.email::public.citext) AS array_agg
FROM mails_contacts_opens
WHERE mails_contacts_opens.mid = s.mid AND (s.matchdelay IS NULL OR (mails_contacts_opens.opentime + s.matchdelay) < now()))
WHEN s.type = 'not_opened'::public.predicate THEN ( SELECT array_agg(x.email::public.citext) AS array_agg
FROM ( SELECT tagged_contacts.email
FROM tagged_contacts
JOIN mails_contacts_sent USING (email)
WHERE tagged_contacts.lid = s.lid AND mails_contacts_sent.mid = s.mid AND (s.matchdelay IS NULL OR (mails_contacts_sent.senttime + s.matchdelay) < now())
EXCEPT
SELECT DISTINCT mails_contacts_opens.email
FROM mails_contacts_opens
WHERE mails_contacts_opens.mid = s.mid) x)
WHEN s.type = 'clicked'::public.predicate THEN ( SELECT array_agg(DISTINCT mails_contacts_clicks.email::public.citext) AS array_agg
FROM mails_contacts_clicks
WHERE mails_contacts_clicks.mid = s.mid AND (s.matchdelay IS NULL OR (mails_contacts_clicks.clicktime + s.matchdelay) < now()))
WHEN s.type = 'not_clicked'::public.predicate THEN ( SELECT array_agg(x.email::public.citext) AS array_agg
FROM ( SELECT tagged_contacts.email
FROM tagged_contacts
JOIN mails_contacts_sent USING (email)
WHERE tagged_contacts.lid = s.lid AND mails_contacts_sent.mid = s.mid AND (s.matchdelay IS NULL OR (mails_contacts_sent.senttime + s.matchdelay) < now())
EXCEPT
SELECT DISTINCT mails_contacts_clicks.email
FROM mails_contacts_clicks
WHERE mails_contacts_clicks.mid = s.mid) x)
WHEN s.type = 'event_triggered_first'::public.predicate THEN ( SELECT array_agg(DISTINCT x.email::public.citext) AS array_agg
FROM ( SELECT e.email
FROM events e
JOIN tagged_contacts USING (email)
WHERE tagged_contacts.lid = s.lid AND e.event OPERATOR(public.=) s.event
GROUP BY e.email
HAVING (s.matchdelay IS NULL OR (min(e.triggered_at) + s.matchdelay) < now()) AND (s.cid IS NULL OR min(e.triggered_at) >= s.activated_at)) x)
WHEN s.type = 'event_triggered_last'::public.predicate THEN ( SELECT array_agg(DISTINCT x.email::public.citext) AS array_agg
FROM ( SELECT e.email
FROM events e
JOIN tagged_contacts USING (email)
WHERE tagged_contacts.lid = s.lid AND e.event OPERATOR(public.=) s.event
GROUP BY e.email
HAVING (s.matchdelay IS NULL OR (max(e.triggered_at) + s.matchdelay) < now()) AND (s.cid IS NULL OR max(e.triggered_at) >= s.activated_at)) x)
ELSE NULL::public.citext[]
END AS emails
FROM tagged_segments s
), unnested AS (
SELECT segments_contacts.lid,
segments_contacts.sid,
segments_contacts.requirements,
unnest(segments_contacts.emails) AS email
FROM segments_contacts
)
SELECT unnested.lid,
unnested.sid,
unnested.email
FROM unnested
GROUP BY unnested.lid, unnested.sid, unnested.email, unnested.requirements
HAVING count(unnested.email) = unnested.requirements;
Access method: heap

mm_prod=> \d+ segments_with_contacts_2
Materialized view "aakpnews.segments_with_contacts_2"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
--------+---------------+-----------+----------+---------+----------+--------------+-------------
lid | integer | | | | plain | |
sid | integer | | | | plain | |
email | public.citext | | | | extended | |
View definition:
WITH tagged_contacts AS (
SELECT cl.lid,
cl.email,
cl.skip_preexisting_campaigns AS skip_subscribed,
ct.skip_preexisting_campaigns AS skip_tags,
cl.ladded,
ct.tagname,
ct.created
FROM contacts_lists cl
LEFT JOIN contacts_tags ct USING (email)
WHERE cl.lstatus::bpchar = 'a'::bpchar
), tagged_segments AS (
SELECT s.lid,
s.cid,
s.sid,
sp.type,
sp.mid,
sp.matchdelay,
sp.tagname,
sp.event,
count(*) OVER (PARTITION BY s.sid) AS requirements,
campaigns.activated_at
FROM segments s
LEFT JOIN campaigns USING (cid)
JOIN segments_predicates sp USING (sid)
WHERE s.archived_at IS NULL AND (s.cid IS NULL OR campaigns.activated_at IS NOT NULL)
), segments_contacts AS (
SELECT s.lid,
s.sid,
s.requirements,
CASE
WHEN s.type = 'subscribed'::public.predicate THEN ( SELECT array_agg(DISTINCT tagged_contacts.email::public.citext) AS array_agg
FROM tagged_contacts
WHERE tagged_contacts.lid = s.lid AND tagged_contacts.ladded >= s.activated_at AND (s.matchdelay IS NULL OR (tagged_contacts.ladded + s.matchdelay) < now()) AND NOT COALESCE(tagged_contacts.skip_subscribed, false))
WHEN s.type = 'has_tag'::public.predicate THEN ( SELECT array_agg(DISTINCT tagged_contacts.email::public.citext) AS array_agg
FROM tagged_contacts
WHERE tagged_contacts.lid = s.lid AND tagged_contacts.tagname OPERATOR(public.=) s.tagname AND (s.matchdelay IS NULL OR (tagged_contacts.created + s.matchdelay) < now()) AND (s.cid IS NULL OR tagged_contacts.created >= s.activated_at AND NOT COALESCE(tagged_contacts.skip_tags, false)))
WHEN s.type = 'not_has_tag'::public.predicate THEN ( SELECT array_agg(aggregated_tags.email::public.citext) AS array_agg
FROM ( SELECT tagged_contacts.email,
array_agg(tagged_contacts.tagname) AS tags
FROM tagged_contacts
WHERE tagged_contacts.lid = s.lid AND (s.cid IS NULL OR tagged_contacts.ladded >= s.activated_at AND NOT COALESCE(tagged_contacts.skip_subscribed, false))
GROUP BY tagged_contacts.email) aggregated_tags
WHERE NOT aggregated_tags.tags @> ARRAY[s.tagname])
WHEN s.type = 'received'::public.predicate THEN ( SELECT array_agg(DISTINCT mails_contacts_sent.email::public.citext) AS array_agg
FROM mails_contacts_sent
WHERE mails_contacts_sent.mid = s.mid AND (s.matchdelay IS NULL OR (mails_contacts_sent.senttime + s.matchdelay) < now()))
WHEN s.type = 'not_received'::public.predicate THEN ( SELECT array_agg(x.email::public.citext) AS array_agg
FROM ( SELECT tagged_contacts.email
FROM tagged_contacts
WHERE tagged_contacts.lid = s.lid
EXCEPT
SELECT DISTINCT mails_contacts_sent.email
FROM mails_contacts_sent
WHERE mails_contacts_sent.mid = s.mid AND (s.matchdelay IS NULL OR (mails_contacts_sent.senttime + s.matchdelay) < now())) x)
WHEN s.type = 'opened'::public.predicate THEN ( SELECT array_agg(DISTINCT mails_contacts_opens.email::public.citext) AS array_agg
FROM mails_contacts_opens
WHERE mails_contacts_opens.mid = s.mid AND (s.matchdelay IS NULL OR (mails_contacts_opens.opentime + s.matchdelay) < now()))
WHEN s.type = 'not_opened'::public.predicate THEN ( SELECT array_agg(x.email::public.citext) AS array_agg
FROM ( SELECT tagged_contacts.email
FROM tagged_contacts
JOIN mails_contacts_sent USING (email)
WHERE tagged_contacts.lid = s.lid AND mails_contacts_sent.mid = s.mid AND (s.matchdelay IS NULL OR (mails_contacts_sent.senttime + s.matchdelay) < now())
EXCEPT
SELECT DISTINCT mails_contacts_opens.email
FROM mails_contacts_opens
WHERE mails_contacts_opens.mid = s.mid) x)
WHEN s.type = 'clicked'::public.predicate THEN ( SELECT array_agg(DISTINCT mails_contacts_clicks.email::public.citext) AS array_agg
FROM mails_contacts_clicks
WHERE mails_contacts_clicks.mid = s.mid AND (s.matchdelay IS NULL OR (mails_contacts_clicks.clicktime + s.matchdelay) < now()))
WHEN s.type = 'not_clicked'::public.predicate THEN ( SELECT array_agg(x.email::public.citext) AS array_agg
FROM ( SELECT tagged_contacts.email
FROM tagged_contacts
JOIN mails_contacts_sent USING (email)
WHERE tagged_contacts.lid = s.lid AND mails_contacts_sent.mid = s.mid AND (s.matchdelay IS NULL OR (mails_contacts_sent.senttime + s.matchdelay) < now())
EXCEPT
SELECT DISTINCT mails_contacts_clicks.email
FROM mails_contacts_clicks
WHERE mails_contacts_clicks.mid = s.mid) x)
WHEN s.type = 'event_triggered_first'::public.predicate THEN ( SELECT array_agg(DISTINCT x.email::public.citext) AS array_agg
FROM ( SELECT e.email
FROM events e
JOIN tagged_contacts USING (email)
WHERE tagged_contacts.lid = s.lid AND e.event OPERATOR(public.=) s.event
GROUP BY e.email
HAVING (s.matchdelay IS NULL OR (min(e.triggered_at) + s.matchdelay) < now()) AND (s.cid IS NULL OR min(e.triggered_at) >= s.activated_at)) x)
WHEN s.type = 'event_triggered_last'::public.predicate THEN ( SELECT array_agg(DISTINCT x.email::public.citext) AS array_agg
FROM ( SELECT e.email
FROM events e
JOIN tagged_contacts USING (email)
WHERE tagged_contacts.lid = s.lid AND e.event OPERATOR(public.=) s.event
GROUP BY e.email
HAVING (s.matchdelay IS NULL OR (max(e.triggered_at) + s.matchdelay) < now()) AND (s.cid IS NULL OR max(e.triggered_at) >= s.activated_at)) x)
ELSE NULL::public.citext[]
END AS emails
FROM tagged_segments s
), unnested AS (
SELECT segments_contacts.lid,
segments_contacts.sid,
segments_contacts.requirements,
unnest(segments_contacts.emails) AS email
FROM segments_contacts
)
SELECT unnested.lid,
unnested.sid,
unnested.email
FROM unnested
GROUP BY unnested.lid, unnested.sid, unnested.email, unnested.requirements
HAVING count(unnested.email) = unnested.requirements;
Access method: heap

Here you can see the results are different:

mm_prod=> begin;
BEGIN
mm_prod=> refresh materialized view segments_with_contacts;
REFRESH MATERIALIZED VIEW
mm_prod=> select count(*) from segments_with_contacts where sid = 42259;
count
-------
91
(1 row)

mm_prod=> refresh materialized view segments_with_contacts_2;
REFRESH MATERIALIZED VIEW
mm_prod=> select count(*) from segments_with_contacts_2 where sid = 42259;
count
-------
98
(1 row)

How can we figure out what's wrong here? Since it's a materialized view using EXPLAIN doesn't give me much to go on:

mm_prod=> explain select count(*) from segments_with_contacts where sid = 42259;
QUERY PLAN
------------------------------------------------------------------------------
Aggregate (cost=83.52..83.53 rows=1 width=8)
-> Seq Scan on segments_with_contacts (cost=0.00..83.29 rows=91 width=0)
Filter: (sid = 42259)
(3 rows)

mm_prod=> explain select count(*) from segments_with_contacts_2 where sid = 42259;
QUERY PLAN
---------------------------------------------------------------------------------
Aggregate (cost=87.46..87.47 rows=1 width=8)
-> Seq Scan on segments_with_contacts_2 (cost=0.00..87.20 rows=105 width=0)
Filter: (sid = 42259)
(3 rows)

mm_prod=> select count(*) from segments_with_contacts_2 where sid = 42259;
count
-------
98
(1 row)

Any insights into this, and how to debug this further, is much appreciated.

Best,

-- a.

Re: Different results from identical matviews

От

Tom Lane

Дата:

02 июля 2020 г., 03:02:17

Anders Steinlein <anders@e5r.no> writes:
> We have a materialized view from which a customer reported some
> confusing/invalid results, leading us to inspect the query and not finding
> anything wrong. Running the query defining the matview manually, or
> creating a new (identical) materialized view returns the correct result.
> Obviously, we've done REFRESH MATERIALIZED VIEW just before doing the
> comparison, and all runs are in the same schema.

I suspect the query underlying the matviews is less deterministic than
you think it is.  I did not study that query in any detail, but just
from a quick eyeball: the array_agg() calls with no attempt to enforce a
particular aggregation order are concerning, and so is grouping by
a citext column (where you'll get some case-folding of a common value,
but who knows which).

            regards, tom lane

Re: Different results from identical matviews

От

Magnus Hagander

Дата:

02 июля 2020 г., 13:12:31

On Thu, Jul 2, 2020 at 2:02 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Anders Steinlein <anders@e5r.no> writes:
> We have a materialized view from which a customer reported some
> confusing/invalid results, leading us to inspect the query and not finding
> anything wrong. Running the query defining the matview manually, or
> creating a new (identical) materialized view returns the correct result.
> Obviously, we've done REFRESH MATERIALIZED VIEW just before doing the
> comparison, and all runs are in the same schema.

I suspect the query underlying the matviews is less deterministic than
you think it is. I did not study that query in any detail, but just
from a quick eyeball: the array_agg() calls with no attempt to enforce a
particular aggregation order are concerning, and so is grouping by
a citext column (where you'll get some case-folding of a common value,
but who knows which).

Also not having looked at the query in detail -- but are there concurrent changes in the database? Because since you're creating your transaction in READ COMMITTED, other transactions finishing in between your two REFRESH commands can alter the data. To make sure that's not what's happening, you may want to try doing the same thing with a BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE instead, and see if the problem still occurs.

Magnus Hagander
Me: https://www.hagander.net/
Work: https://www.redpill-linpro.com/

Re: Different results from identical matviews

От

Anders Steinlein

Дата:

02 июля 2020 г., 14:26:22

On Thu, Jul 2, 2020 at 2:02 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Anders Steinlein <anders@e5r.no> writes:
> We have a materialized view from which a customer reported some
> confusing/invalid results, leading us to inspect the query and not finding
> anything wrong. Running the query defining the matview manually, or
> creating a new (identical) materialized view returns the correct result.
> Obviously, we've done REFRESH MATERIALIZED VIEW just before doing the
> comparison, and all runs are in the same schema.

I suspect the query underlying the matviews is less deterministic than
you think it is. I did not study that query in any detail, but just
from a quick eyeball: the array_agg() calls with no attempt to enforce a
particular aggregation order are concerning, and so is grouping by
a citext column (where you'll get some case-folding of a common value,
but who knows which).

Thanks for the tip, but I'm having a hard time thinking that's the case, seeing as I'm unable to trigger the wrong result no matter how hard I try with a new definition/manual query. I've introduced random ordering to the first CTE-clause (where the initial citext values comes from, and casing thus could differ in some order) which doesn't change the result.

When the citext type is used throughout the query, shouldn't the grouping result be deterministic? The citext values are first "rolled up" with array_agg() and later unnested and finally grouped. Shouldn't the end result be the same, regardless of what particular case-folded version of the value it chooses to group on?

I've simplified the query for this particular customer case that, again, always returns the correct result no matter how often I try:

mm_prod=> SELECT sid, count(*) FROM (

WITH tagged_contacts AS (
SELECT lid, email, cl.skip_preexisting_campaigns AS skip_subscribed,
ct.skip_preexisting_campaigns AS skip_tags, ladded, tagname, created
FROM contacts_lists cl
LEFT JOIN contacts_tags ct USING (email)
WHERE lstatus = 'a'
ORDER BY random()
),
tagged_segments AS (
SELECT s.lid, cid, sid, sp.type, sp.mid, matchdelay, tagname, event,
count(*) OVER (PARTITION BY sid) AS requirements,
activated_at
FROM segments s
LEFT JOIN campaigns USING (cid)
INNER JOIN segments_predicates sp USING (sid)
WHERE
s.archived_at IS NULL
AND (cid IS NULL OR activated_at IS NOT NULL)
),
segments_contacts AS (
SELECT lid, sid, requirements,
CASE
WHEN type = 'has_tag' THEN (
SELECT array_agg(DISTINCT email::citext)
FROM tagged_contacts
WHERE
lid = s.lid
AND tagname = s.tagname
AND (matchdelay IS NULL OR created + matchdelay < now())
AND (
cid IS NULL
OR (
created >= activated_at
AND NOT COALESCE(skip_tags, false)
)
)
)
END AS emails
FROM tagged_segments s
),
unnested AS (
SELECT lid, sid, requirements, unnest(emails) AS email
FROM segments_contacts
)
SELECT lid, sid, email
FROM unnested
GROUP BY lid, sid, email, requirements
HAVING count(email) = requirements
) x
WHERE sid = 42259
GROUP BY sid;

sid | count
-------+-------
42259 | 98
(1 row)

This is stale data for this customer, so no data changes are occurring to change the results. I can REFRESH MATERIALIZED VIEW as many times I was on the original segments_with_contacts matview, and I never see different results. If it were not deterministic, shouldn't I expect to see different results one in at least 100 times tried?

Thanks again for any insight to try and figure this out. Again, I could just re-create the matview we use in production and it would likely work (since I'm unable to get wrong results with a newly created case), but I would rather try to find out the root cause here first.

Best,

-- a.

Re: Different results from identical matviews

От

Anders Steinlein

Дата:

02 июля 2020 г., 14:31:51

On Thu, Jul 2, 2020 at 12:12 PM Magnus Hagander <magnus@hagander.net> wrote:

On Thu, Jul 2, 2020 at 2:02 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Anders Steinlein <anders@e5r.no> writes:
> We have a materialized view from which a customer reported some
> confusing/invalid results, leading us to inspect the query and not finding
> anything wrong. Running the query defining the matview manually, or
> creating a new (identical) materialized view returns the correct result.
> Obviously, we've done REFRESH MATERIALIZED VIEW just before doing the
> comparison, and all runs are in the same schema.

I suspect the query underlying the matviews is less deterministic than
you think it is. I did not study that query in any detail, but just
from a quick eyeball: the array_agg() calls with no attempt to enforce a
particular aggregation order are concerning, and so is grouping by
a citext column (where you'll get some case-folding of a common value,
but who knows which).

Also not having looked at the query in detail -- but are there concurrent changes in the database?

Yes, the database is in production so changes do occur, however the data involved for this particular customer and the tables involved in this query is not changing while we've looked into the case these days. It's SELECTED quite a bit though.

Because since you're creating your transaction in READ COMMITTED, other transactions finishing in between your two REFRESH commands can alter the data. To make sure that's not what's happening, you may want to try doing the same thing with a BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE instead, and see if the problem still occurs.

Thanks for the suggestion, but the issue remains. The results are the same (that is, invalid for the one matview and correct for new matviews or manual queries) whether I run the queries in the same transaction or separate transactions. After many many many attempts. I'm quite baffled, really...

Best,

-- a.

Re: Different results from identical matviews

От

Anders Steinlein

Дата:

02 июля 2020 г., 14:36:47

On Thu, Jul 2, 2020 at 1:26 PM Anders Steinlein <anders@e5r.no> wrote:

On Thu, Jul 2, 2020 at 2:02 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Anders Steinlein <anders@e5r.no> writes:
> We have a materialized view from which a customer reported some
> confusing/invalid results, leading us to inspect the query and not finding
> anything wrong. Running the query defining the matview manually, or
> creating a new (identical) materialized view returns the correct result.
> Obviously, we've done REFRESH MATERIALIZED VIEW just before doing the
> comparison, and all runs are in the same schema.

I suspect the query underlying the matviews is less deterministic than
you think it is. I did not study that query in any detail, but just
from a quick eyeball: the array_agg() calls with no attempt to enforce a
particular aggregation order are concerning, and so is grouping by
a citext column (where you'll get some case-folding of a common value,
but who knows which).

Thanks for the tip, but I'm having a hard time thinking that's the case, seeing as I'm unable to trigger the wrong result no matter how hard I try with a new definition/manual query. I've introduced random ordering to the first CTE-clause (where the initial citext values comes from, and casing thus could differ in some order) which doesn't change the result.

I just wanted to add that we're on Postgres 12.3. This matview has been with us since 9.4 days, and we have not experienced any such issues before (could be customers who haven't noticed or reported it to us, of course...).

version
---------------------------------------------------------------------------------------------------------------------------------
PostgreSQL 12.3 (Ubuntu 12.3-1.pgdg18.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0, 64-bit

Best,

-- a.

Re: Different results from identical matviews

От

Michael Lewis

Дата:

02 июля 2020 г., 15:55:18

Is now() computed at the time the view is defined and not at refresh? If this were a function, I would be more suspicious of that but a plain view, surely not. I hope.

Re: Different results from identical matviews

От

Jeremy Smith

Дата:

02 июля 2020 г., 16:01:02

It looks like you are using now() fairly often in that query. That would, of course, give different results in different transactions, but it could also give different results if a) the things you are comparing now() to are timestamp without time zone and b) the session time zone of the user doing the refresh is different from the session time zone of the user running the query. I'd also be suspicious about any other timestamp comparisons that aren't comparing timestamp to timestamp or timestamptz to timestamptz.

Re: Different results from identical matviews

От

Anders Steinlein

Дата:

02 июля 2020 г., 16:06:58

On Thu, Jul 2, 2020 at 3:01 PM Jeremy Smith <jeremy@musicsmith.net> wrote:

It looks like you are using now() fairly often in that query. That would, of course, give different results in different transactions, but it could also give different results if a) the things you are comparing now() to are timestamp without time zone and b) the session time zone of the user doing the refresh is different from the session time zone of the user running the query. I'd also be suspicious about any other timestamp comparisons that aren't comparing timestamp to timestamp or timestamptz to timestamptz.

Thanks, but as mentioned in another part of the thread this data is stale (unmodified during these days we're looking into this). And the result is different between the erroneous matview and identical manual query regardless, whether I run them in the same transaction or not. And the results were the same yesterday, same days before that when the issue came up, and same today. So I can't see how times are the issue, unfortunately. :-/

Best,

-- a.

Re: Different results from identical matviews

От

Paul Förster

Дата:

02 июля 2020 г., 16:08:12

Hi Michael,

> On 02. Jul, 2020, at 14:55, Michael Lewis <mlewis@entrata.com> wrote:
>
> Is now() computed at the time the view is defined and not at refresh? If this were a function, I would be more
suspiciousof that but a plain view, surely not. I hope. 

now() returns the time at the beginning of the transaction:

see: https://www.postgresql.org/docs/current/functions-datetime.html

If you want the time whenever you call the function, use clock_timestamp() instead. There's also a good article at:

https://www.cybertec-postgresql.com/en/postgresql-now-vs-nowtimestamp-vs-clock_timestamp/

Hope this helps.

Cheers,
Paul

Re: Different results from identical matviews

От

Tom Lane

Дата:

02 июля 2020 г., 16:44:49

Anders Steinlein <anders@e5r.no> writes:
> On Thu, Jul 2, 2020 at 2:02 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I suspect the query underlying the matviews is less deterministic than
>> you think it is.

> Thanks for the tip, but I'm having a hard time thinking that's the case,
> seeing as I'm unable to trigger the wrong result no matter how hard I try
> with a new definition/manual query.

Well, another line of thought is that there actually is some difference
between the stored query for the original matview and the ones you enter
afresh.  You said they were the same, but I surely didn't attempt to
verify that.  Comparing pg_get_viewdef() output for equality would be
a good first step.  Even that perhaps isn't conclusive, so you could
also try comparing the pg_rewrite.ev_action fields for the views'
ON SELECT rules.  (That might be a bit frustrating because of likely
inconsistencies in node "location" fields; but any other difference
is cause for suspicion.)

            regards, tom lane

Re: Different results from identical matviews

От

"David G. Johnston"

Дата:

02 июля 2020 г., 16:54:59

On Thursday, July 2, 2020, Anders Steinlein <anders@e5r.no> wrote:

Thanks for the tip, but I'm having a hard time thinking that's the case, seeing as I'm unable to trigger the wrong result no matter how hard I try with a new definition/manual query. I've introduced random ordering to the first CTE-clause (where the initial citext values comes from, and casing thus could differ in some order) which doesn't change the result.

I just wanted to add that we're on Postgres 12.3. This matview has been with us since 9.4 days, and we have not experienced any such issues before (could be customers who haven't noticed or reported it to us, of course...).
version
---------------------------------------------------------------------------------------------------------------------------------
PostgreSQL 12.3 (Ubuntu 12.3-1.pgdg18.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0, 64-bit

I concur that the determinism doesn’t seem like a problem - but not much else does either. As a shot in the dark does pg_depend show any differences between the dependencies for the two views?

How did this migrate from 9.4 to 12?

It would be helpful if “Explain analyze refresh materialized view” were a thing (is it?)

If you can backup and restore the existing database (basebackup is more likely, but pg_dump would be more useful) and still observe the problem then maybe I see hope for digging down into the cause. Otherwise I’d limit my decision to testing for the symptom with the solution being to rebuild any problem views.

David J.

Re: Different results from identical matviews

От

Anders Steinlein

Дата:

02 июля 2020 г., 17:57:18

On Thu, Jul 2, 2020 at 3:44 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Anders Steinlein <anders@e5r.no> writes:
> On Thu, Jul 2, 2020 at 2:02 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I suspect the query underlying the matviews is less deterministic than
>> you think it is.

> Thanks for the tip, but I'm having a hard time thinking that's the case,
> seeing as I'm unable to trigger the wrong result no matter how hard I try
> with a new definition/manual query.

Well, another line of thought is that there actually is some difference
between the stored query for the original matview and the ones you enter
afresh. You said they were the same, but I surely didn't attempt to
verify that. Comparing pg_get_viewdef() output for equality would be
a good first step.

I used a manual `diff` earlier, but this sure was easier. But yes, the stored queries are identical:

mm_prod=> select pg_get_viewdef('aakpnews.segments_with_contacts') = pg_get_viewdef('aakpnews.segments_with_contacts_2');
?column?
----------
t
(1 row)

Even that perhaps isn't conclusive, so you could
also try comparing the pg_rewrite.ev_action fields for the views'
ON SELECT rules. (That might be a bit frustrating because of likely
inconsistencies in node "location" fields; but any other difference
is cause for suspicion.)

You're right, ev_action is indeed different:

mm_prod=> select x1.ev_type = x2.ev_type as ev_type_equal, x1.ev_enabled = x2.ev_enabled as enabled_equal, x1.is_instead = x2.is_instead as is_instead_equal, x1.ev_qual = x2.ev_qual as ev_qual_equal, x1.ev_action = x2.ev_action as ev_action_equal

from
(select pr.* from pg_namespace pn inner join pg_class pc on pc.relnamespace = pn.oid inner join pg_rewrite pr on pr.ev_class = pc.oid where pn.nspname = 'aakpnews' and pc.relname = 'segments_with_contacts') x1,
(select pr.* from pg_namespace pn inner join pg_class pc on pc.relnamespace = pn.oid inner join pg_rewrite pr on pr.ev_class = pc.oid where pn.nspname = 'aakpnews' and pc.relname = 'segments_with_contacts_2') x2;
ev_type_equal | enabled_equal | is_instead_equal | ev_qual_equal | ev_action_equal
---------------+---------------+------------------+---------------+-----------------
t | t | t | t | f
(1 row)

Is there somehow I can format them to make it easier to compare? My basic attempts didn't help me much. I put them up in all their glories in pastebins, since they are rather large. Please let me know if there is somehow I can make this easier to look into.

ev_action for segments_with_contacts - the origial matview: https://pastebin.com/MBJ45prC

ev_action for segments_with_contacts_2 - the similar newly created matview: https://pastebin.com/sL4WjzBj

Best,

-- a.

Re: Different results from identical matviews

От

Anders Steinlein

Дата:

02 июля 2020 г., 18:06:24

On Thu, Jul 2, 2020 at 3:55 PM David G. Johnston <david.g.johnston@gmail.com> wrote:

On Thursday, July 2, 2020, Anders Steinlein <anders@e5r.no> wrote:

I just wanted to add that we're on Postgres 12.3. This matview has been with us since 9.4 days, and we have not experienced any such issues before (could be customers who haven't noticed or reported it to us, of course...).
version
---------------------------------------------------------------------------------------------------------------------------------
PostgreSQL 12.3 (Ubuntu 12.3-1.pgdg18.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0, 64-bit

I concur that the determinism doesn’t seem like a problem - but not much else does either. As a shot in the dark does pg_depend show any differences between the dependencies for the two views?

Could be worth checking, yes. Could you give me any guidance as to how to compare this? Never looked at pg_depend before -- which of the columns should have the oid for the matview I want to look up dependencies for?

How did this migrate from 9.4 to 12?

pg_dump and pg_restore. It's been a few months, so unfortunately I can't recall which pg_dump version was used. Another thing to possibly note is that the citext extension has subsequently been updated as well; I'm unsure if the matview has been recreated after that (if that could have any effect).

It would be helpful if “Explain analyze refresh materialized view” were a thing (is it?)

Yeah, I was looking for that too initially when investigating. Unfortunately, "Utility statements have no plan structure" is the response given.

If you can backup and restore the existing database (basebackup is more likely, but pg_dump would be more useful) and still observe the problem then maybe I see hope for digging down into the cause. Otherwise I’d limit my decision to testing for the symptom with the solution being to rebuild any problem views.

I //think// I have the dump laying around, but if simply rebuilding the view fixes the problem I'm inclined to just do that, although the issue is a bit concerning. If anyone here suspects an actual bug with a possible avenue for further investigation, in which case I would be happy to help.

Best,

-- a

Re: Different results from identical matviews

От

"David G. Johnston"

Дата:

02 июля 2020 г., 18:25:13

On Thu, Jul 2, 2020 at 8:06 AM Anders Steinlein <anders@e5r.no> wrote:

On Thu, Jul 2, 2020 at 3:55 PM David G. Johnston <david.g.johnston@gmail.com> wrote:
On Thursday, July 2, 2020, Anders Steinlein <anders@e5r.no> wrote:

I just wanted to add that we're on Postgres 12.3. This matview has been with us since 9.4 days, and we have not experienced any such issues before (could be customers who haven't noticed or reported it to us, of course...).
version
---------------------------------------------------------------------------------------------------------------------------------
PostgreSQL 12.3 (Ubuntu 12.3-1.pgdg18.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0, 64-bit

I concur that the determinism doesn’t seem like a problem - but not much else does either. As a shot in the dark does pg_depend show any differences between the dependencies for the two views?

Could be worth checking, yes. Could you give me any guidance as to how to compare this? Never looked at pg_depend before -- which of the columns should have the oid for the matview I want to look up dependencies for?

It would be an educational/trial-and-error experience for me as well. That you found a difference in pg_rewrite.ev_action probably provides a more fruitful avenue of attack though I'm inexperienced there as well. I do believe that inspecting pg_depend will also highlight whatever difference you are seeing in the ev_action. What tickles my curiosity is why that difference (whatever it is, I haven't looked) isn't manifesting in the \d+ output for the materialized view.

David J.

Re: Different results from identical matviews

От

Tom Lane

Дата:

02 июля 2020 г., 18:43:53

Anders Steinlein <anders@e5r.no> writes:
>> Even that perhaps isn't conclusive, so you could
>> also try comparing the pg_rewrite.ev_action fields for the views'
>> ON SELECT rules.  (That might be a bit frustrating because of likely
>> inconsistencies in node "location" fields; but any other difference
>> is cause for suspicion.)

> You're right, ev_action is indeed different:
> ...
> Is there somehow I can format them to make it easier to compare? My basic
> attempts didn't help me much. I put them up in all their glories in
> pastebins, since they are rather large. Please let me know if there is
> somehow I can make this easier to look into.

Yeah, expression trees are pretty unreadable :-(.  I downloaded these,
changed all the "location" fields to -1 to make them more comparable,
and behold there are still a bunch of diffs.  Here's one:

original view:

{JOINEXPR :jointype 0 :isNatural false :larg
{RANGETBLREF :rtindex 1}
 :rarg
{RANGETBLREF :rtindex 2}
 :usingClause ("email") :quals
{OPEXPR :opno 98 :opfuncid 67 :opresulttype 16 :opretset false :opcollid 0 :inputcollid 100 :args (
{RELABELTYPE :arg
{VAR :varno 1 :varattno 2 :vartype 106893 :vartypmod -1 :varcollid 100 :varlevelsup 0 :varnoold 1 :varoattno 2
:location-1} 
 :resulttype 25 :resulttypmod -1 :resultcollid 100 :relabelformat 2 :location -1}
{RELABELTYPE :arg
{VAR :varno 2 :varattno 2 :vartype 106893 :vartypmod -1 :varcollid 100 :varlevelsup 0 :varnoold 2 :varoattno 2
:location-1} 
 :resulttype 25 :resulttypmod -1 :resultcollid 100 :relabelformat 2 :location -1}
) :location -1}
 :alias <> :rtindex 3}

new view:

{JOINEXPR :jointype 0 :isNatural false :larg
{RANGETBLREF :rtindex 1}
 :rarg
{RANGETBLREF :rtindex 2}
 :usingClause ("email") :quals
{OPEXPR :opno 106108 :opfuncid 106101 :opresulttype 16 :opretset false :opcollid 0 :inputcollid 100 :args (
{RELABELTYPE :arg
{VAR :varno 1 :varattno 2 :vartype 106893 :vartypmod -1 :varcollid 100 :varlevelsup 0 :varnoold 1 :varoattno 2
:location-1} 
 :resulttype 106084 :resulttypmod -1 :resultcollid 100 :relabelformat 2 :location -1}
{RELABELTYPE :arg
{VAR :varno 2 :varattno 2 :vartype 106893 :vartypmod -1 :varcollid 100 :varlevelsup 0 :varnoold 2 :varoattno 2
:location-1} 
 :resulttype 106084 :resulttypmod -1 :resultcollid 100 :relabelformat 2 :location -1}
) :location -1}
 :alias <> :rtindex 3}

This is the internal form of a "JOIN ... USING (email)" construct.
I didn't try to trace this back to exactly where it was in the source
queries.  The important thing here is that we have a couple of Vars
of type 106893, which I gather must be citext or a domain over it.
In the first tree, those are coerced via a no-op RelabelType operation
into plain text (type OID 25) and then compared with the built-in texteq
operator.  In the second tree, they are coerced to some other non-built-in
type (maybe plain citext?) and then compared with operator 106108.

I am betting that 106084 is citext, 106108 is citext's equality operator,
and the net implication of all this is that the original matview is doing
the JOIN using case-sensitive equality whereas the new one is using
case-insensitive equality.

A plausible explanation for how things got that way is that citext's
equality operator wasn't in your search_path when you created the original
matview, but it is in view when you make the new one, allowing that
equality operator to capture the interpretation of USING.  Unfortunately,
since the reverse-listing of this join is just going to say "USING
(email)", there's no way to detect from human-readable output that the
interpretation of the USING clauses is different.  (We've contemplated
introducing not-SQL-standard syntax to allow flagging such cases, but
haven't pulled the trigger on that.)

I count five places in the query with similar operator substitutions.
There are some other diffs in the trees that are a bit odd, but might be
explained if the new view was made by dump/reload rather than from the
identical SQL text the original view was made from; they all look like
they are references to JOIN output columns rather than the underlying
table columns or vice versa.  That's probably harmless, but the different
join operators certainly are not.

            regards, tom lane

Re: Different results from identical matviews

От

"David G. Johnston"

Дата:

02 июля 2020 г., 19:02:16

On Thu, Jul 2, 2020 at 8:44 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

A plausible explanation for how things got that way is that citext's
equality operator wasn't in your search_path when you created the original
matview, but it is in view when you make the new one, allowing that
equality operator to capture the interpretation of USING. Unfortunately,
since the reverse-listing of this join is just going to say "USING
(email)", there's no way to detect from human-readable output that the
interpretation of the USING clauses is different. (We've contemplated
introducing not-SQL-standard syntax to allow flagging such cases, but
haven't pulled the trigger on that.)

The citext extension seems to have been installed into the public schema as well which could introduce the CVE-2018-1058 fix as a potential moving part.

It seems a bit odd though since the textual query does specify "DISTINCT mails_contacts_opens.email::public.citext" so it does seem to be search_path induced as the view couldn't exist if the extension was simply missing not extension specific equality operator were present to match in front of the default equality operator. But then those casts also make me question whether the source tables are defined using text instead of citext in which case the joins using text equality would be expected and their using citext equality in the new queries potentially suspect.

David J.

Re: Different results from identical matviews

От

Anders Steinlein

Дата:

03 июля 2020 г., 00:32:48

On Thu, Jul 2, 2020 at 5:43 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Anders Steinlein <anders@e5r.no> writes:
>> Even that perhaps isn't conclusive, so you could
>> also try comparing the pg_rewrite.ev_action fields for the views'
>> ON SELECT rules. (That might be a bit frustrating because of likely
>> inconsistencies in node "location" fields; but any other difference
>> is cause for suspicion.)

> You're right, ev_action is indeed different:
> ...
> Is there somehow I can format them to make it easier to compare? My basic
> attempts didn't help me much. I put them up in all their glories in
> pastebins, since they are rather large. Please let me know if there is
> somehow I can make this easier to look into.

Yeah, expression trees are pretty unreadable :-(. I downloaded these,
changed all the "location" fields to -1 to make them more comparable,
and behold there are still a bunch of diffs. Here's one:

[...]

This is the internal form of a "JOIN ... USING (email)" construct.
I didn't try to trace this back to exactly where it was in the source
queries. The important thing here is that we have a couple of Vars
of type 106893, which I gather must be citext or a domain over it.

Yes, we have a domain called `email` over citext. The reason for the multiple casts to citext, IIRC, was that i.e. array_agg() didn't accept the email domain directly.

In the first tree, those are coerced via a no-op RelabelType operation
into plain text (type OID 25) and then compared with the built-in texteq
operator. In the second tree, they are coerced to some other non-built-in
type (maybe plain citext?) and then compared with operator 106108.

I am betting that 106084 is citext, 106108 is citext's equality operator,
and the net implication of all this is that the original matview is doing
the JOIN using case-sensitive equality whereas the new one is using
case-insensitive equality.

This makes sense. We've narrowed the different results down to this exact case; that is, for the rows that are missing in the old matview, they have "email" entries in different case between some of the tables. So case-sensitive equality checks on these will naturally be lost somewhere. How this came to be is another question...

A plausible explanation for how things got that way is that citext's
equality operator wasn't in your search_path when you created the original
matview, but it is in view when you make the new one, allowing that
equality operator to capture the interpretation of USING.

Possibly, since this view has existed for many years. However in general, our multi-tenant migration system does SET search_path TO <tenant>, public for all DDL we do, so unless we had an issue whey back when it should've been present.

Unfortunately,
since the reverse-listing of this join is just going to say "USING
(email)", there's no way to detect from human-readable output that the
interpretation of the USING clauses is different. (We've contemplated
introducing not-SQL-standard syntax to allow flagging such cases, but
haven't pulled the trigger on that.)

If I'm reading this correctly, would this be a "reason" to be more explicit when doing joins involving non-standard data types? I.e. would it be "safer" to do ON x1.email::citext == x2.email::citext instead of USING (if there is any difference at all...)?

I count five places in the query with similar operator substitutions.
There are some other diffs in the trees that are a bit odd, but might be
explained if the new view was made by dump/reload rather than from the
identical SQL text the original view was made from; they all look like
they are references to JOIN output columns rather than the underlying
table columns or vice versa. That's probably harmless, but the different
join operators certainly are not.

I looked over the procedure we used for the upgrade, and it was this: Postgres 9.4 server backup using WAL-E, restored into 9.4 on a new box, where we did some data cleanup and upgraded extensions including citext. Then used pg_dump from 12, directory format, and restored into 12. Since that time no DDL or extension changes have been made on the tables/view involved here. Could the citext upgrade have had any effect here, messing with dependencies somehow?

Anyway, I think I'll just recreate these views for all tenants to be on the safe side. Still curious about how this came to be, but since it's about 10 years of history in this database I guess it will be hard to figure anything out for sure.

Thanks a lot for looking into this, let me know if there's any reason to dig further.

Best,

-- a.

Re: Different results from identical matviews

От

Tom Lane

Дата:

03 июля 2020 г., 00:44:53

Anders Steinlein <anders@e5r.no> writes:
> On Thu, Jul 2, 2020 at 5:43 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Unfortunately,
>> since the reverse-listing of this join is just going to say "USING
>> (email)", there's no way to detect from human-readable output that the
>> interpretation of the USING clauses is different.  (We've contemplated
>> introducing not-SQL-standard syntax to allow flagging such cases, but
>> haven't pulled the trigger on that.)

> If I'm reading this correctly, would this be a "reason" to be more explicit
> when doing joins involving non-standard data types? I.e. would it be
> "safer" to do ON x1.email::citext == x2.email::citext instead of USING (if
> there is any difference at all...)?

Yes, it would be.  Of course then you don't get the "merging" of the two
join output columns into one, so you might have to qualify references a
bit more.

You might find this thread interesting:

https://www.postgresql.org/message-id/flat/ffefc172-a487-aa87-a0e7-472bf29735c8%40gmail.com

            regards, tom lane

Re: Different results from identical matviews

От

Anders Steinlein

Дата:

03 июля 2020 г., 01:05:27

On Thu, Jul 2, 2020 at 11:44 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Anders Steinlein <anders@e5r.no> writes:
> I'm reading this correctly, would this be a "reason" to be more explicit
> when doing joins involving non-standard data types? I.e. would it be
> "safer" to do ON x1.email::citext == x2.email::citext instead of USING (if
> there is any difference at all...)?

Yes, it would be. Of course then you don't get the "merging" of the two
join output columns into one, so you might have to qualify references a
bit more.

You might find this thread interesting:

https://www.postgresql.org/message-id/flat/ffefc172-a487-aa87-a0e7-472bf29735c8%40gmail.com

Indeed interesting, thanks!

Am I right in thinking that we should actually go over (i.e. re-create) all functions and views defined before this dump/restore where we're using JOIN ... USING (citext_column)? We most definitely have many more such cases, since this is the common (perhaps naive) way we've written joins (unless there are obvious reasons to be explicit). :-/

Best,

-- a.

Re: Different results from identical matviews

От

Tom Lane

Дата:

03 июля 2020 г., 01:25:38

Anders Steinlein <anders@e5r.no> writes:
> Am I right in thinking that we should actually go over (i.e. re-create) all
> functions and views defined before this dump/restore where we're using JOIN
> ... USING (citext_column)? We most definitely have many more such cases,
> since this is the common (perhaps naive) way we've written joins (unless
> there are obvious reasons to be explicit). :-/

If it's not clear to you how this matview came to be different from
the rest, then it certainly seems likely that other ones might have
the same disease.

Note that functions don't really have this sort of issue, since they're
just stored as text.  Only a view or matview would preserve creation-time
decisions about interpretation.

            regards, tom lane

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Different results from identical matviews