Обсуждение: Puzzline CROSS JOIN when doing RECURSIVE CTE

Поиск
Список
Период
Сортировка

Puzzline CROSS JOIN when doing RECURSIVE CTE

От
Pól Ua Laoínecháin
Дата:
Hi all,

I've been working on a recursive query (I've already written a few, so
 I'm not a complete newbie..

All  of the code below is available on the fiddle here:

https://dbfiddle.uk/?rdbms=postgres_13&fiddle=0cc20c9081867131260e6e3550bd08ab


I have a table called line

SELECT
  idx, length, string ~ 'html', string
FROM
  line;


Result:

idxlength?column?string1 257 f with t(x) as (values( XMLPARSE(DOCUMENT
('<root><NotificationServiceDetails NotificationNo="0"
AlarmCode="mail" AlarmStartTime="10:00:00" AlarmTime="0" Id ="2"
><NotificationServiceDetail
Id="2"><Title><![CDATA[aaaaaaaaaaaaa]]></Title><ContentJson><![CDATA[
2 22 t <html lang="en">
3 12 f <head>
4 33 f <meta charset="utf-8"/>
5 20 f more stuff
6 20 f more stuff
716f </table>
...
... snipped for brevity
...

16 rows

OK, grand, now I wish to perform a RECURSIVE CTE on it. So, I start by
trying something (I thought was) very simple. Obviously, I plan to do
more, but I wanted to get the "mechanics" correct to start with. So,
my query is:

WITH RECURSIVE cte1 (n, ln) AS
(
  SELECT 1 AS n, string
  FROM line
  UNION ALL
  SELECT n + 1, ln
  FROM cte1
  WHERE n < (SELECT COUNT(*) FROM line)
)
SELECT * FROM cte1;

i.e. have a counter variable and a string from the line table


But, then to my horror, the result of this query is

1with t(x) as (values( XMLPARSE(DOCUMENT
('<root><NotificationServiceDetails NotificationNo="0"
AlarmCode="mail" AlarmStartTime="10:00:00" AlarmTime="0" Id ="2"
><NotificationServiceDetail
Id="2"><Title><![CDATA[aaaaaaaaaaaaa]]></Title><ContentJson><![CDATA[
1 <html lang="en">
1 <head>
1 <meta charset="utf-8"/>
1 more stuff
1 more stuff
1 </table>
1 </body>
1 </html>
...
... snipped for brevity
...

256 rows!        <<=== note 256!


So, my simple recursive CTE is

a) not incrementing n and
b) obviously doing some sort of CROSS JOIN (16^2 = 256).


I would be grateful if somebody could explain what's going on here. As
shown in the fiddle, I can do the basic 1 - 10 RCTE and I've done way
more complex ones before, so I'm a bit baffled as to what's going on
here.

Any explanations, references to URLs or other advice gratefully received.

TIA and rgs,

Here's the fiddle again if you wish to take a look

https://dbfiddle.uk/?rdbms=postgres_13&fiddle=0cc20c9081867131260e6e3550bd08ab

Pól...



Re: Puzzline CROSS JOIN when doing RECURSIVE CTE

От
Alban Hertroys
Дата:
> On 18 Apr 2022, at 11:56, Pól Ua Laoínecháin <linehanp@tcd.ie> wrote:

(…)

> All  of the code below is available on the fiddle here:
>
> https://dbfiddle.uk/?rdbms=postgres_13&fiddle=0cc20c9081867131260e6e3550bd08ab

(…)

> OK, grand, now I wish to perform a RECURSIVE CTE on it. So, I start by
> trying something (I thought was) very simple. Obviously, I plan to do
> more, but I wanted to get the "mechanics" correct to start with. So,
> my query is:
>
> WITH RECURSIVE cte1 (n, ln) AS
> (
>  SELECT 1 AS n, string
>  FROM line

Here is your first problem, this will yield a result for each row in your line table, numbering it ‘1’. You seem to
haveexpected just a single result here, but that is something that you need to take care of in your query. 
This part is called the base case, base step or initial step.

>  UNION ALL
>  SELECT n + 1, ln
>  FROM cte1
>  WHERE n < (SELECT COUNT(*) FROM line)

And then for each of those rows, it will add all those rows (from the same CTE!) again.
This part is called the recursive step.

You did add a termination condition here, which indeed manages to terminate, but it does so too late.

It seems that you do understand some of the concepts of recursive CTE’s, but you appear to be missing some crucial
knowledge.

For example, it is actually possible to query multiple trees with a single recursive CTE. It is not limited to a single
tree.How many trees the CTE will navigate depends on how you selected the rows in the base case. 

> )
> SELECT * FROM cte1;
>
> i.e. have a counter variable and a string from the line table

My first question is why you’re using a recursive CTE here? This doesn’t appear to be hierarchical data (such as a
tree),unless perhaps you intended to actually traverse the HTML document hierarchy? 

>
> But, then to my horror, the result of this query is
>
> 1with t(x) as (values( XMLPARSE(DOCUMENT
> ('<root><NotificationServiceDetails NotificationNo="0"
> AlarmCode="mail" AlarmStartTime="10:00:00" AlarmTime="0" Id ="2"
>> <NotificationServiceDetail
> Id="2"><Title><![CDATA[aaaaaaaaaaaaa]]></Title><ContentJson><![CDATA[
> 1 <html lang="en">
> 1 <head>
> 1 <meta charset="utf-8"/>
> 1 more stuff
> 1 more stuff
> 1 </table>
> 1 </body>
> 1 </html>
> ...
> ... snipped for brevity
> ...
>
> 256 rows!        <<=== note 256!
>
>
> So, my simple recursive CTE is
>
> a) not incrementing n and

It should be, but only after the first 16 rows from your base case.

> b) obviously doing some sort of CROSS JOIN (16^2 = 256).

Not quite, it’s selecting the same 16 rows 16 times, incrementing the numbering each time. Effectively it is indeed a
Cartesianproduct. More on that below. 

> I would be grateful if somebody could explain what's going on here. As
> shown in the fiddle, I can do the basic 1 - 10 RCTE and I've done way
> more complex ones before, so I'm a bit baffled as to what's going on
> here.

For a recursive CTE to work, you need two things, and that seems to be where you have misunderstood some things:

1. You need a base case that selects the _initial_ rows to start the recursion from.
2. You need a recursive step, where you usually join the results of your CTE to your table to figure out what the next
resultshould be. 

(It is very similar to how this works in procedural languages, where you first call a function that initiates the
recursionand then that calls a second function (with more parameters usually) that calls itself again and again, with
updatedinput parameters, until some termination condition is reached and the function returns a value.) 

For the base case, you will want to make sure that you select the correct row(s) from your line table for the initial
set.My guess is that it should be the row containing the string '<root>'. 

For the recursive step, you will want to select the next child row from your line table that is (directly) related to
theone(s) you selected in the base case, or is (directly) related to the previous iteration of the recursive case. 
It is unclear to me what that should be in your case, I don’t understand the purpose of your CTE.

Your termination condition, WHERE n < (SELECT COUNT(*) FROM line), isn’t quite doing what you expect either I think.
Let’s say that c = (SELECT COUNT(*) FROM line), so c = 16 in your example.
- The first iteration of the CTE, the base case, will select 16 lines with n=1.
- The second iteration is the first time in the recursive step, and will select all lines from the base case where n <
c,or 1 < 16, which is all of them. 
- The next iteration will select all rows where 2 < 16, then 3 < 16, … until finally in the 16th iteration it stops
withthe test 16 < 16. 

So, you managed to make the recursion terminate at least, but probably not how you intended it to work.

Something along the lines of:
WITH RECURSIVE cte (n, ln) AS (
    SELECT 1, string
    FROM line
    WHERE string LIKE '%<root>%'
    UNION ALL
    SELECT n+1, string
    FROM cte
    JOIN line ON iHaveNoClueHere
)
SELECT * FROM cte;

> Any explanations, references to URLs or other advice gratefully received.

There are plenty of explanations about recursive CTE’s in SQL on the Internet. I would probably have a good look at how
it’sdescribed in the PG documentation. There are a couple of good articles on the topic by some PG bloggers too. 

You’ll find plenty of articles related to how to do this in MSSQL or later Oracle versions. Those are totally fine for
understandingthe principles, and even the syntax is almost entirely the same. 

For a PG specific explanation, this one looks pretty thorough:
https://towardsdatascience.com/recursive-sql-queries-with-postgresql-87e2a453f1b

Regards,

Alban Hertroys
--
If you can't see the forest for the trees,
cut the trees and you'll find there is no forest.




Re: Puzzline CROSS JOIN when doing RECURSIVE CTE

От
Pól Ua Laoínecháin
Дата:
Hi Alban, and many thanks for your input.

> My first question is why you’re using a recursive CTE here? This doesn’t appear to be hierarchical data (such as a
tree),unless perhaps you intended to actually traverse the HTML document hierarchy? 

This is basically an exercise on my part.

The question that I'm trying to answer  is here:

https://stackoverflow.com/questions/70574881/how-can-get-html-inner-tag-in-posgresql

I've already answered it in 3 different ways - but I was trying to do
it with RCTEs in order to improve my comprehension of them.

So, basically, I want to pick out a subsection of text from a "passage".

So then, I wanted to establish a true/false state for the lines that I
want and don't want, going through line by line. I know that the RCTE
is  a very contrived way of doing this, but it's for learning really.

I wonder if you could be so kind as to give me  a "skeleton" RCTE for
this - I've been staring at this for hours - and it's not that I'm
lazy or haven't studied RCTEs - I wrote this RCTE

https://stackoverflow.com/a/71674990/470530

recently, so it's not as if I'm completely ignorant of RCTEs - I'm
just stuck in a rut. Any help would be appreciated.


TIA and rgs,


Pól...


> Alban Hertroys



Re: Puzzline CROSS JOIN when doing RECURSIVE CTE

От
Alban Hertroys
Дата:
> On 18 Apr 2022, at 14:51, Pól Ua Laoínecháin <linehanp@tcd.ie> wrote:
>
> Hi Alban, and many thanks for your input.
>
>> My first question is why you’re using a recursive CTE here? This doesn’t appear to be hierarchical data (such as a
tree),unless perhaps you intended to actually traverse the HTML document hierarchy? 
>
> This is basically an exercise on my part.
>
> The question that I'm trying to answer  is here:
>
> https://stackoverflow.com/questions/70574881/how-can-get-html-inner-tag-in-posgresql
>
> I've already answered it in 3 different ways - but I was trying to do
> it with RCTEs in order to improve my comprehension of them.
>
> So, basically, I want to pick out a subsection of text from a "passage".
>
> So then, I wanted to establish a true/false state for the lines that I
> want and don't want, going through line by line. I know that the RCTE
> is  a very contrived way of doing this, but it's for learning really.

Considering that you’re already looking at the elements of a parsed DOM tree, the exercise boils down to traversing
thattree. Due to how xmlparse() is implemented, you probably already get them in the right order even when not using an
explicitorder by. That is, if you’re looking for a DFT (depth first traversal) as opposed to a BFT (breadth first). 

One of the difficulties here is that there are some CDATA sections involved with more XML in them. My guess is that
that’sthe data that you’re actually after, but that’s just a matter of entering the document with the correct path I
suppose?


> I wonder if you could be so kind as to give me  a "skeleton" RCTE for
> this - I've been staring at this for hours - and it's not that I'm
> lazy or haven't studied RCTEs - I wrote this RCTE
>
> https://stackoverflow.com/a/71674990/470530
>
> recently, so it's not as if I'm completely ignorant of RCTEs - I'm
> just stuck in a rut. Any help would be appreciated.

You would first need to determine the root node(s). Those are the ones w/o parents, or you may have some other way of
determiningthose. 

Next is finding all nodes that have an earlier node as their parent.
You could go an extra step here with preserving the order of the siblings in the document, by numbering nodes
(directly)under the same parent. 
I usually build an ltree structure with that information, while traversing the tree - that gets you an ltree with
entries(1, 1.1, 1.1.1, 1.1.2, 1.2.1, etc) that you then can use for the final order by, for example. 

In case you didn’t know, ltree is a module you can install. I find it still very useful in tree traversals. The one
drawbackI see is that for these scenario’s you’d ideally want an ltree based on integers, such that 10 sorts after 9
insteadof between 1 and 2. Padding enough zeroes before the ltree text items is a bit of an extra hassle that I’d
preferto do without. 

I haven’t actually looked at what DOM navigation functions exist for PG, so this is more or less pseudo code. Worse, my
localcopy of PG was compiled w/o XML support, so I don’t know what kind of result the query from that SO article
produces.But then again, I don’t really know what you’re after anyway, so... 

This is basically how I would go about it.

with recursive
-- First we need to get the DOM-tree parsed (this is not actually recursive)
domtree as (
    select node
      from xmlparse(document(‘<root>...</root>'))
),
-- Next we can traverse it
cte (node, hierarchy, n) as (
    select node, 1::text::ltree, 1
      from domtree
     where parent(node) is null

    union all

    select node, cte.hierarchy || (cte.n+1)::text::ltree, n+1
      from domtree t
      join cte on parent(t.node) = cte.node
)
select *
  from cte
 order by hierarchy;

Function parent() is made-up. It would return the parent node of a node, so that there is some way to connect the
differentparts in the hierarchy. I guess xpath() could fulfil that purpose, but I have no way of testing that
hypothesis.

I hope that’s a good enough starting point for you?


Alban Hertroys
--
There is always an exception to always.