Обсуждение: Puzzline CROSS JOIN when doing RECURSIVE CTE
Hi all, I've been working on a recursive query (I've already written a few, so I'm not a complete newbie.. All of the code below is available on the fiddle here: https://dbfiddle.uk/?rdbms=postgres_13&fiddle=0cc20c9081867131260e6e3550bd08ab I have a table called line SELECT idx, length, string ~ 'html', string FROM line; Result: idxlength?column?string1 257 f with t(x) as (values( XMLPARSE(DOCUMENT ('<root><NotificationServiceDetails NotificationNo="0" AlarmCode="mail" AlarmStartTime="10:00:00" AlarmTime="0" Id ="2" ><NotificationServiceDetail Id="2"><Title><![CDATA[aaaaaaaaaaaaa]]></Title><ContentJson><![CDATA[ 2 22 t <html lang="en"> 3 12 f <head> 4 33 f <meta charset="utf-8"/> 5 20 f more stuff 6 20 f more stuff 716f </table> ... ... snipped for brevity ... 16 rows OK, grand, now I wish to perform a RECURSIVE CTE on it. So, I start by trying something (I thought was) very simple. Obviously, I plan to do more, but I wanted to get the "mechanics" correct to start with. So, my query is: WITH RECURSIVE cte1 (n, ln) AS ( SELECT 1 AS n, string FROM line UNION ALL SELECT n + 1, ln FROM cte1 WHERE n < (SELECT COUNT(*) FROM line) ) SELECT * FROM cte1; i.e. have a counter variable and a string from the line table But, then to my horror, the result of this query is 1with t(x) as (values( XMLPARSE(DOCUMENT ('<root><NotificationServiceDetails NotificationNo="0" AlarmCode="mail" AlarmStartTime="10:00:00" AlarmTime="0" Id ="2" ><NotificationServiceDetail Id="2"><Title><![CDATA[aaaaaaaaaaaaa]]></Title><ContentJson><![CDATA[ 1 <html lang="en"> 1 <head> 1 <meta charset="utf-8"/> 1 more stuff 1 more stuff 1 </table> 1 </body> 1 </html> ... ... snipped for brevity ... 256 rows! <<=== note 256! So, my simple recursive CTE is a) not incrementing n and b) obviously doing some sort of CROSS JOIN (16^2 = 256). I would be grateful if somebody could explain what's going on here. As shown in the fiddle, I can do the basic 1 - 10 RCTE and I've done way more complex ones before, so I'm a bit baffled as to what's going on here. Any explanations, references to URLs or other advice gratefully received. TIA and rgs, Here's the fiddle again if you wish to take a look https://dbfiddle.uk/?rdbms=postgres_13&fiddle=0cc20c9081867131260e6e3550bd08ab Pól...
> On 18 Apr 2022, at 11:56, Pól Ua Laoínecháin <linehanp@tcd.ie> wrote: (…) > All of the code below is available on the fiddle here: > > https://dbfiddle.uk/?rdbms=postgres_13&fiddle=0cc20c9081867131260e6e3550bd08ab (…) > OK, grand, now I wish to perform a RECURSIVE CTE on it. So, I start by > trying something (I thought was) very simple. Obviously, I plan to do > more, but I wanted to get the "mechanics" correct to start with. So, > my query is: > > WITH RECURSIVE cte1 (n, ln) AS > ( > SELECT 1 AS n, string > FROM line Here is your first problem, this will yield a result for each row in your line table, numbering it ‘1’. You seem to haveexpected just a single result here, but that is something that you need to take care of in your query. This part is called the base case, base step or initial step. > UNION ALL > SELECT n + 1, ln > FROM cte1 > WHERE n < (SELECT COUNT(*) FROM line) And then for each of those rows, it will add all those rows (from the same CTE!) again. This part is called the recursive step. You did add a termination condition here, which indeed manages to terminate, but it does so too late. It seems that you do understand some of the concepts of recursive CTE’s, but you appear to be missing some crucial knowledge. For example, it is actually possible to query multiple trees with a single recursive CTE. It is not limited to a single tree.How many trees the CTE will navigate depends on how you selected the rows in the base case. > ) > SELECT * FROM cte1; > > i.e. have a counter variable and a string from the line table My first question is why you’re using a recursive CTE here? This doesn’t appear to be hierarchical data (such as a tree),unless perhaps you intended to actually traverse the HTML document hierarchy? > > But, then to my horror, the result of this query is > > 1with t(x) as (values( XMLPARSE(DOCUMENT > ('<root><NotificationServiceDetails NotificationNo="0" > AlarmCode="mail" AlarmStartTime="10:00:00" AlarmTime="0" Id ="2" >> <NotificationServiceDetail > Id="2"><Title><![CDATA[aaaaaaaaaaaaa]]></Title><ContentJson><![CDATA[ > 1 <html lang="en"> > 1 <head> > 1 <meta charset="utf-8"/> > 1 more stuff > 1 more stuff > 1 </table> > 1 </body> > 1 </html> > ... > ... snipped for brevity > ... > > 256 rows! <<=== note 256! > > > So, my simple recursive CTE is > > a) not incrementing n and It should be, but only after the first 16 rows from your base case. > b) obviously doing some sort of CROSS JOIN (16^2 = 256). Not quite, it’s selecting the same 16 rows 16 times, incrementing the numbering each time. Effectively it is indeed a Cartesianproduct. More on that below. > I would be grateful if somebody could explain what's going on here. As > shown in the fiddle, I can do the basic 1 - 10 RCTE and I've done way > more complex ones before, so I'm a bit baffled as to what's going on > here. For a recursive CTE to work, you need two things, and that seems to be where you have misunderstood some things: 1. You need a base case that selects the _initial_ rows to start the recursion from. 2. You need a recursive step, where you usually join the results of your CTE to your table to figure out what the next resultshould be. (It is very similar to how this works in procedural languages, where you first call a function that initiates the recursionand then that calls a second function (with more parameters usually) that calls itself again and again, with updatedinput parameters, until some termination condition is reached and the function returns a value.) For the base case, you will want to make sure that you select the correct row(s) from your line table for the initial set.My guess is that it should be the row containing the string '<root>'. For the recursive step, you will want to select the next child row from your line table that is (directly) related to theone(s) you selected in the base case, or is (directly) related to the previous iteration of the recursive case. It is unclear to me what that should be in your case, I don’t understand the purpose of your CTE. Your termination condition, WHERE n < (SELECT COUNT(*) FROM line), isn’t quite doing what you expect either I think. Let’s say that c = (SELECT COUNT(*) FROM line), so c = 16 in your example. - The first iteration of the CTE, the base case, will select 16 lines with n=1. - The second iteration is the first time in the recursive step, and will select all lines from the base case where n < c,or 1 < 16, which is all of them. - The next iteration will select all rows where 2 < 16, then 3 < 16, … until finally in the 16th iteration it stops withthe test 16 < 16. So, you managed to make the recursion terminate at least, but probably not how you intended it to work. Something along the lines of: WITH RECURSIVE cte (n, ln) AS ( SELECT 1, string FROM line WHERE string LIKE '%<root>%' UNION ALL SELECT n+1, string FROM cte JOIN line ON iHaveNoClueHere ) SELECT * FROM cte; > Any explanations, references to URLs or other advice gratefully received. There are plenty of explanations about recursive CTE’s in SQL on the Internet. I would probably have a good look at how it’sdescribed in the PG documentation. There are a couple of good articles on the topic by some PG bloggers too. You’ll find plenty of articles related to how to do this in MSSQL or later Oracle versions. Those are totally fine for understandingthe principles, and even the syntax is almost entirely the same. For a PG specific explanation, this one looks pretty thorough: https://towardsdatascience.com/recursive-sql-queries-with-postgresql-87e2a453f1b Regards, Alban Hertroys -- If you can't see the forest for the trees, cut the trees and you'll find there is no forest.
Hi Alban, and many thanks for your input. > My first question is why you’re using a recursive CTE here? This doesn’t appear to be hierarchical data (such as a tree),unless perhaps you intended to actually traverse the HTML document hierarchy? This is basically an exercise on my part. The question that I'm trying to answer is here: https://stackoverflow.com/questions/70574881/how-can-get-html-inner-tag-in-posgresql I've already answered it in 3 different ways - but I was trying to do it with RCTEs in order to improve my comprehension of them. So, basically, I want to pick out a subsection of text from a "passage". So then, I wanted to establish a true/false state for the lines that I want and don't want, going through line by line. I know that the RCTE is a very contrived way of doing this, but it's for learning really. I wonder if you could be so kind as to give me a "skeleton" RCTE for this - I've been staring at this for hours - and it's not that I'm lazy or haven't studied RCTEs - I wrote this RCTE https://stackoverflow.com/a/71674990/470530 recently, so it's not as if I'm completely ignorant of RCTEs - I'm just stuck in a rut. Any help would be appreciated. TIA and rgs, Pól... > Alban Hertroys
> On 18 Apr 2022, at 14:51, Pól Ua Laoínecháin <linehanp@tcd.ie> wrote: > > Hi Alban, and many thanks for your input. > >> My first question is why you’re using a recursive CTE here? This doesn’t appear to be hierarchical data (such as a tree),unless perhaps you intended to actually traverse the HTML document hierarchy? > > This is basically an exercise on my part. > > The question that I'm trying to answer is here: > > https://stackoverflow.com/questions/70574881/how-can-get-html-inner-tag-in-posgresql > > I've already answered it in 3 different ways - but I was trying to do > it with RCTEs in order to improve my comprehension of them. > > So, basically, I want to pick out a subsection of text from a "passage". > > So then, I wanted to establish a true/false state for the lines that I > want and don't want, going through line by line. I know that the RCTE > is a very contrived way of doing this, but it's for learning really. Considering that you’re already looking at the elements of a parsed DOM tree, the exercise boils down to traversing thattree. Due to how xmlparse() is implemented, you probably already get them in the right order even when not using an explicitorder by. That is, if you’re looking for a DFT (depth first traversal) as opposed to a BFT (breadth first). One of the difficulties here is that there are some CDATA sections involved with more XML in them. My guess is that that’sthe data that you’re actually after, but that’s just a matter of entering the document with the correct path I suppose? > I wonder if you could be so kind as to give me a "skeleton" RCTE for > this - I've been staring at this for hours - and it's not that I'm > lazy or haven't studied RCTEs - I wrote this RCTE > > https://stackoverflow.com/a/71674990/470530 > > recently, so it's not as if I'm completely ignorant of RCTEs - I'm > just stuck in a rut. Any help would be appreciated. You would first need to determine the root node(s). Those are the ones w/o parents, or you may have some other way of determiningthose. Next is finding all nodes that have an earlier node as their parent. You could go an extra step here with preserving the order of the siblings in the document, by numbering nodes (directly)under the same parent. I usually build an ltree structure with that information, while traversing the tree - that gets you an ltree with entries(1, 1.1, 1.1.1, 1.1.2, 1.2.1, etc) that you then can use for the final order by, for example. In case you didn’t know, ltree is a module you can install. I find it still very useful in tree traversals. The one drawbackI see is that for these scenario’s you’d ideally want an ltree based on integers, such that 10 sorts after 9 insteadof between 1 and 2. Padding enough zeroes before the ltree text items is a bit of an extra hassle that I’d preferto do without. I haven’t actually looked at what DOM navigation functions exist for PG, so this is more or less pseudo code. Worse, my localcopy of PG was compiled w/o XML support, so I don’t know what kind of result the query from that SO article produces.But then again, I don’t really know what you’re after anyway, so... This is basically how I would go about it. with recursive -- First we need to get the DOM-tree parsed (this is not actually recursive) domtree as ( select node from xmlparse(document(‘<root>...</root>')) ), -- Next we can traverse it cte (node, hierarchy, n) as ( select node, 1::text::ltree, 1 from domtree where parent(node) is null union all select node, cte.hierarchy || (cte.n+1)::text::ltree, n+1 from domtree t join cte on parent(t.node) = cte.node ) select * from cte order by hierarchy; Function parent() is made-up. It would return the parent node of a node, so that there is some way to connect the differentparts in the hierarchy. I guess xpath() could fulfil that purpose, but I have no way of testing that hypothesis. I hope that’s a good enough starting point for you? Alban Hertroys -- There is always an exception to always.