Обсуждение: tsvector string representation and parsing
Hi, I am constructing a tsvector representation of some text manually because I need to provide explicit positional parameters. In some cases, the conversion fails and I get a "syntax error in tsvector". This is a minimal example that goes wrong (but shouldn't IMHO: > SELECT format('%L:1', '\:')::tsvector My expectation here is that %L would transform '\:' into a correctly formatted character string, including single quotes around the string. The documentation reads: > L quotes the argument value as an SQL literal > SELECT $$'\:':1$$ returns «'\:':1», but > SELECT $$'\:':1$$::tsvector returns «':':1», so something is happening here inside the single-quoted lexemes to the backslash. My understanding is that everything inside single quotes is taken as a lexeme. From the documentation: > To represent lexemes containing whitespace or punctuation, surround them with quotes So having everything surrounded by single quotes (followed by a positional argument after the colon), «$$'\:':1$$::tsvector» should actually return a vector with a single lexeme «\:» at position 1. Am I missing something? Regards Johannes
=?UTF-8?Q?Johannes_Gra=c3=abn?= <johannes@selfnet.de> writes: > This is a minimal example that goes wrong (but shouldn't IMHO: >> SELECT format('%L:1', '\:')::tsvector format(%L) is designed to produce a SQL literal, which does not have the same requirements as a tsvector element ... yeah, they're close, but not close enough. In this particular example, what you get is =# SELECT format('%L:1', '\:'); format ---------- E'\\:':1 (1 row) because format() adds an E prefix for the avoidance of doubt about what to do with the backslashes. tsvector doesn't like that. I don't think we have any prefab function that does what you're looking for here, and TBH I'm not sure I see the point of it. Pretty much any tsvector you'd be dealing with in practice is going to have come from one of the to_tsvector family of functions, and those tend to drop punctuation. > My understanding is that everything inside > single quotes is taken as a lexeme. From the documentation: >> To represent lexemes containing whitespace or punctuation, surround them with quotes See also the next bit about having to double quotes and backslashes within those quotes. So what you'd actually need is =# select $$'\\:':1$$::tsvector; tsvector ---------- '\\:':1 (1 row) If you write just one backslash, it has the effect of quoting the next character, which in this case doesn't need quoting. regards, tom lane
Thanks, Tom. On 23/02/2022 23.30, Tom Lane wrote: > =?UTF-8?Q?Johannes_Gra=c3=abn?= <johannes@selfnet.de> writes: >> This is a minimal example that goes wrong (but shouldn't IMHO: > >>> SELECT format('%L:1', '\:')::tsvector > > format(%L) is designed to produce a SQL literal, which does not > have the same requirements as a tsvector element ... yeah, they're > close, but not close enough. In this particular example, > what you get is > > =# SELECT format('%L:1', '\:'); > format > ---------- > E'\\:':1 > (1 row) > > because format() adds an E prefix for the avoidance of doubt about > what to do with the backslashes. tsvector doesn't like that. I see. > I don't think we have any prefab function that does what you're > looking for here, and TBH I'm not sure I see the point of it. > Pretty much any tsvector you'd be dealing with in practice is > going to have come from one of the to_tsvector family of > functions, and those tend to drop punctuation. Applied to normal texts in a standard language, I believe that's true. I'd like to use FTS in a setting where I need to control the positional attributes as I'm specifying more than one lexeme at the same position (which works great btw). That's why I can't use to_tsvector() but need to cast it from a string to tsvector. >> My understanding is that everything inside >> single quotes is taken as a lexeme. From the documentation: > >>> To represent lexemes containing whitespace or punctuation, surround them with quotes > > See also the next bit about having to double quotes and backslashes > within those quotes. So what you'd actually need is > > =# select $$'\\:':1$$::tsvector; > tsvector > ---------- > '\\:':1 > (1 row) > > If you write just one backslash, it has the effect of quoting > the next character, which in this case doesn't need quoting. Before using format(), I tried just generating those strings by doubling any single quote or backslash and enclosing the whole string in single quotes, but that didn't seem a safe way, though it works in principle: > SELECT format($$'%s':%s$$, replace(replace(s, $$'$$, $$''$$), '\', '\\'), i)::tsvector > FROM (SELECT $$\:$$ s, 1 i) x; Would that be the way to go if to_tsvector is not an option? Regards Johannes
=?UTF-8?Q?Johannes_Gra=c3=abn?= <johannes@selfnet.de> writes: > Before using format(), I tried just generating those strings by doubling > any single quote or backslash and enclosing the whole string in single > quotes, but that didn't seem a safe way, though it works in principle: That's the documented requirement, so I don't see why it wouldn't be safe. regards, tom lane