Обсуждение: Text search language field
I'm working on project with some large text areas that will need to be searched, and I'm trying to set up an indexed text search field to make things a bit smoother. Only one of the top requirements is that this has to be multi-lingual, so I should be using the correct language's text search. The table stores the language the text is in, so in theory this should be easy... But it seems to be a bit more subtle than it first appears. Any ideas on the best way to set this up? Tables: Resource: "text_element" text, "text_search" tsvector, "language" char varying(3) references "languages" languages (This is 'static' list of ISO 639-3 codes, to make sure everything uses the same codes): "code" character varying(3) primary key, "description" character varying(100) Trigger (This is my first draft): create trigger "textsearch_trig" before update or insert on "resource" for each row execute procedure tsvector_update_trigger_column("text_search", "language", "text_element"); The trigger currently throws an error 'column "language" is not of regconfig type' whenever I try to use it, and I'm looking for the best way to solve that. The 'simple' solution of 'create text search configuration eng ( copy = pg_catalog.english );' didn't work. Before I start into a long exploration into why and what should be done about it, I thought I'd seek wisdom. ;) Daniel T. Staal --------------------------------------------------------------- This email copyright the author. Unless otherwise noted, you are expressly allowed to retransmit, quote, or otherwise use the contents for non-commercial purposes. This copyright will expire 5 years after the author's death, or in 30 years, whichever is longer, unless such a period is in excess of local copyright law. ---------------------------------------------------------------
--As of May 13, 2012 2:15:36 PM +1200, Gavin Flower is alleged to have said: > I can't comment on how to implement the functionality you want, but I > have a few comments you may (or may not!) find useful. This of course was not the *complete* specs of the tables. I didn't feel the need to post pages of SQL for a simple question. ;) I just put in the relevant fields. > • Field names need not be in double quotes. I know. But I prefer to quote everything always, both to help them stand out in the text, and to make sure I don't have any case issues. > • if a character field is Always 3 characters, then say so It currently is always 3 characters, but I don't want to be to dogmatic about it, in case that needs to change in the future. (ISO codes have done that in the past, and I may decide a different/additional set of codes is needed at some point in the future.) Besides, there is no performance benefit in Postgres. (The opposite, actually...) > • add NOT NULL where appropriate (you may well decide more fields > aught to be NOT NULL) Generally I have, though not always I'll admit. 'Primary key' implies it, and that was the only one that actually has that restriction in the selection of fields I showed. > • keeping primary and foreign key fields separate from user visible > data. I disagree strongly here. ;) Primary keys should be whatever is suitable for the data; creating an artificial primary key (when not needed) is just storing more data and doing more work. It also gives a feeling of 'safety' which doesn't actually exist, as you can then create data that is invalid, but that fits because your table design allows it. In particular, the language table keeping the the 'code' unique and distinct is the *entire point* of the table, so there is no good reason to use anything else as the primary key. (I do have a generated ID in the resource table, though it's a much more complex generation than a simple serial. Again, I didn't feel the need to show it.) This does mean thinking through your future use-cases a bit more at the initial design stage, but if I wasn't willing to do that I wouldn't be asking this question in the first place. (Actually, the languages table is nearly exclusively used inside the database, so you could claim it *is* separate from user visible data. I could almost use an enum there instead, but having a linked description available might be useful on occasion. I also thought that knowing it exists might be useful in solving my problem: One possible issue is that text is not the correct data type, and having a mapping table available might be useful.) > • identifying primary and foreign key fields clearly Agreed, although I only showed one of each. ;) > • suggest table names be singular (my convention, not universally > adopted!) I tend to use plural or singular depending on how they will be used: 'resource' will tend to be used one at a time, while 'languages' will tend to be used as a reference list, and therefore as an aggregate. A more defined naming scheme might be useful I'll admit. Daniel T. Staal --------------------------------------------------------------- This email copyright the author. Unless otherwise noted, you are expressly allowed to retransmit, quote, or otherwise use the contents for non-commercial purposes. This copyright will expire 5 years after the author's death, or in 30 years, whichever is longer, unless such a period is in excess of local copyright law. ---------------------------------------------------------------
I can't comment on how to implement the functionality you want, but I have a few comments you may (or may not!) find useful.
- Field names need not be in double quotes.
- if a character field is Always 3 characters, then say so
- add NOT NULL where appropriate (you may well decide more fields aught to be NOT NULL)
- keeping primary and foreign key fields separate from user visible data.
- identifying primary and foreign key fields clearly
- suggest table names be singular (my convention, not universally adaopted!)
CREATE TABLE language
(
id SERIAL PRIMARY KEY,
code char(3) NOT NULL,
description text
);
CREATE TABLE resource
(
id SERIAL PRIMARY KEY,
text_element text NOT NULL,
text_search tsvector,
language_id int REFERENCES language(id) NOT NULL
);
On 13/05/12 12:08, Daniel Staal wrote:
I'm working on project with some large text areas that will need to be searched, and I'm trying to set up an indexed text search field to make things a bit smoother. Only one of the top requirements is that this has to be multi-lingual, so I should be using the correct language's text search. The table stores the language the text is in, so in theory this should be easy... But it seems to be a bit more subtle than it first appears. Any ideas on the best way to set this up?
Tables:
Resource:
"text_element" text,
"text_search" tsvector,
"language" char varying(3) references "languages"
languages (This is 'static' list of ISO 639-3 codes, to make sure everything uses the same codes):
"code" character varying(3) primary key,
"description" character varying(100)
Trigger (This is my first draft):
create trigger "textsearch_trig" before update or insert
on "resource" for each row execute procedure
tsvector_update_trigger_column("text_search", "language", "text_element");
The trigger currently throws an error 'column "language" is not of regconfig type' whenever I try to use it, and I'm looking for the best way to solve that. The 'simple' solution of 'create text search configuration eng ( copy = pg_catalog.english );' didn't work. Before I start into a long exploration into why and what should be done about it, I thought I'd seek wisdom. ;)
Daniel T. Staal
---------------------------------------------------------------
This email copyright the author. Unless otherwise noted, you
are expressly allowed to retransmit, quote, or otherwise use
the contents for non-commercial purposes. This copyright will
expire 5 years after the author's death, or in 30 years,
whichever is longer, unless such a period is in excess of
local copyright law.
---------------------------------------------------------------
--As of May 13, 2012 2:15:36 PM +1200, Gavin Flower is alleged to have said:I am amazed at the number of times I see people specifying 'NOT NULL' and PRIMARY KEY' for the same field! Mind you, these same people could probably justifiable laugh at the daft things I do! :-)I can't comment on how to implement the functionality you want, but I
have a few comments you may (or may not!) find useful.
This of course was not the *complete* specs of the tables. I didn't feel the need to post pages of SQL for a simple question. ;) I just put in the relevant fields.• Field names need not be in double quotes.
I know. But I prefer to quote everything always, both to help them stand out in the text, and to make sure I don't have any case issues.• if a character field is Always 3 characters, then say so
It currently is always 3 characters, but I don't want to be to dogmatic about it, in case that needs to change in the future. (ISO codes have done that in the past, and I may decide a different/additional set of codes is needed at some point in the future.) Besides, there is no performance benefit in Postgres. (The opposite, actually...)• add NOT NULL where appropriate (you may well decide more fields
aught to be NOT NULL)
Generally I have, though not always I'll admit. 'Primary key' implies it, and that was the only one that actually has that restriction in the selection of fields I showed.
• keeping primary and foreign key fields separate from user visible
data.
I disagree strongly here. ;) Primary keys should be whatever is suitable for the data; creating an artificial primary key (when not needed) is just storing more data and doing more work. It also gives a feeling of 'safety' which doesn't actually exist, as you can then create data that is invalid, but that fits because your table design allows it. In particular, the language table keeping the the 'code' unique and distinct is the *entire point* of the table, so there is no good reason to use anything else as the primary key. (I do have a generated ID in the resource table, though it's a much more complex generation than a simple serial. Again, I didn't feel the need to show it.)
Actually, If I had thought about it a bit more, and something I would (or should!) do if I was designing the table for real, would be to use an UNIQUE qualifier for the code field.
Unless there are performance and/or data storage, or some such constraints - I prefer to linking tables with user visible things One production database I worked on had 5 tables in a chain of parent/child relationships, and each child primary key was a concatenation of a character field with the fields comprising the primary key of its parent – could be over 45 bytes in characters in length!
The current database I am designing is very complicated, but likely never to have more than a few thousand records in any table, and is likely to have many more reads (with some quite complicated queries) than writes. So I focus on trying to work in a very standardizing way, without having to worry over much about performance. Knowing my luck, my next project will be the exact opposite!
I would like to use ENUMs in postgres, but there appear to be problems f they need to be updated.
This does mean thinking through your future use-cases a bit more at the initial design stage, but if I wasn't willing to do that I wouldn't be asking this question in the first place.
(Actually, the languages table is nearly exclusively used inside the database, so you could claim it *is* separate from user visible data. I could almost use an enum there instead, but having a linked description available might be useful on occasion. I also thought that knowing it exists might be useful in solving my problem: One possible issue is that text is not the correct data type, and having a mapping table available might be useful.)
Horses for courses.• identifying primary and foreign key fields clearly
Agreed, although I only showed one of each. ;)• suggest table names be singular (my convention, not universally
adopted!)
I tend to use plural or singular depending on how they will be used: 'resource' will tend to be used one at a time, while 'languages' will tend to be used as a reference list, and therefore as an aggregate. A more defined naming scheme might be useful I'll admit.
I remember many years ago, that there was a big argument about systematic as distinct from meaningful names. I was programming in COBOL (names could be up to 30 characters long), I thought the argument was silly, as it depends... In fact in one COBOL program I adopted both approaches, as short systematic names are better for use in a set complicated numerical expressions and only in a self contained stretch of code, and meaningful names for variables used throughout a program.
For one stored procedure (TransactSQL) of over 3000 lines, accessing 15 tables using 7 temporary tables – I was very carefully in the systematic use of suffixes and consistent use of names.
Daniel T. Staal
---------------------------------------------------------------
This email copyright the author. Unless otherwise noted, you
are expressly allowed to retransmit, quote, or otherwise use
the contents for non-commercial purposes. This copyright will
expire 5 years after the author's death, or in 30 years,
whichever is longer, unless such a period is in excess of
local copyright law.
---------------------------------------------------------------
Never too sure what other people know, being helpful can run the risk of seemingly be patronising! I remember in one job I was given a task whee for part of it I was effectivly at the level of a trainess, and for other parts i had greater experience - a little unsettling!
Cheers,
P.S. about top posting before, I got a bit distracted by a work related call.
On 13/05/12 14:56, Daniel Staal wrote:hmm... I meant >>> linking tables with _NON_ user visible things <<<--As of May 13, 2012 2:15:36 PM +1200, Gavin Flower is alleged to have said:I am amazed at the number of times I see people specifying 'NOT NULL' and PRIMARY KEY' for the same field! Mind you, these same people could probably justifiable laugh at the daft things I do! :-)I can't comment on how to implement the functionality you want, but I
have a few comments you may (or may not!) find useful.
This of course was not the *complete* specs of the tables. I didn't feel the need to post pages of SQL for a simple question. ;) I just put in the relevant fields.• Field names need not be in double quotes.
I know. But I prefer to quote everything always, both to help them stand out in the text, and to make sure I don't have any case issues.• if a character field is Always 3 characters, then say so
It currently is always 3 characters, but I don't want to be to dogmatic about it, in case that needs to change in the future. (ISO codes have done that in the past, and I may decide a different/additional set of codes is needed at some point in the future.) Besides, there is no performance benefit in Postgres. (The opposite, actually...)• add NOT NULL where appropriate (you may well decide more fields
aught to be NOT NULL)
Generally I have, though not always I'll admit. 'Primary key' implies it, and that was the only one that actually has that restriction in the selection of fields I showed.• keeping primary and foreign key fields separate from user visible
data.
I disagree strongly here. ;) Primary keys should be whatever is suitable for the data; creating an artificial primary key (when not needed) is just storing more data and doing more work. It also gives a feeling of 'safety' which doesn't actually exist, as you can then create data that is invalid, but that fits because your table design allows it. In particular, the language table keeping the the 'code' unique and distinct is the *entire point* of the table, so there is no good reason to use anything else as the primary key. (I do have a generated ID in the resource table, though it's a much more complex generation than a simple serial. Again, I didn't feel the need to show it.)Actually, If I had thought about it a bit more, and something I would (or should!) do if I was designing the table for real, would be to use an UNIQUE qualifier for the code field.
Unless there are performance and/or data storage, or some such constraints - I prefer to linking tables with user visible things One production database I worked on had 5 tables in a chain of parent/child relationships, and each child primary key was a concatenation of a character field with the fields comprising the primary key of its parent – could be over 45 bytes in characters in length!
(probably obvious, but just in case...)
)
The current database I am designing is very complicated, but likely never to have more than a few thousand records in any table, and is likely to have many more reads (with some quite complicated queries) than writes. So I focus on trying to work in a very standardizing way, without having to worry over much about performance. Knowing my luck, my next project will be the exact opposite!
I would like to use ENUMs in postgres, but there appear to be problems f they need to be updated.
This does mean thinking through your future use-cases a bit more at the initial design stage, but if I wasn't willing to do that I wouldn't be asking this question in the first place.
(Actually, the languages table is nearly exclusively used inside the database, so you could claim it *is* separate from user visible data. I could almost use an enum there instead, but having a linked description available might be useful on occasion. I also thought that knowing it exists might be useful in solving my problem: One possible issue is that text is not the correct data type, and having a mapping table available might be useful.)
Horses for courses.• identifying primary and foreign key fields clearly
Agreed, although I only showed one of each. ;)• suggest table names be singular (my convention, not universally
adopted!)
I tend to use plural or singular depending on how they will be used: 'resource' will tend to be used one at a time, while 'languages' will tend to be used as a reference list, and therefore as an aggregate. A more defined naming scheme might be useful I'll admit.
I remember many years ago, that there was a big argument about systematic as distinct from meaningful names. I was programming in COBOL (names could be up to 30 characters long), I thought the argument was silly, as it depends... In fact in one COBOL program I adopted both approaches, as short systematic names are better for use in a set complicated numerical expressions and only in a self contained stretch of code, and meaningful names for variables used throughout a program.
For one stored procedure (TransactSQL) of over 3000 lines, accessing 15 tables using 7 temporary tables – I was very carefully in the systematic use of suffixes and consistent use of names.
Daniel T. Staal
---------------------------------------------------------------
This email copyright the author. Unless otherwise noted, you
are expressly allowed to retransmit, quote, or otherwise use
the contents for non-commercial purposes. This copyright will
expire 5 years after the author's death, or in 30 years,
whichever is longer, unless such a period is in excess of
local copyright law.
---------------------------------------------------------------
Never too sure what other people know, being helpful can run the risk of seemingly be patronising! I remember in one job I was given a task whee for part of it I was effectivly at the level of a trainess, and for other parts i had greater experience - a little unsettling!
Cheers,
P.S. about top posting before, I got a bit distracted by a work related call.
--As of May 13, 2012 3:37:22 PM +1200, Gavin Flower is alleged to have said: >> Generally I have, though not always I'll admit. 'Primary key' implies >> it, and that was the only one that actually has that restriction in the >> selection of fields I showed. > > I am amazed at the number of times I see people specifying 'NOT NULL' and > PRIMARY KEY' for the same field! Mind you, these same people could > probably justifiable laugh at the daft things I do! :-) I don't mind the belt-and-suspenders approach on this, if you want to use it. Sure, it's redundant, but it's also clearer and a decent reminder to the human reading the SQL. >>> • keeping primary and foreign key fields separate from user >>> visible data. >> >> I disagree strongly here. ;) Primary keys should be whatever is >> suitable for the data; creating an artificial primary key (when not >> needed) is just storing more data and doing more work. It also gives a >> feeling of 'safety' which doesn't actually exist, as you can then create >> data that is invalid, but that fits because your table design allows >> it. In particular, the language table keeping the the 'code' unique and >> distinct is the *entire point* of the table, so there is no good reason >> to use anything else as the primary key. (I do have a generated ID in >> the resource table, though it's a much more complex generation than a >> simple serial. Again, I didn't feel the need to show it.) > > Actually, If I had thought about it a bit more, and something I would (or > should!) do if I was designing the table for real, would be to use an > UNIQUE qualifier for the code field. > > Unless there are performance and/or data storage, or some such > constraints - I prefer to linking tables with (non)user visible things One > production database I worked on had 5 tables in a chain of parent/child > relationships, and each child primary key was a concatenation of a > character field with the fields comprising the primary key of its parent > – could be over 45 bytes in characters in length! > > The current database I am designing is very complicated, but likely never > to have more than a few thousand records in any table, and is likely to > have many more reads (with some quite complicated queries) than writes. > So I focus on trying to work in a very standardizing way, without having > to worry over much about performance. Knowing my luck, my next project > will be the exact opposite! This project has the potential to be very performance-critical, so I'm trying to operate on that assumption. ;) And again, I don't see the point of creating extra fields and data just to enforce some artificial separation between 'user-visible' and 'database' fields. There's no performance benefit, there's a maintenance *penalty* (in that your data is more complicated), and a programming penalty. (Again: your data is more complicated.) In this case, for example, doing it with a separate 'id' field would either require a hash lookup in the application this database is being created to support, or a separate lookup on nearly every write to the resources table (AKA: The second-most common operation I'm expecting), just to get the language code id. The first has obvious maintenance problems, and they both have a performance penalty. And none of this has any benefit to anyone, that I can see. So why? > I remember many years ago, that there was a big argument about systematic > as distinct from meaningful names. I was programming in COBOL (names > could be up to 30 characters long), I thought the argument was silly, as > it depends... In fact in one COBOL program I adopted both approaches, as > short systematic names are better for use in a set complicated numerical > expressions and only in a self contained stretch of code, and meaningful > names for variables used throughout a program. The only time it's not silly is when it's being decided upon as a 'company-wide standard'. Then it's tragic. ;) (Good guidelines are invaluable. Enforced standards, especially given by those who aren't doing any of the actual work, are not.) > Never too sure what other people know, being helpful can run the risk of > seemingly be patronising! I remember in one job I was given a task whee > for part of it I was effectivly at the level of a trainess, and for other > parts i had greater experience - a little unsettling! That's been most of my jobs. ;) > P.S. about top posting before, I got a bit distracted by a work related > call. No problem. Now, if I could only get you to respect the 'Reply-To:' header... (I'm also not a fan of HTML email, but I can live with that.) Daniel T. Staal --------------------------------------------------------------- This email copyright the author. Unless otherwise noted, you are expressly allowed to retransmit, quote, or otherwise use the contents for non-commercial purposes. This copyright will expire 5 years after the author's death, or in 30 years, whichever is longer, unless such a period is in excess of local copyright law. ---------------------------------------------------------------
--As of May 12, 2012 8:08:42 PM -0400, PGSQL-Novice is alleged to have said: > Tables: > > Resource: > "text_element" text, > "text_search" tsvector, > "language" char varying(3) references "languages" > > languages (This is 'static' list of ISO 639-3 codes, to make sure > everything uses the same codes): > "code" character varying(3) primary key, > "description" character varying(100) > The trigger currently throws an error 'column "language" is not of > regconfig type' whenever I try to use it, and I'm looking for the best > way to solve that. The 'simple' solution of 'create text search > configuration eng ( copy = pg_catalog.english );' didn't work. Before I > start into a long exploration into why and what should be done about it, > I thought I'd seek wisdom. ;) --As for the rest, it is mine. Ok, after a bit of exploration, there are two problems with my current approach: The 'config_column_name' passed to 'tsvector_update_trigger_column' cannot be a text column. The config in the config column must be fully qualified. (So I can't use 'english', I'd have to use 'pg_catalog.english'.) Combined, these mean I it's probably not worth trying to use the column to tell both my app and Postgresql what language is being used. So, anyone have any ideas on how to write a trigger to pull the field from the language table based on the code? Daniel T. Staal --------------------------------------------------------------- This email copyright the author. Unless otherwise noted, you are expressly allowed to retransmit, quote, or otherwise use the contents for non-commercial purposes. This copyright will expire 5 years after the author's death, or in 30 years, whichever is longer, unless such a period is in excess of local copyright law. ---------------------------------------------------------------