Re: [Fwd: PostgreSQL to index MEDLINE]
От | Oleg Bartunov |
---|---|
Тема | Re: [Fwd: PostgreSQL to index MEDLINE] |
Дата | |
Msg-id | Pine.GSO.4.55.0305271035460.9108@ra.sai.msu.su обсуждение исходный текст |
Ответ на | [Fwd: PostgreSQL to index MEDLINE] (Justin Clift <justin@postgresql.org>) |
Список | pgsql-general |
Olaf, what's the number of unique words (you want to be searched) in your collection ? You may try contrib/tsearch V2 from http://www.sai.msu.su/~megera/postgres/gist/ May be I miss something but it looks like rather ordinary search system. Oleg On Tue, 27 May 2003, Justin Clift wrote: > Hi everyone, > > Does anyone feel like assisting Olaf here? > > Regards and best wishes, > > Justin Clift > > > -------- Original Message -------- > Subject: PostgreSQL to index MEDLINE > Date: Wed, 21 May 2003 11:47:39 +0200 > From: Olaf Bininda-Emonds <Olaf.Bininda@tz.agrar.tu-muenchen.de> > To: justin@postgresql.org > > Hi, > > I have a question concerning the suitability of PostgreSQL to yield a > word index of MEDLINE, an XML-formatted database of over 12 000 000 > articles from the biomedical literature. A computer scientist friend > suggested that PostgreSQL might be the answer that we require. > > Briefly, our project is as follows: > > We would like to establish word co-occurences within each MEDLINE > entry, with one word representing a gene and the other word > representing one or more user-input keywords of functional traits. As > the MEDLINE database is over 30 GB in size (and split into 396 separate > files), we need an efficient, indexed way to search it. > > Our specific needs are as follows: > > 1) An index of all the words in the fields "title", "abstract" (not > always present), and "medical subject headings" such that I can search > the index to reveal which article (represented by the MedlineID field) > contains a desired word. This will have two purposes: > > 1a) To build a subindex detailing which human genes are being referred > to in an article. This is slightly complicated, in part because of the > ever-changing nomenclature of genes. Officially, a gene is known by an > alphanumeric symbol (of up to eight characters; e.g., BRCA1) and a name > of one or more words ("breast cancer 1, early onset"). Searching for > the symbol itself is insufficient, either because the symbol has not > been used (e.g., older references) or can have other, common meanings > (e.g., the symbol for "pyruvate carboxylase" is PC). Therefore, > determining whether a gene is being mentioned in a given article > requires a combination of the gene symbol being present and/or a > sufficient number of the words in its name ("sufficient" being > determined by the length of the gene symbol). I have a perl script that > can determine this; however, it currently searches through all 396 > files on a pattern matching basis. Using an index would obviously be > that more efficient, allowing me to query for a lists of articles that > contain for a given gene, the gene symbol and each of the words in the > gene name. The intersection of these lists would then be the articles > held to contain a reference to that gene. > > 1b) To enable dynamic searches of the index for user-input keywords of > functional traits. Searching for phrases ("lipid metabolism") would be > ideal, but not necessary. > > In both cases, the matches would be for exact words (i.e., no stem > searches). > > 2) The program should run in a UNIX environment (as implemented in Mac > OS X) as I will be writing my own (hideously-inefficient) perl script > to access it for queries. Generally, the protocol will be as follows: > > a) input gene-article subindex into memory > b) obtain list of user-input keywords (and phrases, which will be split > into individual words if phrases cannot be searched for) > c) query index for MEDLINE articles containing each keyword in turn; > articles will be scored according to the number of keywords they > contain (i.e., an article with two keywords is scored twice as heavily > as an article with only one keyword) > d) for each article in the list in (c), determine whether each of ca. > 16000 genes is also mentioned in that article (via subindex); scores > for genes will be positively or negatively incremented according to > whether or not they are present > e) for each gene, determine its association score (= number of articles > in which it co-occurs with at least one keyword - number of articles > containing a keyword but not that gene). > > Therefore, the index should be written to disk as it will be accessed > repeatedly for each keyword. > > --- > > I would appreciate your advice as to whether PostgreSQL is suitable for > this task and, if so, how best to implement it and set up the word > index. I have attached a sample of the MEDLINE database for you. > > With thanks in advance for your help! > > Best, > > Olaf > > ------------------------------------------------------------------------ > ----------------- > Olaf Bininda-Emonds > Lehrstuhl fЭr Tierzucht > Technical University of Munich > Alte Akademie 12 > 85354 Freising-Weihenstephan > Germany > > Phone: +49 (0)8161 713741 > Fax: +49 (0)8161 713107 > e-mail: Olaf.Bininda@tierzucht.tum.de > WWW: http://www.tierzucht.tum.de/Bininda-Emonds/ > > > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
В списке pgsql-general по дате отправления: