Re: Replacing Apache Solr with Postgre Full Text Search?

Поиск
Список
Период
Сортировка
От Artjom Simon
Тема Re: Replacing Apache Solr with Postgre Full Text Search?
Дата
Msg-id 9d736364-0156-0a4d-d00f-bd5e8520ebb0@gmail.com
обсуждение исходный текст
Ответ на Re: Replacing Apache Solr with Postgre Full Text Search?  (J2eeInside J2eeInside <j2eeinside@gmail.com>)
Список pgsql-general
On 26.03.20 17:05, J2eeInside J2eeInside wrote:
 >> P.S. I need to index .pdf, .html and MS Word .doc/.docx files, is
 >> there any constraints in Ful Text search regarding those file types?
 >
 > - Can you recommend those tools you mention above/any useful resource 
on how to do that?


For PDFs, I know of at least two tools that can extract text. Try 
Ghostscript:

     gs -sDEVICE=txtwrite -o output.txt input.pdf


or a tool called 'pdftotext':

     pdftotext [options] [PDF-file [text-file]]

Both give slightly different results, mainly in terms of indentation and 
layout of the generated plain text, and how they deal with tabular layouts.

Note that PDF is a container format that can embed virtually anything: 
text, images, flash videos, ...
You'll get good results if the PDF input is plain text. If you're 
dealing with embedded images like scanned documents, you'll probably 
need a OCR pass with tools like 'tesseract' to extract the recognized text.

You'll need similar tools to extract the text from DOC and HTML files 
since you're only interested in their plain text representation, not the 
meta data and markup.
Finding converters from HTML/DOC to plain text shouldn't be too hard. 
You could also try to find a commercial document conversion vendor, or 
try to convert HTML and DOC both to PDF so you'll only have to deal with 
PDF-to-text extraction in the end.

Good luck!

Artjom


-- 
Artjom Simon



В списке pgsql-general по дате отправления:

Предыдущее
От: Adrian Klaver
Дата:
Сообщение: Re: How can I recreate a view in a new schema such that the view defreferences tables in the new schema ?
Следующее
От: David Gauthier
Дата:
Сообщение: Re: How can I recreate a view in a new schema such that the view defreferences tables in the new schema ?