Обсуждение: Replacing Apache Solr with Postgre Full Text Search?

Поиск

Список

Период

Сортировка

Replacing Apache Solr with Postgre Full Text Search?

От

J2eeInside J2eeInside

Дата:

25 марта 2020 г., 15:36:38

Hi all,

I hope someone can help/suggest:
I'm currently maintaining a project that uses Apache Solr /Lucene. To be honest, I wold like to replace Solr with Postgre Full Text Search. However, there is a huge amount of documents involved - arround 200GB. Wondering, can Postgre handle this efficiently?
Does anyone have specific experience, and what should the infrastructure look like?

P.S. Not to be confused, the Sol works just fine, i just wanted to eliminate one component from the whole system (if Full text search can replace Solr at all)

Re: Replacing Apache Solr with Postgre Full Text Search?

От

Mike Rylander

Дата:

25 марта 2020 г., 17:36:44

On Wed, Mar 25, 2020 at 8:37 AM J2eeInside J2eeInside
<j2eeinside@gmail.com> wrote:
>
> Hi all,
>
> I hope someone  can help/suggest:
> I'm currently maintaining a project that uses Apache Solr /Lucene. To be honest, I wold like to replace Solr with
PostgreFull Text Search. However, there is a huge amount of documents involved - arround 200GB. Wondering, can Postgre
handlethis efficiently? 
> Does anyone have specific experience, and what should the infrastructure look like?
>
> P.S. Not to be confused, the Sol works just fine, i just wanted to eliminate one component from the whole system (if
Fulltext search can replace Solr at all) 

I'm one of the core developers (and the primary developer of the
search subsystem) for the Evergreen ILS [1] (integrated library system
-- think book library, not software library).  We've been using PGs
full-text indexing infrastructure since day one, and I can say it is
definitely capable of handling pretty much anything you can throw at
it.

Our indexing requirements are very complex and need to be very
configurable, and need to include a lot more than just "search and
rank a text column," so we've had to build a ton of infrastructure
around record (document) ingest, searching/filtering, linking, and
display.  If your indexing and search requirements are stable,
specific, and well-understood it should be straight forward,
especially if you don't have to take into account non-document
attributes like physical location, availability, and arbitrary
real-time visibility rules like Evergreen does.

As for scale, it's more about document count than total size.  There
are Evergreen libraries with several million records to search, and
with proper hardware and tuning everything works well.  Our main
performance issue has to do with all of the stuff outside the records
(documents) themselves that have to be taken into account during
search.  The core full-text search part of our queries is extremely
performant, and has only gotten better over the years.

[1] http://evergreen-ils.org

HTH,
--
Mike Rylander
 | Executive Director
 | Equinox Open Library Initiative
 | phone:  1-877-OPEN-ILS (673-6457)
 | email:  miker@equinoxinitiative.org
 | web:  http://equinoxinitiative.org

Re: Replacing Apache Solr with Postgre Full Text Search?

От

J2eeInside J2eeInside

Дата:

26 марта 2020 г., 11:03:08

Hi Mike, and thanks for valuable answer!

In short, you think a PG Full Text Search can do the same as Apache Solr?

P.S. I need to index .pdf, .html and MS Word .doc/.docx files, is there any constraints in Ful Text search regarding those file types?

On Wed, Mar 25, 2020 at 3:36 PM Mike Rylander <mrylander@gmail.com> wrote:

On Wed, Mar 25, 2020 at 8:37 AM J2eeInside J2eeInside
<j2eeinside@gmail.com> wrote:
>
> Hi all,
>
> I hope someone can help/suggest:
> I'm currently maintaining a project that uses Apache Solr /Lucene. To be honest, I wold like to replace Solr with Postgre Full Text Search. However, there is a huge amount of documents involved - arround 200GB. Wondering, can Postgre handle this efficiently?
> Does anyone have specific experience, and what should the infrastructure look like?
>
> P.S. Not to be confused, the Sol works just fine, i just wanted to eliminate one component from the whole system (if Full text search can replace Solr at all)

I'm one of the core developers (and the primary developer of the
search subsystem) for the Evergreen ILS [1] (integrated library system
-- think book library, not software library). We've been using PGs
full-text indexing infrastructure since day one, and I can say it is
definitely capable of handling pretty much anything you can throw at
it.

Our indexing requirements are very complex and need to be very
configurable, and need to include a lot more than just "search and
rank a text column," so we've had to build a ton of infrastructure
around record (document) ingest, searching/filtering, linking, and
display. If your indexing and search requirements are stable,
specific, and well-understood it should be straight forward,
especially if you don't have to take into account non-document
attributes like physical location, availability, and arbitrary
real-time visibility rules like Evergreen does.

As for scale, it's more about document count than total size. There
are Evergreen libraries with several million records to search, and
with proper hardware and tuning everything works well. Our main
performance issue has to do with all of the stuff outside the records
(documents) themselves that have to be taken into account during
search. The core full-text search part of our queries is extremely
performant, and has only gotten better over the years.

[1] http://evergreen-ils.org

HTH,
--
Mike Rylander
| Executive Director
| Equinox Open Library Initiative
| phone: 1-877-OPEN-ILS (673-6457)
| email: miker@equinoxinitiative.org
| web: http://equinoxinitiative.org

Re: Replacing Apache Solr with Postgre Full Text Search?

От

Mike Rylander

Дата:

26 марта 2020 г., 18:18:16

On Thu, Mar 26, 2020 at 4:03 AM J2eeInside J2eeInside
<j2eeinside@gmail.com> wrote:
>
> Hi Mike, and thanks for valuable answer!
> In short, you think a PG Full Text Search can do the same as Apache Solr?
>

Can it?  I mean, it does today.  Whether it would for you depends on
your needs and how much effort you can afford to put into the stuff
that is /not/ the full text engine itself, like document normalizers
and search UIs.

There are trade-offs to be made when choosing any tool.  Solr is
great, and so is Lucene (Solr's heart), and so is Elastic Search.  For
that matter, Zebra is awesome for full text indexing, too.  Those all
make indexing a pile of documents easy.  But, none of those are great
as an authoritative data store, so for instance there will necessarily
be drift between your data and the Solr index requiring a full
refresh.  It's also hard to integrate non-document filtering
requirements like I have in my use case.  Both of those are important
to my use case, so PG's full text is my preference.

Solr also didn't exist (publicly) in 2004 when we started building Evergreen. :)

> P.S. I need to index .pdf, .html and MS Word .doc/.docx files, is there any constraints in Ful Text search regarding
thosefile types? 
>

It can't handle those without some help -- it supports exactly text --
but you can extract the text using other tools.

--
Mike Rylander
 | Executive Director
 | Equinox Open Library Initiative
 | phone:  1-877-OPEN-ILS (673-6457)
 | email:  miker@equinoxinitiative.org
 | web:  http://equinoxinitiative.org

>
> On Wed, Mar 25, 2020 at 3:36 PM Mike Rylander <mrylander@gmail.com> wrote:
>>
>> On Wed, Mar 25, 2020 at 8:37 AM J2eeInside J2eeInside
>> <j2eeinside@gmail.com> wrote:
>> >
>> > Hi all,
>> >
>> > I hope someone  can help/suggest:
>> > I'm currently maintaining a project that uses Apache Solr /Lucene. To be honest, I wold like to replace Solr with
PostgreFull Text Search. However, there is a huge amount of documents involved - arround 200GB. Wondering, can Postgre
handlethis efficiently? 
>> > Does anyone have specific experience, and what should the infrastructure look like?
>> >
>> > P.S. Not to be confused, the Sol works just fine, i just wanted to eliminate one component from the whole system
(ifFull text search can replace Solr at all) 
>>
>> I'm one of the core developers (and the primary developer of the
>> search subsystem) for the Evergreen ILS [1] (integrated library system
>> -- think book library, not software library).  We've been using PGs
>> full-text indexing infrastructure since day one, and I can say it is
>> definitely capable of handling pretty much anything you can throw at
>> it.
>>
>> Our indexing requirements are very complex and need to be very
>> configurable, and need to include a lot more than just "search and
>> rank a text column," so we've had to build a ton of infrastructure
>> around record (document) ingest, searching/filtering, linking, and
>> display.  If your indexing and search requirements are stable,
>> specific, and well-understood it should be straight forward,
>> especially if you don't have to take into account non-document
>> attributes like physical location, availability, and arbitrary
>> real-time visibility rules like Evergreen does.
>>
>> As for scale, it's more about document count than total size.  There
>> are Evergreen libraries with several million records to search, and
>> with proper hardware and tuning everything works well.  Our main
>> performance issue has to do with all of the stuff outside the records
>> (documents) themselves that have to be taken into account during
>> search.  The core full-text search part of our queries is extremely
>> performant, and has only gotten better over the years.
>>
>> [1] http://evergreen-ils.org
>>
>> HTH,
>> --
>> Mike Rylander
>>  | Executive Director
>>  | Equinox Open Library Initiative
>>  | phone:  1-877-OPEN-ILS (673-6457)
>>  | email:  miker@equinoxinitiative.org
>>  | web:  http://equinoxinitiative.org

Sv: Replacing Apache Solr with Postgre Full Text Search?

От

Andreas Joseph Krogh

Дата:

26 марта 2020 г., 18:32:58

På onsdag 25. mars 2020 kl. 13:36:38, skrev J2eeInside J2eeInside <j2eeinside@gmail.com>:

Hi all,

I hope someone can help/suggest:
I'm currently maintaining a project that uses Apache Solr /Lucene. To be honest, I wold like to replace Solr with Postgre Full Text Search. However, there is a huge amount of documents involved - arround 200GB. Wondering, can Postgre handle this efficiently?
Does anyone have specific experience, and what should the infrastructure look like?

P.S. Not to be confused, the Sol works just fine, i just wanted to eliminate one component from the whole system (if Full text search can replace Solr at all)

I see you've gotten some answers but wanted to chime in...

We seach in ~15mill. emails and ~10 mill documents (extracted text from Word/PDF etc. using Java-tools), and use PG and FTS (gin, not rum) for the exact same reasons as Evergreen (it seems). We have to mix FTS with domain-specific logic/filtering and that is based on relational data in the database. I don't see how we could have done that using an external search-engine. Maybe it's easy, I don't have any experience with it.

Andreas Joseph Krogh

CTO / Partner - Visena AS

Mobile: +47 909 56 963

andreas@visena.com

www.visena.com

Вложения

msg-1561587-3.png

Re: Replacing Apache Solr with Postgre Full Text Search?

От

J2eeInside J2eeInside

Дата:

26 марта 2020 г., 19:05:27

Thanks again.

For the end, the finally question:

On Thu, Mar 26, 2020 at 4:18 PM Mike Rylander <mrylander@gmail.com> wrote:

On Thu, Mar 26, 2020 at 4:03 AM J2eeInside J2eeInside
<j2eeinside@gmail.com> wrote:
>

> P.S. I need to index .pdf, .html and MS Word .doc/.docx files, is there any constraints in Ful Text search regarding those file types?
>

It can't handle those without some help -- it supports exactly text --
but you can extract the text using other tools.

- Can you recommend those tools you mention above/any useful resource on how to do that?

Re: Replacing Apache Solr with Postgre Full Text Search?

От

J2eeInside J2eeInside

Дата:

26 марта 2020 г., 19:06:18

You are wellcome Andreas, and thanks for useful answer ;-)

On Thu, Mar 26, 2020 at 4:33 PM Andreas Joseph Krogh <andreas@visena.com> wrote:

På onsdag 25. mars 2020 kl. 13:36:38, skrev J2eeInside J2eeInside <j2eeinside@gmail.com>:
Hi all,

I hope someone can help/suggest:
I'm currently maintaining a project that uses Apache Solr /Lucene. To be honest, I wold like to replace Solr with Postgre Full Text Search. However, there is a huge amount of documents involved - arround 200GB. Wondering, can Postgre handle this efficiently?
Does anyone have specific experience, and what should the infrastructure look like?

P.S. Not to be confused, the Sol works just fine, i just wanted to eliminate one component from the whole system (if Full text search can replace Solr at all)

I see you've gotten some answers but wanted to chime in...
We seach in ~15mill. emails and ~10 mill documents (extracted text from Word/PDF etc. using Java-tools), and use PG and FTS (gin, not rum) for the exact same reasons as Evergreen (it seems). We have to mix FTS with domain-specific logic/filtering and that is based on relational data in the database. I don't see how we could have done that using an external search-engine. Maybe it's easy, I don't have any experience with it.

--
Andreas Joseph Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
andreas@visena.com
www.visena.com

Вложения

noname

Re: Replacing Apache Solr with Postgre Full Text Search?

От

Artjom Simon

Дата:

26 марта 2020 г., 21:19:42

On 26.03.20 17:05, J2eeInside J2eeInside wrote:
 >> P.S. I need to index .pdf, .html and MS Word .doc/.docx files, is
 >> there any constraints in Ful Text search regarding those file types?
 >
 > - Can you recommend those tools you mention above/any useful resource 
on how to do that?

For PDFs, I know of at least two tools that can extract text. Try 
Ghostscript:

     gs -sDEVICE=txtwrite -o output.txt input.pdf

or a tool called 'pdftotext':

     pdftotext [options] [PDF-file [text-file]]

Both give slightly different results, mainly in terms of indentation and 
layout of the generated plain text, and how they deal with tabular layouts.

Note that PDF is a container format that can embed virtually anything: 
text, images, flash videos, ...
You'll get good results if the PDF input is plain text. If you're 
dealing with embedded images like scanned documents, you'll probably 
need a OCR pass with tools like 'tesseract' to extract the recognized text.

You'll need similar tools to extract the text from DOC and HTML files 
since you're only interested in their plain text representation, not the 
meta data and markup.
Finding converters from HTML/DOC to plain text shouldn't be too hard. 
You could also try to find a commercial document conversion vendor, or 
try to convert HTML and DOC both to PDF so you'll only have to deal with 
PDF-to-text extraction in the end.

Good luck!

Artjom

-- 
Artjom Simon

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Replacing Apache Solr with Postgre Full Text Search?

Вложения

Вложения