Обсуждение: Mass-Data question

Поиск
Список
Период
Сортировка

Mass-Data question

От
Boris Köster
Дата:
Hello friends,

I have a question. Currently I am planning a new project that should
collect really much data. My question is:

What should I do if the disk-space is not enough? Is there something
to distribute data over several machines and to collect the data with
a select statement if required? The informations stored are needed to
be analyzed.

It is a  enterprise-computing project currently in a development and a
little bit planning phase,   I want to use
postgresql, but how should I handle real mass-data?

Don´t tell me to enhance disk-space, whatever we use it´s not enough.
We need more than one machine and we need to analyze the data over
several machines if possible with one select statement... or is there
a better idea how to handle really much data? Its important to us to
have realtime-analysis so we can not let the user wait for whatever.

Sorry for this question but I have a problem with that thingie.

--
Best regards,
 Boris Köster                         mailto:koester@x-itec.de


Re: Mass-Data question

От
Gunther Schadow
Дата:
Boris Köster wrote:

> Hello friends,
>
> I have a question. Currently I am planning a new project that should
> collect really much data. My question is:
>
> What should I do if the disk-space is not enough? Is there something
> to distribute data over several machines and to collect the data with
> a select statement if required? The informations stored are needed to
> be analyzed.
>
> It is a  enterprise-computing project currently in a development and a
> little bit planning phase,   I want to use
> postgresql, but how should I handle real mass-data?
>
> Don´t tell me to enhance disk-space, whatever we use it´s not enough.
> We need more than one machine and we need to analyze the data over
> several machines if possible with one select statement... or is there
> a better idea how to handle really much data? Its important to us to
> have realtime-analysis so we can not let the user wait for whatever.
>
> Sorry for this question but I have a problem with that thingie.


Hmm, interesting. I have similar needs. Let's hear what the gurus
have to say. But asking independent of PostgreSQL, what do you want
the RDBMS to do? You probably want a virtual shared disk storage,
such as a RAID system to which you can connect multiple hosts. VMS
clusters have that feature. The disks are independent of the hosts.
But then of course it's non-trivial to use multiple server hosts on
the same database storage. Oracle can do something like that (but
you pay heavy $$$).

So, what is it you want the system to do? Parallelize a single query
over multiple hosts? I wouldn't count on that being available with
PostgreSQL any time soon.

-Gunther




--
Gunther Schadow, M.D., Ph.D.                    gschadow@regenstrief.org
Medical Information Scientist      Regenstrief Institute for Health Care
Adjunct Assistant Professor        Indiana University School of Medicine
tel:1(317)630-7960                         http://aurora.regenstrief.org



Re: Mass-Data question

От
Curt Sampson
Дата:
> Boris Kster wrote:
>
> > What should I do if the disk-space is not enough? Is there something
> > to distribute data over several machines and to collect the data with
> > a select statement if required?

> Hmm, interesting. I have similar needs.

As do I. Unfortuantely, I'm not a guru. But I'll be testing out
something like this in the next few weeks if all goes well. I was
planning to do some fairly simple data partitioning. My initial
plan is to drop the data into multiple tables across multiple
servers, partitioned by date, and have a master table indicating
the names of the various tables and the date ranges they cover.
The application will then deal with determining which tables the
query will be spread across, construct and submit the appropriate
queries (eventually in parallel, if I'm getting a lot of queries
crossing multiple tables), and collate the results.

cjs
--
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
    Don't you know, in this new Dark Age, we're all light.  --XTC


Re: Mass-Data question

От
Boris Köster
Дата:
Hello Gunther,

Monday, April 15, 2002, 10:53:19 PM, you wrote:

GS> Boris Köster wrote:



GS> Hmm, interesting. I have similar needs. Let's hear what the gurus
GS> have to say. But asking independent of PostgreSQL, what do you want
GS> the RDBMS to do? You probably want a virtual shared disk storage,
GS> such as a RAID system to which you can connect multiple hosts. VMS
GS> clusters have that feature. The disks are independent of the hosts.

Yes, that sounds interesting as one option.

GS> But then of course it's non-trivial to use multiple server hosts on
GS> the same database storage. Oracle can do something like that (but
GS> you pay heavy $$$).

Hm yes.

GS> So, what is it you want the system to do? Parallelize a single query
GS> over multiple hosts? I wouldn't count on that being available with
GS> PostgreSQL any time soon.

I am collecting up to 6000-7000 informations in a timeline of 1-3
Secounds, that 24h a day, 365 days a year. The harddrives may not be
fast enough to collect the data, that´s really heavy. All these data
must be analyzed in realtime if required by the customer(s).
parallelized query is an interseting idea.

GS> -Gunther



--
Best regards,
 Boris                            mailto:koester@x-itec.de


Re: Mass-Data question

От
Boris Köster
Дата:
Hello Curt,

Tuesday, April 16, 2002, 5:25:25 AM, you wrote:

>> Hmm, interesting. I have similar needs.

CS> As do I. Unfortuantely, I'm not a guru. But I'll be testing out
CS> something like this in the next few weeks if all goes well. I was
CS> planning to do some fairly simple data partitioning. My initial
CS> plan is to drop the data into multiple tables across multiple
CS> servers, partitioned by date, and have a master table indicating
CS> the names of the various tables and the date ranges they cover.

Aha, interesting.

CS> The application will then deal with determining which tables the
CS> query will be spread across, construct and submit the appropriate
CS> queries (eventually in parallel, if I'm getting a lot of queries
CS> crossing multiple tables), and collate the results.

Parallel querying sounds very interesting to me. My current plan was
to do parallel writing because the hard-drives are not fast enough to
collect all the data, your idea of parallel reading is very
intersting.

I have written a C++ library to access mysql+postgresql databases. My
OS is FreeBSD, but it should work with other OSes, too I think.

Normally it sounds not very complex to do parallelized
reading/writing but getting the results in the right order that is a
problem. Maybe I could collect data parallelized from several
machines via threads, writing the content to a (new) machine (?) if the numer of rows is
not higher than x rows to avoid disk-overrun. The advantage could be
that if this works, its possible to use that feature with pgsql+mysql.

----------   ----------
rdbms1       rdbms[n]
----------   ----------
    |             |
    |             |
    ---------------
           |
           |distributed writing for logfiles or similar into databases
           |
           |         ----------
           |-------- rdbms-tmp  temporary db-server (?)
           |         ---------- to analyze the data for parallelized
           |              |      reading like a temporary space... ?
           |              |
           |              |---- > Customer-Access for analyzing
    --------------
     Machine with Memory-Queue implementation for fast reading/writing
     "Collector for writing and distributing the content"
    --------------
           |
           |
        Internet
----------   ----------
client1      client[n]
----------   ----------

What do the GURUs think about this? I need this functionality within
the next 1-2 month and I could try to code it as a C++ library. If the
concept is not bogus, the only question left is if i should give out
the source for free or not, this is no solution for a home-user *gg
I have no idea.

--
Best regards,
 Boris Köster                           mailto:koester@x-itec.de


Re: Mass-Data question

От
Alvaro Herrera
Дата:
On Tue, 16 Apr 2002, [ISO-8859-15] Boris Köster wrote:

> Normally it sounds not very complex to do parallelized
> reading/writing but getting the results in the right order that is a
> problem. Maybe I could collect data parallelized from several
> machines via threads, writing the content to a (new) machine (?) if the numer of rows is
> not higher than x rows to avoid disk-overrun. The advantage could be
> that if this works, its possible to use that feature with pgsql+mysql.

Maybe you can use dblink to retrieve the results from the various
"parallel servers" into one central server and then merge them (UNION,
maybe?). That would work for simple SELECTs, but when you have a couple
of triggers you start getting into trouble.

Obviously you would have to split UPDATEs and INSERTs appropiately.

Who knows, maybe you can even get it to actually work.

--
Alvaro Herrera (<alvherre[@]atentus.com>)
"On the other flipper, one wrong move and we're Fatal Exceptions"
(T.U.X.: Term Unit X  - http://www.thelinuxreview.com/TUX/)


Re: Mass-Data question

От
Curt Sampson
Дата:
On Tue, 16 Apr 2002, [ISO-8859-15] Boris Kster wrote:

> Parallel querying sounds very interesting to me. My current plan was
> to do parallel writing because the hard-drives are not fast enough to
> collect all the data....

If it's really the hard drives that are not fast enough, you've
got a serious problem. The raw write speed of a hard drive is much,
much faster than Postgres.

But even so, it sounds like you have basically the same problem as
I do; how to get loads of data into the system really quickly.

> Normally it sounds not very complex to do parallelized
> reading/writing but getting the results in the right order that is a
> problem.

I don't see why. Just run the queries in parallel and merge the
results as they come in. Just make sure you use the same ORDER BY
on all the queries so you can do a merge sort.

cjs
--
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
    Don't you know, in this new Dark Age, we're all light.  --XTC