Обсуждение: Using Postgres to store high volume streams of sensor readings

Поиск

Список

Период

Сортировка

Using Postgres to store high volume streams of sensor readings

От

"Ciprian Dorin Craciun"

Дата:

21 ноября 2008 г., 11:50:51

    Hello all!

    I would like to ask some advice about the following problem
(related to the Dehems project: http://www.dehems.eu/ ):
    * there are some clients; (the clients are in fact house holds;)
    * each device has a number of sensors (about 10), and not all the
clients have the same sensor; also sensors might appear and disappear
dynamicaly; (the sensors are appliances;)
    * for each device and each sensor a reading is produced (at about
6 seconds); (the values could be power consumptions;)
    * I would like to store the following data: (client, sensor,
timestamp, value);
    * the usual queries are:
        * for a given client (and sensor), and time interval, I need
the min, max, and avg of the values;
        * for a given time interval (and sensor), I need min, max, and
avg of the values;
        * other statistics;

    Currently I'm benchmarking the following storage solutions for this:
    * Hypertable (http://www.hypertable.org/) -- which has good insert
rate (about 250k inserts / s), but slow read rate (about 150k reads /
s); (the aggregates are manually computed, as Hypertable does not
support other queries except scanning (in fact min, and max are easy
beeing the first / last key in the ordered set, but avg must be done
by sequential scan);)
    * BerkeleyDB -- quite Ok insert rate (about 50k inserts / s), but
fabulos read rate (about 2M reads / s); (the same issue with
aggregates;)
    * Postgres -- which behaves quite poorly (see below)...
    * MySQL -- next to be tested;

    So what can I do / how could I optimize the use of Postgres for this usage?

    (I'm aware that there could be optimizations for this problem
(like computing the aggregates in memory and storing only these
aggregates at 10 minutes, or other interval), but I want to have the
full history (for data mining tasks for example);)

    I will also like to present how was the Postgres benchmark implemented:

    * test machine: Linux (Ubuntu 8.04 x64), IBM x3750, 2 x 500Gb,
SCSI, RAID 0, 8Gb RAM, 2 x Intel Xeon 5160 x 2 core (3Ghz, 4Mb L2);
    * Postgres version: 8.3.3;

    * database schema:
> create table sds_benchmark_data (
>     client int4 not null,
>     sensor int4 not null,
>     timestamp int4 not null,
>     value int4
> );
>
> alter table sds_benchmark_data add primary key (client, sensor, timestamp);
>
> create index sds_benchmark_data__client_sensor__index on sds_benchmark_data (client, sensor);

    * postgres.conf (the default values, I will list only what has
been changed):
> max_connections = 20
> shared_buffers = 24MB
> work_mem = 64MB
> maintenance_work_mem = 256MB
> fsync = off
> checkpoint_segments = 32
> effective_cache_size = 1024MB

    * inserts are done like this:
        * generated 100 million readings by using the following rule:
            * client is randomly chosen between 0 and 10 thousand;
            * sensor is randomly chosen between 0 and 10;
            * the timestamp is always increasing by one;
        * the insert is done in batches of 500 thousand inserts (I've
also tried 5, 25, 50 and 100 thousand without big impact);
        * the banch inserts are done through COPY sds_benchmark_data
FROM STDIN through libpq (by using UNIX (local) sockets);

    What have I observed / tried:
    * I've tested without the primary key and the index, and the
results were the best for inserts (600k inserts / s), but the
readings, worked extremly slow (due to the lack of indexing);
    * with only the index (or only the primary key) the insert rate is
good at start (for the first 2 million readings), but then drops to
about 200 inserts / s;

    So could someone point me where I'me wrong, or what can I do to
optimize Postgres for this particular task?

    Thanks for your help,
    Ciprian Dorin Craciun.

    P.S.: I'll want to publish the benchmark results after they are
done, and I want to squeeze as much power out of Postgres as possible.

Re: Using Postgres to store high volume streams of sensor readings

От

Gerhard Heift

Дата:

21 ноября 2008 г., 11:56:08

On Fri, Nov 21, 2008 at 02:50:45PM +0200, Ciprian Dorin Craciun wrote:
>     Hello all!
>
>     I would like to ask some advice about the following problem
> (related to the Dehems project: http://www.dehems.eu/ ):
>     * there are some clients; (the clients are in fact house holds;)
>     * each device has a number of sensors (about 10), and not all the
> clients have the same sensor; also sensors might appear and disappear
> dynamicaly; (the sensors are appliances;)
>     * for each device and each sensor a reading is produced (at about
> 6 seconds); (the values could be power consumptions;)
>     * I would like to store the following data: (client, sensor,
> timestamp, value);
>     * the usual queries are:
>         * for a given client (and sensor), and time interval, I need
> the min, max, and avg of the values;
>         * for a given time interval (and sensor), I need min, max, and
> avg of the values;
>         * other statistics;
>
>     Currently I'm benchmarking the following storage solutions for this:
>     * Hypertable (http://www.hypertable.org/) -- which has good insert
> rate (about 250k inserts / s), but slow read rate (about 150k reads /
> s); (the aggregates are manually computed, as Hypertable does not
> support other queries except scanning (in fact min, and max are easy
> beeing the first / last key in the ordered set, but avg must be done
> by sequential scan);)
>     * BerkeleyDB -- quite Ok insert rate (about 50k inserts / s), but
> fabulos read rate (about 2M reads / s); (the same issue with
> aggregates;)
>     * Postgres -- which behaves quite poorly (see below)...
>     * MySQL -- next to be tested;

For this case I would use RRDTool: http://oss.oetiker.ch/rrdtool/

Regards,
  Gerhard

Вложения

signature.asc

Re: Using Postgres to store high volume streams of sensor readings

От

"Ciprian Dorin Craciun"

Дата:

21 ноября 2008 г., 12:03:48

On Fri, Nov 21, 2008 at 2:55 PM, Gerhard Heift
<ml-postgresql-20081012-3518@gheift.de> wrote:
> On Fri, Nov 21, 2008 at 02:50:45PM +0200, Ciprian Dorin Craciun wrote:
>>     Hello all!
>>
>>     I would like to ask some advice about the following problem
>> (related to the Dehems project: http://www.dehems.eu/ ):
>>     * there are some clients; (the clients are in fact house holds;)
>>     * each device has a number of sensors (about 10), and not all the
>> clients have the same sensor; also sensors might appear and disappear
>> dynamicaly; (the sensors are appliances;)
>>     * for each device and each sensor a reading is produced (at about
>> 6 seconds); (the values could be power consumptions;)
>>     * I would like to store the following data: (client, sensor,
>> timestamp, value);
>>     * the usual queries are:
>>         * for a given client (and sensor), and time interval, I need
>> the min, max, and avg of the values;
>>         * for a given time interval (and sensor), I need min, max, and
>> avg of the values;
>>         * other statistics;
>>
>>     Currently I'm benchmarking the following storage solutions for this:
>>     * Hypertable (http://www.hypertable.org/) -- which has good insert
>> rate (about 250k inserts / s), but slow read rate (about 150k reads /
>> s); (the aggregates are manually computed, as Hypertable does not
>> support other queries except scanning (in fact min, and max are easy
>> beeing the first / last key in the ordered set, but avg must be done
>> by sequential scan);)
>>     * BerkeleyDB -- quite Ok insert rate (about 50k inserts / s), but
>> fabulos read rate (about 2M reads / s); (the same issue with
>> aggregates;)
>>     * Postgres -- which behaves quite poorly (see below)...
>>     * MySQL -- next to be tested;
>
> For this case I would use RRDTool: http://oss.oetiker.ch/rrdtool/
>
> Regards,
>  Gerhard
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.9 (GNU/Linux)
>
> iEYEARECAAYFAkkmr9wACgkQa8fhU24j2flejQCeKl650qmSlomovOXDl6IKrADb
> cTAAnRebAFq420MuW9aMmhoFOo+sPIje
> =Zcoo
> -----END PGP SIGNATURE-----

    Hy Gerhard, I know about RRDTool, but it has some limitations:
    * I must know in advance the number of sensors;
    * I must create for each client a file (and If I have 10 thousand clients?);
    * I have a limited amount of history;
    * (I'm not sure about this one but i think that) I must insert
each data point by executing a command;
    * and also I can not replicate (distribute) it easily;

    Or have you used RRDTool in a similar context as mine? Do you have
some benchmarks?

    Ciprian.

Re: Using Postgres to store high volume streams of sensor readings

От

"Grzegorz Jaśkiewicz"

Дата:

21 ноября 2008 г., 12:18:36

you'll have to provide us with some sort of test-case to get some answers, please. (set of scripts, queries, etc).

Re: Using Postgres to store high volume streams of sensor readings

От

"Ciprian Dorin Craciun"

Дата:

21 ноября 2008 г., 12:27:29

On Fri, Nov 21, 2008 at 3:18 PM, Grzegorz Jaśkiewicz <gryzman@gmail.com> wrote:
> you'll have to provide us with some sort of test-case to get some answers,
> please. (set of scripts, queries, etc).

    Bellow is the content of my original post. Inside I mention
exactly the may the benchmark was conducted.

    In short the data is inserted by using COPY sds_benchmark_data
from STDIN, in batches of 500 thousand data points.

    I'll also paste the important part here:

>    * Postgres version: 8.3.3;
>
>    * database schema:
>> create table sds_benchmark_data (
>>     client int4 not null,
>>     sensor int4 not null,
>>     timestamp int4 not null,
>>     value int4
>> );
>>
>> alter table sds_benchmark_data add primary key (client, sensor, timestamp);
>>
>> create index sds_benchmark_data__client_sensor__index on sds_benchmark_data (client, sensor);
>
>    * postgres.conf (the default values, I will list only what has
> been changed):
>> max_connections = 20
>> shared_buffers = 24MB
>> work_mem = 64MB
>> maintenance_work_mem = 256MB
>> fsync = off
>> checkpoint_segments = 32
>> effective_cache_size = 1024MB
>
>    * inserts are done like this:
>        * generated 100 million readings by using the following rule:
>            * client is randomly chosen between 0 and 10 thousand;
>            * sensor is randomly chosen between 0 and 10;
>            * the timestamp is always increasing by one;
>        * the insert is done in batches of 500 thousand inserts (I've
> also tried 5, 25, 50 and 100 thousand without big impact);
>        * the banch inserts are done through COPY sds_benchmark_data
> FROM STDIN through libpq (by using UNIX (local) sockets);

    Unfortunately I don't know what more information to give...

    Thanks,
    Ciprian Dorin Craciun.


On Fri, Nov 21, 2008 at 2:50 PM, Ciprian Dorin Craciun
<ciprian.craciun@gmail.com> wrote:
>    Hello all!
>
>    I would like to ask some advice about the following problem
> (related to the Dehems project: http://www.dehems.eu/ ):
>    * there are some clients; (the clients are in fact house holds;)
>    * each device has a number of sensors (about 10), and not all the
> clients have the same sensor; also sensors might appear and disappear
> dynamicaly; (the sensors are appliances;)
>    * for each device and each sensor a reading is produced (at about
> 6 seconds); (the values could be power consumptions;)
>    * I would like to store the following data: (client, sensor,
> timestamp, value);
>    * the usual queries are:
>        * for a given client (and sensor), and time interval, I need
> the min, max, and avg of the values;
>        * for a given time interval (and sensor), I need min, max, and
> avg of the values;
>        * other statistics;
>
>    Currently I'm benchmarking the following storage solutions for this:
>    * Hypertable (http://www.hypertable.org/) -- which has good insert
> rate (about 250k inserts / s), but slow read rate (about 150k reads /
> s); (the aggregates are manually computed, as Hypertable does not
> support other queries except scanning (in fact min, and max are easy
> beeing the first / last key in the ordered set, but avg must be done
> by sequential scan);)
>    * BerkeleyDB -- quite Ok insert rate (about 50k inserts / s), but
> fabulos read rate (about 2M reads / s); (the same issue with
> aggregates;)
>    * Postgres -- which behaves quite poorly (see below)...
>    * MySQL -- next to be tested;
>
>    So what can I do / how could I optimize the use of Postgres for this usage?
>
>    (I'm aware that there could be optimizations for this problem
> (like computing the aggregates in memory and storing only these
> aggregates at 10 minutes, or other interval), but I want to have the
> full history (for data mining tasks for example);)
>
>    I will also like to present how was the Postgres benchmark implemented:
>
>    * test machine: Linux (Ubuntu 8.04 x64), IBM x3750, 2 x 500Gb,
> SCSI, RAID 0, 8Gb RAM, 2 x Intel Xeon 5160 x 2 core (3Ghz, 4Mb L2);
>    * Postgres version: 8.3.3;
>
>    * database schema:
>> create table sds_benchmark_data (
>>     client int4 not null,
>>     sensor int4 not null,
>>     timestamp int4 not null,
>>     value int4
>> );
>>
>> alter table sds_benchmark_data add primary key (client, sensor, timestamp);
>>
>> create index sds_benchmark_data__client_sensor__index on sds_benchmark_data (client, sensor);
>
>    * postgres.conf (the default values, I will list only what has
> been changed):
>> max_connections = 20
>> shared_buffers = 24MB
>> work_mem = 64MB
>> maintenance_work_mem = 256MB
>> fsync = off
>> checkpoint_segments = 32
>> effective_cache_size = 1024MB
>
>    * inserts are done like this:
>        * generated 100 million readings by using the following rule:
>            * client is randomly chosen between 0 and 10 thousand;
>            * sensor is randomly chosen between 0 and 10;
>            * the timestamp is always increasing by one;
>        * the insert is done in batches of 500 thousand inserts (I've
> also tried 5, 25, 50 and 100 thousand without big impact);
>        * the banch inserts are done through COPY sds_benchmark_data
> FROM STDIN through libpq (by using UNIX (local) sockets);
>
>    What have I observed / tried:
>    * I've tested without the primary key and the index, and the
> results were the best for inserts (600k inserts / s), but the
> readings, worked extremly slow (due to the lack of indexing);
>    * with only the index (or only the primary key) the insert rate is
> good at start (for the first 2 million readings), but then drops to
> about 200 inserts / s;
>
>    So could someone point me where I'me wrong, or what can I do to
> optimize Postgres for this particular task?
>
>    Thanks for your help,
>    Ciprian Dorin Craciun.
>
>    P.S.: I'll want to publish the benchmark results after they are
> done, and I want to squeeze as much power out of Postgres as possible.

Re: Using Postgres to store high volume streams of sensor readings

От

"Grzegorz Jaśkiewicz"

Дата:

21 ноября 2008 г., 12:29:55

see, I am affraid of the part when it says "randomly", because you probably used random(), which isn't the fastest thing on earth :)

Re: Using Postgres to store high volume streams of sensor readings

От

Rafael Martinez

Дата:

21 ноября 2008 г., 13:14:49

Ciprian Dorin Craciun wrote:
[............]
>
>     So what can I do / how could I optimize the use of Postgres for this usage?
>

Hello, here you have some comments that will probably help you to get
more from this test machine ......

>
>     * test machine: Linux (Ubuntu 8.04 x64), IBM x3750, 2 x 500Gb,
> SCSI, RAID 0, 8Gb RAM, 2 x Intel Xeon 5160 x 2 core (3Ghz, 4Mb L2);
>     * Postgres version: 8.3.3;
>

(RAID 0 is never a good thing with databases if you don't have another
redundant system that can be used to restore your data or if you want to
minimize your downtime.)

Putting the database transaction logs ($PGDATA/pg_xlog) on its own
dedicated disk resource will probably increase write performace.

>>
>> create index sds_benchmark_data__client_sensor__index on sds_benchmark_data (client, sensor);
>

You don't need this index if the primary key is (client, sensor, timestamp).

>> shared_buffers = 24MB

I would increase this to 25% of your RAM. 2GB in the test machine (if it
is a dedicated postgres server). It will help read-rate.

You will probably have to increase kernel.shmmax and kernel.shmall in
/etc/sysctl.conf (linux)

>> fsync = off

Do you have the results with this on?

>> checkpoint_segments = 32

I would increase this to 128-256 if you work with large write loads
(several GB of data). $PGDATA/pg_xlog would use some extra disk if you
change this value.

>> effective_cache_size = 1024MB
>

50% of your RAM. 4GB in the test machine (if it is a dedicated postgres
server). It will probably help read-rate.

In addition, I will try to change these parameters also:

wal_buffers = 64
random_page_cost = 2.0

In general, faster and more disks in a RAID 1+0 / 0+1 will help write
performace. autovacuum should be on.

regards.
--
 Rafael Martinez, <r.m.guerrero@usit.uio.no>
 Center for Information Technology Services
 University of Oslo, Norway

 PGP Public Key: http://folk.uio.no/rafael/

Re: Using Postgres to store high volume streams of sensor readings

От

Sam Mason

Дата:

21 ноября 2008 г., 13:49:43

On Fri, Nov 21, 2008 at 02:50:45PM +0200, Ciprian Dorin Craciun wrote:
>     Currently I'm benchmarking the following storage solutions for this:
>     * Hypertable (http://www.hypertable.org/) -- which has good insert
> rate (about 250k inserts / s), but slow read rate (about 150k reads /
> s); (the aggregates are manually computed, as Hypertable does not
> support other queries except scanning (in fact min, and max are easy
> beeing the first / last key in the ordered set, but avg must be done
> by sequential scan);)
>     * BerkeleyDB -- quite Ok insert rate (about 50k inserts / s), but
> fabulos read rate (about 2M reads / s); (the same issue with
> aggregates;)
>     * Postgres -- which behaves quite poorly (see below)...
>     * MySQL -- next to be tested;

It's not quite what you're asking for; but have you checked out any
of the databases that have resulted from the StreamSQL research?  The
Borealis engine[1] looks like the most recent development, but I'm
not sure how you are with academic style code.  I've never used it
before, but it sounds as though it was designed for exactly your sort of
problem.


  Sam

 [1] http://www.cs.brown.edu/research/borealis/public/

Re: Using Postgres to store high volume streams of sensor readings

От

"Ciprian Dorin Craciun"

Дата:

21 ноября 2008 г., 14:07:41

On Fri, Nov 21, 2008 at 3:29 PM, Grzegorz Jaśkiewicz <gryzman@gmail.com> wrote:
> see, I am affraid of the part when it says "randomly", because you probably
> used random(), which isn't the fastest thing on earth :)

    I can assure you this is not the problem... The other storage
engines work quite well, and also the generation speed is somewhere at
30 million records / second, which 100 greater than the speed achieved
by the fastest store I've tested so far...

    Ciprian.

Re: Using Postgres to store high volume streams of sensor readings

От

"Nikolas Everett"

Дата:

21 ноября 2008 г., 14:33:25

You might want to look into how OpenNMS uses RRDTool. It is able to handle a huge number of nodes by queuing inserts into the RRDs and using JRobin.

I'm not sure if it is a great solution for what you are looking for, but I've found its performance scales quite well. I'm getting well over 500 updates per second using JRobin and an NFS disk. I'm sure I could do better but by my application is limited by hibernate. Each of my files stores an MRTG's worth of data and keeps the average and max of four points in 136k.

Here is an iostat -dmx 1 of the rrd update:
Device:         rrqm/s   wrqm/s   r/s   w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await svctm %util
xvda              7.00    33.00 122.00 2093.00     0.48     8.34     8.16   150.70   69.92   0.45 100.00

Device:         rrqm/s   wrqm/s   r/s   w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await svctm %util
xvda            493.00    26.00 248.00 540.00     2.84     2.21    13.13    43.70   55.42   1.26 99.60

My big problem with RRD is not being able to query it like you can a database and it tending not to keep exact values. Oh, and it being pretty cryptic.

Sorry to clutter up the list of RRD stuff. I just thought it might be pertinent to Ciprian.

On Fri, Nov 21, 2008 at 8:03 AM, Ciprian Dorin Craciun <ciprian.craciun@gmail.com> wrote:

On Fri, Nov 21, 2008 at 2:55 PM, Gerhard Heift
<ml-postgresql-20081012-3518@gheift.de> wrote:
> On Fri, Nov 21, 2008 at 02:50:45PM +0200, Ciprian Dorin Craciun wrote:
>> Hello all!
>>
>> I would like to ask some advice about the following problem
>> (related to the Dehems project: http://www.dehems.eu/ ):
>> * there are some clients; (the clients are in fact house holds;)
>> * each device has a number of sensors (about 10), and not all the
>> clients have the same sensor; also sensors might appear and disappear
>> dynamicaly; (the sensors are appliances;)
>> * for each device and each sensor a reading is produced (at about
>> 6 seconds); (the values could be power consumptions;)
>> * I would like to store the following data: (client, sensor,
>> timestamp, value);
>> * the usual queries are:
>> * for a given client (and sensor), and time interval, I need
>> the min, max, and avg of the values;
>> * for a given time interval (and sensor), I need min, max, and
>> avg of the values;
>> * other statistics;
>>
>> Currently I'm benchmarking the following storage solutions for this:
>> * Hypertable (http://www.hypertable.org/) -- which has good insert
>> rate (about 250k inserts / s), but slow read rate (about 150k reads /
>> s); (the aggregates are manually computed, as Hypertable does not
>> support other queries except scanning (in fact min, and max are easy
>> beeing the first / last key in the ordered set, but avg must be done
>> by sequential scan);)
>> * BerkeleyDB -- quite Ok insert rate (about 50k inserts / s), but
>> fabulos read rate (about 2M reads / s); (the same issue with
>> aggregates;)
>> * Postgres -- which behaves quite poorly (see below)...
>> * MySQL -- next to be tested;
>
> For this case I would use RRDTool: http://oss.oetiker.ch/rrdtool/
>
> Regards,
> Gerhard
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.9 (GNU/Linux)
>
> iEYEARECAAYFAkkmr9wACgkQa8fhU24j2flejQCeKl650qmSlomovOXDl6IKrADb
> cTAAnRebAFq420MuW9aMmhoFOo+sPIje
> =Zcoo
> -----END PGP SIGNATURE-----

Hy Gerhard, I know about RRDTool, but it has some limitations:
* I must know in advance the number of sensors;
* I must create for each client a file (and If I have 10 thousand clients?);
* I have a limited amount of history;
* (I'm not sure about this one but i think that) I must insert
each data point by executing a command;
* and also I can not replicate (distribute) it easily;

Or have you used RRDTool in a similar context as mine? Do you have
some benchmarks?

Ciprian.

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: Using Postgres to store high volume streams of sensor readings

От

Tom Lane

Дата:

21 ноября 2008 г., 15:06:26

"Ciprian Dorin Craciun" <ciprian.craciun@gmail.com> writes:
>     In short the data is inserted by using COPY sds_benchmark_data
> from STDIN, in batches of 500 thousand data points.

Not sure if it applies to your real use-case, but if you can try doing
the COPY from a local file instead of across the network link, it
might go faster.  Also, as already noted, drop the redundant index.

            regards, tom lane

Re: Using Postgres to store high volume streams of sensor readings

От

"Ciprian Dorin Craciun"

Дата:

21 ноября 2008 г., 15:51:17

    Thank's for your info! Please see below...


On Fri, Nov 21, 2008 at 4:14 PM, Rafael Martinez
<r.m.guerrero@usit.uio.no> wrote:
> Ciprian Dorin Craciun wrote:
> [............]
>>
>>     So what can I do / how could I optimize the use of Postgres for this usage?
>>
>
> Hello, here you have some comments that will probably help you to get
> more from this test machine ......
>
>>
>>     * test machine: Linux (Ubuntu 8.04 x64), IBM x3750, 2 x 500Gb,
>> SCSI, RAID 0, 8Gb RAM, 2 x Intel Xeon 5160 x 2 core (3Ghz, 4Mb L2);
>>     * Postgres version: 8.3.3;
>>
>
> (RAID 0 is never a good thing with databases if you don't have another
> redundant system that can be used to restore your data or if you want to
> minimize your downtime.)
>
> Putting the database transaction logs ($PGDATA/pg_xlog) on its own
> dedicated disk resource will probably increase write performace.

    Unfortunately this is a test machine shared with other projects,
and I can't change (for now) the disk setup... When I'll have a
dedicated machine I'll do this... For now nop...

>>> create index sds_benchmark_data__client_sensor__index on sds_benchmark_data (client, sensor);
>>
>
> You don't need this index if the primary key is (client, sensor, timestamp).

    In both the primary key and the index are listed here because when
I've tested I have switched between them... (Almost the same behaviour
with only the primary key, or with only the index)...


>>> shared_buffers = 24MB
>
> I would increase this to 25% of your RAM. 2GB in the test machine (if it
> is a dedicated postgres server). It will help read-rate.
>
> You will probably have to increase kernel.shmmax and kernel.shmall in
> /etc/sysctl.conf (linux)

    Modified it.


>>> fsync = off
>
> Do you have the results with this on?

    Doesn't help at all... I guest the problem is with the index building...


>>> checkpoint_segments = 32
>
> I would increase this to 128-256 if you work with large write loads
> (several GB of data). $PGDATA/pg_xlog would use some extra disk if you
> change this value.

    Updated it to 256.


>>> effective_cache_size = 1024MB
>>
>
> 50% of your RAM. 4GB in the test machine (if it is a dedicated postgres
> server). It will probably help read-rate.

    Updated it to 4096MB.


> In addition, I will try to change these parameters also:
>
> wal_buffers = 64
> random_page_cost = 2.0

    Currently wal_buffers is 64kB, I've set it to 64MB???

    random_page_cost was 4.0, decreased it to 2.0???


> In general, faster and more disks in a RAID 1+0 / 0+1 will help write
> performace. autovacuum should be on.
>
> regards.
> --
>  Rafael Martinez, <r.m.guerrero@usit.uio.no>
>  Center for Information Technology Services
>  University of Oslo, Norway
>
>  PGP Public Key: http://folk.uio.no/rafael/


    So after the updates, the results were better, but still under 1k
inserts / second...

    Thanks again for your info!

Re: Using Postgres to store high volume streams of sensor readings

От

"Ciprian Dorin Craciun"

Дата:

21 ноября 2008 г., 15:53:10

On Fri, Nov 21, 2008 at 6:06 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Ciprian Dorin Craciun" <ciprian.craciun@gmail.com> writes:
>>     In short the data is inserted by using COPY sds_benchmark_data
>> from STDIN, in batches of 500 thousand data points.
>
> Not sure if it applies to your real use-case, but if you can try doing
> the COPY from a local file instead of across the network link, it
> might go faster.  Also, as already noted, drop the redundant index.
>
>                        regards, tom lane


    Hy!

    It won't be that difficult to use a local file (now I'm using the
same computer), but will it really make a difference? (I mean have you
seen such issues?)

    Thanks,
    Ciprian Craciun.

Re: Using Postgres to store high volume streams of sensor readings

От

Tom Lane

Дата:

21 ноября 2008 г., 16:12:13

"Ciprian Dorin Craciun" <ciprian.craciun@gmail.com> writes:
> On Fri, Nov 21, 2008 at 6:06 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Not sure if it applies to your real use-case, but if you can try doing
>> the COPY from a local file instead of across the network link, it
>> might go faster.  Also, as already noted, drop the redundant index.

>     It won't be that difficult to use a local file (now I'm using the
> same computer), but will it really make a difference?

Yes.  I'm not sure how much, but there is nontrivial protocol overhead.

            regards, tom lane

Re: Using Postgres to store high volume streams of sensor readings

От

"Ciprian Dorin Craciun"

Дата:

21 ноября 2008 г., 16:43:02

On Fri, Nov 21, 2008 at 7:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Ciprian Dorin Craciun" <ciprian.craciun@gmail.com> writes:
>> On Fri, Nov 21, 2008 at 6:06 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Not sure if it applies to your real use-case, but if you can try doing
>>> the COPY from a local file instead of across the network link, it
>>> might go faster.  Also, as already noted, drop the redundant index.
>
>>     It won't be that difficult to use a local file (now I'm using the
>> same computer), but will it really make a difference?
>
> Yes.  I'm not sure how much, but there is nontrivial protocol overhead.
>
>                        regards, tom lane
>

    Ok, I have tried it, and no improvements... (There is also the
drawback that I must run the inserts as the superuser...)

    Ciprian Craciun.

Re: Using Postgres to store high volume streams of sensor readings

От

Greg Smith

Дата:

21 ноября 2008 г., 16:46:04

On Fri, 21 Nov 2008, Tom Lane wrote:

> Not sure if it applies to your real use-case, but if you can try doing
> the COPY from a local file instead of across the network link, it
> might go faster.

The fact that the inserts are reported as fast initially but slow as the
table and index size grow means it's probably a disk bottleneck rather
than anything related to the client itself.  If the network overhead was
the issue, I wouldn't expect it to start fast like that.  Ditto for
concerns about the random function being slow.  Either of those might
speed up the initial, fast period a bit, but they're not going to impact
the later slowdown.

Ciprian, the main interesting piece of data to collect here is a snapshot
of a few samples lines from the output from "vmstat 1" during the initial,
fast loading section versus the slower period.  I think what's happening
to you is that maintaining the index blocks on the disk is turning into
increasingly random I/O as the size of the table grows, and your disks
just can't keep up with that.  What I'd expect is that initially the
waiting for I/O "wa" figure will be low, but it will creep up constantly
and at some point spike up hard after the working set of data operated on
exceeds memory.

The fact that PostgreSQL performs badly here compared to the more
lightweight databases you've used isn't that much of a surprise.  There's
a fair amount of overhead for the write-ahead log and the MVCC
implementation in the database, and your application is suffering from all
that robustness overhead but not really gaining much of a benefit from it.
The main things that help in this sort of situation are increases in
shared_buffers and checkpoint_segments, so that more database information
is stored in RAM for longer rather than being pushed to disk too quickly,
but what Rafael suggested already got you most of the possible improvement
here.  You might get an extra bit of boost by adjusting the index
FILLFACTOR upwards (default is 90, if you're never updating you could try
100).  I doubt that will be anything but a minor incremental improvement
though.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Using Postgres to store high volume streams of sensor readings

От

"Ciprian Dorin Craciun"

Дата:

21 ноября 2008 г., 16:47:00

On Fri, Nov 21, 2008 at 7:42 PM, Ciprian Dorin Craciun
<ciprian.craciun@gmail.com> wrote:
> On Fri, Nov 21, 2008 at 7:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> "Ciprian Dorin Craciun" <ciprian.craciun@gmail.com> writes:
>>> On Fri, Nov 21, 2008 at 6:06 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>>> Not sure if it applies to your real use-case, but if you can try doing
>>>> the COPY from a local file instead of across the network link, it
>>>> might go faster.  Also, as already noted, drop the redundant index.
>>
>>>     It won't be that difficult to use a local file (now I'm using the
>>> same computer), but will it really make a difference?
>>
>> Yes.  I'm not sure how much, but there is nontrivial protocol overhead.
>>
>>                        regards, tom lane
>>
>
>    Ok, I have tried it, and no improvements... (There is also the
> drawback that I must run the inserts as the superuser...)
>
>    Ciprian Craciun.

    If I think better, the protocol overhead is not important...
Because if I don't use indexes, I obtain 600k inserts / second... (So
the test was useless... :) But I learn from my mistakes...)

    Ciprian.

Re: Using Postgres to store high volume streams of sensor readings

От

"Ciprian Dorin Craciun"

Дата:

21 ноября 2008 г., 16:58:35

On Fri, Nov 21, 2008 at 7:45 PM, Greg Smith <gsmith@gregsmith.com> wrote:
> On Fri, 21 Nov 2008, Tom Lane wrote:
>
>> Not sure if it applies to your real use-case, but if you can try doing
>> the COPY from a local file instead of across the network link, it
>> might go faster.
>
> The fact that the inserts are reported as fast initially but slow as the
> table and index size grow means it's probably a disk bottleneck rather than
> anything related to the client itself.  If the network overhead was the
> issue, I wouldn't expect it to start fast like that.  Ditto for concerns
> about the random function being slow.  Either of those might speed up the
> initial, fast period a bit, but they're not going to impact the later
> slowdown.
>
> Ciprian, the main interesting piece of data to collect here is a snapshot of
> a few samples lines from the output from "vmstat 1" during the initial, fast
> loading section versus the slower period.  I think what's happening to you
> is that maintaining the index blocks on the disk is turning into
> increasingly random I/O as the size of the table grows, and your disks just
> can't keep up with that.  What I'd expect is that initially the waiting for
> I/O "wa" figure will be low, but it will creep up constantly and at some
> point spike up hard after the working set of data operated on exceeds
> memory.

    Ok. Silly question: how do I "vmstat 1"???

    The problem is indeed the indexes... So If we analyze the insert
patterns: client id's randomly distributed and sensor id's the same,
and the index is created ontop of these two, it means that
(probabilistically) speaking after 100 thousand inserts (10 thousand
clients and 10 sensors), all the index pages would be dirty...

    Indeed I could prolongue the flush by using bigger and bigger
memory, but this doesn't help for 100 million records...


> The fact that PostgreSQL performs badly here compared to the more
> lightweight databases you've used isn't that much of a surprise.  There's a
> fair amount of overhead for the write-ahead log and the MVCC implementation
> in the database, and your application is suffering from all that robustness
> overhead but not really gaining much of a benefit from it. The main things
> that help in this sort of situation are increases in shared_buffers and
> checkpoint_segments, so that more database information is stored in RAM for
> longer rather than being pushed to disk too quickly, but what Rafael
> suggested already got you most of the possible improvement here.  You might
> get an extra bit of boost by adjusting the index FILLFACTOR upwards (default
> is 90, if you're never updating you could try 100).  I doubt that will be
> anything but a minor incremental improvement though.

    About the fillfactor, on the contrary, I think I should set it
lower (as the index fills very quickly)... I've set it to 10% and it
behave a little better than with (10)... I'll run a full 100 million
test to see where it breaks...

> --
> * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

    Thanks,
    Ciprian Craciun.

Re: Using Postgres to store high volume streams of sensor readings

От

Greg Smith

Дата:

21 ноября 2008 г., 17:41:54

On Fri, 21 Nov 2008, Sam Mason wrote:

> It's not quite what you're asking for; but have you checked out any
> of the databases that have resulted from the StreamSQL research?

A streaming database approach is in fact ideally suited to handling this
particular problem.  Looking at the original request here:

> * for a given client (and sensor), and time interval, I need the min,
> max, and avg of the values;
> * for a given time interval (and sensor), I need min, max, and avg of
> the values;

The most efficient way possible to compute these queries is to buffer the
full interval worth of data in memory as the values are being inserted,
compute these aggregates once the time window for the interval has ended,
then write a summarized version of the data.  Doing that sort of thing and
then operating on the aggregated data, rather than maintaining a bulky
index covering every single row, is exactly the sort of thing a good
streaming database would handle for you.

I can't comment on the current state of Borealis.  But as the original
focus of the streaming database research that spawned Truviso where I work
was accelerating data capture from sensor networks, I know this general
style of approach is quite beneficial here.  For example,
http://telegraph.cs.berkeley.edu/telegraphcq/v2.1/ is an open-source
implementation from that research that's integrated with an older version
of PostgreSQL.  If you look at the "windowed aggregates" example there, it
shows what a streaming query similar to the requirements here would look
like:  an average and other statistics produced on a per-interval basis.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Using Postgres to store high volume streams of sensor readings

От

"Ciprian Dorin Craciun"

Дата:

21 ноября 2008 г., 18:58:09

On Fri, Nov 21, 2008 at 8:41 PM, Greg Smith <gsmith@gregsmith.com> wrote:
> On Fri, 21 Nov 2008, Sam Mason wrote:
>
>> It's not quite what you're asking for; but have you checked out any
>> of the databases that have resulted from the StreamSQL research?
>
> A streaming database approach is in fact ideally suited to handling this
> particular problem.  Looking at the original request here:
>
>> * for a given client (and sensor), and time interval, I need the min, max,
>> and avg of the values;
>> * for a given time interval (and sensor), I need min, max, and avg of the
>> values;
>
> The most efficient way possible to compute these queries is to buffer the
> full interval worth of data in memory as the values are being inserted,
> compute these aggregates once the time window for the interval has ended,
> then write a summarized version of the data.  Doing that sort of thing and
> then operating on the aggregated data, rather than maintaining a bulky index
> covering every single row, is exactly the sort of thing a good streaming
> database would handle for you.
>
> I can't comment on the current state of Borealis.  But as the original focus
> of the streaming database research that spawned Truviso where I work was
> accelerating data capture from sensor networks, I know this general style of
> approach is quite beneficial here.  For example,
> http://telegraph.cs.berkeley.edu/telegraphcq/v2.1/ is an open-source
> implementation from that research that's integrated with an older version of
> PostgreSQL.  If you look at the "windowed aggregates" example there, it
> shows what a streaming query similar to the requirements here would look
> like:  an average and other statistics produced on a per-interval basis.
>
> --
> * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general

     Thank you for your pointers, I'll check them out...

     I'm aware of the fact that the problem could be solved in other
more efficient ways (for example as you've said, by aggregating the
data in memory and flushing only the aggregates), but as I've said in
the beginning, I also want to test different storage systems in such a
situation...

    Ciprian Craciun.

Re: Using Postgres to store high volume streams of sensor readings

От

"Diego Schulz"

Дата:

21 ноября 2008 г., 19:26:11

On Fri, Nov 21, 2008 at 9:50 AM, Ciprian Dorin Craciun <ciprian.craciun@gmail.com> wrote:

Currently I'm benchmarking the following storage solutions for this:
* Hypertable (http://www.hypertable.org/) -- which has good insert
rate (about 250k inserts / s), but slow read rate (about 150k reads /
s); (the aggregates are manually computed, as Hypertable does not
support other queries except scanning (in fact min, and max are easy
beeing the first / last key in the ordered set, but avg must be done
by sequential scan);)
* BerkeleyDB -- quite Ok insert rate (about 50k inserts / s), but
fabulos read rate (about 2M reads / s); (the same issue with
aggregates;)
* Postgres -- which behaves quite poorly (see below)...
* MySQL -- next to be tested;

I think it'll be also interesting to see how SQLite 3 performs in this scenario. Any plans?

regards

diego

Re: Using Postgres to store high volume streams of sensor readings

От

"Ciprian Dorin Craciun"

Дата:

22 ноября 2008 г., 05:50:51

On Fri, Nov 21, 2008 at 10:26 PM, Diego Schulz <dschulz@gmail.com> wrote:
>
>
> On Fri, Nov 21, 2008 at 9:50 AM, Ciprian Dorin Craciun
> <ciprian.craciun@gmail.com> wrote:
>>
>>    Currently I'm benchmarking the following storage solutions for this:
>>    * Hypertable (http://www.hypertable.org/) -- which has good insert
>> rate (about 250k inserts / s), but slow read rate (about 150k reads /
>> s); (the aggregates are manually computed, as Hypertable does not
>> support other queries except scanning (in fact min, and max are easy
>> beeing the first / last key in the ordered set, but avg must be done
>> by sequential scan);)
>>    * BerkeleyDB -- quite Ok insert rate (about 50k inserts / s), but
>> fabulos read rate (about 2M reads / s); (the same issue with
>> aggregates;)
>>    * Postgres -- which behaves quite poorly (see below)...
>>    * MySQL -- next to be tested;
>
> I think it'll be also interesting to see how SQLite 3 performs in this
> scenario. Any plans?
>
> regards
>
> diego


    I would try it if I would know that it could handle the load... Do
you have some info about this? Any pointers about the configuration
issues?

    Ciprian.

Re: Using Postgres to store high volume streams of sensor readings

От

Michal Szymanski

Дата:

22 ноября 2008 г., 13:52:18

On 21 Lis, 13:50, ciprian.crac...@gmail.com ("Ciprian Dorin Craciun")
wrote:
>     Hello all!
>
>     I would like to ask some advice about the following problem
> (related to the Dehems project:http://www.dehems.eu/):
>     * there are some clients; (the clients are in fact house holds;)
>     * each device has a number of sensors (about 10), and not all the
> clients have the same sensor; also sensors might appear and disappear
> dynamicaly; (the sensors are appliances;)
>     * for each device and each sensor a reading is produced (at about
> 6 seconds); (the values could be power consumptions;)
>     * I would like to store the following data: (client, sensor,
> timestamp, value);
>     * the usual queries are:
>         * for a given client (and sensor), and time interval, I need
> the min, max, and avg of the values;
>         * for a given time interval (and sensor), I need min, max, and
> avg of the values;
>         * other statistics;

How many devices you expect ?
As I understand number of expected is more or less:
no.of devices * no.sensors (about 10)
every 6second. Let assume that you have 100 devices it means 1000
inserts per 6s = 166 insert for 1 seconds.

>     * inserts are done like this:
>         * generated 100 million readings by using the following rule:
>             * client is randomly chosen between 0 and 10 thousand;
>             * sensor is randomly chosen between 0 and 10;
>             * the timestamp is always increasing by one;
>         * the insert is done in batches of 500 thousand inserts (I've
> also tried 5, 25, 50 and 100 thousand without big impact);
>         * the banch inserts are done through COPY sds_benchmark_data
> FROM STDIN through libpq (by using UNIX (local) sockets);

>     What have I observed / tried:
>     * I've tested without the primary key and the index, and the
> results were the best for inserts (600k inserts / s), but the
> readings, worked extremly slow (due to the lack of indexing);
>     * with only the index (or only the primary key) the insert rate is
> good at start (for the first 2 million readings), but then drops to
> about 200 inserts / s;


Try periodicaly execute REINDEX your index, and execute ANALYZE for
your table . To be honest should not influance on inserts but will
influance on select.


Michal Szymanski
http://blog.szymanskich.net

Re: Using Postgres to store high volume streams of sensor readings

От

"Ciprian Dorin Craciun"

Дата:

22 ноября 2008 г., 14:16:22

On Fri, Nov 21, 2008 at 3:12 PM, Michal Szymanski <dyrex@poczta.onet.pl> wrote:
> On 21 Lis, 13:50, ciprian.crac...@gmail.com ("Ciprian Dorin Craciun")
> wrote:
>>     Hello all!
>>
>>     I would like to ask some advice about the following problem
>> (related to the Dehems project:http://www.dehems.eu/):
>>     * there are some clients; (the clients are in fact house holds;)
>>     * each device has a number of sensors (about 10), and not all the
>> clients have the same sensor; also sensors might appear and disappear
>> dynamicaly; (the sensors are appliances;)
>>     * for each device and each sensor a reading is produced (at about
>> 6 seconds); (the values could be power consumptions;)
>>     * I would like to store the following data: (client, sensor,
>> timestamp, value);
>>     * the usual queries are:
>>         * for a given client (and sensor), and time interval, I need
>> the min, max, and avg of the values;
>>         * for a given time interval (and sensor), I need min, max, and
>> avg of the values;
>>         * other statistics;
>
> How many devices you expect ?
> As I understand number of expected is more or less:
> no.of devices * no.sensors (about 10)
> every 6second. Let assume that you have 100 devices it means 1000
> inserts per 6s = 166 insert for 1 seconds.

    Yes, the figures are like this:
    * average number of raw inserts / second (without any optimization
or previous aggregation): #clients (~ 100 thousand) * #sensors (~ 10)
/ 6seconds = 166 thousand inserts / second...
    * if I use sharding this number vould drop linearly with the
number of Postgres instances... so let's say I use about 10 thousand
users / Postgres instance => 16 thousand inserts / second... (a figure
which I wasn't able to reach in my Postgres benchmarks...)

    Either way, I would expect at least 2-3 thousand inserts per second...


>>     * inserts are done like this:
>>         * generated 100 million readings by using the following rule:
>>             * client is randomly chosen between 0 and 10 thousand;
>>             * sensor is randomly chosen between 0 and 10;
>>             * the timestamp is always increasing by one;
>>         * the insert is done in batches of 500 thousand inserts (I've
>> also tried 5, 25, 50 and 100 thousand without big impact);
>>         * the banch inserts are done through COPY sds_benchmark_data
>> FROM STDIN through libpq (by using UNIX (local) sockets);
>
>>     What have I observed / tried:
>>     * I've tested without the primary key and the index, and the
>> results were the best for inserts (600k inserts / s), but the
>> readings, worked extremly slow (due to the lack of indexing);
>>     * with only the index (or only the primary key) the insert rate is
>> good at start (for the first 2 million readings), but then drops to
>> about 200 inserts / s;
>
>
> Try periodicaly execute REINDEX your index, and execute ANALYZE for
> your table . To be honest should not influance on inserts but will
> influance on select.

    I'll keep this in mind when I'll reach the select part... For the
moment I'm strugling with inserts... (Actually I've kind of given
up...)


> Michal Szymanski
> http://blog.szymanskich.net
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general


    Thanks,
    Ciprian Craciun.

Fwd: Using Postgres to store high volume streams of sensor readings

От

"Ciprian Dorin Craciun"

Дата:

22 ноября 2008 г., 14:40:34

(I'm adding the discussion also to the Postgres list.)

On Fri, Nov 21, 2008 at 11:19 PM, Dann Corbit <DCorbit@connx.com> wrote:
> What is the schema for your table?
> If you are using copy rather than insert, 1K rows/sec for PostgreSQL seems very bad unless the table is extremely
wide.

   The schema is posted at the beginning of the thread. But in short
it is a table with 4 columns: client, sensor, timestamp and value, all
beeing int4 (integer). There is only one (compound) index on the
client and sensor...

   I gues the problem is from the index...


> Memory mapped database systems may be the answer to your need for speed.
> If you have a single inserting process, you can try FastDB, but unless you use a 64 bit operating system and
compiler,you will be limited to 2 GB file size.  FastDB is single writer, multiple reader model.  See: 
> http://www.garret.ru/databases.html
>
> Here is output from the fastdb test program testperf, when compiled in 64 bit mode (the table is ultra-simple with
onlya string key and a string value, with also a btree and a hashed index on key): 
> Elapsed time for inserting 1000000 record: 8 seconds
> Commit time: 1
> Elapsed time for 1000000 hash searches: 1 seconds
> Elapsed time for 1000000 index searches: 4 seconds
> Elapsed time for 10 sequential search through 1000000 records: 2 seconds
> Elapsed time for search with sorting 1000000 records: 3 seconds
> Elapsed time for deleting all 1000000 records: 0 seconds
>
> Here is a bigger set so you can get an idea about scaling:
>
> Elapsed time for inserting 10000000 record: 123 seconds
> Commit time: 13
> Elapsed time for 10000000 hash searches: 10 seconds
> Elapsed time for 10000000 index searches: 82 seconds
> Elapsed time for 10 sequential search through 10000000 records: 8 seconds
> Elapsed time for search with sorting 10000000 records: 41 seconds
> Elapsed time for deleting all 10000000 records: 4 seconds
>
> If you have a huge database, then FastDB may be problematic because you need free memory equal to the size of your
database.
> E.g. a 100 GB database needs 100 GB memory to operate at full speed.  In 4GB allotments, at $10-$50/GB 100 GB costs
between$1000 and $5000. 

   Unfortunately the database will be too large (eventually) to store
all of it inside the memory...

   For the moment, I don't think I'll be able to try FastDB... Il put
it on my reminder list...


> MonetDB is worth a try, but I had trouble getting it to work properly on 64 bit Windows:
> http://monetdb.cwi.nl/

   I've heard of MonetDB -- it's from the same family as
Hypertable... Maybe I'll give it a try after I finish with SQLlite...

   Ciprian Craciun.

Re: Using Postgres to store high volume streams of sensor readings

От

Shane Ambler

Дата:

22 ноября 2008 г., 17:04:38

Ciprian Dorin Craciun wrote:

>
>     I would try it if I would know that it could handle the load... Do
> you have some info about this? Any pointers about the configuration
> issues?
>
>     Ciprian.
>

Apart from the configure options at build time you should read -
http://www.sqlite.org/pragma.html

It was a few versions ago so may be changed by now, but I reckon it
was the temp_store setting - which is described as temp tables and
indexes but is (or was) also used for large query and sorting needs.
Setting this to memory did make a difference for some queries.

--

Shane Ambler
pgSQL (at) Sheeky (dot) Biz

Get Sheeky @ http://Sheeky.Biz

Re: Using Postgres to store high volume streams of sensor readings

От

"Ciprian Dorin Craciun"

Дата:

22 ноября 2008 г., 20:37:57

On Sat, Nov 22, 2008 at 8:04 PM, Shane Ambler <pgsql@sheeky.biz> wrote:
> Ciprian Dorin Craciun wrote:
>
>>
>>    I would try it if I would know that it could handle the load... Do
>> you have some info about this? Any pointers about the configuration
>> issues?
>>
>>    Ciprian.
>>
>
>
> Apart from the configure options at build time you should read -
> http://www.sqlite.org/pragma.html
>
> It was a few versions ago so may be changed by now, but I reckon it
> was the temp_store setting - which is described as temp tables and
> indexes but is (or was) also used for large query and sorting needs.
> Setting this to memory did make a difference for some queries.
>
>
> --
>
> Shane Ambler
> pgSQL (at) Sheeky (dot) Biz
>
> Get Sheeky @ http://Sheeky.Biz

    Hello all!

    (This email now is about Sqlite3, but it also relates to Postgres
as a coparison.)

    I've tested also Sqlite3 and it has the same behavior as
Postgres... Meaning at beginning it goes really nice 20k inserts,
drops to about 10k inserts, but after a few million records, the HDD
led starts to blink non-stop, and then it drops to unde 1k....

    I've used exactly the same schema as for Postgres, and the
following pragmas:
    * page_size = 8192;
    * fullsync = 0;
    * synchronous = off;
    * journal_mode = off; (this has a 10 fold impact... from 1k
inserts at the beginning to 10 or 20k...)
    * cache_size = 65536; (this is in pages, and it results at 512MB,
but I don't see the memory being used during inserts...)
    * auto_vacuum = none;
    * analyze at the end of the inserts;

    So I would conclude that relational stores will not make it for
this use case...

    I'll rerun the tests tomorrow and post a comparison between SQLite
and Postgres.

    Ciprian Craciun.

Re: Using Postgres to store high volume streams of sensor readings

От

"Scott Marlowe"

Дата:

22 ноября 2008 г., 20:52:02

On Sat, Nov 22, 2008 at 2:37 PM, Ciprian Dorin Craciun
<ciprian.craciun@gmail.com> wrote:
>
>    Hello all!
SNIP
>    So I would conclude that relational stores will not make it for
> this use case...

I was wondering you guys are having to do all individual inserts or if
you can batch some number together into a transaction.  Being able to
put > 1 into a single transaction is a huge win for pgsql.

Re: Using Postgres to store high volume streams of sensor readings

От

"Ciprian Dorin Craciun"

Дата:

22 ноября 2008 г., 20:54:36

On Sat, Nov 22, 2008 at 11:51 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
> On Sat, Nov 22, 2008 at 2:37 PM, Ciprian Dorin Craciun
> <ciprian.craciun@gmail.com> wrote:
>>
>>    Hello all!
> SNIP
>>    So I would conclude that relational stores will not make it for
>> this use case...
>
> I was wondering you guys are having to do all individual inserts or if
> you can batch some number together into a transaction.  Being able to
> put > 1 into a single transaction is a huge win for pgsql.

    I'm aware of the performance issues between 1 insert vs x batched
inserts in one operation / transaction. That is why in the case of
Postgres I am using COPY <table> FROM STDIN, and using 5k batches...
(I've tried even 10k, 15k, 25k, 50k, 500k, 1m inserts / batch and no
improvement...)

    Ciprian Craciun.

Re: Using Postgres to store high volume streams of sensor readings

От

"David Wilson"

Дата:

22 ноября 2008 г., 21:16:55

On Sat, Nov 22, 2008 at 4:54 PM, Ciprian Dorin Craciun
<ciprian.craciun@gmail.com> wrote:
> On Sat, Nov 22, 2008 at 11:51 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
>> On Sat, Nov 22, 2008 at 2:37 PM, Ciprian Dorin Craciun
>> <ciprian.craciun@gmail.com> wrote:
>>>
>>>    Hello all!
>> SNIP
>>>    So I would conclude that relational stores will not make it for
>>> this use case...
>>
>> I was wondering you guys are having to do all individual inserts or if
>> you can batch some number together into a transaction.  Being able to
>> put > 1 into a single transaction is a huge win for pgsql.
>
>    I'm aware of the performance issues between 1 insert vs x batched
> inserts in one operation / transaction. That is why in the case of
> Postgres I am using COPY <table> FROM STDIN, and using 5k batches...
> (I've tried even 10k, 15k, 25k, 50k, 500k, 1m inserts / batch and no
> improvement...)

I've had exactly the same experience with Postgres during an attempt
to use it as a store for large-scale incoming streams of data at a
rate very comparable to what you're looking at (~100k/sec). We
eventually just ended up rolling our own solution.

--
- David T. Wilson
david.t.wilson@gmail.com

Re: Using Postgres to store high volume streams of sensor readings

От

Alvaro Herrera

Дата:

22 ноября 2008 г., 21:26:40

Ciprian Dorin Craciun escribió:

>     I've tested also Sqlite3 and it has the same behavior as
> Postgres... Meaning at beginning it goes really nice 20k inserts,
> drops to about 10k inserts, but after a few million records, the HDD
> led starts to blink non-stop, and then it drops to unde 1k....

The problem is, most likely, on updating the indexes.  Heap inserts
should always take more or less the same time, but index insertion
requires walking down the index struct for each insert, and the path to
walk gets larger the more data you have.

Postgres does not have bulk index insert, which could perhaps get you a
huge performance improvement.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: Using Postgres to store high volume streams of sensor readings

От

Alvaro Herrera

Дата:

22 ноября 2008 г., 21:32:47

> On 21 Lis, 13:50, ciprian.crac...@gmail.com ("Ciprian Dorin Craciun")
> wrote:

> >     What have I observed / tried:
> >     * I've tested without the primary key and the index, and the
> > results were the best for inserts (600k inserts / s), but the
> > readings, worked extremly slow (due to the lack of indexing);
> >     * with only the index (or only the primary key) the insert rate is
> > good at start (for the first 2 million readings), but then drops to
> > about 200 inserts / s;

I didn't read the thread so I don't know if this was suggested already:
bulk index creation is a lot faster than retail index inserts.  Maybe
one thing you could try is to have an unindexed table to do the inserts,
and a separate table that you periodically truncate, refill with the
contents from the other table, then create index.  Two main problems: 1.
querying during the truncate/refill/reindex process (you can solve it by
having a second table that you "rename in place"); 2. the query table is
almost always out of date.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: Using Postgres to store high volume streams of sensor readings

От

Tom Lane

Дата:

22 ноября 2008 г., 22:02:56

Alvaro Herrera <alvherre@commandprompt.com> writes:
> The problem is, most likely, on updating the indexes.  Heap inserts
> should always take more or less the same time, but index insertion
> requires walking down the index struct for each insert, and the path to
> walk gets larger the more data you have.

It's worse than that: his test case inserts randomly ordered keys, which
means that there's no locality of access during the index updates.  Once
the indexes get bigger than RAM, update speed goes into the toilet,
because the working set of index pages that need to be touched also
is bigger than RAM.  That effect is going to be present in *any*
standard-design database, not just Postgres.

It's possible that performance in a real-world situation would be
better, if the incoming data stream isn't so random; but it's
hard to tell about that with the given facts.

One possibly useful trick is to partition the data by timestamp with
partition sizes chosen so that the indexes don't get out of hand.
But the partition management might be enough of a PITA to negate
any win.

            regards, tom lane

Re: Using Postgres to store high volume streams of sensor readings

От

Scara Maccai

Дата:

22 ноября 2008 г., 23:55:04

Since you always need the timestamp in your selects, have you tried indexing only the timestamp field?
Your selects would be slower, but since client and sensor don't have that many distinct values compared to the number
ofrows you are inserting maybe the difference in selects would not be that huge.

Re: Using Postgres to store high volume streams of sensor readings

От

"Scott Marlowe"

Дата:

23 ноября 2008 г., 00:09:51

On Sat, Nov 22, 2008 at 5:54 PM, Scara Maccai <m_lists@yahoo.it> wrote:
> Since you always need the timestamp in your selects, have you tried indexing only the timestamp field?
> Your selects would be slower, but since client and sensor don't have that many distinct values compared to the number
ofrows you are inserting maybe the difference in selects would not be that huge. 

Even better might be partitioning on the timestamp.  IF all access is
in a certain timestamp range it's usually a big win, especially
because he can move to a new table every hour / day / week or whatever
and merge the old one into a big "old data" table.

Re: Using Postgres to store high volume streams of sensor readings

От

"Ciprian Dorin Craciun"

Дата:

23 ноября 2008 г., 05:35:02

On Sun, Nov 23, 2008 at 1:02 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
>> The problem is, most likely, on updating the indexes.  Heap inserts
>> should always take more or less the same time, but index insertion
>> requires walking down the index struct for each insert, and the path to
>> walk gets larger the more data you have.
>
> It's worse than that: his test case inserts randomly ordered keys, which
> means that there's no locality of access during the index updates.  Once
> the indexes get bigger than RAM, update speed goes into the toilet,
> because the working set of index pages that need to be touched also
> is bigger than RAM.  That effect is going to be present in *any*
> standard-design database, not just Postgres.
>
> It's possible that performance in a real-world situation would be
> better, if the incoming data stream isn't so random; but it's
> hard to tell about that with the given facts.
>
> One possibly useful trick is to partition the data by timestamp with
> partition sizes chosen so that the indexes don't get out of hand.
> But the partition management might be enough of a PITA to negate
> any win.
>
>                        regards, tom lane

    Thanks for your feedback! This is just as I supposed, but i didn't
had the Postgres experience to be certain.
    I'll include your conclusion to my report.

    Ciprian Craciun.

Re: Using Postgres to store high volume streams of sensor readings

От

"Ciprian Dorin Craciun"

Дата:

23 ноября 2008 г., 05:35:54

On Sun, Nov 23, 2008 at 12:26 AM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:
> Ciprian Dorin Craciun escribió:
>
>>     I've tested also Sqlite3 and it has the same behavior as
>> Postgres... Meaning at beginning it goes really nice 20k inserts,
>> drops to about 10k inserts, but after a few million records, the HDD
>> led starts to blink non-stop, and then it drops to unde 1k....
>
> The problem is, most likely, on updating the indexes.  Heap inserts
> should always take more or less the same time, but index insertion
> requires walking down the index struct for each insert, and the path to
> walk gets larger the more data you have.
>
> Postgres does not have bulk index insert, which could perhaps get you a
> huge performance improvement.
>
> --
> Alvaro Herrera                                http://www.CommandPrompt.com/
> The PostgreSQL Company - Command Prompt, Inc.

    I don't think the index depth is a problem. For example in the
case of BerkeleyDB with BTree storage, the tree height is 3 after 100m
inserts... So this is not the problem.

    I think the problem is that after a certain amount of data,
perdicaly the entire index is touched, and in this case the HDD
becomes a bottleneck... (Demonstrated by the vmstat 1 output I've put
in a previous email.)

    Ciprian.

Re: Using Postgres to store high volume streams of sensor readings

От

"Ciprian Dorin Craciun"

Дата:

23 ноября 2008 г., 05:37:01

On Sun, Nov 23, 2008 at 12:32 AM, Alvaro Herrera
<alvherre@commandprompt.com> wrote:
>
>> On 21 Lis, 13:50, ciprian.crac...@gmail.com ("Ciprian Dorin Craciun")
>> wrote:
>
>> >     What have I observed / tried:
>> >     * I've tested without the primary key and the index, and the
>> > results were the best for inserts (600k inserts / s), but the
>> > readings, worked extremly slow (due to the lack of indexing);
>> >     * with only the index (or only the primary key) the insert rate is
>> > good at start (for the first 2 million readings), but then drops to
>> > about 200 inserts / s;
>
> I didn't read the thread so I don't know if this was suggested already:
> bulk index creation is a lot faster than retail index inserts.  Maybe
> one thing you could try is to have an unindexed table to do the inserts,
> and a separate table that you periodically truncate, refill with the
> contents from the other table, then create index.  Two main problems: 1.
> querying during the truncate/refill/reindex process (you can solve it by
> having a second table that you "rename in place"); 2. the query table is
> almost always out of date.
>
> --
> Alvaro Herrera                                http://www.CommandPrompt.com/
> The PostgreSQL Company - Command Prompt, Inc.
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general

    The concerts you have listed are very important to me... I will
use the database not only for archival and offline analysis, but also
for realtime queries (like what is the power consumption in the last
minute)...

    Of course I could use Postgres only for archival like you've said,
and some other solution for realtime queries, but this adds complexity
to the application...

    Thanks,
    Ciprian Craciun.

Re: Using Postgres to store high volume streams of sensor readings

От

"Ciprian Dorin Craciun"

Дата:

23 ноября 2008 г., 05:39:30

On Sun, Nov 23, 2008 at 3:09 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
> On Sat, Nov 22, 2008 at 5:54 PM, Scara Maccai <m_lists@yahoo.it> wrote:
>> Since you always need the timestamp in your selects, have you tried indexing only the timestamp field?
>> Your selects would be slower, but since client and sensor don't have that many distinct values compared to the
numberof rows you are inserting maybe the difference in selects would not be that huge. 
>
> Even better might be partitioning on the timestamp.  IF all access is
> in a certain timestamp range it's usually a big win, especially
> because he can move to a new table every hour / day / week or whatever
> and merge the old one into a big "old data" table.
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general

    Yes, If i would speed the inserts tremendously... I've tested it
and the insert speed is somewhere at 200k->100k.

    But unfortunately the query speed is not good at all because most
queries are for a specific client (and sensor) in a given time
range...

    Ciprian Craciun.

Re: Using Postgres to store high volume streams of sensor readings

От

Stephen Frost

Дата:

23 ноября 2008 г., 12:28:54

* Ciprian Dorin Craciun (ciprian.craciun@gmail.com) wrote:
> > Even better might be partitioning on the timestamp.  IF all access is
> > in a certain timestamp range it's usually a big win, especially
> > because he can move to a new table every hour / day / week or whatever
> > and merge the old one into a big "old data" table.
>
>     Yes, If i would speed the inserts tremendously... I've tested it
> and the insert speed is somewhere at 200k->100k.
>
>     But unfortunately the query speed is not good at all because most
> queries are for a specific client (and sensor) in a given time
> range...

Have you set up your partitions correctly (eg, with appropriate CHECK
constraints and with constraint_exclusion turned on)?  Also, you'd want
to keep your indexes on the individual partitions, of course..  That
should improve query time quite a bit since it should only be hitting
the partitions where the data might be.

    Stephen

Вложения

signature.asc

Re: Using Postgres to store high volume streams of sensor readings

От

Scara Maccai

Дата:

23 ноября 2008 г., 14:07:57

>     But unfortunately the query speed is not good at all
> because most
> queries are for a specific client (and sensor) in a given
> time
> range...

Maybe I'm wrong, I don't know a lot of these things; but defining the index as (timestamp, clientid, sensor) instead of
(clientid,sensor, timestamp) should give you more "locality of access" in the index creation? 
I think it would make more sense too, since you are not going to query the db without the timestamp, but maybe you want
toquery it without the clientid or probe (to get aggregates for example). 

Plus: could you give us some numbers about the difference in performance of the selects between the index defined as
(timestamp)and defined as (clientid, sensor, timestamp)?

Re: Using Postgres to store high volume streams of sensor readings

От

"V S P"

Дата:

23 ноября 2008 г., 15:06:33

While most of my experience with oracle/informix

I would also recommend
a) partitioning on DB level
Put partitions on on separate hard disks, have the system to be
at least dual core, and make the disks to be attached via SCSI
controller (not IDE) for parallel performance.


b) partitioning on application level (that is having
the insert code dynamically figure out what DB/and what table to go
(this complicates the application for inserts as well as for reports)



c) may be there is a chance to remove the index (if all you are doing
is inserts) -- and then recreate it later?

e) I did not see the type of index but if the value of at least
some of the indexed fields repeated a lot -- Oracle had what's called
'bitmap index'
Postgresql might have something similar, where that type of index
is optimized for the fact that values are the same for majority
of the rows (it becomes much smaller, and therefore quicker to update).

f) review that there are no insert triggers and
constraints (eithe field or foreign) on those tables
if there -- validate why they are there and see if they can
be removed -- and the application would then need to gurantee
correctness

VSP


On Sun, 23 Nov 2008 08:34:57 +0200, "Ciprian Dorin Craciun"
<ciprian.craciun@gmail.com> said:
> On Sun, Nov 23, 2008 at 1:02 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Alvaro Herrera <alvherre@commandprompt.com> writes:
> >> The problem is, most likely, on updating the indexes.  Heap inserts
> >> should always take more or less the same time, but index insertion
> >> requires walking down the index struct for each insert, and the path to
> >> walk gets larger the more data you have.
> >
> > It's worse than that: his test case inserts randomly ordered keys, which
> > means that there's no locality of access during the index updates.  Once
> > the indexes get bigger than RAM, update speed goes into the toilet,
> > because the working set of index pages that need to be touched also
> > is bigger than RAM.  That effect is going to be present in *any*
> > standard-design database, not just Postgres.
> >
> > It's possible that performance in a real-world situation would be
> > better, if the incoming data stream isn't so random; but it's
> > hard to tell about that with the given facts.
> >
> > One possibly useful trick is to partition the data by timestamp with
> > partition sizes chosen so that the indexes don't get out of hand.
> > But the partition management might be enough of a PITA to negate
> > any win.
> >
> >                        regards, tom lane
>
>     Thanks for your feedback! This is just as I supposed, but i didn't
> had the Postgres experience to be certain.
>     I'll include your conclusion to my report.
>
>     Ciprian Craciun.
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
--
  V S P
  toreason@fastmail.fm

--
http://www.fastmail.fm - Email service worth paying for. Try it for free

Re: Using Postgres to store high volume streams of sensor readings

От

"Ciprian Dorin Craciun"

Дата:

23 ноября 2008 г., 16:02:01

On Sun, Nov 23, 2008 at 3:28 PM, Stephen Frost <sfrost@snowman.net> wrote:
> * Ciprian Dorin Craciun (ciprian.craciun@gmail.com) wrote:
>> > Even better might be partitioning on the timestamp.  IF all access is
>> > in a certain timestamp range it's usually a big win, especially
>> > because he can move to a new table every hour / day / week or whatever
>> > and merge the old one into a big "old data" table.
>>
>>     Yes, If i would speed the inserts tremendously... I've tested it
>> and the insert speed is somewhere at 200k->100k.
>>
>>     But unfortunately the query speed is not good at all because most
>> queries are for a specific client (and sensor) in a given time
>> range...
>
> Have you set up your partitions correctly (eg, with appropriate CHECK
> constraints and with constraint_exclusion turned on)?  Also, you'd want
> to keep your indexes on the individual partitions, of course..  That
> should improve query time quite a bit since it should only be hitting
> the partitions where the data might be.
>
>        Stephen
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.9 (GNU/Linux)
>
> iEYEARECAAYFAkkpWpEACgkQrzgMPqB3kihvyACgm6ITdkodTqZvDLCjqavj9lkR
> w1oAnRrB1rbW+bF6Spr77VcH5/Mty4S6
> =G7aX
> -----END PGP SIGNATURE-----

    Well, now that I've read the previous two emails better, I
understand what Scot and Stephen are talking about...

    So if I understood it correctly: I should build indexes only for
certain parts of the data (like previous full hour and so). But I see
a problem: wouldn't this lead to a lot of indices beeing created (24 /
hour, ~150 / week, ...)?

    Another question: wouldn't the index creation impact the insertion
and query speed during they are created?

    Either case I don't think this is a very easy to implement solution...

    Ciprian Craciun.

Re: Using Postgres to store high volume streams of sensor readings

От

"Ciprian Dorin Craciun"

Дата:

23 ноября 2008 г., 16:36:19

    Thanks for your info! Please see my observations below.

    By the way, we are planning to also try Informix (the time series
extension?)... Do you have some other tips about Informix?

    Ciprian Craciun.


On Sun, Nov 23, 2008 at 6:06 PM, V S P <toreason@fastmail.fm> wrote:
> While most of my experience with oracle/informix
>
> I would also recommend
> a) partitioning on DB level
> Put partitions on on separate hard disks, have the system to be
> at least dual core, and make the disks to be attached via SCSI
> controller (not IDE) for parallel performance.

    This I found out, but for now I'm not able to change the disk layout...


> b) partitioning on application level (that is having
> the insert code dynamically figure out what DB/and what table to go
> (this complicates the application for inserts as well as for reports)

    We wanted to do this, and in this circumstance the winner for the
moment is BerkeleyDB as it's super fast for readings. (This is what
it's currently called sharding, right?)


> c) may be there is a chance to remove the index (if all you are doing
> is inserts) -- and then recreate it later?

    Not possible, as we would like to use the same database (and
table) for both inserts and real time queries... Otherwise the
application would complicate a lot...


> e) I did not see the type of index but if the value of at least
> some of the indexed fields repeated a lot -- Oracle had what's called
> 'bitmap index'
> Postgresql might have something similar, where that type of index
> is optimized for the fact that values are the same for majority
> of the rows (it becomes much smaller, and therefore quicker to update).

    For the moment the index type is the default one (btree), and from
the documentation I didn't see another matching (with the current
usage) one.


> f) review that there are no insert triggers and
> constraints (eithe field or foreign) on those tables
> if there -- validate why they are there and see if they can
> be removed -- and the application would then need to gurantee
> correctness

    Nop, no triggers or constraints (other than not null).


> VSP
>
>
> On Sun, 23 Nov 2008 08:34:57 +0200, "Ciprian Dorin Craciun"
> <ciprian.craciun@gmail.com> said:
>> On Sun, Nov 23, 2008 at 1:02 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> > Alvaro Herrera <alvherre@commandprompt.com> writes:
>> >> The problem is, most likely, on updating the indexes.  Heap inserts
>> >> should always take more or less the same time, but index insertion
>> >> requires walking down the index struct for each insert, and the path to
>> >> walk gets larger the more data you have.
>> >
>> > It's worse than that: his test case inserts randomly ordered keys, which
>> > means that there's no locality of access during the index updates.  Once
>> > the indexes get bigger than RAM, update speed goes into the toilet,
>> > because the working set of index pages that need to be touched also
>> > is bigger than RAM.  That effect is going to be present in *any*
>> > standard-design database, not just Postgres.
>> >
>> > It's possible that performance in a real-world situation would be
>> > better, if the incoming data stream isn't so random; but it's
>> > hard to tell about that with the given facts.
>> >
>> > One possibly useful trick is to partition the data by timestamp with
>> > partition sizes chosen so that the indexes don't get out of hand.
>> > But the partition management might be enough of a PITA to negate
>> > any win.
>> >
>> >                        regards, tom lane
>>
>>     Thanks for your feedback! This is just as I supposed, but i didn't
>> had the Postgres experience to be certain.
>>     I'll include your conclusion to my report.
>>
>>     Ciprian Craciun.
>>
>> --
>> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-general
> --
>  V S P
>  toreason@fastmail.fm
>
> --
> http://www.fastmail.fm - Email service worth paying for. Try it for free

Re: Using Postgres to store high volume streams of sensor readings

От

Scara Maccai

Дата:

23 ноября 2008 г., 17:10:41

>     If you watch the speed, you'll see that the insert
> speed is the
> same, but the scan speed is worse (from 32k  to 200).

As I said, I don't know a lot about these things.
But I would like someone to comment on this (so that maybe I will know something!):

1) I thought the poor insert performance was due to a "locality of access" in the index creation, hence I thought that
sincethe timestamp is always increasing putting it as first column in the index should give a better insert speed, but
itdidn't: why???? 

2) I thought that given a query like:

select * from taba where clientid=2 and sensor=4 and timestamp between 'start_t' and 'end_t'

there shouldn't be a huge difference in speed between an index defined as (timestamp, clientid, sensorid) and another
onedefined as (clientid, sensor, timestamp) but I was VERY wrong: it's 1000 times worst. How is it possible??? 

It's obvious I don't know how multicolumn indexes work...
Can someone explain?

Re: Using Postgres to store high volume streams of sensor readings

От

"Scott Marlowe"

Дата:

23 ноября 2008 г., 17:24:00

On Sun, Nov 23, 2008 at 10:01 AM, Ciprian Dorin Craciun
<ciprian.craciun@gmail.com> wrote:
> On Sun, Nov 23, 2008 at 3:28 PM, Stephen Frost <sfrost@snowman.net> wrote:
>> * Ciprian Dorin Craciun (ciprian.craciun@gmail.com) wrote:
>>> > Even better might be partitioning on the timestamp.  IF all access is
>>> > in a certain timestamp range it's usually a big win, especially
>>> > because he can move to a new table every hour / day / week or whatever
>>> > and merge the old one into a big "old data" table.
>>>
>>>     Yes, If i would speed the inserts tremendously... I've tested it
>>> and the insert speed is somewhere at 200k->100k.
>>>
>>>     But unfortunately the query speed is not good at all because most
>>> queries are for a specific client (and sensor) in a given time
>>> range...
>>
>> Have you set up your partitions correctly (eg, with appropriate CHECK
>> constraints and with constraint_exclusion turned on)?  Also, you'd want
>> to keep your indexes on the individual partitions, of course..  That
>> should improve query time quite a bit since it should only be hitting
>> the partitions where the data might be.
>>
>>        Stephen
>>
>> -----BEGIN PGP SIGNATURE-----
>> Version: GnuPG v1.4.9 (GNU/Linux)
>>
>> iEYEARECAAYFAkkpWpEACgkQrzgMPqB3kihvyACgm6ITdkodTqZvDLCjqavj9lkR
>> w1oAnRrB1rbW+bF6Spr77VcH5/Mty4S6
>> =G7aX
>> -----END PGP SIGNATURE-----
>
>    Well, now that I've read the previous two emails better, I
> understand what Scot and Stephen are talking about...
>
>    So if I understood it correctly: I should build indexes only for
> certain parts of the data (like previous full hour and so). But I see
> a problem: wouldn't this lead to a lot of indices beeing created (24 /
> hour, ~150 / week, ...)?

No, not exactly what I'm talking about.  I'm talking about
pre-creating partitions that the data will soon go into (let's say a
new one every hour) with indexes in place, and having a trigger that
fires on insert to put the data into the right partition.  Once that
partition is no longer being inserted into, and we aren't running a
bunch of queries on it, we migrate it to a historical partition.

So, your table looks something like this all the time:

|**|^^|##|##|##|##|$$$$$$$$$$$$$$...|

Where:

** is a partition we have created in advance of needing it.
^^ is the partition we are currently writing to
## are the partitions we're still using in select queries a lot
$$$... are the old data stuffed into the monolithic history table.

When it's time to switch to writing to the new partition (i.e. **) we
make a new one ahead of that, and the trigger starts writing to what
was a ** partition but is now the new ^^, and the ^^ becomes a ##.  At
the end of the day / week whatever, we take all the old ## partitions
and move their data into the $$$ and drop the ## partitions.

Note that we only need to put data into an archive partition to keep
from having hundreds or thousands of partitions.  There's a limit of a
few hundred partitions where things start getting slow again due to
planner overhead.

Re: Using Postgres to store high volume streams of sensor readings

От

"marcin mank"

Дата:

24 ноября 2008 г., 00:42:16

>    Yes, the figures are like this:
>    * average number of raw inserts / second (without any optimization
> or previous aggregation): #clients (~ 100 thousand) * #sensors (~ 10)
> / 6seconds = 166 thousand inserts / second...

this is average?
166 000 * 20 bytes per record * 86400 seconds per day = 280GB / day ,
not counting indices.

What is the time span You want to have the data from?

Greetings
Marcin

Re: Using Postgres to store high volume streams of sensor readings

От

"Ciprian Dorin Craciun"

Дата:

24 ноября 2008 г., 04:27:10

On Mon, Nov 24, 2008 at 3:42 AM, marcin mank <marcin.mank@gmail.com> wrote:
>>    Yes, the figures are like this:
>>    * average number of raw inserts / second (without any optimization
>> or previous aggregation): #clients (~ 100 thousand) * #sensors (~ 10)
>> / 6seconds = 166 thousand inserts / second...
>
> this is average?
> 166 000 * 20 bytes per record * 86400 seconds per day = 280GB / day ,
> not counting indices.
>
> What is the time span You want to have the data from?
>
> Greetings
> Marcin

    Well I'm not sure for the archival period... Maybe a day, maybe a
week... For the moment I'm just struggling with the insert speed.
    (We could also use sharding -- horizontal partitioning on
different machines -- and this wourd reduce the load...)

    Ciprian.

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Using Postgres to store high volume streams of sensor readings

Вложения

Вложения