Обсуждение: using a postgres table as a multi-writer multi-updater queue

Поиск
Список
Период
Сортировка

using a postgres table as a multi-writer multi-updater queue

От
Chris Withers
Дата:
Hi All,

I wondered if any of you could recommend best practices for using a postgres table as a queue. Roughly speaking, 100-200 workers will vomit rows and rates of a few hundres per second into the table leaving the status as new and then as many workers as needed to keep up with the load will plough through the queue changing the status to something other than new.

My naive implementation would be something along the lines of:

CREATE TABLE event (   ts        timestamp,   event     char(40),   status    char(10),   CONSTRAINT pkey PRIMARY KEY(ts, event)
);

...with writers doing INSERT or COPY to get data into the table and readers doing something like:
SELECT FOR UPDATE * FROM event WHERE status='new' LIMIT 1000;
...so, grabbing batches of 1,000, working on them and then setting their status.

But, am I correct in thinking that SELECT FOR UPDATE will not prevent multiple workers selecting the same rows?

Anyway, is this approach reasonable? If so, what tweaks/optimisations should I be looking to make?

If it's totally wrong, how should I be looking to approach the problem?

cheers,

Chris

Re: using a postgres table as a multi-writer multi-updater queue

От
Ladislav Lenart
Дата:
Hello.


On 23.11.2015 11:41, Chris Withers wrote:
> Hi All,
>
> I wondered if any of you could recommend best practices for using a postgres
> table as a queue. Roughly speaking, 100-200 workers will vomit rows and rates of
> a few hundres per second into the table leaving the status as new and then as
> many workers as needed to keep up with the load will plough through the queue
> changing the status to something other than new.
>
> My naive implementation would be something along the lines of:
>
> CREATE TABLE event (
>     ts        timestamp,
>     event     char(40),
>     status    char(10),
>     CONSTRAINT pkey PRIMARY KEY(ts, event)
> );
>
>
> ...with writers doing INSERT or COPY to get data into the table and readers
> doing something like:
>
> SELECT FOR UPDATE * FROM event WHERE status='new' LIMIT 1000;
>
> ...so, grabbing batches of 1,000, working on them and then setting their status.
>
> But, am I correct in thinking that SELECT FOR UPDATE will not prevent multiple
> workers selecting the same rows?
>
> Anyway, is this approach reasonable? If so, what tweaks/optimisations should I
> be looking to make?
>
> If it's totally wrong, how should I be looking to approach the problem?

I suggest an excellent read on this topic:

http://www.depesz.com/2013/08/30/pick-a-task-to-work-on/

Highly recommended if you haven't read it yet.

Also, if you aim on 9.5 (not released yet), it will introduce:

SELECT...
FOR UPDATE
SKIP LOCKED -- this is new

which supports exactly this use-case (i.e. to implement a job queue).


HTH,

Ladislav Lenart



Re: using a postgres table as a multi-writer multi-updater queue

От
Jim Nasby
Дата:
On 11/23/15 6:12 AM, Ladislav Lenart wrote:
> I suggest an excellent read on this topic:
>
> http://www.depesz.com/2013/08/30/pick-a-task-to-work-on/
>
> Highly recommended if you haven't read it yet.

One thing it doesn't mention that you need to be aware of is the vacuum
workload on a queue table. In a busy queue, it will be difficult or even
impossible for vacuum to keep the amount of dead rows down to something
manageable. That's why PgQ and Slony don't even attempt it; instead,
they rotate through a fixed set of tables. Once all the entries in a
table have been processed, the table is truncated.

If you go the delete route, make sure you don't index any fields in the
queue that get updated (otherwise you won't get HOT updates), and run a
very aggressive manual vacuum so the table stays small.
--
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com


Re: using a postgres table as a multi-writer multi-updater queue

От
Andy Colson
Дата:
On 11/23/2015 4:41 AM, Chris Withers wrote:
> Hi All,
>
> I wondered if any of you could recommend best practices for using a
> postgres table as a queue. Roughly speaking, 100-200 workers will vomit
> rows and rates of a few hundres per second into the table leaving the
> status as new and then as many workers as needed to keep up with the
> load will plough through the queue changing the status to something
> other than new.
>
> My naive implementation would be something along the lines of:
>
> CREATE TABLE event (
>      ts        timestamp,
>      event     char(40),
>      status    char(10),
>      CONSTRAINT pkey PRIMARY KEY(ts, event)
> );
>
>
> ...with writers doing INSERT or COPY to get data into the table and
> readers doing something like:
>
> SELECT FOR UPDATE * FROM event WHERE status='new' LIMIT 1000;
>
> ...so, grabbing batches of 1,000, working on them and then setting their
> status.
>
> But, am I correct in thinking that SELECT FOR UPDATE will not prevent
> multiple workers selecting the same rows?
>
> Anyway, is this approach reasonable? If so, what tweaks/optimisations
> should I be looking to make?
>
> If it's totally wrong, how should I be looking to approach the problem?
>
> cheers,
>
> Chris

Have you tried Redis?  Its really good at that sort of thing.

-Andy


Re: using a postgres table as a multi-writer multi-updater queue

От
John R Pierce
Дата:
On 11/23/2015 2:41 AM, Chris Withers wrote:
>
> If it's totally wrong, how should I be looking to approach the problem?


depending on where these queue entries are coming from, I'd considering
using a message queueing system like AMS, MQseries, etc, rather than
trying to use a relational database table as a queue. your external data
source(s) would write messages to this queue, and you'd have
'subscriber' processes that listen to the queue and process the
messages, inserting persistent data into the database as needed.



--
john r pierce, recycling bits in santa cruz



Re: using a postgres table as a multi-writer multi-updater queue

От
Merlin Moncure
Дата:
On Mon, Nov 23, 2015 at 12:54 PM, John R Pierce <pierce@hogranch.com> wrote:
> On 11/23/2015 2:41 AM, Chris Withers wrote:
>>
>>
>> If it's totally wrong, how should I be looking to approach the problem?
>
> depending on where these queue entries are coming from, I'd considering
> using a message queueing system like AMS, MQseries, etc, rather than trying
> to use a relational database table as a queue. your external data source(s)
> would write messages to this queue, and you'd have 'subscriber' processes
> that listen to the queue and process the messages, inserting persistent data
> into the database as needed.

I just don't agree with this generalization.  Keeping the state of the
queue in the database has a lot of advantages and is a lot easier to
deal with from a programming perspective especially if SQL is your
core competency.  Being able to produce and consume in SQL based on
other relational datasources is...elegant.

Specialized queue systems are a very heavy dependency and adding a new
server to your platform to mange queues is not something to take
lightly.  This advice also applies to scheduling systems like quartz,
specialized search like solr and elastisearch, and distributed data
platforms like hadoop.  I've used all of these things and have tended
to wish I had just used the database instead in just about every case.

Also, check out pgq (https://wiki.postgresql.org/wiki/PGQ_Tutorial).
Personally, I tend to roll my own queues.   It's not difficult.

merlin


Re: using a postgres table as a multi-writer multi-updater queue

От
John R Pierce
Дата:
On 11/23/2015 2:51 PM, Merlin Moncure wrote:
> On Mon, Nov 23, 2015 at 12:54 PM, John R Pierce<pierce@hogranch.com>  wrote:
> >
> >depending on where these queue entries are coming from, I'd considering
> >using a message queueing system like AMS, MQseries, etc, rather than trying
> >to use a relational database table as a queue. your external data source(s)
> >would write messages to this queue, and you'd have 'subscriber' processes
> >that listen to the queue and process the messages, inserting persistent data
> >into the database as needed.
> I just don't agree with this generalization.  Keeping the state of the
> queue in the database has a lot of advantages and is a lot easier to
> deal with from a programming perspective especially if SQL is your
> core competency.  Being able to produce and consume in SQL based on
> other relational datasources is...elegant.

our whole system at $job is message based as its a distributed
system.    clients send messages to middleware servers that talk to the
database servers.    noone talks directly to the database, instead they
use messaging.   also the several databases in our core cluster talk to
each other with messaging, where the front end database publishes events
that the other reporting database servers subscribe to.   its a very
powerful model for building complex distributed systems and maintaining
quite a lot of implementation flexibility, as the exact nature of the
schema only needs to be known by a few publisher and subscriber modules.



--
john r pierce, recycling bits in santa cruz



Re: using a postgres table as a multi-writer multi-updater queue

От
Tim Uckun
Дата:
I'll add my two cents.....

I set up something similar a while ago. Here are my suggestions for what they are worth.

You don't ever want to delete from such a table so you need to set up something which allows you to truncate the tables when you no longer need them.  

One way to accomplish this is with rules (not triggers, rules are blazingly fast compared to triggers).    Set up a table inheritance scheme based on whatever search criteria you have (date, sequence etc).   Set up a cron job to create the child tables well ahead and to set up the rule redirecting the insert. For example let's say you have a date partitioned table and you want to  keep a table for every day.  Your cron job would run once a day and would create the next seven days worth of tables (just incase the cron job fails to run for some reason) and would rewrite the rule to insert into the table with a if then else type of logic.  This method is preferable to the dynamic creation of the table name with string concatenation because again it's significantly faster.  

Another method I tried was to have one "primary" child table and "archival" child tables and insert directly into the primal child table.  For example say you have a table called "Q".  You set up a table called Q_in which inherits from Q.  Your code inserts into the Q_in table, you select from the Q table.  On a periodic basis you do this 

BEGIN TRANSACTION
LOCK TABLE Q_IN IN EXCLUSIVE MODE;
ALTER TABLE Q_IN RENAME TO Q_SOME_DATETIME;
CREATE TABLE Q_IN
  (LIKE Q_SOME_DATETIME INCLUDING DEFAULTS INCLUDING CONSTRAINTS INCLUDING INDEXES);
ALTER SEQUENCE q_in_id_seq OWNED BY q_in.id;
 -- Set some constraints so the query optimizer knows what to do
END TRANSACTION


There is one other method which is the Kafka approach( You can use this in addition to the above methods)

Create N tables for incoming queue, each one has a sequence for their ID number. N should be determined by how many clients you expect to run.  Create a rule which uses some hash function or round robin or randomly to insert data into one of the tables.   Create a different table which keeps track of client connections.   The clients use this table to keep track of the last id fetched.  For example let's have I have three types of processes that run on the incoming data p1,p2, p3  The table logs the highest ID fetched from each table for each process. When a client connects it connects to the table with the lowest used ID for that process,  and it locks it for that client (not process because you can multiple clients running each process), it processes the records, it updates the id, it unlocks the table and it backs off for a few seconds.  The next client which woke up goes through the same process and so on.  Both Apache Kafka and Amazon kinesis use this approach.  One nice thing about this approach is that you can put each table in it's own tablespace in it's own disk for higher performance.

One other thing. This is dangerous but you can turn off logging of the tables, this will make the inserts much faster.

Finally:

There is no need to do any of this.  Kinesis is cheap, Kafka is pretty awesome, Rabbit is crazy useful.  

Hope this helps, ping me offline if you want more details.

Cheers.


On Tue, Nov 24, 2015 at 11:51 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
On Mon, Nov 23, 2015 at 12:54 PM, John R Pierce <pierce@hogranch.com> wrote:
> On 11/23/2015 2:41 AM, Chris Withers wrote:
>>
>>
>> If it's totally wrong, how should I be looking to approach the problem?
>
> depending on where these queue entries are coming from, I'd considering
> using a message queueing system like AMS, MQseries, etc, rather than trying
> to use a relational database table as a queue. your external data source(s)
> would write messages to this queue, and you'd have 'subscriber' processes
> that listen to the queue and process the messages, inserting persistent data
> into the database as needed.

I just don't agree with this generalization.  Keeping the state of the
queue in the database has a lot of advantages and is a lot easier to
deal with from a programming perspective especially if SQL is your
core competency.  Being able to produce and consume in SQL based on
other relational datasources is...elegant.

Specialized queue systems are a very heavy dependency and adding a new
server to your platform to mange queues is not something to take
lightly.  This advice also applies to scheduling systems like quartz,
specialized search like solr and elastisearch, and distributed data
platforms like hadoop.  I've used all of these things and have tended
to wish I had just used the database instead in just about every case.

Also, check out pgq (https://wiki.postgresql.org/wiki/PGQ_Tutorial).
Personally, I tend to roll my own queues.   It's not difficult.

merlin


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: using a postgres table as a multi-writer multi-updater queue

От
Jeff Janes
Дата:
On Mon, Nov 23, 2015 at 2:41 AM, Chris Withers <chris@simplistix.co.uk> wrote:
> Hi All,
>
> I wondered if any of you could recommend best practices for using a postgres
> table as a queue. Roughly speaking, 100-200 workers will vomit rows and
> rates of a few hundres per second into the table leaving the status as new
> and then as many workers as needed to keep up with the load will plough
> through the queue changing the status to something other than new.

Is that several hundred per second per worker, or just several hundred
per second in total?

What happens if the worker finds the database has crashed when it goes
to insert the records?  That would determine what kind of
transactional system you need.  On the consumer side, what would
happen if a record were processed twice?


>
> My naive implementation would be something along the lines of:
>
> CREATE TABLE event (
>     ts        timestamp,
>     event     char(40),
>     status    char(10),
>     CONSTRAINT pkey PRIMARY KEY(ts, event)
> );

How long are going you keep these records around for once processed?

Unless you delete them immediately, you will probably want a partial
index on (ts) where status='new'

> ...with writers doing INSERT or COPY to get data into the table and readers
> doing something like:
>
> SELECT FOR UPDATE * FROM event WHERE status='new' LIMIT 1000;
>
> ...so, grabbing batches of 1,000, working on them and then setting their
> status.

Why 1000 at a time?  Also, you probably need an ORDER BY

> But, am I correct in thinking that SELECT FOR UPDATE will not prevent
> multiple workers selecting the same rows?

You are correct, but the way it will prevent multiple workers from
selecting them at the same time is that the next worker will block
until the first one commits.  You would either use need to use SKIP
LOCKED in 9.5 release, or you would need to use
pg_try_advisory_xact_lock on lower versions, to avoid that.

And, how do you unlock the rows?  There are two general approaches.
One is to lock the row using PostgreSQL's FOR UPDATE type locks, and
hold the transaction open while processing, then updating the row to
mark it done and committing to release the lock.  This cleans up after
itself in the case a worker crashes, but there is no visibility into
what is going on.

The other is to claim the row for the worker by updating a status
field (for example, to have a hostname and pid), and committing that.
And then doing the processing, then updating it again to set it as
done, and committing that.  An abnormal terminated worker will need to
have someone or something clean up after it, but it gives you much
better visibility into what is happening.

>
> Anyway, is this approach reasonable? If so, what tweaks/optimisations should
> I be looking to make?

The best optimization would be to not do it at all.  Why can't the 200
inserting worker just do the work themselves immediately (either
client side or server side), instead of queuing it?  If there is any
possibility of making the processing fast enough to do it that way,
I'd surely spend my time optimizing the actual work, rather than
optimizing a queuing system.

And depending on what failure modes you can tolerate, consider a
best-effort dedicated queuing system rather than a perfectly ACID one
built on PostgreSQL.

Cheers,

Jeff


Re: using a postgres table as a multi-writer multi-updater queue

От
"Steve Petrie, P.Eng."
Дата:

(Kindly forgive my top-post. My stupid email client software (Microsoft Outlook Express on Win XP) refuses to respect its own "Reply" option settings for inline reply text. I've been looking for a replacement email client but so far without success.)
 
* * *
* * *
 
> You don't ever want to delete from such a table so you need to set up something which allows you to truncate the tables when you no longer need them.
 
I am migrating a web PHP application (called ITS-ETO) from mysql to postgres. The app INSERTs a row into a postgres table <eto_sql_tb_session_www> to manage each session with a web browser. Here is the DDL for the session table:
CREATE TABLE its_eto.eto_sql_tb_session_www
(
  session_www_code       char(32)      NOT NULL UNIQUE PRIMARY KEY,
 
  session_www_type       int           NOT NULL,
  session_www_state      int           NOT NULL,
  session_verify_code    char(7)       NOT NULL,
 
  session_www_serno      SERIAL        NOT NULL UNIQUE,
 
  session_target_serno   int           NULL,
  session_target_data    varchar(1000) NULL,
 
  session_www_init_utc   timestamp     NOT NULL,
  session_www_last_utc   timestamp     NOT NULL,
  session_www_expiry_utc timestamp     NOT NULL,
  session_www_delete_utc timestamp     NOT NULL,
  session_www_hit_count  int           NOT NULL,
  session_www_act_seqno  int           NULL
 
);
 
CREATE INDEX ON its_eto.eto_sql_tb_session_www (session_www_serno);
Using a "fuzzy" probability mechanism, some randomly-selected fraction of the HTTP requests that initiate a new session, also SELECT and DELETE expired rows from the session table. I naively assumed that the database server would automatically recycle the storage space dynamically released in this way.
 
Now, I'm reading in this forum that in fact, postgres does not efficiently automatically recycle storage space released by row DELETion.
 
* * *
* * *
 
My application is quite simple and will be supporting a modest workload, using a small amount of storage space, compared to the massive transaction rates and gigantic space usages, I'm reading about in this forum.
 
I do have the luxury of being able to shut down the application for a few minutes periodically e.g every 24 hours.
 
My plan was always, to avoid eventual exhaustion of the SERIAL sequence number integer value series, by swapping in during the periodic app shutdown, a freshly truncated postgres <eto_sql_tb_session_www> table.
 
Before going online, the freshly truncated postgres <eto_sql_tb_session_www> table will receive INSERTs of any active session rows copied over from the old <eto_sql_tb_session_www> table. The copied session rows will get new sequence numbers, but that won't matter, because a session row is referenced within each incoming HTTP request, not by its row serial number column <session_www_serno> integer value, but by a randomly-generated (MD5) 32-character unique key column <session_www_code> value. 
 
So my question to this postgres forum is -- should I just remove from the online app the "fuzzy" probability mechanism, that DELETEs expired rows from the session table -- because the postgres server is not going to dynamically recycle the released storage space anyway?
 
Any comments appreciated.
 
Steve
 
----- Original Message -----
From: Tim Uckun
Sent: Wednesday, November 25, 2015 3:50 PM
Subject: Re: [GENERAL] using a postgres table as a multi-writer multi-updater queue

I'll add my two cents.....

I set up something similar a while ago. Here are my suggestions for what they are worth.

You don't ever want to delete from such a table so you need to set up something which allows you to truncate the tables when you no longer need them.  

One way to accomplish this is with rules (not triggers, rules are blazingly fast compared to triggers).    Set up a table inheritance scheme based on whatever search criteria you have (date, sequence etc).   Set up a cron job to create the child tables well ahead and to set up the rule redirecting the insert. For example let's say you have a date partitioned table and you want to  keep a table for every day.  Your cron job would run once a day and would create the next seven days worth of tables (just incase the cron job fails to run for some reason) and would rewrite the rule to insert into the table with a if then else type of logic.  This method is preferable to the dynamic creation of the table name with string concatenation because again it's significantly faster.  

Another method I tried was to have one "primary" child table and "archival" child tables and insert directly into the primal child table.  For example say you have a table called "Q".  You set up a table called Q_in which inherits from Q.  Your code inserts into the Q_in table, you select from the Q table.  On a periodic basis you do this 

BEGIN TRANSACTION
LOCK TABLE Q_IN IN EXCLUSIVE MODE;
ALTER TABLE Q_IN RENAME TO Q_SOME_DATETIME;
CREATE TABLE Q_IN
  (LIKE Q_SOME_DATETIME INCLUDING DEFAULTS INCLUDING CONSTRAINTS INCLUDING INDEXES);
ALTER SEQUENCE q_in_id_seq OWNED BY q_in.id;
 -- Set some constraints so the query optimizer knows what to do
END TRANSACTION


There is one other method which is the Kafka approach( You can use this in addition to the above methods)

Create N tables for incoming queue, each one has a sequence for their ID number. N should be determined by how many clients you expect to run.  Create a rule which uses some hash function or round robin or randomly to insert data into one of the tables.   Create a different table which keeps track of client connections.   The clients use this table to keep track of the last id fetched.  For example let's have I have three types of processes that run on the incoming data p1,p2, p3  The table logs the highest ID fetched from each table for each process. When a client connects it connects to the table with the lowest used ID for that process,  and it locks it for that client (not process because you can multiple clients running each process), it processes the records, it updates the id, it unlocks the table and it backs off for a few seconds.  The next client which woke up goes through the same process and so on.  Both Apache Kafka and Amazon kinesis use this approach.  One nice thing about this approach is that you can put each table in it's own tablespace in it's own disk for higher performance.

One other thing. This is dangerous but you can turn off logging of the tables, this will make the inserts much faster.

Finally:

There is no need to do any of this.  Kinesis is cheap, Kafka is pretty awesome, Rabbit is crazy useful.  

Hope this helps, ping me offline if you want more details.

Cheers.


On Tue, Nov 24, 2015 at 11:51 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
On Mon, Nov 23, 2015 at 12:54 PM, John R Pierce <pierce@hogranch.com> wrote:
> On 11/23/2015 2:41 AM, Chris Withers wrote:
>>
>>
>> If it's totally wrong, how should I be looking to approach the problem?
>
> depending on where these queue entries are coming from, I'd considering
> using a message queueing system like AMS, MQseries, etc, rather than trying
> to use a relational database table as a queue. your external data source(s)
> would write messages to this queue, and you'd have 'subscriber' processes
> that listen to the queue and process the messages, inserting persistent data
> into the database as needed.

I just don't agree with this generalization.  Keeping the state of the
queue in the database has a lot of advantages and is a lot easier to
deal with from a programming perspective especially if SQL is your
core competency.  Being able to produce and consume in SQL based on
other relational datasources is...elegant.

Specialized queue systems are a very heavy dependency and adding a new
server to your platform to mange queues is not something to take
lightly.  This advice also applies to scheduling systems like quartz,
specialized search like solr and elastisearch, and distributed data
platforms like hadoop.  I've used all of these things and have tended
to wish I had just used the database instead in just about every case.

Also, check out pgq (https://wiki.postgresql.org/wiki/PGQ_Tutorial).
Personally, I tend to roll my own queues.   It's not difficult.

merlin


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: using a postgres table as a multi-writer multi-updater queue

От
Jeff Janes
Дата:
On Wed, Nov 25, 2015 at 8:39 PM, Steve Petrie, P.Eng.
<apetrie@aspetrie.net> wrote:
>
>> You don't ever want to delete from such a table so you need to set up
>> something which allows you to truncate the tables when you no longer need
>> them.
>
> I am migrating a web PHP application (called ITS-ETO) from mysql to
> postgres. The app INSERTs a row into a postgres table
> <eto_sql_tb_session_www> to manage each session with a web browser. Here is
> the DDL for the session table:
>
> CREATE TABLE its_eto.eto_sql_tb_session_www
> (
>   session_www_code       char(32)      NOT NULL UNIQUE PRIMARY KEY,
>
>   session_www_type       int           NOT NULL,
>   session_www_state      int           NOT NULL,
>   session_verify_code    char(7)       NOT NULL,
>
>   session_www_serno      SERIAL        NOT NULL UNIQUE,
>
>   session_target_serno   int           NULL,
>   session_target_data    varchar(1000) NULL,
>
>   session_www_init_utc   timestamp     NOT NULL,
>   session_www_last_utc   timestamp     NOT NULL,
>   session_www_expiry_utc timestamp     NOT NULL,
>   session_www_delete_utc timestamp     NOT NULL,
>   session_www_hit_count  int           NOT NULL,
>   session_www_act_seqno  int           NULL
>
> );
>
> CREATE INDEX ON its_eto.eto_sql_tb_session_www (session_www_serno);
>
> Using a "fuzzy" probability mechanism, some randomly-selected fraction of
> the HTTP requests that initiate a new session, also SELECT and DELETE
> expired rows from the session table. I naively assumed that the database
> server would automatically recycle the storage space dynamically released in
> this way.
>
> Now, I'm reading in this forum that in fact, postgres does not efficiently
> automatically recycle storage space released by row DELETion.

> My application is quite simple and will be supporting a modest workload,
> using a small amount of storage space, compared to the massive transaction
> rates and gigantic space usages, I'm reading about in this forum.


Truncation is far more efficient than deletion + vacuuming.  If you
are running on the edge of your hardware's capabilities, this
efficiency is important.  But if you are not on the edge, then it is
not worth worrying about.  Just make sure your autovacuum settings are
at least as aggressive as the default settings.


>
> I do have the luxury of being able to shut down the application for a few
> minutes periodically e.g every 24 hours.
>
> My plan was always, to avoid eventual exhaustion of the SERIAL sequence
> number integer value series, by swapping in during the periodic app
> shutdown, a freshly truncated postgres <eto_sql_tb_session_www> table.

I'd make the serial column and bigserial, and then forget about it.

Cheers,

Jeff


Re: using a postgres table as a multi-writer multi-updater queue

От
George Neuner
Дата:
On Wed, 25 Nov 2015 23:39:39 -0500, "Steve Petrie, P.Eng."
<apetrie@aspetrie.net> wrote:


>Now, I'm reading in this forum that in fact, postgres does not efficiently
>automatically recycle storage space released by row DELETion.

Yes and no.

Space resulting from deleted rows is not given back to the OS unless
you perform a full vacuum.  Autovacuum compacts the table, squeezing
out empty rows and leaving free space at the end for new insertions.

If the table is heavily used, you will reuse the free space quickly
anyway.  The problem you may run into is needing to autovacuum too
often to control the size of the table.

Space from truncated or dropped tables is immediately given back to
the OS.

The issue with heavy deletion usually is performance.  Deleting scans
the rows and fires any relevant triggers ... truncating or dropping
the table does not.


>My plan was always, to avoid eventual exhaustion of the SERIAL
>sequence number integer value series, by swapping in during the
>periodic app shutdown, a freshly truncated postgres
> <eto_sql_tb_session_www> table.

Is there a logical problem with letting the sequence wrap around?


>So my question to this postgres forum is -- should I just remove from
>the online app the "fuzzy" probability mechanism, that DELETEs expired
>rows from the session table -- because the postgres server is not going
>to dynamically recycle the released storage space anyway?

I'm not sure I understand the reason for "fuzzy" deletion.  There are
a number of timestamps in your data ... is it not possible to delete
deterministically based on one of them?


Hope this helps,
George

Re: using a postgres table as a multi-writer multi-updater queue

От
George Neuner
Дата:
On Wed, 25 Nov 2015 23:39:39 -0500, "Steve Petrie, P.Eng."
<apetrie@aspetrie.net> wrote:

>My stupid email client software (Microsoft Outlook Express on Win XP)
>refuses to respect its own "Reply" option settings for inline reply text.
>I've been looking for a replacement email client but so far without success.)

Without further comment about a 15 year old, unsupported OS ...

My vote for an email client would be Thunderbird.  It runs on XP or
higher and you can import Outlook's PST files so as to keep your mail
archives.  Importing PST files directly requires Outlook be available
on the same system [there is also a less friendly way to do it via EML
files exported from Outlook where Outlook is not on the same system].

Thunderbird directly supports net news, so you don't have to get news
mixed with your mail (unless you want to).


Hope this helps,
George

Re: using a postgres table as a multi-writer multi-updater queue

От
"Steve Petrie, P.Eng."
Дата:
Thanks to Jeff for the helpful response. My remarks are below.

----- Original Message -----
From: "Jeff Janes" <jeff.janes@gmail.com>
To: "Steve Petrie, P.Eng." <apetrie@aspetrie.net>
Cc: "Tim Uckun" <timuckun@gmail.com>; "Merlin Moncure" <mmoncure@gmail.com>;
"John R Pierce" <pierce@hogranch.com>; "PostgreSQL General"
<pgsql-general@postgresql.org>
Sent: Thursday, November 26, 2015 2:07 AM
Subject: Re: [GENERAL] using a postgres table as a multi-writer
multi-updater queue


> On Wed, Nov 25, 2015 at 8:39 PM, Steve Petrie, P.Eng.
> <apetrie@aspetrie.net> wrote:
>>
>>> You don't ever want to delete from such a table so you need to set up
>>> something which allows you to truncate the tables when you no longer
>>> need
>>> them.
...
> Truncation is far more efficient than deletion + vacuuming.  If you
> are running on the edge of your hardware's capabilities, this
> efficiency is important.  But if you are not on the edge, then it is
> not worth worrying about.

This is what I was hoping to learn -- because I doubt my app workload will
ever approach the edge of hardware capability.

> Just make sure your autovacuum settings are
> at least as aggressive as the default settings.
>

I'll keep that in mind. And by a happy coincidence, in another recent forum
thread (26 Nov 2015, subject "Re: [GENERAL] Re: [webmaster] How to
commit/vacuum a batch of delete statements in a postgresql function") there
is advice from Adrian Klaver to about the need to execute VACUUM outside of
a transaction block.

>> My plan was always, to avoid eventual exhaustion of the SERIAL sequence
>> number integer value series, by swapping in during the periodic app
>> shutdown, a freshly truncated postgres <eto_sql_tb_session_www> table.
>
> I'd make the serial column and bigserial, and then forget about it.

I considered using bigint, but decided against doing so for three reasons.
1. int rep is already going to be way more precision than is needed, 2.
avoid the extra resource consumption incurred by bigint as compared to int,
and 3. keep strictly 32-bit compatible with my 32-bit Win XP / PHP
development system. In fact the PHP app is programmed to terminate
abnormally on detecting a postgres SERIAL sequence number that exceeds the
maximum positive value of a strictly 32-bit signed integer.

>
> Cheers,
>
> Jeff

Steve Petrie, P.Eng.

ITS-ETO Consortium
Oakville, Ontario, Canada
apetrie@aspetrie.net



Re: using a postgres table as a multi-writer multi-updater queue

От
"Steve Petrie, P.Eng."
Дата:
Thanks to George for the helpful comments.  My remarks are below.

"George Neuner" <gneuner2@comcast.net> wrote in message
news:gvad5bllba9slstdhkn6ql2jbplgd78p98@4ax.com...
> On Wed, 25 Nov 2015 23:39:39 -0500, "Steve Petrie, P.Eng."
> <apetrie@aspetrie.net> wrote:
>
>
>

>>My plan was always, to avoid eventual exhaustion of the SERIAL
>>sequence number integer value series, by swapping in during the
>>periodic app shutdown, a freshly truncated postgres
>> <eto_sql_tb_session_www> table.
>
> Is there a logical problem with letting the sequence wrap around?
>

I can't think of any reason why letting the sequence wrap would be a logical
problem, but for operational simplicity I prefer to reset the sequence
number back to one, during each brief daily postgres server shutdown window.

I don't relish the challenge of mentally grappling with gigantic session row
sequence numbers, when I'm investigating some future operational anomaly.

>
> I'm not sure I understand the reason for "fuzzy" deletion.  There are
> a number of timestamps in your data ... is it not possible to delete
> deterministically based on one of them?
>

You are correct. The timestamps are there in the session row, and the PHP
app's session row deletion process does use them to select rows for
deletion. There is also a decrementing integer "hit count" limit column
(presently initialized == 25) that kills a session when its "budget" of HTTP
requests is used up.

I should have clarified, that I use the the term "fuzzy" to refer to the
probability mechanism, that hooks a small burst of session row deletion
activity, to each one of a randomly-selected portion (presently 1/5) of the
HTTP requests that cause a new session row to be INSERTed into the session
table.

This means that on average, only every 5th HTTP request that creates a new
session row, will also incur the session deletion workload. When the session
row deletion process occurs, its (aggressive) limit for deletion workload is
2X as many expired rows as needed on average, to keep up with the rate of
session row creation (so the 2X DELETE limit presently == 10 rows).

The idea is to make the process of DELETing expired session rows,
automatically scale its activity, to closely and aggressively match the rate
of session row creation. A heavy burst of new sessions being created will
work proportionately more aggressively to DELETE expired session rows. This
way, the (HTTP request-driven) PHP app will be self-tuning its own session
table space recycling. And there is no process (e.g. cron-driven),
external to the PHP app itself, that is doing session row deletion.

Based on what I've learned from this forum (but before I have studied
AUTOVACUUM in any detail) my thinking is to include an AUTOVACUUM command
(outside of any SQL transaction block) in the HTTP request-driven PHP app,
immediately following any time the PHP app completes a session row DELETE
command.

Or maybe the AUTOVACUUM request should occur less frequently?

>
> Hope this helps,
> George
>
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general

Steve Petrie, P.Eng.

ITS-ETO Consortium
Oakville, Ontario, Canada
apetrie@aspetrie.net



Re: using a postgres table as a multi-writer multi-updater queue

От
"Steve Petrie, P.Eng."
Дата:
Thanks to George for the Thuderbird email client recommendation. My remarks
are below.

"George Neuner" <gneuner2@comcast.net> wrote in message
news:kaed5btl92qr4v8ndevlgtv0f28qaaeju7@4ax.com...
> On Wed, 25 Nov 2015 23:39:39 -0500, "Steve Petrie, P.Eng."
> <apetrie@aspetrie.net> wrote:
>
>>My stupid email client software (Microsoft Outlook Express on Win XP)
>>refuses to respect its own "Reply" option settings for inline reply text.
>>I've been looking for a replacement email client but so far without
>>success.)
>
> Without further comment about a 15 year old, unsupported OS ...
>

(Suddenly, my MS Outlook Express is letting me do inline comments.)

George, your self-restraint is admirable :) And my plan is to move from Win
XP to Linux in the (somewhat near) future.

Before I lose all credibility with this excellent forum -- be assured that
the PHP website app (not yet online) I'm migrating to postgres from mysql,
will ABSOLUTELY NOT be running in prodution under any kind of Microsoft
server software. Right now I'm planning to use as a server o/s, DragonFlyBSD
with its robust HAMMER filesystem.


> My vote for an email client would be Thunderbird.  It runs on XP or
> higher and you can import Outlook's PST files so as to keep your mail
> archives.  Importing PST files directly requires Outlook be available
> on the same system [there is also a less friendly way to do it via EML
> files exported from Outlook where Outlook is not on the same system].
>

It's a common misconception that MS Outlook Express is compatible with MS
Outlook. But in fact the two products are architecturally unrelated. Much
the same way that Java and Javascript are unrelated.

MS OE does not use .PST files, but there are open source utilities that will
extract the contents of MS OE mail folders for migration to alternate email
clients.

I am considering Thunderbird as an MS OE replacement, but my understanding
is that Mozilla has abandoned all but security-related support for
Thundebird. I have been kicking the (email client functionality) tires of
SeaMonkey under my Win XP. I believe that much of SeaMonkey is built on a
Mozilla code base.

>
>
> Hope this helps,
> George
>
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general

--
Steve Petrie, P.Eng.

ITS-ETO Consortium
Oakville, Ontario, Canada
(905) 847-3253
apetrie@aspetrie.net



Re: using a postgres table as a multi-writer multi-updater queue

От
Adrian Klaver
Дата:
On 11/29/2015 02:04 AM, Steve Petrie, P.Eng. wrote:
> Thanks to George for the Thuderbird email client recommendation. My remarks
> are below.
>
> "George Neuner" <gneuner2@comcast.net> wrote in message
> news:kaed5btl92qr4v8ndevlgtv0f28qaaeju7@4ax.com...
>> On Wed, 25 Nov 2015 23:39:39 -0500, "Steve Petrie, P.Eng."
>> <apetrie@aspetrie.net> wrote:
>>
>>> My stupid email client software (Microsoft Outlook Express on Win XP)
>>> refuses to respect its own "Reply" option settings for inline reply
>>> text.
>>> I've been looking for a replacement email client but so far without
>>> success.)
>>
>> Without further comment about a 15 year old, unsupported OS ...
>>
>
> (Suddenly, my MS Outlook Express is letting me do inline comments.)
>
> George, your self-restraint is admirable :) And my plan is to move from Win
> XP to Linux in the (somewhat near) future.
>
> Before I lose all credibility with this excellent forum -- be assured that
> the PHP website app (not yet online) I'm migrating to postgres from mysql,
> will ABSOLUTELY NOT be running in prodution under any kind of Microsoft
> server software. Right now I'm planning to use as a server o/s,
> DragonFlyBSD
> with its robust HAMMER filesystem.
>
>
>> My vote for an email client would be Thunderbird.  It runs on XP or
>> higher and you can import Outlook's PST files so as to keep your mail
>> archives.  Importing PST files directly requires Outlook be available
>> on the same system [there is also a less friendly way to do it via EML
>> files exported from Outlook where Outlook is not on the same system].
>>
>
> It's a common misconception that MS Outlook Express is compatible with MS
> Outlook. But in fact the two products are architecturally unrelated. Much
> the same way that Java and Javascript are unrelated.
>
> MS OE does not use .PST files, but there are open source utilities that
> will
> extract the contents of MS OE mail folders for migration to alternate email
> clients.
>
> I am considering Thunderbird as an MS OE replacement, but my understanding
> is that Mozilla has abandoned all but security-related support for
> Thundebird. I have been kicking the (email client functionality) tires of
> SeaMonkey under my Win XP. I believe that much of SeaMonkey is built on a
> Mozilla code base.

My take is SeaMonkey is Thunderbird with a bunch of unneeded features. I
use Thunderbird on a variety of machines and it works just fine. To me
an email client is a solved problem and security fixes are all that are
required, still Mozilla is making changes. For the latest version 38:

https://support.mozilla.org/en-US/kb/new-thunderbird-38
>
>>
>>
>> Hope this helps,
>> George
>>
>>
>>
>> --
>> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-general
>


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: using a postgres table as a multi-writer multi-updater queue

От
Adrian Klaver
Дата:
On 11/29/2015 01:59 AM, Steve Petrie, P.Eng. wrote:
> Thanks to Jeff for the helpful response. My remarks are below.
>
> ----- Original Message ----- From: "Jeff Janes" <jeff.janes@gmail.com>
> To: "Steve Petrie, P.Eng." <apetrie@aspetrie.net>
> Cc: "Tim Uckun" <timuckun@gmail.com>; "Merlin Moncure"
> <mmoncure@gmail.com>;
> "John R Pierce" <pierce@hogranch.com>; "PostgreSQL General"
> <pgsql-general@postgresql.org>
> Sent: Thursday, November 26, 2015 2:07 AM
> Subject: Re: [GENERAL] using a postgres table as a multi-writer
> multi-updater queue
>
>
>> On Wed, Nov 25, 2015 at 8:39 PM, Steve Petrie, P.Eng.
>> <apetrie@aspetrie.net> wrote:
>>>
>>>> You don't ever want to delete from such a table so you need to set up
>>>> something which allows you to truncate the tables when you no longer
>>>> need
>>>> them.
> ...
>> Truncation is far more efficient than deletion + vacuuming.  If you
>> are running on the edge of your hardware's capabilities, this
>> efficiency is important.  But if you are not on the edge, then it is
>> not worth worrying about.
>
> This is what I was hoping to learn -- because I doubt my app workload will
> ever approach the edge of hardware capability.
>
>> Just make sure your autovacuum settings are
>> at least as aggressive as the default settings.
>>
>
> I'll keep that in mind. And by a happy coincidence, in another recent forum
> thread (26 Nov 2015, subject "Re: [GENERAL] Re: [webmaster] How to
> commit/vacuum a batch of delete statements in a postgresql function") there
> is advice from Adrian Klaver to about the need to execute VACUUM outside of
> a transaction block.

Just to be clear my reference was to someone running VACUUM manually.
Jeff's reference was to the autovacuum daemon which runs VACUUM and
ANALYZE as a background processes:

http://www.postgresql.org/docs/9.4/interactive/routine-vacuuming.html#AUTOVACUUM

>
>>> My plan was always, to avoid eventual exhaustion of the SERIAL sequence
>>> number integer value series, by swapping in during the periodic app
>>> shutdown, a freshly truncated postgres <eto_sql_tb_session_www> table.
>>
>> I'd make the serial column and bigserial, and then forget about it.
>
> I considered using bigint, but decided against doing so for three reasons.
> 1. int rep is already going to be way more precision than is needed, 2.
> avoid the extra resource consumption incurred by bigint as compared to int,
> and 3. keep strictly 32-bit compatible with my 32-bit Win XP / PHP
> development system. In fact the PHP app is programmed to terminate
> abnormally on detecting a postgres SERIAL sequence number that exceeds the
> maximum positive value of a strictly 32-bit signed integer.
>
>>
>> Cheers,
>>
>> Jeff
>
> Steve Petrie, P.Eng.
>
> ITS-ETO Consortium
> Oakville, Ontario, Canada
> apetrie@aspetrie.net
>
>
>


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: using a postgres table as a multi-writer multi-updater queue

От
George Neuner
Дата:
On Sun, 29 Nov 2015 05:04:42 -0500, "Steve Petrie, P.Eng."
<apetrie@aspetrie.net> wrote:

>"George Neuner" <gneuner2@comcast.net> wrote in message
>news:kaed5btl92qr4v8ndevlgtv0f28qaaeju7@4ax.com...
>
>> My vote for an email client would be Thunderbird.  It runs on XP or
>> higher and you can import Outlook's PST files so as to keep your mail
>> archives.  Importing PST files directly requires Outlook be available
>> on the same system [there is also a less friendly way to do it via EML
>> files exported from Outlook where Outlook is not on the same system].
>>
>
>It's a common misconception that MS Outlook Express is compatible with MS
>Outlook. But in fact the two products are architecturally unrelated.

My understanding was that OE was based on the old (Win9x) Outlook.  I
know it isn't the same as the "enterprise" version.

I wasn't aware that OE used a different file format.  But, AFAIK, it
does still export EML files, so you can move your mailboxes into
Thunderbird (or whatever).


>I am considering Thunderbird as an MS OE replacement, but my understanding
>is that Mozilla has abandoned all but security-related support for
>Thundebird. I have been kicking the (email client functionality) tires of
>SeaMonkey under my Win XP. I believe that much of SeaMonkey is built on a
>Mozilla code base.

Yes and no.  Mozilla has Thunderbird on a slow development track.  It
does occasionally get new features, but mostly now by having some very
popular extension becoming built in.

Seamonkey was a fork from a discontinued Mozilla application suite. It
is not a Mozilla project, although it does incorporate Mozilla code
from Firefox and Thunderbird.

The problem I have with Seamonkey is that it tries to be all things to
all web users.  "Jack of all trades, master of none" is a truism.
YMMV, but I would rather have very reliable purpose specific tools
than an integrated suite which may do more but be less reliable
overall.

I'm not knocking Seamonkey per se - it seems to be quite well done -
I'm just making a general observation re: integrated application
suites.  Netscape failed in part because it bit off too much, trying
to do mail and news on top of the browser [and not doing them well - I
loved the Netscape browser, but it's mail and news interface was just
bad].  Mozilla discontinued its web application suite because too few
people wanted it.

George

Re: using a postgres table as a multi-writer multi-updater queue

От
Magnus Hagander
Дата:


On Nov 29, 2015 18:34, "George Neuner" <gneuner2@comcast.net> wrote:
>
> On Sun, 29 Nov 2015 05:04:42 -0500, "Steve Petrie, P.Eng."
> <apetrie@aspetrie.net> wrote:
>
> >"George Neuner" <gneuner2@comcast.net> wrote in message
> >news:kaed5btl92qr4v8ndevlgtv0f28qaaeju7@4ax.com...
> >
> >> My vote for an email client would be Thunderbird.  It runs on XP or
> >> higher and you can import Outlook's PST files so as to keep your mail
> >> archives.  Importing PST files directly requires Outlook be available
> >> on the same system [there is also a less friendly way to do it via EML
> >> files exported from Outlook where Outlook is not on the same system].
> >>
> >
> >It's a common misconception that MS Outlook Express is compatible with MS
> >Outlook. But in fact the two products are architecturally unrelated.
>
> My understanding was that OE was based on the old (Win9x) Outlook.  I
> know it isn't the same as the "enterprise" version.

This is fantastically of topic but no, it was not. OE was based on the old "Internet mail and news".  The actual outlook product has always been separate.

/Magnus

Re: using a postgres table as a multi-writer multi-updater queue

От
George Neuner
Дата:
On Sun, 29 Nov 2015 05:02:58 -0500, "Steve Petrie, P.Eng."
<apetrie@aspetrie.net> wrote:

>I should have clarified, that I use the the term "fuzzy" to refer to the
>probability mechanism, that hooks a small burst of session row deletion
>activity, to each one of a randomly-selected portion (presently 1/5) of the
>HTTP requests that cause a new session row to be INSERTed into the session
>table.
>
>This means that on average, only every 5th HTTP request that creates a new
>session row, will also incur the session deletion workload. When the session
>row deletion process occurs, its (aggressive) limit for deletion workload is
>2X as many expired rows as needed on average, to keep up with the rate of
>session row creation (so the 2X DELETE limit presently == 10 rows).
>
>The idea is to make the process of DELETing expired session rows,
>automatically scale its activity, to closely and aggressively match the rate
>of session row creation.

There's nothing really wrong with that, but I wouldn't do it that way
... I would bulk delete old records from a separate scheduled task.

Another way to do it would be to have each new session delete exactly
one old session.  1:1 scales perfectly and spreads the delete load
evenly across all users.

Not that deleting a handful of records is a lengthy process, but it
seems unfair to burden some users with it but not others.  I would
burden (or not) everyone equally.


>A heavy burst of new sessions being created will
>work proportionately more aggressively to DELETE expired session rows. This
>way, the (HTTP request-driven) PHP app will be self-tuning its own session
>table space recycling. And there is no process (e.g. cron-driven),
>external to the PHP app itself, that is doing session row deletion.
>
>Based on what I've learned from this forum (but before I have studied
>AUTOVACUUM in any detail) my thinking is to include an AUTOVACUUM command
>(outside of any SQL transaction block) in the HTTP request-driven PHP app,
>immediately following any time the PHP app completes a session row DELETE
>command.
>
>Or maybe the AUTOVACUUM request should occur less frequently?

Reducing the frequency will result in a larger table space on disk.
Insertions are made at the end of the table so the table keeps growing
in size regardless of deletions until (some kind of) vacuum is run.

Autovacuum doesn't shrink the table space on disk, it merely compacts
the table's live data so that any free space is at the end.

If you want to tightly control the growth of the table space, you need
to run autovacuum _more_ often, not less.

George

Re: using a postgres table as a multi-writer multi-updater queue

От
"Steve Petrie, P.Eng."
Дата:
Thanks to George and Adrian for their latest responses regarding a DELETE /
AUTOVACUUM / VACUUM approach, to recycling disk space allocated for rows in
a
postgres table, that is used to manage sessions (e.g. HTTP sessions with
visitor web browsers).

My comments are below. In them, I propose an alternative strategy, using a
partial index, instead of a DELETE /
AUTOVACUUM / VACUUM approach.

The comments are a bit lengthy, but I hope forum members will consider them,
and maybe help lead to a breakthrough in using a postgres table for
high-performance session management.

"George Neuner" <gneuner2@comcast.net> wrote in message
news:ukdm5b1ni7lv393coa71vf8d0i1mi0phuh@4ax.com...
> On Sun, 29 Nov 2015 05:02:58 -0500, "Steve Petrie, P.Eng."
> <apetrie@aspetrie.net> wrote:
>
>>I should have clarified, that I use the the term "fuzzy" to refer to the
>>probability mechanism, that hooks a small burst of session row deletion
>>activity, to each one of a randomly-selected portion (presently 1/5) of
>>the
>>HTTP requests that cause a new session row to be INSERTed into the session
>>table.
>>
>>This means that on average, only every 5th HTTP request that creates a new
>>session row, will also incur the session deletion workload. When the
>>session
>>row deletion process occurs, its (aggressive) limit for deletion workload
>>is
>>2X as many expired rows as needed on average, to keep up with the rate of
>>session row creation (so the 2X DELETE limit presently == 10 rows).
>>
>>The idea is to make the process of DELETing expired session rows,
>>automatically scale its activity, to closely and aggressively match the
>>rate
>>of session row creation.
>
> There's nothing really wrong with that, but I wouldn't do it that way
> ... I would bulk delete old records from a separate scheduled task.
>
> Another way to do it would be to have each new session delete exactly
> one old session.  1:1 scales perfectly and spreads the delete load
> evenly across all users.
>
> Not that deleting a handful of records is a lengthy process, but it
> seems unfair to burden some users with it but not others.  I would
> burden (or not) everyone equally.
>
>
>>A heavy burst of new sessions being created will
>>work proportionately more aggressively to DELETE expired session rows.
>>This
>>way, the (HTTP request-driven) PHP app will be self-tuning its own session
>>table space recycling. And there is no process (e.g. cron-driven),
>>external to the PHP app itself, that is doing session row deletion.
>>
>>Based on what I've learned from this forum (but before I have studied
>>AUTOVACUUM in any detail) my thinking is to include an AUTOVACUUM command
>>(outside of any SQL transaction block) in the HTTP request-driven PHP app,
>>immediately following any time the PHP app completes a session row DELETE
>>command.
>>
>>Or maybe the AUTOVACUUM request should occur less frequently?
>
> Reducing the frequency will result in a larger table space on disk.
> Insertions are made at the end of the table so the table keeps growing
> in size regardless of deletions until (some kind of) vacuum is run.
>
> Autovacuum doesn't shrink the table space on disk, it merely compacts
> the table's live data so that any free space is at the end.
>
> If you want to tightly control the growth of the table space, you need
> to run autovacuum _more_ often, not less.
>

Before I start in to implement a DELETE / AUTOVACUUM / VACUUM approach, to
recycling disk space used for a session management table, I would like to
propose, for consideration by this forum, an idea for a different approach.

A row in a session management table, represents a significant "sunk cost" in
both computing time used to create the row, and in disk space allocated.
Postgres has to use a lot of resources to create that row in the first
place.

When the session that originally caused that row to be allocated, eventually
expires -- why delete the associated session managent row ??

Instead of using a DELETE command to destroy the row (and a
resource-intensive AUTOVACUUM / VACUUM process to reclaim the storage
space), why not instead, simply mark that session management row as "free"
(with an UPDATE command) ??

In brief -- Why not just leave the expired session management row allocated
in place, already "hot" and ready to be reused for a new session ??

But -- when the app needs to initiate a new session, it must be able quickly
to: 1. know if there are any "free" rows available for re-use, and if so 2.
select and allocate a specific "free" row for re-use, (3. and if no "free"
row is available, then the app will INSERT a new session row)

* * *
* * *

Would the postgres Partial Index facility be useful here?

Here are quotes from the excellent postgres documentation (9.3.5):

"11.8. Partial Indexes

A partial index is an index built over a subset of a table; the subset is
defined by a conditional expression (called the predicate of the partial
index). The index contains entries only for those table rows that satisfy
the predicate. ... This reduces the size of the index, which will speed up
those queries that do use the index. It will also speed up many table update
operations because the index does not need to be updated in all cases."

"11.2. Index Types

PostgreSQL provides several index types: B-tree ... By default, the CREATE
INDEX command creates B-tree indexes, which fit the most common situations.

B-trees can handle equality and range queries on data that can be sorted
into some ordering. In particular, the PostgreSQL query planner will
consider using a B-tree index whenever an indexed column is involved in a
comparison using one of these operators:

<
<=
=
>=
>

... Also, an IS NULL or IS NOT NULL condition on an index column can be used
with a B-tree index."

* * *
* * *

In the case of my particular app, I propose to add to the session management
table eto_sql_tb_session_www:

1. a new int column session_www_free, and
2. a partial (B-tree) index -- session_www_free_ndx -- on the new column.

CREATE DATABASE eto_sql_db WITH ENCODING 'UTF8';

CREATE SCHEMA its_eto;

CREATE TABLE its_eto.eto_sql_tb_session_www
(
  session_www_code       char(32)      NOT NULL UNIQUE PRIMARY KEY,

  session_www_type       int           NOT NULL,
  session_www_state      int           NOT NULL,
  session_verify_code    char(7)       NOT NULL,

  session_www_serno      SERIAL        NOT NULL UNIQUE,
  session_www_free      int         NULL,

  session_target_serno   int           NULL,
  session_target_data    varchar(1000) NULL,

  session_www_init_utc   timestamp     NOT NULL,
  session_www_last_utc   timestamp     NOT NULL,
  session_www_expiry_utc timestamp     NOT NULL,
  session_www_delete_utc timestamp     NOT NULL,
  session_www_hit_count  int           NOT NULL,
  session_www_act_seqno  int           NULL

);

CREATE INDEX session_www_serno_ndx ON its_eto.eto_sql_tb_session_www
(session_www_serno);

CREATE INDEX session_www_free_ndx ON its_eto.eto_sql_tb_session_www
(session_www_free)
  WHERE session_www_free IS NOT NULL;

* * *
* * *

The app starts with an empty table eto_sql_tb_session_www that has empty
indexes.

When the app goes to initiate a new session, it checks for availability of a
"free" session row, using a command like:

SELECT session_www_code, session_www_serno, ... FROM eto_sql_tb_session_www
  WHERE ((session_www_free IS NOT NULL)) LIMIT 1 FOR UPDATE.

(I am presuming that the postgres query planner will decide to use the
session_www_free_ndx partial index, to very quickly satisfy this query.)

If a "free" session row is found (e.g. with session_www_code ==
'F032...40D9'), the app will allocate that row to the new session, with a
command like:

UPDATE eto_sql_tb_session_www
  SET (..., session_www_free = NULL, ...)
  WHERE (session_www_code = 'F032...40D9')

When performing the UPDATE command, postgres will (MUST !!) remove the entry
for that row, from the session_www_free_ndx partial index, because that
partial index excludes rows WHERE column session_www_free IS NULL.

However, when the app goes to initiate a new session, if there are no
entries in the session_www_free_ndx partial index, the app then creates a
new session row with a unique (MD5) session_www_code value (e.g.
'26AD...50C9'):

INSERT INTO eto_sql_tb_session_www
  (session_www_code, ..., session_www_free, ...)
  VALUES ('26AD...50C9', ..., NULL, ...))

In that INSERT command, the value of the session_www_free column is set to
NULL, so postgres will not (MUST NOT!!)create an entry for that row in the
session_www_free_ndx partial index.

When a session (e.g. with session_www_code == '26AD...50C9') expires, the
app UPDATEs the corresponding session row:

UPDATE eto_sql_tb_session_www
  SET (..., session_www_free = session_www_serno, ...)
  WHERE (session_www_code = '26AD...50C9')

Since the UPDATEd value of column session_www_free IS NOT NULL, postgres
will (MUST!!) create an entry for that row in the session_www_free_ndx
partial index.

Every row in table eto_sql_tb_session_www always has a UNIQUE value for
column session_www_serno, so there can never an attempt to create a
duplicate key value in the session_www_free_ndx partial index.

* * *
* * *

The app uses the 32-character (MD5) key value in column session_www_code, as
the session ID parameter in dynamically-generated HTML pages. So an HTTP
request, generated for that session by a web browser from that page, can be
associated by the app with the correct unique row in table
eto_sql_tb_session_www.

To protect itself from processing an HTTP request coming from an outdated
HTML page, the app uses an additional 6-character hash value, generated from
two columns in table eto_sql_tb_session_www:

session_www_init_utc timestamp
session_www_act_seqno  int

The session_www_init_utc column gets a new timestamp value, every time the
session row is allocated to a new session. Then the timestamp value stays
the same in that column for the duration of that session.

The session_www_act_seqno column int value starts == 0, and simply
increments by 1 with every UPDATE to that session row. A session row gets
updated every time the app receives an HTTP request for that row. The app
will wrap the value in the row back to zero, if it ever reaches its maximum
possible value.

This 6-character hash value, is a second parameter, provided in every
dynamically-generated HTML page.

So there should be no logical problem, with reusing the same session row,
with the same 32-character session_www_code value, for more than one session
(but only ever for one session at a time).

* * *
* * *

Given the above proposed approach, the number of rows in table
eto_sql_tb_session_www will grow as large as necessary to service the
maximum number of concurrent sessions required.

But the app can reuse rows of expired sessions, again and again, instead of
DELETing every row after it is used used for only one session (and also
periodically AUTOVACUUMing or VACUUMing table eto_sql_tb_session_www to
reclaim the storage space).

Suppose there is a large burst of session activity, and table
eto_sql_tb_session_www grows in size to e.g. 100,000 "active" session rows.
Eventually, when system activity slows down (or ceases) again, most (all) of
the "active" rows will become "free" and available for reuse during the next
burst of activity.

And as the system gets busier again, the session_www_free_ndx partial index
will get smaller (because NULL values are being excluded), so postgres
should need to use less resources to search this partial index to find a
"free" session row for reuse. Use less resources, at the most advantageous
time, when the system is busy.

* * *
* * *

In the case of my app, I do still plan to take the postgres server offline
once per day, for a maintenance window of a few minutes. But instead of just
forcing the postgres server offline at an arbitrary time every day, the app
could analyze counts of entries in the two indexes:

session_www_serno_ndx
session_www_free_ndx

to detect when there are not too many active sessions, so postgres can be
taken offline with minimum inconvenience to website visitors.

During the brief offline maintenance window, the app can allocate a fresh
empty eto_sql_tb_session_www table, and copy over to it from the old
eto_sql_tb_session_www table, only rows of active sessions.

As a result of this table copy operation, column session_www_serno SERIAL
numbers will start again at 1, but this is no problem, because this SERIAL
number is only ever used as a key by the app, for the duration of a single
HTTP request, once the (MD5) key value in the HTTP request has been used to
find the row (by the session_www_code  PRIMARY KEY). The value of the
session_www_code column won't get changed, when a row is copied across to
the fresh eto_sql_tb_session_www table.

* * *
* * *

For this approach to succeed, it will be important to ensure that, once
storage space for a row in table eto_sql_tb_session_www has been allocated,
postgres is in fact going to be able to reuse the same storage space for
that row, even as values in that row get changed.

I would expect this proviso to be more likely correct for columns containing
fixed-length values (e.g. char(32), int, char(7), SERIAL, timestamp).

But what about the session_target_data varchar(1000) NULL column in table
eto_sql_tb_session_www ??

Here are some snippets from the postgres doc (9.3.5):

3. Character Types

"... Values of type character are physically padded with spaces to the
specified width n, and are stored and displayed that way."

"... The storage requirement for a short string (up to 126 bytes) is 1 byte
plus the actual string, which includes the space padding in the case of
character. Longer strings have 4 bytes of overhead instead of 1. Long
strings are compressed by the system automatically, so the physical
requirement on disk might be less."

"... There is no performance difference among these three types, apart from
increased storage space when using the blank-padded type, and a few extra
CPU cycles to check the length when storing into a length-constrained
column. While character(n) has performance advantages in some other database
systems, there is no such advantage in PostgreSQL; in fact character(n) is
usually the slowest of the three because of its additional storage costs. In
most situations text or character varying should be used instead."

Perhaps someone with postgres internals knolwledge could advise, about how
much storage space reuse, the proposed partial index approach might actually
accomplish, with table eto_sql_tb_session_www ... ??

* * *
* * *

If forum members think this approach, of using a partial index might work,
to
implement an efficient session row reuse strategy, I will proceed to try it.

> George
>
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general

Steve Petrie, P.Eng.

ITS-ETO Consortium
Oakville, Ontario, Canada
apetrie@aspetrie.net



Re: using a postgres table as a multi-writer multi-updater queue

От
Alban Hertroys
Дата:
> Before I start in to implement a DELETE / AUTOVACUUM / VACUUM approach, to
> recycling disk space used for a session management table, I would like to
> propose, for consideration by this forum, an idea for a different approach.
>
> A row in a session management table, represents a significant "sunk cost" in
> both computing time used to create the row, and in disk space allocated.
> Postgres has to use a lot of resources to create that row in the first
> place.
>
> When the session that originally caused that row to be allocated, eventually
> expires -- why delete the associated session managent row ??
>
> Instead of using a DELETE command to destroy the row (and a
> resource-intensive AUTOVACUUM / VACUUM process to reclaim the storage
> space), why not instead, simply mark that session management row as "free"
> (with an UPDATE command) ??

An UPDATE is a combination of an INSERT and a DELETE command.

However, rows marked as deleted will be reused at some point after
autovacuum (or manual VACUUM) has made sure they are no longer in use
by any DB session.

So your approach can still work, as long as you vacuum that table
frequently enough. The actual solution isn't the UPDATE instead of the
DELETE though, but rather the partial index and an increased
autovacuum frequency.

Alban.
--
If you can't see the forest for the trees,
Cut the trees and you'll see there is no forest.


Re: using a postgres table as a multi-writer multi-updater queue

От
George Neuner
Дата:
On Mon, 30 Nov 2015 23:07:36 -0500, "Steve Petrie, P.Eng."
<apetrie@aspetrie.net> wrote:

>Instead of using a DELETE command to destroy the row (and a
>resource-intensive AUTOVACUUM / VACUUM process to reclaim the storage
>space), why not instead, simply mark that session management row as "free"
>(with an UPDATE command) ??

Alban beat me to the answer  8-)
But to expand a little:

Your plan won't work because Postgresql does not update in place - it
inserts a new changed row and marks the old as deleted.  It does not
physically overwrite the old row until the table is vacuumed.
[If even then - a row which lies beyond the logical end-of-table when
vacuum is finished won't be overwritten until its space is recycled.]

This behavior, known as MVCC (multiple version concurrency control),
is integral to transaction isolation: selects which are running
concurrently with the update may already have seen the old row and
must continue to see it until they complete, even if the update
completes first.

Postgresql doesn't support "dirty read" isolation.  A row can't be
physically dropped or its space overwritten while any transaction that
can "see" it is still running.


For more:
https://devcenter.heroku.com/articles/postgresql-concurrency
https://momjian.us/main/writings/pgsql/mvcc.pdf

Actually lots of great stuff in the presentation section on Bruce
Momjian's site:  https://momjian.us


George