Обсуждение: is there a way to firmly cap postgres worker memory consumption?
I'm running into some scenarios where errant postgres queries are consuming all memory and bringing down production boxes. I'm running Postgres 9.3.4 on CentOS 5.x VMs under XenServer. My custom config is here: https://gist.github.com/skehlet/9984666
The incidents may be related to reports of an array_agg memory leak, and another recent incident seems like it was due to the xpath() function provided by --with-xml. There's some more information on the former in my post at: http://dba.stackexchange.com/questions/62587/why-does-this-query-cause-the-postgres-worker-process-to-consume-so-much-memory, if it's interesting. I'd like to dig into that further, maybe on another thread.
Thanks for any help/ideas!
But for now, the devs are asking me for some way to put a cap on a postgres query's total memory consumption. I'm familiar with the available settings on the "Resource Consumption" docs (and you can see my settings in my gist above, including work_mem turned way down to 1MB for testing), but it seems like there are things like Materialize that remain uncappable, since they're not constrained by the shared_buffers and work_mem limits. For example in my post to dba.stackexchange above, I found by doing a "set enable_material=false;" I could stop a particularly nasty query from consuming all memory, narrowing the problem somewhat, but that query is one we have to live with in the near term.
With all that laid out, can anyone help with the following questions:
- Is there any way to set to total memory cap on a worker processes memory consumption?
- Are there other (even undocumented) settings to cap memory usage?
- Other ideas?
In the meantime, to stop the bleeding I'm looking into using the postgres feature to coerce the OOM killer to kill rogue child processes first (-DLINUX_OOM_ADJ). Unfortunately, another setback, even the latest kernel on CentOS 5.x doesn't allow non-root processes to write to /proc/self/oom_adj, so it isn't working. So I'll either need to patch the CentOS kernel and rebuild, move to a newer kernel, or maybe do something hacky like run a background process to reassign the oom_adj value for all postgres workers found.
Thanks for any help/ideas!
Hi Steve,
Did you check if your OS allows you to limit memory usage per user / process basis?When memory consumption grows , Doesn't ever swap out?
Thanks,On Tue, Apr 8, 2014 at 10:48 AM, Steve Kehlet <steve.kehlet@gmail.com> wrote:
I'm running into some scenarios where errant postgres queries are consuming all memory and bringing down production boxes. I'm running Postgres 9.3.4 on CentOS 5.x VMs under XenServer. My custom config is here: https://gist.github.com/skehlet/9984666The incidents may be related to reports of an array_agg memory leak, and another recent incident seems like it was due to the xpath() function provided by --with-xml. There's some more information on the former in my post at: http://dba.stackexchange.com/questions/62587/why-does-this-query-cause-the-postgres-worker-process-to-consume-so-much-memory, if it's interesting. I'd like to dig into that further, maybe on another thread.But for now, the devs are asking me for some way to put a cap on a postgres query's total memory consumption. I'm familiar with the available settings on the "Resource Consumption" docs (and you can see my settings in my gist above, including work_mem turned way down to 1MB for testing), but it seems like there are things like Materialize that remain uncappable, since they're not constrained by the shared_buffers and work_mem limits. For example in my post to dba.stackexchange above, I found by doing a "set enable_material=false;" I could stop a particularly nasty query from consuming all memory, narrowing the problem somewhat, but that query is one we have to live with in the near term.With all that laid out, can anyone help with the following questions:- Is there any way to set to total memory cap on a worker processes memory consumption?- Are there other (even undocumented) settings to cap memory usage?- Other ideas?In the meantime, to stop the bleeding I'm looking into using the postgres feature to coerce the OOM killer to kill rogue child processes first (-DLINUX_OOM_ADJ). Unfortunately, another setback, even the latest kernel on CentOS 5.x doesn't allow non-root processes to write to /proc/self/oom_adj, so it isn't working. So I'll either need to patch the CentOS kernel and rebuild, move to a newer kernel, or maybe do something hacky like run a background process to reassign the oom_adj value for all postgres workers found.
Thanks for any help/ideas!
Did you check if your OS allows you to limit memory usage per user / process basis?
I'll definitely dig more into this. I found some stuff saying it isn't as easy as one would expect, i.e. not just a ulimit and done. Or maybe it is, in this case. I will play around.
Did you either profiled or debugged it to see what is going on?
I would love to learn more about how to do this, to get to the bottom of the memory usage. I can google around, or can you suggest any reads?
When memory consumption grows , Doesn't ever swap out?
Yes, it consumes all swap.
Steve Kehlet <steve.kehlet@gmail.com> writes: > But for now, the devs are asking me for some way to put a cap on a postgres > query's total memory consumption. I'm familiar with the available settings > on the "Resource Consumption" docs (and you can see my settings in my gist > above, including work_mem turned way down to 1MB for testing), but it seems > like there are things like Materialize that remain uncappable, since > they're not constrained by the shared_buffers and work_mem limits. Materialize nodes should honor the work_mem limit. That's per node of course, but you only have half a dozen of them in this query, so I doubt that's where the problem is. You mentioned array_agg ... how are you using that exactly, and how large a result are you expecting it to create? If you had an array_agg being used in a HashAggregate plan node, I'd figure for sure that was the source of your problem, but the explain output shows it's a GroupAggregate so there should only be one array_agg active at a time. Anyway, what I would try doing is starting the postmaster with a "ulimit -d" value corresponding to the max per-process data segment size you want. Something in the range of 50-100MB would probably be reasonable if your queries aren't too complex. regards, tom lane
Steve Kehlet <steve.kehlet@gmail.com> writes: >> Did you either profiled or debugged it to see what is going on? > I would love to learn more about how to do this, to get to the bottom of > the memory usage. I can google around, or can you suggest any reads? Once you've got a ulimit in place so that malloc eventually fails with ENOMEM, the backend process should print a memory context dump on stderr when it hits that. Make sure your logging setup captures the process stderr someplace (logging_collector works for this, syslog does not). Post the dump here when you've got it. regards, tom lane
On Tue, Apr 8, 2014 at 12:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Once you've got a ulimit in place so that malloc eventually fails withENOMEM, the backend process should print a memory context dump on stderr
when it hits that. Make sure your logging setup captures the process
stderr someplace (logging_collector works for this, syslog does not).
Post the dump here when you've got it.
Thank you. For some reason I couldn't get it to trip with "ulimit -d 51200", but "ulimit -v 1572864" (1.5GiB) got me this in serverlog. I hope this is readable, if not it's also here: https://gist.github.com/skehlet/10206741
TopMemoryContext: 123760 total in 16 blocks; 2720 free (7 chunks); 121040 used
TableSpace cache: 8192 total in 1 blocks; 3216 free (0 chunks); 4976 used
Type information cache: 24240 total in 2 blocks; 3744 free (0 chunks); 20496 used
Operator lookup cache: 24576 total in 2 blocks; 11888 free (5 chunks); 12688 used
TopTransactionContext: 8192 total in 1 blocks; 7368 free (0 chunks); 824 used
MessageContext: 4186112 total in 9 blocks; 633960 free (19 chunks); 3552152 used
JoinRelHashTable: 8192 total in 1 blocks; 624 free (0 chunks); 7568 used
Operator class cache: 8192 total in 1 blocks; 1680 free (0 chunks); 6512 used
smgr relation table: 24576 total in 2 blocks; 9808 free (4 chunks); 14768 used
TransactionAbortContext: 32768 total in 1 blocks; 32736 free (0 chunks); 32 used
Portal hash: 8192 total in 1 blocks; 1680 free (0 chunks); 6512 used
PortalMemory: 8192 total in 1 blocks; 7888 free (0 chunks); 304 used
PortalHeapMemory: 1024 total in 1 blocks; 816 free (0 chunks); 208 used
ExecutorState: 566208 total in 12 blocks; 195256 free (2 chunks); 370952 used
TupleSort: 17817440 total in 36 blocks; 9326832 free (8801 chunks); 8490608 used
TupleSort: 17500568 total in 86 blocks; 12016040 free (2896 chunks); 5484528 used
TupleSort: 17616584 total in 86 blocks; 9975520 free (3649 chunks); 7641064 used
TupleSort: 10825056 total in 14 blocks; 3896400 free (45082 chunks); 6928656 used
TupleSort: 17459720 total in 67 blocks; 8495600 free (2226 chunks); 8964120 used
TupleSort: 10899360 total in 23 blocks; 2112448 free (23706 chunks); 8786912 used
TupleSort: 11908776 total in 15 blocks; 8344064 free (145642 chunks); 3564712 used
printtup: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 8192 total in 1 blocks; 8160 free (0 chunks); 32 used
ExprContext: 8192 total in 1 blocks; 8160 free (0 chunks); 32 used
ExprContext: 8192 total in 1 blocks; 8160 free (0 chunks); 32 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 8192 total in 1 blocks; 8160 free (0 chunks); 32 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 8192 total in 1 blocks; 8160 free (0 chunks); 32 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 8192 total in 1 blocks; 8160 free (0 chunks); 32 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 8192 total in 1 blocks; 8160 free (0 chunks); 32 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 8192 total in 1 blocks; 8160 free (0 chunks); 32 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 8192 total in 1 blocks; 8160 free (0 chunks); 32 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 8192 total in 1 blocks; 8160 free (0 chunks); 32 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 8192 total in 1 blocks; 8160 free (0 chunks); 32 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 812638208 total in 108 blocks; 183520 free (171 chunks); 812454688 used
AggContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
Relcache by OID: 24576 total in 2 blocks; 13872 free (3 chunks); 10704 used
CacheMemoryContext: 1342128 total in 21 blocks; 424872 free (1 chunks); 917256 used
pg_toast_16646_index: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used
d_mc2_fki: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used
d_mc2_pkey: 3072 total in 2 blocks; 1736 free (2 chunks); 1336 used
d_mcm2_pkey: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used
d_mm2_fki: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
d_mm2_index2: 3072 total in 2 blocks; 1736 free (2 chunks); 1336 used
d_mm2_index1: 3072 total in 2 blocks; 1736 free (2 chunks); 1336 used
d_mm2_pkey: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used
d_m2_index1: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used
d_m2_pkey: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
pg_index_indrelid_index: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
pg_attrdef_adrelid_adnum_index: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used
pg_db_role_setting_databaseid_rol_index: 1024 total in 1 blocks; 64 free (0 chunks); 960 used
pg_user_mapping_user_server_index: 1024 total in 1 blocks; 64 free (0 chunks); 960 used
pg_user_mapping_oid_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 used
pg_type_oid_index: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
pg_type_typname_nsp_index: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used
pg_ts_template_oid_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 used
pg_ts_template_tmplname_index: 1024 total in 1 blocks; 64 free (0 chunks); 960 used
pg_ts_parser_oid_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 used
pg_ts_parser_prsname_index: 1024 total in 1 blocks; 64 free (0 chunks); 960 used
pg_ts_dict_oid_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 used
pg_ts_dict_dictname_index: 1024 total in 1 blocks; 64 free (0 chunks); 960 used
pg_ts_config_oid_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 used
pg_ts_config_cfgname_index: 1024 total in 1 blocks; 64 free (0 chunks); 960 used
pg_ts_config_map_index: 3072 total in 2 blocks; 1784 free (2 chunks); 1288 used
pg_tablespace_oid_index: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
pg_statistic_relid_att_inh_index: 3072 total in 2 blocks; 1736 free (2 chunks); 1336 used
pg_class_relname_nsp_index: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used
pg_range_rngtypid_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 used
pg_proc_oid_index: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
pg_proc_proname_args_nsp_index: 3072 total in 2 blocks; 1736 free (2 chunks); 1336 used
pg_opfamily_oid_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 used
pg_opfamily_am_name_nsp_index: 3072 total in 2 blocks; 1784 free (2 chunks); 1288 used
pg_operator_oid_index: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
pg_operator_oprname_l_r_n_index: 3072 total in 2 blocks; 1736 free (2 chunks); 1336 used
pg_namespace_oid_index: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
pg_namespace_nspname_index: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
pg_language_oid_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 used
pg_language_name_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 used
pg_foreign_table_relid_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 used
pg_foreign_server_oid_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 used
pg_foreign_server_name_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 used
pg_foreign_data_wrapper_oid_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 used
pg_foreign_data_wrapper_name_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 used
pg_event_trigger_oid_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 used
pg_event_trigger_evtname_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 used
pg_enum_typid_label_index: 1024 total in 1 blocks; 64 free (0 chunks); 960 used
pg_enum_oid_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 used
pg_default_acl_role_nsp_obj_index: 3072 total in 2 blocks; 1784 free (2 chunks); 1288 used
pg_conversion_oid_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 used
pg_constraint_oid_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 used
pg_conversion_name_nsp_index: 1024 total in 1 blocks; 64 free (0 chunks); 960 used
pg_conversion_default_index: 3072 total in 2 blocks; 1784 free (2 chunks); 1288 used
pg_collation_oid_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 used
pg_collation_name_enc_nsp_index: 3072 total in 2 blocks; 1784 free (2 chunks); 1288 used
pg_opclass_am_name_nsp_index: 3072 total in 2 blocks; 1736 free (2 chunks); 1336 used
pg_cast_source_target_index: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used
pg_auth_members_role_member_index: 1024 total in 1 blocks; 64 free (0 chunks); 960 used
pg_attribute_relid_attnam_index: 1024 total in 1 blocks; 64 free (0 chunks); 960 used
pg_amop_fam_strat_index: 3072 total in 2 blocks; 1736 free (2 chunks); 1336 used
pg_amop_opr_fam_index: 3072 total in 2 blocks; 1736 free (2 chunks); 1336 used
pg_aggregate_fnoid_index: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
pg_auth_members_member_role_index: 1024 total in 1 blocks; 64 free (0 chunks); 960 used
pg_authid_oid_index: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
pg_authid_rolname_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 used
pg_database_oid_index: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
pg_database_datname_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 used
pg_trigger_tgrelid_tgname_index: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used
pg_rewrite_rel_rulename_index: 1024 total in 1 blocks; 64 free (0 chunks); 960 used
pg_amproc_fam_proc_index: 3072 total in 2 blocks; 1736 free (2 chunks); 1336 used
pg_opclass_oid_index: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
pg_index_indexrelid_index: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
pg_attribute_relid_attnum_index: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used
pg_class_oid_index: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
MdSmgr: 8192 total in 1 blocks; 6112 free (0 chunks); 2080 used
ident parser context: 0 total in 0 blocks; 0 free (0 chunks); 0 used
hba parser context: 3072 total in 2 blocks; 80 free (1 chunks); 2992 used
LOCALLOCK hash: 24576 total in 2 blocks; 13920 free (4 chunks); 10656 used
Timezones: 83472 total in 2 blocks; 3744 free (0 chunks); 79728 used
ErrorContext: 8192 total in 1 blocks; 8160 free (3 chunks); 32 used
2014-04-08 23:00:14 GMT [31229]: [1-1] ERROR: out of memory
2014-04-08 23:00:14 GMT [31229]: [2-1] DETAIL: Failed on request of size 597.
2014-04-08 23:00:14 GMT [31229]: [3-1] STATEMENT: SELECT MESSAGE.ID,
MESSAGE.SERVER_ID,
MESSAGE.RECEIVED_DATE,
MESSAGE.PROCESSED,
MESSAGE.ORIGINAL_ID,
MESSAGE.IMPORT_ID,
MESSAGE.IMPORT_CHANNEL_ID,
METADATA_IDS
FROM
( SELECT M.ID,
ARRAY_TO_STRING(ARRAY_AGG(MM.ID::CHARACTER VARYING), ',') AS METADATA_IDS
FROM D_M2 M
JOIN D_MM2 MM ON MM.MESSAGE_ID = M.ID
LEFT JOIN D_MCM2 MCM ON MCM.MESSAGE_ID = MM.MESSAGE_ID
AND MCM.METADATA_ID = MM.ID
LEFT JOIN D_MC2 MC_1 ON MC_1.MESSAGE_ID = M.ID
AND MC_1.IS_ENCRYPTED = FALSE
AND MC_1.METADATA_ID = 0
AND ((MM.ID = 0
AND MC_1.CONTENT_TYPE = 1)
OR (MM.ID > 0
AND MC_1.CONTENT_TYPE = 4))
LEFT JOIN D_MC2 MC_2 ON MC_2.MESSAGE_ID = M.ID
AND MC_2.IS_ENCRYPTED = FALSE
AND MC_2.METADATA_ID = MM.ID
AND MC_2.CONTENT_TYPE = '2'
LEFT JOIN D_MC2 MC_3 ON MC_3.MESSAGE_ID = M.ID
AND MC_3.IS_ENCRYPTED = FALSE
AND MC_3.METADATA_ID = MM.ID
AND MC_3.CONTENT_TYPE = '3'
LEFT JOIN D_MC2 MC_4 ON MC_4.MESSAGE_ID = M.ID
AND MC_4.IS_ENCRYPTED = FALSE
AND MC_4.METADATA_ID = MM.ID
AND MC_4.CONTENT_TYPE = '4'
LEFT JOIN D_MC2 MC_5 ON MC_5.MESSAGE_ID = M.ID
AND MC_5.IS_ENCRYPTED = FALSE
AND MC_5.METADATA_ID = MM.ID
AND MC_5.CONTENT_TYPE = '5'
LEFT JOIN D_MC2 MC_6 ON MC_6.MESSAGE_ID = M.ID
AND MC_6.IS_ENCRYPTED = FALSE
AND MC_6.METADATA_ID = MM.ID
AND MC_6.CONTENT_TYPE = '6'
LEFT JOIN D_MC2 MC_7 ON MC_7.MESSAGE_ID = M.ID
AND MC_7.IS_ENCRYPTED = FALSE
AND MC_7.METADATA_ID = MM.ID
AND MC_7.CONTENT_TYPE = '7'
LEFT JOIN D_MC2 MC_8 ON MC_8.MESSAGE_ID = M.ID
AND MC_8.IS_ENCRYPTED = FALSE
AND MC_8.METADATA_ID = MM.ID
AND MC_8.CONTENT_TYPE ='8'
WHERE M.ID <= '230716'
AND ( M.ID::VARCHAR ILIKE '%' || 'rwar mea' || '%'
OR MM.CONNECTOR_NAME ILIKE '%' || 'rwar mea' || '%'
OR MC_1.CONTENT ILIKE '%' || 'rwar mea' || '%'
OR MC_2.CONTENT ILIKE '%' || 'rwar mea' || '%'
OR MC_3.CONTENT ILIKE '%' || 'rwar mea' || '%'
OR MC_4.CONTENT ILIKE '%' || 'rwar mea' || '%'
OR MC_5.CONTENT ILIKE '%' || 'rwar mea' || '%'
OR MC_6.CONTENT ILIKE '%' || 'rwar mea' || '%'
OR MC_7.CONTENT ILIKE '%' || 'rwar mea' || '%'
OR MC_8.CONTENT ILIKE '%' || 'rwar mea' || '%'
OR MCM."SOURCE" ILIKE '%' || 'rwar mea' || '%'
OR MCM."TYPE" ILIKE '%' || 'rwar mea' || '%' )
GROUP BY M.ID
ORDER BY M.ID DESC
OFFSET 0 LIMIT 21 ) AS SUBQUERY
ORDER BY MESSAGE.ID DESC
;
2014-04-08 23:00:14 GMT [31229]: [4-1] LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp31229.7", size 124600320
2014-04-08 23:00:14 GMT [31229]: [5-1] LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp31229.6", size 36167680
2014-04-08 23:00:14 GMT [31229]: [6-1] LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp31229.5", size 1073741824
2014-04-08 23:00:14 GMT [31229]: [7-1] LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp31229.4", size 823541760
2014-04-08 23:00:14 GMT [31229]: [8-1] LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp31229.3", size 5324800
2014-04-08 23:00:14 GMT [31229]: [9-1] LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp31229.2", size 402513920
2014-04-08 23:00:14 GMT [31229]: [10-1] LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp31229.1", size 38617088
2014-04-08 23:00:14 GMT [31229]: [11-1] LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp31229.0", size 5079040
I should add I had to move testing to a different VM, with less RAM and different tunings; they are here: https://gist.github.com/skehlet/10207086
On Tue, Apr 8, 2014 at 4:08 PM, Steve Kehlet <steve.kehlet@gmail.com> wrote:
On Tue, Apr 8, 2014 at 12:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:Once you've got a ulimit in place so that malloc eventually fails withENOMEM, the backend process should print a memory context dump on stderr
when it hits that. Make sure your logging setup captures the process
stderr someplace (logging_collector works for this, syslog does not).
Post the dump here when you've got it.Thank you. For some reason I couldn't get it to trip with "ulimit -d 51200", but "ulimit -v 1572864" (1.5GiB) got me this in serverlog. I hope this is readable, if not it's also here: https://gist.github.com/skehlet/10206741TopMemoryContext: 123760 total in 16 blocks; 2720 free (7 chunks); 121040 usedTableSpace cache: 8192 total in 1 blocks; 3216 free (0 chunks); 4976 usedType information cache: 24240 total in 2 blocks; 3744 free (0 chunks); 20496 usedOperator lookup cache: 24576 total in 2 blocks; 11888 free (5 chunks); 12688 usedTopTransactionContext: 8192 total in 1 blocks; 7368 free (0 chunks); 824 usedMessageContext: 4186112 total in 9 blocks; 633960 free (19 chunks); 3552152 usedJoinRelHashTable: 8192 total in 1 blocks; 624 free (0 chunks); 7568 usedOperator class cache: 8192 total in 1 blocks; 1680 free (0 chunks); 6512 usedsmgr relation table: 24576 total in 2 blocks; 9808 free (4 chunks); 14768 usedTransactionAbortContext: 32768 total in 1 blocks; 32736 free (0 chunks); 32 usedPortal hash: 8192 total in 1 blocks; 1680 free (0 chunks); 6512 usedPortalMemory: 8192 total in 1 blocks; 7888 free (0 chunks); 304 usedPortalHeapMemory: 1024 total in 1 blocks; 816 free (0 chunks); 208 usedExecutorState: 566208 total in 12 blocks; 195256 free (2 chunks); 370952 usedTupleSort: 17817440 total in 36 blocks; 9326832 free (8801 chunks); 8490608 usedTupleSort: 17500568 total in 86 blocks; 12016040 free (2896 chunks); 5484528 usedTupleSort: 17616584 total in 86 blocks; 9975520 free (3649 chunks); 7641064 usedTupleSort: 10825056 total in 14 blocks; 3896400 free (45082 chunks); 6928656 usedTupleSort: 17459720 total in 67 blocks; 8495600 free (2226 chunks); 8964120 usedTupleSort: 10899360 total in 23 blocks; 2112448 free (23706 chunks); 8786912 usedTupleSort: 11908776 total in 15 blocks; 8344064 free (145642 chunks); 3564712 usedprinttup: 0 total in 0 blocks; 0 free (0 chunks); 0 usedExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 usedExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 usedExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 usedExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 usedExprContext: 8192 total in 1 blocks; 8160 free (0 chunks); 32 usedExprContext: 8192 total in 1 blocks; 8160 free (0 chunks); 32 usedExprContext: 8192 total in 1 blocks; 8160 free (0 chunks); 32 usedExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 usedExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 usedExprContext: 8192 total in 1 blocks; 8160 free (0 chunks); 32 usedExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 usedExprContext: 8192 total in 1 blocks; 8160 free (0 chunks); 32 usedExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 usedExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 usedExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 usedExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 usedExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 usedExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 usedExprContext: 8192 total in 1 blocks; 8160 free (0 chunks); 32 usedExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 usedExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 usedExprContext: 8192 total in 1 blocks; 8160 free (0 chunks); 32 usedExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 usedExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 usedExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 usedExprContext: 8192 total in 1 blocks; 8160 free (0 chunks); 32 usedExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 usedExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 usedExprContext: 8192 total in 1 blocks; 8160 free (0 chunks); 32 usedExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 usedExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 usedExprContext: 8192 total in 1 blocks; 8160 free (0 chunks); 32 usedExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 usedExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 usedExprContext: 8192 total in 1 blocks; 8160 free (0 chunks); 32 usedExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 usedExprContext: 812638208 total in 108 blocks; 183520 free (171 chunks); 812454688 usedAggContext: 0 total in 0 blocks; 0 free (0 chunks); 0 usedExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 usedExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 usedExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 usedExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 usedRelcache by OID: 24576 total in 2 blocks; 13872 free (3 chunks); 10704 usedCacheMemoryContext: 1342128 total in 21 blocks; 424872 free (1 chunks); 917256 usedpg_toast_16646_index: 1024 total in 1 blocks; 16 free (0 chunks); 1008 usedd_mc2_fki: 1024 total in 1 blocks; 16 free (0 chunks); 1008 usedd_mc2_pkey: 3072 total in 2 blocks; 1736 free (2 chunks); 1336 usedd_mcm2_pkey: 1024 total in 1 blocks; 16 free (0 chunks); 1008 usedd_mm2_fki: 1024 total in 1 blocks; 152 free (0 chunks); 872 usedd_mm2_index2: 3072 total in 2 blocks; 1736 free (2 chunks); 1336 usedd_mm2_index1: 3072 total in 2 blocks; 1736 free (2 chunks); 1336 usedd_mm2_pkey: 1024 total in 1 blocks; 16 free (0 chunks); 1008 usedd_m2_index1: 1024 total in 1 blocks; 16 free (0 chunks); 1008 usedd_m2_pkey: 1024 total in 1 blocks; 152 free (0 chunks); 872 usedpg_index_indrelid_index: 1024 total in 1 blocks; 152 free (0 chunks); 872 usedpg_attrdef_adrelid_adnum_index: 1024 total in 1 blocks; 16 free (0 chunks); 1008 usedpg_db_role_setting_databaseid_rol_index: 1024 total in 1 blocks; 64 free (0 chunks); 960 usedpg_user_mapping_user_server_index: 1024 total in 1 blocks; 64 free (0 chunks); 960 usedpg_user_mapping_oid_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 usedpg_type_oid_index: 1024 total in 1 blocks; 152 free (0 chunks); 872 usedpg_type_typname_nsp_index: 1024 total in 1 blocks; 16 free (0 chunks); 1008 usedpg_ts_template_oid_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 usedpg_ts_template_tmplname_index: 1024 total in 1 blocks; 64 free (0 chunks); 960 usedpg_ts_parser_oid_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 usedpg_ts_parser_prsname_index: 1024 total in 1 blocks; 64 free (0 chunks); 960 usedpg_ts_dict_oid_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 usedpg_ts_dict_dictname_index: 1024 total in 1 blocks; 64 free (0 chunks); 960 usedpg_ts_config_oid_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 usedpg_ts_config_cfgname_index: 1024 total in 1 blocks; 64 free (0 chunks); 960 usedpg_ts_config_map_index: 3072 total in 2 blocks; 1784 free (2 chunks); 1288 usedpg_tablespace_oid_index: 1024 total in 1 blocks; 152 free (0 chunks); 872 usedpg_statistic_relid_att_inh_index: 3072 total in 2 blocks; 1736 free (2 chunks); 1336 usedpg_class_relname_nsp_index: 1024 total in 1 blocks; 16 free (0 chunks); 1008 usedpg_range_rngtypid_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 usedpg_proc_oid_index: 1024 total in 1 blocks; 152 free (0 chunks); 872 usedpg_proc_proname_args_nsp_index: 3072 total in 2 blocks; 1736 free (2 chunks); 1336 usedpg_opfamily_oid_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 usedpg_opfamily_am_name_nsp_index: 3072 total in 2 blocks; 1784 free (2 chunks); 1288 usedpg_operator_oid_index: 1024 total in 1 blocks; 152 free (0 chunks); 872 usedpg_operator_oprname_l_r_n_index: 3072 total in 2 blocks; 1736 free (2 chunks); 1336 usedpg_namespace_oid_index: 1024 total in 1 blocks; 152 free (0 chunks); 872 usedpg_namespace_nspname_index: 1024 total in 1 blocks; 152 free (0 chunks); 872 usedpg_language_oid_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 usedpg_language_name_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 usedpg_foreign_table_relid_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 usedpg_foreign_server_oid_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 usedpg_foreign_server_name_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 usedpg_foreign_data_wrapper_oid_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 usedpg_foreign_data_wrapper_name_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 usedpg_event_trigger_oid_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 usedpg_event_trigger_evtname_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 usedpg_enum_typid_label_index: 1024 total in 1 blocks; 64 free (0 chunks); 960 usedpg_enum_oid_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 usedpg_default_acl_role_nsp_obj_index: 3072 total in 2 blocks; 1784 free (2 chunks); 1288 usedpg_conversion_oid_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 usedpg_constraint_oid_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 usedpg_conversion_name_nsp_index: 1024 total in 1 blocks; 64 free (0 chunks); 960 usedpg_conversion_default_index: 3072 total in 2 blocks; 1784 free (2 chunks); 1288 usedpg_collation_oid_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 usedpg_collation_name_enc_nsp_index: 3072 total in 2 blocks; 1784 free (2 chunks); 1288 usedpg_opclass_am_name_nsp_index: 3072 total in 2 blocks; 1736 free (2 chunks); 1336 usedpg_cast_source_target_index: 1024 total in 1 blocks; 16 free (0 chunks); 1008 usedpg_auth_members_role_member_index: 1024 total in 1 blocks; 64 free (0 chunks); 960 usedpg_attribute_relid_attnam_index: 1024 total in 1 blocks; 64 free (0 chunks); 960 usedpg_amop_fam_strat_index: 3072 total in 2 blocks; 1736 free (2 chunks); 1336 usedpg_amop_opr_fam_index: 3072 total in 2 blocks; 1736 free (2 chunks); 1336 usedpg_aggregate_fnoid_index: 1024 total in 1 blocks; 152 free (0 chunks); 872 usedpg_auth_members_member_role_index: 1024 total in 1 blocks; 64 free (0 chunks); 960 usedpg_authid_oid_index: 1024 total in 1 blocks; 152 free (0 chunks); 872 usedpg_authid_rolname_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 usedpg_database_oid_index: 1024 total in 1 blocks; 152 free (0 chunks); 872 usedpg_database_datname_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 usedpg_trigger_tgrelid_tgname_index: 1024 total in 1 blocks; 16 free (0 chunks); 1008 usedpg_rewrite_rel_rulename_index: 1024 total in 1 blocks; 64 free (0 chunks); 960 usedpg_amproc_fam_proc_index: 3072 total in 2 blocks; 1736 free (2 chunks); 1336 usedpg_opclass_oid_index: 1024 total in 1 blocks; 152 free (0 chunks); 872 usedpg_index_indexrelid_index: 1024 total in 1 blocks; 152 free (0 chunks); 872 usedpg_attribute_relid_attnum_index: 1024 total in 1 blocks; 16 free (0 chunks); 1008 usedpg_class_oid_index: 1024 total in 1 blocks; 152 free (0 chunks); 872 usedMdSmgr: 8192 total in 1 blocks; 6112 free (0 chunks); 2080 usedident parser context: 0 total in 0 blocks; 0 free (0 chunks); 0 usedhba parser context: 3072 total in 2 blocks; 80 free (1 chunks); 2992 usedLOCALLOCK hash: 24576 total in 2 blocks; 13920 free (4 chunks); 10656 usedTimezones: 83472 total in 2 blocks; 3744 free (0 chunks); 79728 usedErrorContext: 8192 total in 1 blocks; 8160 free (3 chunks); 32 used2014-04-08 23:00:14 GMT [31229]: [1-1] ERROR: out of memory2014-04-08 23:00:14 GMT [31229]: [2-1] DETAIL: Failed on request of size 597.2014-04-08 23:00:14 GMT [31229]: [3-1] STATEMENT: SELECT MESSAGE.ID,MESSAGE.SERVER_ID,MESSAGE.RECEIVED_DATE,MESSAGE.PROCESSED,MESSAGE.ORIGINAL_ID,MESSAGE.IMPORT_ID,MESSAGE.IMPORT_CHANNEL_ID,METADATA_IDSFROM( SELECT M.ID,ARRAY_TO_STRING(ARRAY_AGG(MM.ID::CHARACTER VARYING), ',') AS METADATA_IDSFROM D_M2 MJOIN D_MM2 MM ON MM.MESSAGE_ID = M.IDLEFT JOIN D_MCM2 MCM ON MCM.MESSAGE_ID = MM.MESSAGE_IDAND MCM.METADATA_ID = MM.IDLEFT JOIN D_MC2 MC_1 ON MC_1.MESSAGE_ID = M.IDAND MC_1.IS_ENCRYPTED = FALSEAND MC_1.METADATA_ID = 0AND ((MM.ID = 0AND MC_1.CONTENT_TYPE = 1)OR (MM.ID > 0AND MC_1.CONTENT_TYPE = 4))LEFT JOIN D_MC2 MC_2 ON MC_2.MESSAGE_ID = M.IDAND MC_2.IS_ENCRYPTED = FALSEAND MC_2.METADATA_ID = MM.IDAND MC_2.CONTENT_TYPE = '2'LEFT JOIN D_MC2 MC_3 ON MC_3.MESSAGE_ID = M.IDAND MC_3.IS_ENCRYPTED = FALSEAND MC_3.METADATA_ID = MM.IDAND MC_3.CONTENT_TYPE = '3'LEFT JOIN D_MC2 MC_4 ON MC_4.MESSAGE_ID = M.IDAND MC_4.IS_ENCRYPTED = FALSEAND MC_4.METADATA_ID = MM.IDAND MC_4.CONTENT_TYPE = '4'LEFT JOIN D_MC2 MC_5 ON MC_5.MESSAGE_ID = M.IDAND MC_5.IS_ENCRYPTED = FALSEAND MC_5.METADATA_ID = MM.IDAND MC_5.CONTENT_TYPE = '5'LEFT JOIN D_MC2 MC_6 ON MC_6.MESSAGE_ID = M.IDAND MC_6.IS_ENCRYPTED = FALSEAND MC_6.METADATA_ID = MM.IDAND MC_6.CONTENT_TYPE = '6'LEFT JOIN D_MC2 MC_7 ON MC_7.MESSAGE_ID = M.IDAND MC_7.IS_ENCRYPTED = FALSEAND MC_7.METADATA_ID = MM.IDAND MC_7.CONTENT_TYPE = '7'LEFT JOIN D_MC2 MC_8 ON MC_8.MESSAGE_ID = M.IDAND MC_8.IS_ENCRYPTED = FALSEAND MC_8.METADATA_ID = MM.IDAND MC_8.CONTENT_TYPE ='8'WHERE M.ID <= '230716'AND ( M.ID::VARCHAR ILIKE '%' || 'rwar mea' || '%'OR MM.CONNECTOR_NAME ILIKE '%' || 'rwar mea' || '%'OR MC_1.CONTENT ILIKE '%' || 'rwar mea' || '%'OR MC_2.CONTENT ILIKE '%' || 'rwar mea' || '%'OR MC_3.CONTENT ILIKE '%' || 'rwar mea' || '%'OR MC_4.CONTENT ILIKE '%' || 'rwar mea' || '%'OR MC_5.CONTENT ILIKE '%' || 'rwar mea' || '%'OR MC_6.CONTENT ILIKE '%' || 'rwar mea' || '%'OR MC_7.CONTENT ILIKE '%' || 'rwar mea' || '%'OR MC_8.CONTENT ILIKE '%' || 'rwar mea' || '%'OR MCM."SOURCE" ILIKE '%' || 'rwar mea' || '%'OR MCM."TYPE" ILIKE '%' || 'rwar mea' || '%' )GROUP BY M.IDORDER BY M.ID DESCOFFSET 0 LIMIT 21 ) AS SUBQUERYORDER BY MESSAGE.ID DESC;2014-04-08 23:00:14 GMT [31229]: [4-1] LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp31229.7", size 1246003202014-04-08 23:00:14 GMT [31229]: [5-1] LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp31229.6", size 361676802014-04-08 23:00:14 GMT [31229]: [6-1] LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp31229.5", size 10737418242014-04-08 23:00:14 GMT [31229]: [7-1] LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp31229.4", size 8235417602014-04-08 23:00:14 GMT [31229]: [8-1] LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp31229.3", size 53248002014-04-08 23:00:14 GMT [31229]: [9-1] LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp31229.2", size 4025139202014-04-08 23:00:14 GMT [31229]: [10-1] LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp31229.1", size 386170882014-04-08 23:00:14 GMT [31229]: [11-1] LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp31229.0", size 5079040
Steve Kehlet <steve.kehlet@gmail.com> writes: > Thank you. For some reason I couldn't get it to trip with "ulimit -d > 51200", but "ulimit -v 1572864" (1.5GiB) got me this in serverlog. I hope > this is readable, if not it's also here: Well, here's the problem: > ExprContext: 812638208 total in 108 blocks; 183520 free (171 > chunks); 812454688 used So something involved in expression evaluation is eating memory. Looking at the query itself, I'd have to bet on this: > ARRAY_TO_STRING(ARRAY_AGG(MM.ID::CHARACTER VARYING), ',') My guess is that this aggregation is being done across a lot more rows than you were expecting, and the resultant array/string therefore eats lots of memory. You might try replacing that with COUNT(*), or even better SUM(LENGTH(MM.ID::CHARACTER VARYING)), just to get some definitive evidence about what the query is asking to compute. Meanwhile, it seems like ulimit -v would provide the safety valve you asked for originally. I too am confused about why -d didn't do it, but as long as you found a variant that works ... regards, tom lane
On Tue, Apr 8, 2014 at 6:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Well, here's the problem:> ExprContext: 812638208 total in 108 blocks; 183520 free (171So something involved in expression evaluation is eating memory.
> chunks); 812454688 used
Looking at the query itself, I'd have to bet on this:My guess is that this aggregation is being done across a lot more rows
> ARRAY_TO_STRING(ARRAY_AGG(MM.ID::CHARACTER VARYING), ',')
than you were expecting, and the resultant array/string therefore eats
lots of memory. You might try replacing that with COUNT(*), or even
better SUM(LENGTH(MM.ID::CHARACTER VARYING)), just to get some definitive
evidence about what the query is asking to compute.
The devs have moved on and want to stick with their new query, so I'll just chalk this up to a bad query and let it go. But I'm glad to have learned a few new tricks, thanks.
Meanwhile, it seems like ulimit -v would provide the safety valve
you asked for originally.
Thank you Amador and Tom for the ulimit solution, that's exactly what I needed.