Yep, 420ish million records out of 540 million records have a titleid
of 1. There are about 880,000 other unique values, but most of the
records are 1. Of course, n_distinct is only 292. I'm surprised
it's not eliminating the duplicates while it builds that hash table.
This is what I'm doing for a work around right now. Getting
n_distinct right seems to be preventing the system from breaking.
It's going to be executed once a week during the weekly maintenance.
It's setting the n_distinct of each column to the number of rows in
the associated table.
CREATE OR REPLACE FUNCTION patch_ndistinct(_table varchar, _column
varchar, _string_table varchar)
RETURNS real AS
$$
DECLARE _cnt REAL;
BEGIN
SELECT reltuples INTO _cnt from pg_class where relname = _string_table;
EXECUTE 'ALTER TABLE ' || _table || ' ALTER COLUMN ' || _column
|| ' SET (n_distinct=' || _cnt || ')';
RETURN _cnt;
END
$$ LANGUAGE plpgsql;
select patch_ndistinct('log_raw', 'titleid', 'titles');
select patch_ndistinct('log_raw', 'urlid', 'urls');
select patch_ndistinct('log_raw', 'hostid', 'hosts');
ANALYZE log_raw;
On Thu, Feb 16, 2017 at 10:54 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> David Hinkle <hinkle@cipafilter.com> writes:
>> Tom, there are three columns in this table that exhibit the problem,
>> here is the statistics data after an analyze, and the real data to
>> compare it to.
>
>> attname | n_distinct | most_common_freqs
>
>> titleid | 292 | {0.767167}
>
> Ouch. That's saying there's some single value of titleid that accounts
> for more than three-quarters of the entries ... does that square with
> reality? That'd certainly explain why a hash join goes nuts.
>
> regards, tom lane
--
David Hinkle
Senior Software Developer
Phone: 800.243.3729x3000
Email: hinkle@cipafilter.com
Hours: Mon-Fri 8:00AM-5:00PM (CT)