Обсуждение: designing time dimension for star schema
Hello everybody, I was wondering if anyone had any experiences they can share when designing the time dimension for a star schema and the like. I'm curious about how well it would work to use a timestamp for the attribute key, as opposed to a surrogate key, and populating the time dimension with triggers on insert to the fact tables. This is something that would have data streaming in (as oppose to bulk loading) and I think we want time granularity to the minute. A simplified example: -- Time dimension CREATE TABLE time ( datetime TIMESTAMP WITH TIME ZONE NOT NULL, day_of_week SMALLINT NOT NULL ); CREATE UNIQUE INDEX ON time (datetime); -- Fact CREATE TABLE fact( datetime TIMESTAMP WITH TIME ZONE NOT NULL, FOREIGN KEY (datetime) REFERENCES time(datetime) ); -- Function to populate the time dimension CREATE OR REPLACE FUNCTION decompose_timestamp() RETURNS TRIGGER AS $$ BEGIN NEW.datetime = date_trunc('minutes', NEW.datetime); INSERT INTO time (datetime, day_of_week) VALUES (NEW.datetime, date_part('dow', NEW.datetime)); RETURN NEW; EXCEPTION WHEN unique_violation THEN -- Do nothing if the timestamp already exists in the dimension table. RETURN new; END; $$ LANGUAGE 'plpgsql'; CREATE TRIGGER populate_time BEFORE INSERT ON fact FOR EACH ROW EXECUTE PROCEDURE decompose_timestamp(); Regards, Mark
I've done a lot of DSS architecture. A couple of thoughts: - in most cases the ETL process figures out the time id's as part of the preparation and then does bulk loads into the fact tables I would be very concerned about performance of a trigger that fired for every row on the fact table you mention you want to do data streaming instead of bulk loads, can you elaborate? - When querying a star schema one of the performance features is the fact that all joins to the dimension tables are performed via a numeric key, such as: "select * from fact, time_dim, geo_dim where fact.time_id = time_dim.time_id..." In the case of this being a timestamp I suspect the performance would take a hit, depending on the size of your fact table and the scope/volume of your DSS queries this could easily be a show stopper based on the assumption that the database can do a numeric binary search much faster than a timestamp search On 2/10/14, 9:45 AM, Mark Wong wrote: > Hello everybody, > > I was wondering if anyone had any experiences they can share when > designing the time dimension for a star schema and the like. I'm > curious about how well it would work to use a timestamp for the > attribute key, as opposed to a surrogate key, and populating the time > dimension with triggers on insert to the fact tables. This is > something that would have data streaming in (as oppose to bulk > loading) and I think we want time granularity to the minute. > > A simplified example: > > -- Time dimension > CREATE TABLE time ( > datetime TIMESTAMP WITH TIME ZONE NOT NULL, > day_of_week SMALLINT NOT NULL > ); > CREATE UNIQUE INDEX ON time (datetime); > > -- Fact > CREATE TABLE fact( > datetime TIMESTAMP WITH TIME ZONE NOT NULL, > FOREIGN KEY (datetime) REFERENCES time(datetime) > ); > > -- Function to populate the time dimension > CREATE OR REPLACE FUNCTION decompose_timestamp() RETURNS TRIGGER AS $$ > BEGIN > NEW.datetime = date_trunc('minutes', NEW.datetime); > INSERT INTO time (datetime, day_of_week) > VALUES (NEW.datetime, date_part('dow', NEW.datetime)); > RETURN NEW; > EXCEPTION > WHEN unique_violation THEN > -- Do nothing if the timestamp already exists in the dimension table. > RETURN new; > END; $$ > LANGUAGE 'plpgsql'; > > CREATE TRIGGER populate_time BEFORE INSERT > ON fact FOR EACH ROW > EXECUTE PROCEDURE decompose_timestamp(); > > Regards, > Mark > >
On Mon, Feb 10, 2014 at 9:20 AM, CS DBA <cs_dba@consistentstate.com> wrote: > I've done a lot of DSS architecture. A couple of thoughts: > > - in most cases the ETL process figures out the time id's as part of the > preparation and then does bulk loads into the fact tables > I would be very concerned about performance of a trigger that > fired for every row on the fact table > > you mention you want to do data streaming instead of bulk loads, > can you elaborate? We have processes inserting data from log files as they are written. > - When querying a star schema one of the performance features is the > fact that all joins to the dimension tables are performed via a numeric > key, such as: > "select * from fact, time_dim, geo_dim > where fact.time_id = time_dim.time_id..." > > In the case of this being a timestamp I suspect the performance would > take a hit, depending on the size of your fact table and the > scope/volume of your DSS queries this could easily be a show stopper > based on the assumption that the database can do a numeric binary search > much faster than a timestamp search I guess I was hoping the extra 4 bytes from a timestamp, compared to a bigint, wouldn't be too significant yet I didn't consider postgres might do a binary search faster on an integer type than a timestamp. Even with 1 billion rows, but maybe that's wishful thinking. Maybe a regular integer at 4 bytes would be good enough. I would estimate a query would touch up to an order of 1 million rows at a time. Regards, Mark
Mark Wong <markwkm@gmail.com> writes: > On Mon, Feb 10, 2014 at 9:20 AM, CS DBA <cs_dba@consistentstate.com> wrote: >> In the case of this being a timestamp I suspect the performance would >> take a hit, depending on the size of your fact table and the >> scope/volume of your DSS queries this could easily be a show stopper >> based on the assumption that the database can do a numeric binary search >> much faster than a timestamp search > I guess I was hoping the extra 4 bytes from a timestamp, compared to a > bigint, wouldn't be too significant yet I didn't consider postgres > might do a binary search faster on an integer type than a timestamp. Actually, Postgres timestamps *are* bigints under the hood, and comparisons will be about the same speed as for bigints. Now, I/O is a lot more expensive for timestamps ... but not index searches. regards, tom lane
On Mon, Feb 10, 2014 at 8:45 AM, Mark Wong <markwkm@gmail.com> wrote: > Hello everybody, > > I was wondering if anyone had any experiences they can share when > designing the time dimension for a star schema and the like. I'm > curious about how well it would work to use a timestamp for the > attribute key, as opposed to a surrogate key, and populating the time > dimension with triggers on insert to the fact tables. This is > something that would have data streaming in (as oppose to bulk > loading) and I think we want time granularity to the minute. > > A simplified example: > > -- Time dimension > CREATE TABLE time ( > datetime TIMESTAMP WITH TIME ZONE NOT NULL, > day_of_week SMALLINT NOT NULL > ); > CREATE UNIQUE INDEX ON time (datetime); > > -- Fact > CREATE TABLE fact( > datetime TIMESTAMP WITH TIME ZONE NOT NULL, > FOREIGN KEY (datetime) REFERENCES time(datetime) > ); > > -- Function to populate the time dimension > CREATE OR REPLACE FUNCTION decompose_timestamp() RETURNS TRIGGER AS $$ > BEGIN > NEW.datetime = date_trunc('minutes', NEW.datetime); > INSERT INTO time (datetime, day_of_week) > VALUES (NEW.datetime, date_part('dow', NEW.datetime)); > RETURN NEW; > EXCEPTION > WHEN unique_violation THEN > -- Do nothing if the timestamp already exists in the dimension table. > RETURN new; > END; $$ > LANGUAGE 'plpgsql'; Failure to inline the date/time in dimension tables a terrible practice IMO. You add a lookup and an expensive subtransaction to each insert. When querying the fact table you tack on a join for every query that does not need to be there (for no benefit I can see). merlin
On Mon, Feb 10, 2014 at 8:45 AM, Mark Wong <markwkm@gmail.com> wrote: > Hello everybody, > > I was wondering if anyone had any experiences they can share when > designing the time dimension for a star schema and the like. I'm > curious about how well it would work to use a timestamp for the > attribute key, as opposed to a surrogate key, and populating the time > dimension with triggers on insert to the fact tables. This is > something that would have data streaming in (as oppose to bulk > loading) and I think we want time granularity to the minute. Hello everybody, I did a simple experiment and just wanted to share. Hopefully this wasn't too simple. On a 72GB 15K rpm 2.5" drive, I tried to see how long it would take to insert (committing after each insert) 100,000 bigints, timestamps with time zone, and timestamps with time zone with insert trigger. The timestamp and bigints by themselves took ~10 minutes to insert 100,000 rows, and implementing the trigger increased the time up to about ~11 minutes. Regards, Mark