Обсуждение: Optimal time series sampling.
As a prelude to where I really want to go, please consider the following SELECT statement. SELECT close_price FROM stockprices A WHERE price_date = (SELECT MAX(price_date) FROM stockprices B WHERE A.stock_id = B.stock_id AND A.stock_id = id); stockprices has a primary key comprised of stock_id and price_date, and I tried the same query with an extra inex on price_date (but that index made no difference in apparent performance as seen on the clock on the wall). I have been advised (on the MySQL board), to use the following (with the claim, unsupported as far as I can tell, that it is both correct and much faster - it appears to be correct, but it is certainly no faster): SELECT A.`close` AS close_price FROM stockprices A LEFT JOIN stockprices B ON A.stock_id = B.stock_id AND B.price_date > A.price_date WHERE B.price_date IS NULL AND A.stock_id = id; It appears to do the right thing. I certainly get the right answer, but I am not seeing a significant difference in performance. Worse, when I invoke something like it for a suite of about two dozen stocks, it takes about ten minutes to complete. (I may try a variant in which the last clause used in WHERE is replaced by IN followed by a trivial select that gets the same two dozen stock_ids, to see if that helps.) Now, I am concerned with performance because, and this is where I really want to go, I want to adapt this logic to create new time series of closing prices, but at the granularity of a week, a month or quarter, and there is no predicting a priori how long the series is. IBM's data goes back decades while I have data for other stocks that go back only a couple years. Now, a junior programmer here had suggested just doing a simple select, at least for weekly granularity, by selecting a value if it's day of the week computes to Friday. That can't work correctly because in some weeks, there are statutory holidays that land on Fridays, resulting in the last actual trading day for that week being Thursday. His simple approach guarantees that many records that ought to be included will be ignored. I need a more dynamic and flexible approach which allows me to work on the basis that I have prices for all trading days for a given stock from the time my data for it begins. So I need a more complex select statement that will just select the most recent price for a given stock for each week (or month or quarter or year). Now, I can get the full time series for two dozen stocks, as slow and brain dead as doing a select for each stock ID, AND have my Java code construct and display a chart, in less than 20 seconds (and Java does not have a reputation for being fast). I need whatever solution I use to be that quick. Any thoughts about how best to attack this in order to get the correct results as fast as is possible? What options would you consider, WRT defining the SQL statements you would benchmark, in order to design your benchmark testing? Thanks, Ted
"Ted Byers" <r.ted.byers@rogers.com> writes: > As a prelude to where I really want to go, please > consider the following SELECT statement. > > SELECT close_price FROM stockprices A > WHERE price_date = > (SELECT MAX(price_date) FROM stockprices B > WHERE A.stock_id = B.stock_id AND A.stock_id = id); I assume you're missing another "stock_id = id" on the outer query? I think you'll have to post the actual explain analyze output you're getting and the precise schema you have. You might need an index on <stock_id,price_date>. > It appears to do the right thing. I certainly get the > right answer, but I am not seeing a significant > difference in performance. Worse, when I invoke > something like it for a suite of about two dozen > stocks, it takes about ten minutes to complete. That would be an entirely different ball of wax than trying to pull out a single stock's closing price. I suspect you're going to want to use Postgres's "DISTINCT ON" SQL extension. Something like: SELECT DISTINCT ON (stock_id,price_date) * FROM stockprices ORDER BY stock_id, price_date DESC And you may want an index on < stock_id, price_date DESC > I believe MySQL does have a similar extension where you can use GROUP BY and have columns listed in the select target list which aren't included in the grouping sets. > So I need a more complex select statement that will just select the most > recent price for a given stock for each week (or month or quarter or year). Do you care what happens if there were no trades for a given stock in the time period? The query you give above using MAX would still work but the query I described using DISTINCT ON would not emit a record for the stock at all. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's 24x7 Postgres support!
--- Gregory Stark <stark@enterprisedb.com> wrote: > "Ted Byers" <r.ted.byers@rogers.com> writes: > > > As a prelude to where I really want to go, please > > consider the following SELECT statement. > > > > SELECT close_price FROM stockprices A > > WHERE price_date = > > (SELECT MAX(price_date) FROM stockprices B > > WHERE A.stock_id = B.stock_id AND A.stock_id > = id); > > I assume you're missing another "stock_id = id" on > the outer query? > Right. > I think you'll have to post the actual explain > analyze output you're getting > and the precise schema you have. OK, it is challenging to present it in plain text, but here is the HTML exported by MySQL Query Browser. If you cut between the "====" lines and paste the content into a file with an html extension, it will look fine and be easy to read. ====================================== <html> <head> <title>Query EXPLAIN SELECT price_date,`close` AS close_price FROM stockprices A WHERE A.stock_id = 1 AND price_date IN (SELECT MAX(price_date) FROM stockprices B WHERE A.stock_id = B.stock_id AND A.stock_id = 1 GROUP BY YEAR(B.price_date),WEEKOFYEAR(B.price_date));, Fri Nov 09 11:12:46 2007 </title> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> </head> <body><h1>Query EXPLAIN SELECT price_date,`close` AS close_price FROM stockprices A WHERE A.stock_id = 1 AND price_date IN (SELECT MAX(price_date) FROM stockprices B WHERE A.stock_id = B.stock_id AND A.stock_id = 1 GROUP BY YEAR(B.price_date),WEEKOFYEAR(B.price_date));, Fri Nov 09 11:12:46 2007 </h1> <table border=1 cellspacing=1 cellpadding=0><tr> <th>id</th><th>select_type</th><th>table</th><th>type</th><th>possible_keys</th><th>key</th><th>key_len</th><th>ref</th><th>rows</th><th>Extra</th></tr> <tr> <td>1</td><td>PRIMARY</td><td>A</td><td>ref</td><td>PRIMARY</td><td>PRIMARY</td><td>4</td><td>const</td><td>17442</td><td>Using where</td></tr> <tr> <td>2</td><td>DEPENDENT SUBQUERY</td><td>B</td><td>ref</td><td>PRIMARY</td><td>PRIMARY</td><td>4</td><td>const</td><td>17442</td><td>Using where; Using index; Using temporary; Using filesort</td></tr> </table> </body></html> ======================================================= > You might need an > index on > <stock_id,price_date>. > That is the definition of the primary key. That is why both the main query and the subquery are shown using the primary key. > That would be an entirely different ball of wax than > trying to pull out a > single stock's closing price. I suspect you're going > to want to use Postgres's > "DISTINCT ON" SQL extension. Something like: > > SELECT DISTINCT ON (stock_id,price_date) * > FROM stockprices > ORDER BY stock_id, price_date DESC > > And you may want an index on < stock_id, price_date > DESC > > That pair, as I mentioned, formed the primary key for the stockprices table. Here is my SQL for subsampling a time series: SELECT price_date,close_price FROM stockprices A WHERE A.stock_id = 1 AND price_date IN (SELECT MAX(price_date) FROM stockprices B WHERE A.stock_id = B.stock_id AND A.stock_id = 1 GROUP BY YEAR(B.price_date),WEEKOFYEAR(B.price_date)); This performs better than the outer join algorithm for getting only the last price for a couple dozen stocks. This particular statement crawls to completion in about 4 or 5 minutes, as compared to over ten to get just the last price for a couple dozen stocks. Not too surprisingly, Explain gives identical results for this query as it did for the simpler SELECT above. > Do you care what happens if there were no trades for > a given stock in the time > period? The query you give above using MAX would > still work but the query I > described using DISTINCT ON would not emit a record > for the stock at all. > No. But then I haven't yet analyzed the data to learn what the prices for a given period really mean if there haven't been any trades within the period. I have yet to see a series of prices for which the volume is 0. That may be an artifact of how my colleagues selected stocks and etfs to use to test our algorithm. I do not yet know if it will be a significant issue for us since I don't see how a stock or etf that can go for a while without any trades at all would be of interest given the kind of information we will be producing for our clients. I am presently focussed on developing ways of looking at the data, to help my colleagues better understand the data and what our algorithm does with it. One of the properties of interest, and relevance to our algorithm is the common property that the series seem to be self affine (as described most notably by B. Mandlebrot): hence the need to sample with different degrees of granularity. My colleagues have worked primarily with finance data (esp. commodities), while my background is more focussed on risk management in environmental science. Thanks Ted
Ted Byers wrote: > --- Gregory Stark <stark@enterprisedb.com> wrote: > >>I think you'll have to post the actual explain >>analyze output you're getting >>and the precise schema you have. > > > OK, it is challenging to present it in plain text, but > here is the HTML exported by MySQL Query Browser. If > you cut between the "====" lines and paste the > content into a file with an html extension, it will > look fine and be easy to read. How about *you* cut & paste into a file, save that, then open it in a browser, then cut & paste the resulting text into a mail? And if you happen to be using an operating system that refuses to let go of formatting instructions when copying plaintext, there's always Notepad. brian
Ted Byers <r.ted.byers@rogers.com> writes: > OK, it is challenging to present it in plain text, but > here is the HTML exported by MySQL Query Browser. Why are you asking this list for help with a MySQL performance problem? regards, tom lane
--- Tom Lane <tgl@sss.pgh.pa.us> wrote: > Ted Byers <r.ted.byers@rogers.com> writes: > > OK, it is challenging to present it in plain text, > but > > here is the HTML exported by MySQL Query Browser. > > Why are you asking this list for help with a MySQL > performance problem? > because my question isn't really about MySQL, but rather about how best to construct the SQL required to get the job done, regardless of what database is used. I have seen some claims that it is better to use joins instead of correlated subqueries and others that say the opposite. And I do not, at this stage, know if there are other options in SQL that may or may not be better. At this time, the database in use is irrelevant (I want to stick as close to the ANSI standard as practicable so the rewriting required will be minimal should we decide to change the database later, for whatever reason). Thanks, Ted
Ted Byers <r.ted.byers@rogers.com> writes: > --- Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Why are you asking this list for help with a MySQL >> performance problem? >> > because my question isn't really about MySQL, but > rather about how best to construct the SQL required to > get the job done, regardless of what database is used. > I have seen some claims that it is better to use > joins instead of correlated subqueries and others that > say the opposite. The problem is that the answer may well be different for different databases. You'd get better answers about MySQL on a MySQL list than you'll get here. regards, tom lane
On Nov 9, 2007 11:47 AM, Ted Byers <r.ted.byers@rogers.com> wrote: > > --- Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > Ted Byers <r.ted.byers@rogers.com> writes: > > > OK, it is challenging to present it in plain text, > > but > > > here is the HTML exported by MySQL Query Browser. > > > > Why are you asking this list for help with a MySQL > > performance problem? > > > because my question isn't really about MySQL, but > rather about how best to construct the SQL required to > get the job done, regardless of what database is used. > I have seen some claims that it is better to use > joins instead of correlated subqueries and others that > say the opposite. And I do not, at this stage, know > if there are other options in SQL that may or may not > be better. Which is better depends largely on how your database is built. MySQL still uses loops for all subselects, so with large numbers of tuples in the subselect method, it will be slow. But they might fix this in a later release. Fairly recent versions of PostgreSQL could make some bad choices when doing joins for certain datasets that would be much faster with a correlated subquery (specifically the old left join where righttable.field is null trick made some pgsql versions choose an inefficient join method) So, the "right" way is a question of which db, and even which version of that DB you're on. > At this time, the database in use is > irrelevant (I want to stick as close to the ANSI > standard as practicable so the rewriting required will > be minimal should we decide to change the database > later, for whatever reason). If you want to stick with ANSI, MySQL tends to be more divergent from the spec than pgsql and other rdbms. Most people would consider the correlate subquery the better method. But it's also likely to be the slowest on MySQL.
--- Scott Marlowe <scott.marlowe@gmail.com> wrote: > On Nov 9, 2007 11:47 AM, Ted Byers > <r.ted.byers@rogers.com> wrote: > > > > --- Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > > > Ted Byers <r.ted.byers@rogers.com> writes: > [snip] > Which is better depends largely on how your database > is built. MySQL > still uses loops for all subselects, so with large > numbers of tuples > in the subselect method, it will be slow. But they > might fix this in > a later release. > > Fairly recent versions of PostgreSQL could make some > bad choices when > doing joins for certain datasets that would be much > faster with a > correlated subquery (specifically the old left join > where > righttable.field is null trick made some pgsql > versions choose an > inefficient join method) > > So, the "right" way is a question of which db, and > even which version > of that DB you're on. > My life just got soooooooo much more complicated. Oh well, I guess that will be useful when providing advice to management when they start seeing performance issues. Thanks ;-) Have you looked at version 5.0.45? I am always seeing the claim that the left join trick is so much more faster than the correlated subquery, especially if a function like MAX() is used, but the numbers I am seeing with real stock price data has it running, on average, about three times faster than the left join. So I assumed I was doing something wrong in a manner than would get me the right answer the slowest way possible. > > At this time, the database in use is > > irrelevant (I want to stick as close to the ANSI > > standard as practicable so the rewriting required > will > > be minimal should we decide to change the database > > later, for whatever reason). > > If you want to stick with ANSI, MySQL tends to be > more divergent from > the spec than pgsql and other rdbms. > The books I am using, which describe the SQL language, don't seem to mention or illustrate much difference among any of the rdbms (including my references that talk about Oracle and MS SQL Server). The SQL I try from those books seem to work reasonably well in all of them (I can't check against Oracle, though, since I don't have that), and I try most of my SQL against MySQL, Postgres and MS SQL Server (the biggest divergences seem to be in how bulk loading of data happens). Maybe I haven't explored enough of the SQL language, with large enough datasets, to see the differences you mention; or perhaps things are improving with all of them. > Most people would consider the correlate subquery > the better method. > But it's also likely to be the slowest on MySQL. > Right now, with this particular query the correlated subquery is the one that gets me the right answers about 3 times faster than any other method I have tried. But it still takes several minutes to get the results for only a few dozen stocks. And yet I can get several megabytes of data from the following query in about a quarter of the time. SELECT price_date,close_price FROM stockprices A WHERE A.stock_id = 1 AND price_date IN (SELECT MAX(B.price_date) FROM stockprices B WHERE A.stock_id = B.stock_id AND A.stock_id = 1 GROUP BY YEAR(B.price_date),WEEKOFYEAR(B.price_date)); I had expected this to take many times longer than the "simple" select that gets only the last price for a given stock, but I was surprised to see it so much faster than the query that gets just the last prices for only a couple dozen stocks. Thanks alot. I learned alot from your reply. Ted
Is there an existing way to enforce password complexity for password authentication? I am not seeing anything in the docs, and I can only turn up this reference to a pending patch for 8.2 (bottom of page): http://www.postgresql.org/community/weeklynews/pwn20061210 Thanks in advance for any suggestions. Paul