Обсуждение: Google Summer of Code 2008
Hi PostgreSQL! Although this year's GSoC is just starting, I thought getting in touch a bit earlier would only be of benefit. I study Computer Science in Faculty of Mathematics, Informatics and Mechanics of Warsaw University. I'm currently in my fourth year of studies. Having chosen Databases for my degree course I plan to write my thesis concentrating at least partially on PostgreSQL. This will (hopefully) be my first GSoC. For the past one and a half years I've alse been working in a privately held company Fiok LLP. The company deals, among others, in developing custom Web applications, which all use PostgreSQL as their database solution. During my time in Fiok I have taken part in creating an accounting system for a large Polish university, capable of generating financial reports required by the European Union, a publishing platform for editors working in the Polish Catholic Press Agency and a custom tailored CRM application. All of these projects use unique PostgreSQL features, like PITR and full-text search to name a few. You can glimpse the implemented FTS functionality by looking here: http://system.ekai.pl/kair/?_tw_DepeszeKlientaTable_0__search_plainfulltext=kalendarz&_tw_DepeszeKlientaTable_0__search_rank_orderby=on&screen=depesze It's the public part of the publishing platform, which allows subscribed readers to view published messages. The link takes you to search results for the word 'kalendarz' (which is Polish for calendar), ordered by rank() and highlighted by headline() (our client uses 8.2, hence the old function names). I do my work in Fiok almost exclusively from home, showing up at the office once every two or three weeks, so working in a distributed environment using SCM tools is natural to me. I'm also engaged in an open source project called Kato, being one of the key developers. It's a small project that started as my company's requirement for a new Web application framework and ended up being released under the New BSD License. Of course it's native database engine is PostgreSQL. You can take a look at the source here: http://kato.googlecode.com/ or play around with a simple demo here: http://sahara.fiok.pl/~jurbanski/kato-demo/kato-demo.en.php Speaking of open source contributions, I also wrote a FTS-related patch for Postgres, that made it's way into 8.3: http://archives.postgresql.org/pgsql-patches/2007-11/msg00081.php I try to follow -patches, occasinally read -hackers and sometimes make excursions around the pgsql source, trying to learn more and more of it. About my programming skills, particulary in C - one piece of code I'd like to show you was written for an Operating Systems course. It's a kernel patch implementing I/O operations throttling on a per-process basis through a /proc based interface. The code lacks comments, as they were in Polish, but it's just to assure you I'm able to write some good C: http://students.mimuw.edu.pl/~wulczer/linux-2.6.17.13-iolimits-ju219721.patch And now for the SoC. As this year's PostgreSQL Ideas are not set up yet, I thought I'd give you the two projects floating through my mind 1. WAL segment files explorer / mangler While preparing a presentation about PITR and warm stanby in PostgreSQL for my degree course, I thought it would be nice if one had a command-line tool to examine the contents of a WAL segment file and determine for example what commands were recorded in it, what are the transaction IDs they were in, etc. This could allow for instance to replay the WAL sequence up until a function went haywire and wrecked one's data - without the need to know *when* the accident happened. It could be useful as an alternative method of logging operations - since WAL files are written anyway, one could imagine a process periodically looking through them and reporting (perheaps not all) operations to some external listener. If for instance you were curious which column in a table is updated most, instead of writing a trigger to log updates to it, you could use the WAL explorer to find updates to that column and log them over the network, thus reducing disk I/O. Being even bolder, I thought about allowing to edit the contents of a WAL file, so if the proverbial junior DBA drops a crucial table and gets caught the next morning, you don't have to throw away all transactions that got commited over the night. Maybe you could *overwrite* his DROP TABLE with something neutral and replay the WAL up to it's end. 2. Implement better selectivity estimates for FTS. If I'm not mistaken, the @@ operator still uses the contsel selectivity function, returning 0.001 * <total_row_count> as the expected number of rows left after applying the @@ operator. I have in the past been bitten by performance problems that I think could be traced back to row count estimates being horribly wrong (i.e. much too low) for FTS queries asking for a very popular word. Maybe we could do better that just return one-thousandth? I myself am more for the first idea, but both seem good concepts to me. Also, both are implementable as contrib modules, with the WAL explorer possibly requiring modification to the WAL structure, and thus having to wait for 8.4 to get into core. As of now, these are just loose ideas, but ones I believe are possible to implement in the time boundaries of GSoC coding. Before digging deeper into the source and giving them more thought i wanted to consult some more experienced PosgtreSQL hackers and get their opinions - after all, that's what the community is for. To wrap it up: do you find any of these ideas worthwhile? Could they be good candidates for a GSoC project? Of course doing some stuff from the TODO list would still be fun, if you believe they are more promising/needed. Basically, any kind of involvment in PostgreSQL is something that gets me excited. Hope to hear from you, Cheers, Jan Urbanski -- Jan Urbanski GPG key ID: E583D7D2 ouden estin
Jan Urbański <j.urbanski@students.mimuw.edu.pl> writes: > 2. Implement better selectivity estimates for FTS. +1 for that one ... regards, tom lane
Tom Lane wrote: > Jan Urbański <j.urbanski@students.mimuw.edu.pl> writes: >> 2. Implement better selectivity estimates for FTS. > > +1 for that one ... OK, this one might very well be the one that'd be more useful. And I can always reuse the other idea for my thesis, after expanding it a bit. Speaking of @@ selectivity, I even mailed about it once on the -performance list, but the mail somehow got lost in the processing queue and never reached the list. Anyway, the idea has been on my mind for some time now. So are you considering putting FTS selectivity estimates on this year's SoC ideas list? That is, if PostgreSQL is planning to participate in SoC this year, which I'm sure it does ;) cheers, Jan Urbanski -- Jan Urbanski GPG key ID: E583D7D2 ouden estin
On Tue, Mar 4, 2008 at 4:47 PM, Jan Urbański <j.urbanski@students.mimuw.edu.pl> wrote: > Tom Lane wrote: > > Jan Urbański <j.urbanski@students.mimuw.edu.pl> writes: > >> 2. Implement better selectivity estimates for FTS. > > > > +1 for that one ... > > OK, this one might very well be the one that'd be more useful. And I can > always reuse the other idea for my thesis, after expanding it a bit. > Speaking of @@ selectivity, I even mailed about it once on the > -performance list, but the mail somehow got lost in the processing queue > and never reached the list. Anyway, the idea has been on my mind for > some time now. > So are you considering putting FTS selectivity estimates on this year's > SoC ideas list? That is, if PostgreSQL is planning to participate in SoC > this year, which I'm sure it does ;) We are - but the idea doesn't need to be on the list for us to consider it. Just write up a good project outline and plan ready to submit when the doors open. -- Dave Page EnterpriseDB UK Ltd: http://www.enterprisedb.com PostgreSQL UK 2008 Conference: http://www.postgresql.org.uk
Jan, > OK, this one might very well be the one that'd be more useful. Well, you should submit *both* once SoC opens for applications. The mentors will decide which. -- Josh Berkus PostgreSQL @ Sun San Francisco
Jan, the problem is known and well requested. From your promotion it's not clear what's an idea ? Oleg On Tue, 4 Mar 2008, Jan Urbaski wrote: > Tom Lane wrote: >> Jan Urbański <j.urbanski@students.mimuw.edu.pl> writes: >>> 2. Implement better selectivity estimates for FTS. >> >> +1 for that one ... > > OK, this one might very well be the one that'd be more useful. And I can > always reuse the other idea for my thesis, after expanding it a bit. > Speaking of @@ selectivity, I even mailed about it once on the -performance > list, but the mail somehow got lost in the processing queue and never reached > the list. Anyway, the idea has been on my mind for some time now. > So are you considering putting FTS selectivity estimates on this year's SoC > ideas list? That is, if PostgreSQL is planning to participate in SoC this > year, which I'm sure it does ;) > > cheers, > Jan Urbanski > > Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
Oleg Bartunov wrote: > Jan, > > the problem is known and well requested. From your promotion it's not > clear what's an idea ? I guess the first approach could be to populate some more columns in pg_statistics for tables with tsvectors. I see there are some statistics already being gathered (pg_stat's histogram_bounds are populated for tsvector columns), so maybe one could use that? Even remembering a few of the most frequently appearing lexemes could in my opinion help. I plotted distinct lexemes against the number documents containing them (basically the output of stat()) in one of our databases and came out with this: http://www.fiok.pl/~jurbanski/kaired-depesze.png The three high values are really stopwords, and partially because of that I wrote my first FTS patch, but this shows that if we'd remember the ~40 most frequent lexemes, we could give much better estimates for popular queries (and I think are the ones that hurt performance most are those which underestimate the row count). As for a more general solution I'd have to read deeper into the tsearch code to understand how the tsvector type and @@ operator work and give it a bit more thought. I'm planning to do that in the next three weeks (read: before the student applications period starts). Maybe some kind of heuristic could be implemented? Possibly someone could load some information specific to her language, which would tell the planner how common (more or less) a given word is? Another attempt at it would be: return lower estimates for tsqueries consisting of more parts - 'X'::tsquery is usually far less selective than 'X & Y & Z & V'::tsquery. I searched through the list archives, but couldn't find any other attempts at this problem - were there any? Jan -- Jan Urbanski GPG key ID: E583D7D2 ouden estin
Oleg Bartunov wrote: > Jan, > > the problem is known and well requested. From your promotion it's not > clear what's an idea ? >> Tom Lane wrote: >>> Jan Urbański <j.urbanski@students.mimuw.edu.pl> >>> writes: >>>> 2. Implement better selectivity estimates for FTS. OK, after reading through the some of the code the idea is to write a custom typanalyze function for tsvector columns. It could look inside the tsvectors, compute the most commonly appearing lexemes and store that information in pg_statistics. Then there should be a custom selectivity function for @@ and friends, that would look at the lexemes in pg_statistics, see if the tsquery it got matches some/any of them and return a result based on that. I have a feeling that in many cases identifying the top 50 to 300 lexemes would be enough to talk about text search selectivity with a degree of confidence. At least we wouldn't give overly low estimates for queries looking for very popular words, which I believe is worse than givng an overly high estimate for a obscure query (am I wrong here?). Regards, Jan -- Jan Urbanski GPG key ID: E583D7D2 ouden estin
On Sat, 8 Mar 2008, Jan Urbaski wrote: > Oleg Bartunov wrote: >> Jan, >> >> the problem is known and well requested. From your promotion it's not >> clear what's an idea ? >>> Tom Lane wrote: >>>> Jan Urbański <j.urbanski@students.mimuw.edu.pl> >>>> writes: >>>>> 2. Implement better selectivity estimates for FTS. > > OK, after reading through the some of the code the idea is to write a custom > typanalyze function for tsvector columns. It could look inside the tsvectors, > compute the most commonly appearing lexemes and store that information in > pg_statistics. Then there should be a custom selectivity function for @@ and > friends, that would look at the lexemes in pg_statistics, see if the tsquery > it got matches some/any of them and return a result based on that. such function already exists, it's ts_stat(). The problem with ts_stat() is its performance, since it sequentually scans ALL tsvectors. It's possible to write special function for tsvector data type, which will be used by analyze, but I'm not sure sampling is a good approach here. The way we could improve performance of gathering stats using ts_stat() is to process only new documents. It may be not as fast as it looks because of lot of updates, so one need to think more about. > > I have a feeling that in many cases identifying the top 50 to 300 lexemes > would be enough to talk about text search selectivity with a degree of > confidence. At least we wouldn't give overly low estimates for queries > looking for very popular words, which I believe is worse than givng an overly > high estimate for a obscure query (am I wrong here?). Unfortunately, selectivity estimation for query is much difficult than just estimate frequency of individual word. Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
Oleg Bartunov <oleg@sai.msu.su> writes: > On Sat, 8 Mar 2008, Jan Urbaski wrote: >> I have a feeling that in many cases identifying the top 50 to 300 lexemes >> would be enough to talk about text search selectivity with a degree of >> confidence. At least we wouldn't give overly low estimates for queries >> looking for very popular words, which I believe is worse than givng an overly >> high estimate for a obscure query (am I wrong here?). > Unfortunately, selectivity estimation for query is much difficult than > just estimate frequency of individual word. It'd be an oversimplification, sure, but almost any degree of smarts would be a huge improvement over what we have now ... regards, tom lane
Oleg Bartunov wrote: > On Sat, 8 Mar 2008, Jan Urbaski wrote: >> OK, after reading through the some of the code the idea is to write a >> custom typanalyze function for tsvector columns. It could look inside > such function already exists, it's ts_stat(). The problem with ts_stat() is > its performance, since it sequentually scans ALL tsvectors. It's > possible to > write special function for tsvector data type, which will be used by > analyze, but I'm not sure sampling is a good approach here. I too was concerned about whether looking through the tsvectors of 100 random documents would be enough to get satisfying results. But maybe it would? I haven't seen many databases in my life, but I have a feeling that in many cases the documents stored in them have many things in common, they use similar vocabulary or so. For example, a database containing the PostgreSQL documentation would have a lot of records with the word 'tuple' or 'column' and a sampling approach should catch at least these most common words. By the way, does ALTER TABLE SET STATISTICS influence the number of tuples chosen by ANALYZE to do it's work, or just the amount of statistics extracted from that sample? It wasn't obvious to me while reading the code. > The way we could improve performance of gathering stats using ts_stat() > is to process only new documents. It may be not as fast as it looks > because of > lot of updates, so one need to think more about. That's the second approach I thought about. Suppose we let the user choose a table built by doing a INSERT ... SELECT * FROM ts_stat(...) and associate it somehow with the table containing the indexed documents. We could then give quite precise estimates on text search queries by looking for lexemes frequencies in that table. We could also give the users a trigger function, similar to tsvector_update_trigger, that could be installed for inserts, updates and deletes and would "merge" the changes to a document with this statistics table. Or just cram that data into pg_statistics and make the trigger operate on that? This looks like a more complicated approach. It sure would give a whole lot better results, but I'm worried about the additional overhead on every non-readonly operation. Anyway, I'm not even sure if this approach is sane: if we're to do it for tsvectors (essentialy keep an extra table with the computed frequencies of values from a column), then why not do it for integers? > Unfortunately, selectivity estimation for query is much difficult than > just estimate frequency of individual word. Sure, given something like 'cats & dogs'::tsquery the frequency of 'cat' and 'dog' won't suffice. But at least it's a starting point and if we estimate that 80% of the documents have 'dog' and 70% have 'cat' then we can tell for sure that at least 50% have both and that's a lot better than 0.1% that's being returned now. Regards, Jan -- Jan Urbanski GPG key ID: E583D7D2 ouden estin
On Sat, 8 Mar 2008, Tom Lane wrote: > Oleg Bartunov <oleg@sai.msu.su> writes: >> On Sat, 8 Mar 2008, Jan Urbaski wrote: >>> I have a feeling that in many cases identifying the top 50 to 300 lexemes >>> would be enough to talk about text search selectivity with a degree of >>> confidence. At least we wouldn't give overly low estimates for queries >>> looking for very popular words, which I believe is worse than givng an overly >>> high estimate for a obscure query (am I wrong here?). > >> Unfortunately, selectivity estimation for query is much difficult than >> just estimate frequency of individual word. > > It'd be an oversimplification, sure, but almost any degree of smarts > would be a huge improvement over what we have now ... yes, given that the most popular queries are just one-word long. Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
On Sat, 8 Mar 2008, Jan Urbaski wrote: > >> Unfortunately, selectivity estimation for query is much difficult than just >> estimate frequency of individual word. > > Sure, given something like 'cats & dogs'::tsquery the frequency of 'cat' and > 'dog' won't suffice. But at least it's a starting point and if we estimate > that 80% of the documents have 'dog' and 70% have 'cat' then we can tell for > sure that at least 50% have both and that's a lot better than 0.1% that's > being returned now. certainly yes and given that most popular queries are single word query this would very helpful in most cases. The reason I though about ts_stat() improvement is that we could use its statistics for incomplete search feature people requested, when AND query like ( a & b &c ) rewrites to a set of AND|OR queries depending on the terms occurency. Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83
OK, here's a more detailed description of the FTS selectivity improvement idea: === Write a typanalyze function for column type tsvector ==== The function would go through the tuples returned by the BlockSampler and compute the number of times each distinct lexeme appears inside the tsvectors of those tuples. It would then store the most common lexemes, along with their frequencies in pg_statistics. This will likely require adding a new STATISTIC_KIND_* constant, as with tsvector statistics we won't store the most common values of the tsvector column based on some kind of equality operator, but rather the most common lexemes appearing in those tsvectors. The frequencies will be the fraction of total rows, that contain a particular lexeme in the tsvector column being analyzed. Most frequent lexemes would be stored as text[] in stavalues1 and their frequencies as real[] in stanumbers1. XXXs: - Will looking for most common lexemes in just a few sample rows be enough to do useful selectivity estimation? Maybe the minimum number of rows returned by the sampler should be raised for this kind of stat-gathering. Or maybe it should be documented that it is advisable to SET STATISTICS for a tsvector column to something fairly large if you are to get good results from the planner. - There are typically very few or none at all deletes in tables storing indexed documents. This means that when whe're regularly sampling rows and computing most common lexemes, maybe we shouldn't throw away the previous results? Maybe it would be smart to "merge" previous results with the freshly obtained? If assume no deletes were made between the ANALYZEs we could do the maths and do MCV and frequencies estimates based on that assumption and the previous results. - Right now there seems to be some duplicate code in compute_minimal_stats and compute_scalar_stats, maybe this could be cleaned up as a side effect. The custom typanalyze function would also need to estimate the number of nonnull entries and the average width of the column, perheaps these could be made into separate functions and called from all three places (compute_minimal_stats, compute_scalar_stats, tsvector_typanalyze). - Maybe there are other interesting statistics we could collect for tsvectors, something more fancy than just most common lexemes? === Write a selectivity estimation function for the @@ operator === The function would look at the tsquery and the statistics gathered by the function described earlier and return a selectivity estimation based on them. For example, given SELECT * FROM documents WHERE doc_vector @@ to_tsquery('dog') if the lexeme 'dog' appears among the MCV of doc_vector and has a frequency of 0.7, we would get a 0.7 returned rows estimation. Of course this is a very simple example, it'll be much harder than this. First, the function would have to walk the TSQuery and take it's structure in consideration. For example SELECT * FROM documents WHERE doc_vector @@ to_tsquery('!dog') would have to return a 0.3 estimation (or something more subtle than just 1 - 0.7?). Same goes for other modifiers like &, |. If no lexemes from the tsquery are among MCV, we return an arbitrary 0.001, as it is done currently for all queries. === Deploy these functions === This could at first be deployed as a contrib module, that would define tsvector_typanalyze (or maybe ts_typanalyze, to be consistent with other ts_* functions) and tsvectorsel and update pg_operator and pg_type so tsvector would be ANALYZed and @@ restricted with the new method. So much for the idea, but I might very well have missed some crucial things that'd have to be done in order to pull this off. Comments, suggestions, criticism? Regards, Jan -- Jan Urbanski GPG key ID: E583D7D2 ouden estin