Re: many tables in db

Поиск
Список
Период
Сортировка
От Kenneth Been
Тема Re: many tables in db
Дата
Msg-id 3B57547F.4010708@telocity.com
обсуждение исходный текст
Ответ на many tables in db  (Kenneth Been <kennethb@telocity.com>)
Ответы Re: many tables in db  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-general
>>I am considering a db schema with many tables (in the
>>thousands).
>>
>
> Just out of curiosity, why would you need so many tables?  Couldn't you
> represent the same info in fewer tables, maybe with more key columns?
> If every table has a different set of columns, that wouldn't work, but
> I'm wondering what the organizing principle really is.


I would love to just use one table, but I am worried about
performance, both in building the db and in querying it.
(Once built, the db is static.)  So each table contains a
subset of the rows that would be in the hypothetical single
table.

Specifically, I am making an interactive map server for the
web.  So the tables store the geographic information, lines
and polygons, indexed by the (latitude,longitude) range of
each feature (geometric type box).  So a one table solution
would have a query like

select stuff from lines where range && query_range

which would be replaced in the multi-table solution with

(select stuff from lines_43 where range && query_range)
union
(select stuff from lines_46 where range && query_range)

I can easily store in memory the range (bounding box) for
each table, so no disk accesses are required to know which
tables need to be searched.

>>2. One of the reasons I am considering breaking the data
>>into multiple tables is that I want to cluster a large table
>>(on an rtree index)
>>
>
> This isn't really relevant to your main point, but: since an rtree
> doesn't have an associated sort order, it's not clear to me that this
> operation makes any sense.  Have you determined that you'll actually get
> any performance improvement as a result of the clustering?  I suspect
> you may find that you're just rearranging the table into a different
> random order.


The rtree doesn't have an exact order, but things that are
closer to each other in the tree (on the same branch) will
generally be closer to each other in the geography.  The
documentation for the cluster command says that the table is
copied to a temporary order "in index order" and then
renamed.  If this is really what is happening, then it
should put things that are close to each other in geography
close to each other on disk.

I haven't done any rigorous experiments, but I've eyeballed
it and I think the clustering helps.  But you're right, I
should do some more serious experiments to be sure.

>>3. Another reason for splitting up the data is that I am
>>worried about the size of the rtree.  I have some suspicion
>>that the rtree size grows more than linearly with the size
>>of the table.  Is that the case?
>>
>
> Dunno.  The rtree code isn't very well maintained, because none of
> the current crop of developers uses it or knows much about it.
> It could well have some bugs that cause it to waste space.  (I realized
> only a few days ago that it probably behaves completely horribly when
> asked to store NULLs, for example.  Do you have any NULLs in the column
> you're indexing?)


No, no nulls.

>>4. Another possible reason is query speed.  Any guesses on
>>which would be faster, two or three queries on tables of
>>about 40K records, or one query on a table of about 10M
>>records?  (All queries are geometric range searches on an
>>rtree index.)
>>
>
> Given the size difference, I'd suppose that the 10M table would actually
> be the equivalent of a materialized join of the smaller tables?  I'd
> guess that recomputing the join on the fly is better because it takes
> so much less I/O, but that's just a guess.  You'd be well advised to
> test both ways and see.

I'm not sure what you mean by "materialized join".  The big
table would be the union of the smaller tables.

Again, I'm sure you're right, I just need to try it.  But in
this case it is harder to try, because in order to build the
  big table, and cluster it, I have to get past that problem
of running out of disk space during the cluster!  I'll do
the experiment to see if clustering does anything first.

Thanks.

Ken


В списке pgsql-general по дате отправления:

Предыдущее
От: "Keith G. Murphy"
Дата:
Сообщение: Interesting discussion
Следующее
От: missive@frontiernet.net (Lee Harr)
Дата:
Сообщение: Re: Client connection problem...