Обсуждение: Complex database for testing, U.S. Census Tiger/UA
The U.S. Census provides a database of street polygons and other data about landmarks, elevation, etc. This was discussed in a separate thread. The main URL is here: http://www.census.gov/geo/www/tiger/index.html My loader was written for the 2000 version, the 2002 version has some difference, but it should be easy enough to ad the fields. On my site, in the downloads section, at the bottom is the tigerua loader. It is very raw, just hacked together to load the data. It may take a little work to function with 2002 files, I have not looked at that yet. My site: http://www.mohawksoft.com
mlw wrote: > > The U.S. Census provides a database of street polygons and other data > about landmarks, elevation, etc. This was discussed in a separate thread. > > The main URL is here: > http://www.census.gov/geo/www/tiger/index.html While yes, the tiger database (or better it's content) is interesting, I don't think that it can be counted as a "complex database". Just that something is big doesn't mean that. > > My loader was written for the 2000 version, the 2002 version has some > difference, but it should be easy enough to ad the fields. OT: Just out of curiosity, do you plan more on this? I was playing around with the 2000 version a while back, but the Garmin GPS units unfortunately use a proprietary map format, so one cannot generate his own detail maps for download. The waypoint and route data protocol is well known though. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
> mlw wrote: >> >> The U.S. Census provides a database of street polygons and other data >> about landmarks, elevation, etc. This was discussed in a separate >> thread. >> >> The main URL is here: >> http://www.census.gov/geo/www/tiger/index.html > > While yes, the tiger database (or better it's content) is interesting, > I don't think that it can be counted as a "complex database". Just that > something is big doesn't mean that. I guess you are right, but there are a lot of related tables. I wouldn't call it simple, though. It can get huge, however. > >> >> My loader was written for the 2000 version, the 2002 version has some >> difference, but it should be easy enough to ad the fields. > > OT: > > Just out of curiosity, do you plan more on this? I was playing around > with the 2000 version a while back, but the Garmin GPS units > unfortunately use a proprietary map format, so one cannot generate his > own detail maps for download. The waypoint and route data protocol is > well known though. I'm not sure what a Garmin GPS unit is, but the TigerUA DB uses longitude and latitude. Any reasonable geographical system must somehow map to lat/long. Actually, I am going to download the latest version and get it installed on a system. There is a project I plan to work on in the near future, after all the other crap I gotta do, that will make use of the data.
Jan Wieck wrote: > mlw wrote: > > > > The U.S. Census provides a database of street polygons and other data > > about landmarks, elevation, etc. This was discussed in a separate thread. > > > > The main URL is here: > > http://www.census.gov/geo/www/tiger/index.html > > While yes, the tiger database (or better it's content) is interesting, I > don't think that it can be counted as a "complex database". Just that > something is big doesn't mean that. Just so. There are doubtless interesting cases that may be tested by virtue of having a data set that is large, and perhaps "deeply interlinked." But that only covers cases that have to do with "largeness." It doesn't help ensure that PostgreSQL plays well when it gets hit by nested sets of updates where the challenges involve ensuring the system performs OK and does not deadlock when hit by complex sets of transactions. So that an "interesting" database might involve not only a database, but also a set of transactions that hit multiple tables that are to update that database. In effect, something like the "readers/writers" that get used to test locking semantics. This is something that would not be able to solely consist of a set of tables; it would have to include streams of updates. Something like one of the TPC benchmarks... -- output = reverse("moc.enworbbc@" "enworbbc") http://www3.sympatico.ca/cbbrowne/rdbms.html "If I could find a way to get [Saddam Hussein] out of there, even putting a contract out on him, if the CIA still did that sort of a thing, assuming it ever did, I would be for it." -- Richard M. Nixon
Around 11:24 on Apr 8, 2003, cbbrowne@cbbrowne.com said: I think it was my first application I wrote in python which parsed the zip files containing these data and shoved it into a postgres system. I had multiple clients on four or five computers running nonstop for about two weeks to get it all populated. By the time I was done, and got my first index created, I began to run out of disk space. I think I only had about 70GB to work with on the RAID array. # Jan Wieck wrote: # > mlw wrote: # > > # > > The U.S. Census provides a database of street polygons and other data # > > about landmarks, elevation, etc. This was discussed in a separate thread. # > > # > > The main URL is here: # > > http://www.census.gov/geo/www/tiger/index.html # > # > While yes, the tiger database (or better it's content) is interesting, I # > don't think that it can be counted as a "complex database". Just that # > something is big doesn't mean that. # # Just so. # # There are doubtless interesting cases that may be tested by virtue of # having a data set that is large, and perhaps "deeply interlinked." # # But that only covers cases that have to do with "largeness." It doesn't # help ensure that PostgreSQL plays well when it gets hit by nested sets # of updates where the challenges involve ensuring the system performs OK # and does not deadlock when hit by complex sets of transactions. # # So that an "interesting" database might involve not only a database, but # also a set of transactions that hit multiple tables that are to update # that database. In effect, something like the "readers/writers" that get # used to test locking semantics. # # This is something that would not be able to solely consist of a set of # tables; it would have to include streams of updates. Something like one # of the TPC benchmarks... # -- # output = reverse("moc.enworbbc@" "enworbbc") # http://www3.sympatico.ca/cbbrowne/rdbms.html # "If I could find a way to get [Saddam Hussein] out of there, even # putting a contract out on him, if the CIA still did that sort of a # thing, assuming it ever did, I would be for it." -- Richard M. Nixon # # # ---------------------------(end of broadcast)--------------------------- # TIP 3: if posting/reading through Usenet, please send an appropriate # subscribe-nomail command to majordomo@postgresql.org so that your # message can get through to the mailing list cleanly # # -- SPY My girlfriend asked me which one I like better. pub 1024/3CAE01D5 1994/11/03 Dustin Sallings <dustin@spy.net> | Key fingerprint = 87 02 57 08 02 D0 DA D6 C8 0F 3E 65 51 98 D8 BE L_______________________ I hope the answer won't upset her. ____________
Dustin Sallings wrote: > I think it was my first application I wrote in python which parsed > the zip files containing these data and shoved it into a postgres system. > I had multiple clients on four or five computers running nonstop for about > two weeks to get it all populated. > > By the time I was done, and got my first index created, I began to > run out of disk space. I think I only had about 70GB to work with on the > RAID array. But this does not establish that this data represents a meaningful "transactional" load. Based on the sources, which presumably involve unique data, the "transactions" are all touching independent sets of data, and are likely to be totally uninteresting from the perspective of seeing how the system works under /TRANSACTION/ load. TRANSACTION loading will involve doing updates that actually have some opportunity to trample on one another. Multiple transactions concurrently updating a single balance table. Multiple transactions concurrently trying to attach links to a table entry. That sort of thing. I remember a while back when MSFT did a "enterprise scalability day," where they were trumpeting SQL Server performance on "hundreds of millions of transactions." At the time, I was at Sabre, who actually do tens of millions of transactions per day, for passenger reservations across lotso airlines. Microsoft was making loud noises to the effect that NT Server was wonderful for "enterprise transaction" work; the guys at work just laughed, because the kind of performance they got involved considerable amounts of 370 assembler to tune vital bits of the systems. What happened in the "scalability tests" was that Microsoft did much the same thing you did; they had hordes of transactions going through that were well, basically independent of one another. They could "scale" things up trivially by adding extra boxes. Need to handle 10x the transactions? Well, since they don't actually modify any shared resources, you just need to put in 10x as many servers. And that's essentially what happens any time TPC-? benchmarks reach the point of irrelevance; that happens every time someone figures out some "hack" that is able to successfully partition the work load. At that point, they merely need to add a bit of extra hardware, and increasing performance is as easy as adding extra processor boards. The real world doesn't scale so easily... -- (concatenate 'string "cbbrowne" "@acm.org") http://cbbrowne.com/info/emacs.html Send messages calling for fonts not available to the recipient(s). This can (in the case of Zmail) totally disable the user's machine and mail system for up to a whole day in some circumstances. -- from the Symbolics Guidelines for Sending Mail
MLW, > > The U.S. Census provides a database of street polygons and other data > > about landmarks, elevation, etc. This was discussed in a separate thread. > > Yeah, this was me. We decided to go with the FCC database because it is more managably sized and has extensive schema documentation. Personally, I'd be happy to see someone put together a "huge table" test using the Tiger database, but for general tests we're aiming more at the 50-100mb size. -- -Josh BerkusAglio Database SolutionsSan Francisco
Josh Berkus wrote: > MLW, > > > > The U.S. Census provides a database of street polygons and other data > > > about landmarks, elevation, etc. This was discussed in a separate > thread. > > > > happy to see someone put together a "huge table" test using the Tiger > database, but for general tests we're aiming more at the 50-100mb size. The Tiger US street level data would be an excellent test of the polygon storage and extraction routines. My information might no longer be current, but the last time I checked Tiger gave the street level data out on cd (er, cds) as one huge table of disconnected road 'segments' broken up by state. Connecting the segments into longer streets for more meaningful processing is a good benchmarking procedure. I this is interesting strictly on that level. It doesn't test the optimizer or esoteric features much (except for geo features), but is a good test of index/cache/random tuple access. It's typical of the scientific/data processing problem domain that is much less common (but much more interesting!) than your average business based app. I definitely understand mlw's thinking. That being said, since major competitors lack the robust geo types/indices of postgres (the only way, IMHO, to do this type of thing); it wouldn't be a very fair test. I would like to point out the problem is scalable by picking one state e.g. Rhode Island :), and building off that. One thing at a time tho. It's no accident we have a 'PostGIS' and not a 'MyGIS' :) Merlin