Обсуждение: Efficiency vs. code bloat for SELECT wrappers
Greetings. I am working on a PostgreSQL-backed mod_perl web application that's just in its infancy. Let's say I have a users table that holds about 15 columns of data about each user. If I write one Perl sub for each operation on the table (e.g. one that gets the username and password hash, another that gets the last name and first name, etc.), there will be a whole lot of subs, each of which performs one very specific task. If I write one larger Perl sub that grabs the whole row, and then I deal with the contents of the row in Perl, ignoring columns as I please, it will require fewer subs and, in turn, imply cleaner code. My concern is that I don't know what efficiency I would be forfeiting on the PostgreSQL side of the application by always querying entire rows if my transaction occurs entirely within a single table. Of course, I would want to handle more complicated JOINs and other more intensive operations on the PostgreSQL side. I don't think the application will ever store a tuple larger than about 512 bytes in any table, so the network speed wouldn't really come into play. Thanks. Colin
Colin Wetherbee <cww@denterprises.org> writes: > Let's say I have a users table that holds about 15 columns of data about > each user. > If I write one Perl sub for each operation on the table (e.g. one that > gets the username and password hash, another that gets the last name and > first name, etc.), there will be a whole lot of subs, each of which > performs one very specific task. > If I write one larger Perl sub that grabs the whole row, and then I deal > with the contents of the row in Perl, ignoring columns as I please, it > will require fewer subs and, in turn, imply cleaner code. > My concern is that I don't know what efficiency I would be forfeiting on > the PostgreSQL side of the application by always querying entire rows if > my transaction occurs entirely within a single table. Not nearly as much as you would lose anytime you perform two independent queries to fetch different fields of the same row. What you really need to worry about here is making sure you only fetch the row once regardless of which field(s) you want out of it. It's not clear to me whether your second design concept handles that, but if it does then I think it'd be fine. The only case where custom field sets might be important is if you have fields that are wide enough to potentially get TOASTed (ie more than a kilobyte or so apiece). Then it'd be worth the trouble to not fetch those when you don't need them. But that apparently isn't the case with this table. regards, tom lane
Tom Lane wrote: > Colin Wetherbee <cww@denterprises.org> writes: >> Let's say I have a users table that holds about 15 columns of data about >> each user. > >> If I write one Perl sub for each operation on the table (e.g. one that >> gets the username and password hash, another that gets the last name and >> first name, etc.), there will be a whole lot of subs, each of which >> performs one very specific task. > >> If I write one larger Perl sub that grabs the whole row, and then I deal >> with the contents of the row in Perl, ignoring columns as I please, it >> will require fewer subs and, in turn, imply cleaner code. > >> My concern is that I don't know what efficiency I would be forfeiting on >> the PostgreSQL side of the application by always querying entire rows if >> my transaction occurs entirely within a single table. > > Not nearly as much as you would lose anytime you perform two independent > queries to fetch different fields of the same row. What you really need > to worry about here is making sure you only fetch the row once > regardless of which field(s) you want out of it. It's not clear to me > whether your second design concept handles that, but if it does then > I think it'd be fine. Yes, the second design concept would handle that. > The only case where custom field sets might be important is if you have > fields that are wide enough to potentially get TOASTed (ie more than a > kilobyte or so apiece). Then it'd be worth the trouble to not fetch > those when you don't need them. But that apparently isn't the case > with this table. Sounds good. Thanks. Colin
Sam Mason wrote: > On Sun, Dec 16, 2007 at 06:31:56PM -0500, Colin Wetherbee wrote: >> If I write one Perl sub for each operation on the table (e.g. one that >> gets the username and password hash, another that gets the last name and >> first name, etc.), there will be a whole lot of subs, each of which >> performs one very specific task. >> >> If I write one larger Perl sub that grabs the whole row, and then I deal >> with the contents of the row in Perl, ignoring columns as I please, it >> will require fewer subs and, in turn, imply cleaner code. > > It sounds as though you're just treating the database as a relatively > dumb datastore. They can be used as this, and Toms comments are as > always good, but relational databases come into their own when you're > writing more complex queries. When I'm writing my code I tend to put > the SQL statements directly in with the rest of the code, abstracting > away from the database tends to make things more complicated than they > need to be. Because I know Perl a whole lot better than SQL, PostgreSQL, and even the Perl DBI, I'm always inclined to wrap the database stuff in a nice little package and forget about it. This method has worked well for me in the past, but the project I'm starting is much bigger in terms of database use than anything else I've written. You say you write SQL directly in your application code, rather than writing task-oriented wrappers. I like that idea, but I can see that getting really messy. What is your opinion on how it affects code maintenance and things like that? Thanks. Colin
On Mon, Dec 17, 2007 at 12:49:46PM -0500, Colin Wetherbee wrote: > Because I know Perl a whole lot better than SQL, PostgreSQL, and even > the Perl DBI, I'm always inclined to wrap the database stuff in a nice > little package and forget about it. This method has worked well for me > in the past, but the project I'm starting is much bigger in terms of > database use than anything else I've written. SQL allows you to express some sorts of problems very succinctly and efficiently, the problem is that you need some experience of SQL before knowing when is best to use imperative vs relational tools. Without knowing what sort of thing you're trying to do I'm unable to give any appropriate examples. I'd definitely say that spending time getting to know databases has made me much more productive in total. Mainly because with relational databases you just don't have to worry about lots of the, mostly distracting, details that you do in imperative languages. > You say you write SQL directly in your application code, rather than > writing task-oriented wrappers. I like that idea, but I can see that > getting really messy. What is your opinion on how it affects code > maintenance and things like that? It's normally very manageable, but it'll depend on where you like to put your machinery. I seem to be relying more and more on (database) server side code and most of the UI code just tends to do some initial simple input validation and then hand all the data over to the database. My UI code then takes any results back from the database and feeds it into the appropriate places. I'd guess that maybe 5 to 10% of the lines in my programs contain some SQL---i.e. two or three lines of SQL for every twenty or thirty lines. I don't think I'd save much, if any, lines in total if I tried to put some database abstraction layer in the middle. I can say with reasonable certainty that it would be much more complicated to keep everything working as efficiently. Luckily I've been able to design most of the programs I work on as relatively simple layers over a database, I'm not sure if you're able to work like this. Sam
--- Colin Wetherbee <cww@denterprises.org> wrote: > Sam Mason wrote: > > On Sun, Dec 16, 2007 at 06:31:56PM -0500, Colin > Wetherbee wrote: > >> If I write one Perl sub for each operation on the > table (e.g. one that > >> gets the username and password hash, another that > gets the last name and > >> first name, etc.), there will be a whole lot of > subs, each of which > >> performs one very specific task. > >> Right. First rule of software engineering is keep functions as small as possible, focussed on one thing wherever practicable. It doesn't matter if the language is Perl or C++ or Java, or a stored procedure in an RDBMS. One can always create additional driver functions that use the elemental simple functions to do more complex tasks (bearing in mind the complexities that will inevitably arise in multiple user situations). > >> If I write one larger Perl sub that grabs the > whole row, and then I deal > >> with the contents of the row in Perl, ignoring > columns as I please, it > >> will require fewer subs and, in turn, imply > cleaner code. > > Define "cleaner code." The more data, and the more complex that data, the more code you have to write, regardless of whether that is in one function or several. Either way, done badly, can be a problem for both maintenance and performance. > > It sounds as though you're just treating the > database as a relatively > > dumb datastore. They can be used as this, and > Toms comments are as > > always good, but relational databases come into > their own when you're > > writing more complex queries. When I'm writing my > code I tend to put > > the SQL statements directly in with the rest of > the code, abstracting > > away from the database tends to make things more > complicated than they > > need to be. > > Because I know Perl a whole lot better than SQL, > PostgreSQL, and even > the Perl DBI, I'm always inclined to wrap the > database stuff in a nice > little package and forget about it. This method has > worked well for me > in the past, but the project I'm starting is much > bigger in terms of > database use than anything else I've written. > I routinely keep my SQL code distinct from my Perl, java or C++ code. When a client program needs to do something with the database, then either a child process executes a script I have written, if the client program doesn't need to do anything with data drawn from the database, or I have all the SQL code in one or more stored procedures, and use the appropriate client interface to invoke the stored procedure(s). Whether the SQL is in a specific script or in a stored procedure, my SQL code is kept distinct from the client code, regardles of the language I have used for that. I find this even MORE useful as my projects get bigger. > You say you write SQL directly in your application > code, rather than > writing task-oriented wrappers. I like that idea, > but I can see that > getting really messy. What is your opinion on how > it affects code > maintenance and things like that? > From what I have seen, even in small, almost trivial, client programs, I find this gets messy real quick. I, therefore, hate the idea of mixing SQL in with client code (it is all, in my view, application code). I like the use of middleware objects since, if well designed, they can make developing the code required all the simpler. It is only when badly done that an abstraction leads to complicated code that is a nightmare to maintain; worse if it is inadequately documented. The whole purpose of abstraction, whether one is using objected oriented development of middleware, or generic programming, or structured programming, or functional programming, is to analyze a complex problem into more manageable parts. The parts themselves become easier to code (witness java beans of various kinds, such as backing beans - my beans get their data either from the database or from the user interface - in either case, they make interaction between a web based interface and the database back end MUCH simpler to code), and the relationships among the parts are easier to understand. Each kind of abstraction has its place. It is up to the analyst or architect to figure out how many layers and what abstractions are appropriate for a given project. HTH Ted
Sam Mason wrote: > Luckily I've been able to design most of the programs I work on as > relatively simple layers over a database, I'm not sure if you're able to > work like this. I'm not at liberty to divulge much of the application concept, but consider, if you will, an application like Gmail or any other web-based mail service. I'm venturing a guess that the database code in Gmail scripts isn't overly-complicated, and there are probably a handful of "task categories" that get executed by the application. Folder operations might be lumped into one category, and SMTP operations into another, for example. Each category probably has a few variations, like retrieving an email with or without full headers. Overall, though, I would wager that the front-end, UI-type stuff in Gmail is much more complicated than the database code, especially with all the Javascript it uses (though, I suspect most of that is relatively static code). This is roughly the distribution of code I'm implementing: lots of web stuff with only a few database hits per page, most of which are SELECT queries. So, really, I don't think my application would be considered to be "relatively simple layers over a database", since the UI part will be so full-featured. I doubt I would ever see 5-10% of the lines accessing the database in this application. A better estimate would probably be around 1% or 1.5%. My guess, having written this, is that your approach might be more useful for applications that rely heavily on interaction with a database. I'd appreciate any more comments you have on this, though. Colin
Ted Byers wrote: > --- Colin Wetherbee <cww@denterprises.org> wrote: > >> Sam Mason wrote: >>> On Sun, Dec 16, 2007 at 06:31:56PM -0500, Colin >> Wetherbee wrote: >>>> If I write one Perl sub for each operation on the >> table (e.g. one that >>>> gets the username and password hash, another that >> gets the last name and >>>> first name, etc.), there will be a whole lot of >> subs, each of which >>>> performs one very specific task. >>>> > Right. First rule of software engineering is keep > functions as small as possible, focussed on one thing > wherever practicable. It doesn't matter if the > language is Perl or C++ or Java, or a stored procedure > in an RDBMS. One can always create additional driver > functions that use the elemental simple functions to > do more complex tasks (bearing in mind the > complexities that will inevitably arise in multiple > user situations). I've programmed in this way for years, basically ever since I learned object-oriented programming. I find it "cleaner" to keep functional elements separate and access them sequentially from larger, more broadly-focused functions. >>>> If I write one larger Perl sub that grabs the >> whole row, and then I deal >>>> with the contents of the row in Perl, ignoring >> columns as I please, it >>>> will require fewer subs and, in turn, imply >> cleaner code. > Define "cleaner code." The more data, and the more > complex that data, the more code you have to write, > regardless of whether that is in one function or > several. Either way, done badly, can be a problem for > both maintenance and performance. See above. > I routinely keep my SQL code distinct from my Perl, > java or C++ code. When a client program needs to do > something with the database, then either a child > process executes a script I have written, if the > client program doesn't need to do anything with data > drawn from the database, or I have all the SQL code in > one or more stored procedures, and use the appropriate > client interface to invoke the stored procedure(s). > Whether the SQL is in a specific script or in a stored > procedure, my SQL code is kept distinct from the > client code, regardles of the language I have used for > that. I find this even MORE useful as my projects get > bigger. This seems like quite a departure from Sam's recommendation. Now, I'm torn! Colin
On Mon, Dec 17, 2007 at 01:29:10PM -0500, Ted Byers wrote: > I routinely keep my SQL code distinct from my Perl, > java or C++ code. When a client program needs to do > something with the database, then either a child > process executes a script I have written, if the > client program doesn't need to do anything with data > drawn from the database, or I have all the SQL code in > one or more stored procedures, and use the appropriate > client interface to invoke the stored procedure(s). > Whether the SQL is in a specific script or in a stored > procedure, my SQL code is kept distinct from the > client code, regardles of the language I have used for > that. I find this even MORE useful as my projects get > bigger. Humm, this discussion is horribly domain specific. Abstractions are the bread and butter of programming and designing them appropriately makes the difference between things turning into a mess later on or not. The only solid rules I stick to is that when I'm working to solve a problem I've not solved before, I will get it (i.e. the way I go about solving the problem, getting the right answer is generally pretty easy) wrong the first time and probably only get it reasonable the third or fourth time. Luckily, big bits of software take a while to accumulate so it's possible to do the naive thing first and then, when you know how things work, better abstractions can be found and incorporated. For example, I'd be tempted to replace your "get password hash" with a "verify password" procedure that actually hashes the password client side and then sends it over to the database for actual validation. However, you may actually care what the password hash is, at which point the hashing should be broken out and revealed to the appropriate balance of the software. > Each kind of abstraction has its place. > It is up to the analyst or architect to figure out how > many layers and what abstractions are appropriate for > a given project. Indeed. But to be able to choose a suitable set of abstractions, it helps for the designer to know the ins and outs of the tools being used. I was trying to encourage Colin to look at using databases in a different way, different styles of programming suit different applications and hiding the database can have detrimental effects as well as positive effects. Sam
--- Sam Mason <sam@samason.me.uk> wrote: > On Mon, Dec 17, 2007 at 01:29:10PM -0500, Ted Byers > wrote: > > I routinely keep my SQL code distinct from my > Perl, > > java or C++ code. When a client program needs to > do > > something with the database, then either a child > > process executes a script I have written, if the > > client program doesn't need to do anything with > data > > drawn from the database, or I have all the SQL > code in > > one or more stored procedures, and use the > appropriate > > client interface to invoke the stored > procedure(s). > > Whether the SQL is in a specific script or in a > stored > > procedure, my SQL code is kept distinct from the > > client code, regardles of the language I have used > for > > that. I find this even MORE useful as my projects > get > > bigger. > > Humm, this discussion is horribly domain specific. > Abstractions are the > bread and butter of programming and designing them > appropriately makes > the difference between things turning into a mess > later on or not. The > only solid rules I stick to is that when I'm working > to solve a problem > I've not solved before, I will get it (i.e. the way > I go about solving > the problem, getting the right answer is generally > pretty easy) wrong > the first time and probably only get it reasonable > the third or fourth > time. > Right. But there is a quicker way. I like to work with people who have experience that is different from mine, expecting they will have seen a different suite of problems and found solutions I have come to rely on. In that way, in discussing my current problems, I can learn something new, and that much faster than trial and error. Of course, there are always test or proof of concept programs, where I test ideas. For example, I have routinely been advised that left joins are faster than correlated subqueries. In the past month, I have found two problems requiring either a correlated subquery or a left join, and in one case the correlated subquery was more than ten times faster while in the other the left join was about twice as fast. In both cases, the results returned by the two approaches were identical, but there were significant differences in performance; most astonishing in the one case that proved to be quite different than expected. > > Each kind of abstraction has its place. > > It is up to the analyst or architect to figure out > how > > many layers and what abstractions are appropriate > for > > a given project. > > Indeed. But to be able to choose a suitable set of > abstractions, it > helps for the designer to know the ins and outs of > the tools being > used. I was trying to encourage Colin to look at > using databases > in a different way, different styles of programming > suit different > applications and hiding the database can have > detrimental effects as > well as positive effects. > I find the opportunity to look at problems in a different light priceless. But some practices generally turn out to be counter productive. Producing overly long functions, or mixing code involving different languages into the same file, often leads to an unmaintainable mess. More often than not, such practices are a consequence of poor design. But even here, there is no hard and fast rule, since some algorithms, such as numeric quadrature or QR factorization of general real matrices can not be written in a short, simple function although they logically ought to be one function in the simplest cases; but even these admit enhancements that warrant more interesting data structures and ancillary functions. But for these algorithms, which admittedly have nothing to do with databases, the top consideration is the speed of provably correct code. If that means it is too complex for junior or even intermediate programmers, so be it. I have seen and used library code for number crunching that only a specialist in numeric methods have a hope of easily understanding, and been glad I didn't have to write those functions myself. ;-) For Colin's situation, what I would have suggested would be a simple extension of what he normally does, and that is look at an additional layer that gets the data from the database as efficiently as possible, and cache it in that layer, providing it to the UI layer as required. Similarly, perhaps in a different layer, get user data from the interface and cache it, sending it to the database back end at the most appropriate time in the most efficient way. And perhaps with all this, some form of database connection pooling; although this latter option will depend on how many concurrent users there may be in the worse case scenario (or maybe that is the best case scenario since it would mean that there is significant demand for what the app does ;-). I haven't attempted this in Perl, but it is dirt simple in Java/J2EE. In a web application, based, e.g. on Tomcat, it is easy to set this up with as many layers as one can justify, especially with an IDE like NetBeans6. Not having thought about how to do this using perl, I wouldn't know what to advise regarding how to put this into practice if one wants to stick with perl. HTH Ted
Colin Wetherbee wrote: > My guess, having written this, is that your approach might be more > useful for applications that rely heavily on interaction with a > database. I'd appreciate any more comments you have on this, though. Tom, Sam, and Ted (a lovely assortment of three-letter names), thank you very much for your input on my situation. You've given me quite a lot to consider. For now, I will stick with Tom's and Ted's recommendations for a while, but I'll also keep Sam's ideas in mind if this application becomes more heavily reliant on the database. Colin