j4: (work)
[personal profile] j4
Crowdsourcing my brain....

I am looking for a good introduction to sensible relational database design.

I have written Perl to interface with databases; I am happy with constructing reasonably simple SQL queries and can, with a bit of headdesking, understand the seriously mangled SQL which makes the course booking system work; but I have never built a complex database (ie anything much above "hello world" kind of level) from scratch.

Pretty much everything I know about databases has been inferred from querying them in one way or another, usually in fairly narrow ways; I feel a bit like two and a half blind men with approximately three eighths of an elephant (<pred>but where can I get one at this time of night?</pred>), and I am worried that my incomplete understanding will result in me getting trampled on by the elephant sooner or later.

What I want is to be able to look at a moderately complex database and, with some poking around (hopefully knowing what I was looking for!), understand a) how it all interlinks, and b) whether if I was building it again from scratch I would change it. I suspect that the latter will probably be partly a question of personal style.

I'm not really expecting clear rules about what makes a sensible database (though that would be nice), so much as something that will point me towards 'best practice' or at least 'how to avoid tying oneself in unspeakable knots'.

I would also like to know how people who do understand databases describe them, document them, map them so that the representation of them a) makes sense to normal people, and b) is unambiguous to abnormal people anybody who's dealing with the back end of the database (which is, I fear, a bit like the back end of an elephant). I suspect there will be more than one right answer to a) and/or b).

I have a map on my wall of how our course booking database works, though it's only the basic tables and not all the complicated logic. I suspect this is not how real people draw database structures.

Any suggestions, comments, constructive ridicule etc gratefully received.

Date: 2008-08-18 05:53 pm (UTC)
emperor: (Default)
From: [personal profile] emperor
I could try and find my copy of the compsci databases notes? They're a bit theoretical, but that might leave you with a fair idea of the underlying design issues (and the jargon)...

Date: 2008-08-19 12:35 am (UTC)
ext_3241: (Default)
From: [identity profile] pizza.maircrosoft.com (from livejournal.com)
my notes: http://cs.maircrosoft.com/notes/thumbs.cgi?1B/Part-1B-Databases/

(can't remember much about the course, although I got terribly excited that year about the idea of having a website entirely generated from a database and thus never going out of date. Now everybody's doing it).

Date: 2008-08-18 06:19 pm (UTC)
vatine: Generated with some CL code and a hand-designed blackletter font (Default)
From: [personal profile] vatine
Let'sd prefix with "I am not a database expert, but..."

Try to avoid duplication of data, until duplication of data is speed-critical.

Try to store each conceptual "thing" in a table row of its own, using many-to-many tie-tables when there's assorted interlinking between related "things" (typical example where I did the DB design myself and am not entirely displeased with the result; one table for books that hold "id", "title", "isbn", "classification" [this is for my own sheer enjoyment], "publisher" [an ID reference to a publisher table]; one for authors that hold "non-sur name(s)", "surname" and assorted bio data when known and a book-author tie table that holds tuples of book-id and author-id).

Having loked at the image, it' snot too far off what I've mostly seen. In general, database sketches tend to describe teh tables and their direct inter-linking.

Date: 2008-08-18 07:43 pm (UTC)
From: [identity profile] barnacle.livejournal.com
Normalization - which is the de-duplication of data - definitely helps optimize your databases performance, but at the cost of complicating your application. It has to hold more table joins and conditions in its declarative layer, to reassemble the data into something meaningful. The level of normalization should support the application, and you can always (with some pain) normalize more in future.

The Art of SQL is a fascinating, surprisingly hard-headed book. But it's not a great beginning book, and there's only one or two large bits that I've had the patience to sit down and read straight through. The O'Reilly books for the SQL variants are generally OK.

Your db diagram looks fine, j4. You might want to flip the arrowheads, so they look like cartoon snowmen arms and hands. Then the single line means "one from this table" and the multiple line means "can be joined to many from this". Putting a line across the single end (so making it a cross) means it's specifically one-to-many (a line being a stylized 1) and a circle means zero-to-many. But that's just disambiguation.

Outside of just getting data in and out, hardcore optimization comes from knowing about indexes (and sequences) and, if adding indexes on foreign keys doesn't work its magic, being able to figure out the query plan output from the SQL client of your choice, and maybe just changing the order of your table joins to give the planner the hint it needs.

If you're using PostgreSQL, learn about these things; if you're using MySQL, just keep plugging away until performance starts to suffer in twelve months' time, then learn about them; if you're using Oracle or SQL Server, then set fire to the server and head to G&D's while the sirens come and go.

Date: 2008-08-19 09:10 am (UTC)
From: (Anonymous)
I'll second The Art of SQL as an excellent book to read eventually, though not beginner-friendly.

Date: 2008-08-18 07:04 pm (UTC)
From: [identity profile] rysmiel.livejournal.com
I have been doing this for a living for nineteen years OMG how did I get so damned old, and I only don't have a good reference text to point you at because the one I wrote a chapter for last year is taking forever to grind through the presses, so, summary of the summary of the summary:

The point from which you start is, what questions are people going to be asking it most often ? Any design that needs horribly mangled SQL to do things people are doing with any regularity is broken.

If you will have something like a regular web interface, with specific questions people will ask with some regularity, it helps a lot to organise the set of data that match each standard answer format into one table, so that the query that produces the answer you want is simple; helps with both speed and resilience.

If your incoming data is in a different format, regular or no, somewhere in the process there will be a rearrangement step. Insofar as is humanly possible this should be done between one table and another under your control, rather than when your user queries things. That way if something breaks it breaks in a situation under your control.

The number one bad habit I have to train people working with me on this out of is that they are still being taught from textbooks which say "duplicating data BAD" and they want to make complex cross-queries and so forth, which, once your dataset reaches any reasonable size is a net loss. Almost all the time these days, disk space is less of an issue than speed of response; so build the tables that your ussers will query in order to make the questions they will ask simple queries. [ The more control you exercise over what your users can do, the better. Ideally one convinces them that this was their idea. ]

I can probably provide you with something more concrete, actually; expect mail shortly.

Date: 2008-08-18 07:29 pm (UTC)
vatine: Generated with some CL code and a hand-designed blackletter font (Default)
From: [personal profile] vatine
I've always interpreted "duplicating data BAD" as meaning that it's too damned easy ending up with inconsistent data sets if you have data duplicated everywhere. If you need to duplicate data to get sufficient speed, by all means do this, but...

Date: 2008-08-18 07:51 pm (UTC)
From: [identity profile] rysmiel.livejournal.com
This is what table constraints, integrity checks, and well-defined population procedures are for.

Date: 2008-08-18 07:55 pm (UTC)
From: [identity profile] mobbsy.livejournal.com

The number one bad habit I have to train people working with me on this out of is that they are still being taught from textbooks which say "duplicating data BAD" and they want to make complex cross-queries and so forth, which, once your dataset reaches any reasonable size is a net loss. Almost all the time these days, disk space is less of an issue than speed of response; so build the tables that your ussers will query in order to make the questions they will ask simple queries. [ The more control you exercise over what your users can do, the better. Ideally one convinces them that this was their idea. ]


Gak! Normalised database design isn't about saving disk space, it's about data integrity. With data duplicated all across the schema, you're asking for inconsistency. If you need to denormalise for performance, do so in materialized views not in the basic schema.

Date: 2008-08-18 08:07 pm (UTC)
From: (Anonymous)
Yes! Please listen to Mobbsy here. Database schemas should represent the data and its existing relationships, not be built 'in order to make the questions they will ask simple queries'. You exercise control over users by building them applications (or at the very least procedures and views) so they don't query and manipulate tables directly.

Date: 2008-08-18 08:11 pm (UTC)
From: [identity profile] mobbsy.livejournal.com
This (http://en.wikipedia.org/wiki/Entity-relationship_model#Crow.27s_Foot) is a common notation for schema diagrammes that I've used and seen used quite frequently.

Again, I can't really think of a single good text on database design. If you ask, somebody will usually crop up and say Date's Introduction to Database Systems (http://www.amazon.co.uk/Introduction-Data-Base-Systems-Pie/dp/0321189566/) - it's the classic reference in the field, however it's not a terribly good tutorial.

Understand normalisation (http://en.wikipedia.org/wiki/Database_normalization), understand why a design being 3NF (http://en.wikipedia.org/wiki/Third_normal_form) is worth aiming for, and only then think about breaking the rules.

Date: 2008-08-18 08:17 pm (UTC)
From: [identity profile] mobbsy.livejournal.com
http://www2.cs.uregina.ca/~bernatja/crowsfoot.html - better page for Crow's Foot notation than Wikipedia

Date: 2008-08-19 09:45 am (UTC)
From: [identity profile] caramel-betty.livejournal.com
I second the caution about Date. He's very thorough, but sometimes too thorough. He's also incredibly dry to read.

March 2024

S M T W T F S
     12
3456789
101112 13141516
17181920212223
24252627282930
31      

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Jun. 2nd, 2025 10:46 am
Powered by Dreamwidth Studios