NSO News

Latest US news, world news, sports, business, opinion, analysis and the world's leading liberal voice.

Ask HN: What are some examples of good database schema designs?

9 min read
https://news.ycombinator.com/item?id=22324691

PostGraphile [1] is a framework for generating a GraphQL API based on the tables in your database; as a result, good database design is crucial. Graphile Starter [2] is a quickstart project that demonstrates best practices for getting up and running with PostGraphile quickly. In particular, check out the SQL migration file in that project [3]. It demonstrates:

1. Dividing up tables so that one user can have more than one email address

2. Using PostgreSQL’s “row-level security” system to restrict database results based on the logged in user

3. Dividing tables across multiple schemas for additional security

4. SQL comments for documentation

5. SQL functions for common functionality, such as logging in or verifying an email

It’s a fascinating project, and well worth a look!

[1](https://www.graphile.org/postgraphile/)

[2](https://github.com/graphile/starter)

[3](https://github.com/graphile/starter/blob/master/%40app/db/mi…)

I’d highly recommend reading SQL Antipatterns. It’s a very approachable book that illustrates how to design a database schema by taking commonly encountered scenarios and first showing you the naive approach. After explaining why this is bad, it then shows you the recommended way of doing it and why this way works better.

I think ‘learning from how not to do something’ is a really powerful pedagogical technique that should be more widespread.

I don’t think this is a legal (US laws) copy of the book. It is still under copyright and publisher has not released it as Free download.

Please do not share links to pirated copies of resources such as this. I can’t stop you from pirating it yourself, but it shouldn’t be actively encouraged.

I’ve lived 20 years as an amateur SQL database designer and last year I designed my first production nosql schema on mongo. Nosql is a different world… collections, embedded documents. I’m not sure it’s 100% “correct” but it’s working great for me so far.

My project is a license server for my electron app. The tech stack is a joy: mongo/mongoose, express, graphql, JWT, all with a React web front end. Payments by Stripe. The actual licenses are the signed JWT vended by the graphql API (not to be confused with the totally separate JWT for auth).

The main goal is to sell software so I license by machine fingerprint (node module on electron).

It’s been running for over 6 months without issue. I’m just beginning to start architecting an update where I allow a subscription model similar to Jetbrains Perpetual Fallback license, but slightly different in favor of the user. I’ve taken a lot of notes from comments at https://news.ycombinator.com/item?id=17007036

Here’s what I’m thinking so far:

A) Non-Expiring License at $97.99. Unlimited bug fixes for that one version.
or
B) Expiring License at $9.79/month, and you always have access to the latest version. After 12 consecutive payments you are granted a Non-Expiring License for the most recent version at the time of each payment.

Now, to model this…

10 years ago, I’d said “at least third normal form”… but today: Whatever gets the job done. When the application is not really dependent on weird queries (e.g. just a blog), screw the normal forms and design your schema to use the least number of queries for a certain task. Nobody understands three lines of code of queries with left and right joins.

On the other hand, if your bookkeeping application uses a database try to keep things as tidy as possible.

Just dont break the first normal form. Nowadays with json columns breaking the first normal form is a real dumb move.

I’d rather say: use an ORM ! It will design the DB schema better and faster than you. Still comprehensive enough

If you’re doing any moderately complex analysis of the data in your database, the ORM will quickly start falling down. Abstracting the query layer into the application codebase is nice, and mapping entities to objects is nice, but ORM is not a silver bullet. Learning what makes for a good schema vs a bad schema and how to avoid N+1 loading or query problems is important no matter what.

ORMs aren’t bad, but learn their escape hatches or else you’ll have a hard time doing more complicated things.

It also usually forces your design towards the entities themselves rather than the specific way they’re stored, which positions you better for switching to a completely different storage system in the future if, for instance, it’s becoming too slow or expensive to maintain everything in a traditional big name RDBMS.

> forces your design towards the entities themselves

I agree that it’s very important to not let the physical schema leak into the rest of the system, and to have a strong conceptual model (aka entities and relations). This has been well understood for almost half a century: https://en.wikipedia.org/wiki/Three-schema_approach

But I don’t think ORMs are in any special position to help with this. They typically introduce so much other confusion that they tend to divert attention from designing both a good physical schema and a good conceptual model, and maintain a sensible mapping between the two. This can be done with RDBMS views for example, with a fraction of the overhead of an ORM. Most ORM-based code bases I’ve seen leak tons of db-level details.

> switching to a completely different storage system in the future

Designing for this eventuality is not healthy IMO. If you get there, it will be a so-called “good problem to have” and you will have to deal with whatever unique challenges you face at that level. We might as well be writing code with the possibility of “switching to a completely different programming language in the future” in mind. Yes, clean, modular code will help, but beyond that, not committing to the capabilities of the tools you have chosen will harm you system.

I’d say this applies to virtually all best practices, patterns, architectures, etc. If you’re doing something very simple, who cares about modularity or any kind of code hygiene? I don’t. But what happens in reality? Simple and small systems or experiments grow, one little addition at a time, and we all know the mess that ensues.

So in my understanding, the question posed only applies to at least moderately complex systems, which is where engineering skills matter. And in that context, learning what distinguishes a good database design is obviously very valuable, not to say crucial.

> Nobody understands three lines of code of queries with left and right joins.

Not sure if you’re being flippant, but a) this is not true, and more importantly b) why is it that we don’t expect programmers to be at least as fluent in SQL as in other, less important, languages?

I’ve been using Len Silverstein’s Universal Data Models for 15 years. You’ll be writing to lots of tables and will want views for your common aggregates. But you’ll have the common tables you’ll need, the patterns for those you don’t and be able to handle new requirements with minimal change.

There is no Customer table.

“The Data Model Resource Book, Vol. 1: A Library of Universal Data Models for All Enterprises”

In a similar vein, I recommend David Hay’s “Data Model Patternns: Conventions of Thought”. It’s more focused on the concept model side, but with awareness of how models are implemented in db schemas and the trade-offs involved.

Kinda tough to give a good answer without more context, IMO. What I mean is that a good e-commerce schema that serves a single small store and runs off a single database server would look quite different then a multi-tenant or distributed data store for a e-commerce site at scale.

The one you linked is a pretty typical relational model and isn’t bad, but it has trade offs that I’d personally not make, however, that doesn’t make it bad.

In the end context, scale and usage all determine a good schema design. Sometimes what would be a good relational design on paper would be tragically horrid in practice once you get beyond a small dataset.

On this topic: I’m in the process of making a compiler for a DSL I designed to help with the schema design process.

https://gist.github.com/nomsolence/69bc0b5fe1ba943d82cd37fdb…

Pictured is the compiler internals; the attached .syrup is the DSL. (It started as a tangent on a project I’m doing for my girlfriend, the schema described in the .syrup has been improved a bit since)

Note: things after # are comments.

I find even just defaulting to NOT NULL and not having to worry about commas is a boon for when I create schemas.

The DSL will of course support things like compound primary keys and SQLite’s WITHOUT ROWID.

I’ll post the code here, likely before the weekend: https://github.com/nomsolence/syrup

Schema’s that reflects reality not the current specs. Flexibility is key. In my experience adding tables and migrating existing data to them is hard, adding columns easy. So spend extra time at the start on what tables there should be.

Spec:
Product has a supplier
[tables:product, supplier]

Reality:
Product can be bought from multiple suppliers
[table:product, supplier, product_supplier]

100%

The importance of having db schemas and other software entities reflect real-world things is not appreciated enough. It looks unimportant at first, but soon all intuition becomes useless and the system tends to not have “joints” in the right places, ie it doesn’t have flexibility in the same places that reality does.

There are good reasons to denormalise, but as a rule of thumb… yeah, this. I don’t think you can go that far wrong with schemas as long as you have an idea of your entities and their cardinalities. It’s much easier than designing, say, the associated Java classes, because there are clear rules about how to do it and it’s just obvious when you’ve done it wrong (your cardinality is all messed up).

diaclaimer: theoretical opinion.

I think the primary problem of giving examples here is similar to teaching software engineering, which needs complex projects solving complex problems – too big for a semester project.

A good schema depends on the problem it’s solving.

A secondary problem is similar to code that has sacrificed clarity for performance. The tweaks made for performance are not intrinsic to the logical problem, but are an additional constraint.

For performance on common queries, schema can be flattened or “denomalized”. The ability to do so was one of the original motivations for Codd’s relational algebra.

“what kind of problem could possibly require so many tables”

CRMs often have hundreds of tables and ERPs have thousands or tens of thousands or more.

Dumb question, but does anyone have a recommendation for good software for generating the schema diagrams in the Drupal link but for Redshift?

A SAP partner once told me (the company I was working at was considering using SAP) that the deployment would have ~180K tables – I don’t know if they got the figure wrong, I have misremembered (I did check when they said it) or maybe that’s for a “fully loaded” instance.

Edit: Not SAP, but certain other ERP products have an alarming habit of not using foreign keys – which makes working out the structure of the database quite interesting…

How do you perform joins without foreign keys? Do you just have a column that is effectively a foreign key but is not marked as such?

Sure it is. We can perform DB joins with any column as long as the data type and the data value is matched.

> When I read about database schemas with thousands of tables I wonder what kind of problem could possibly require so many tables.

They’re probably mostly small (in terms of columns), many of them done just to have foreign key constraints, separate indexes, and arguably easier querying (joining) on just the data you need.

But I think it’s a particular style, rather than a ‘problem’ that ‘requires’ so many.

(IANA database expert though, just my tuppence.)

Interesting to see the denormalization of user Display names on the most important tables, but not everywhere.

That made me laugh.

My 2 cents after doing this long enough to recognise it

– Aim for 3NF but not religiously. Still, if you need a flat table try a view.

– Any ternary relationship can be modeled as a pair of binary relations (you’ll never regret keeping it simpler)

– You don’t need EAV (Magento is a good example of why you shouldn’t)

– On the other hand don’t serialize data (looking at you WordPress)

– XML and JSON data types though are perfectly fine when you need to store an object

– Every table should have a primary key (preferably an integer)

– If you really want a string for your primary key make it a candidate key (why, because someone will insist on changing it)

– E/R diagrams are your friend

– So are Venn diagrams for visualizing a complex select

Leave a Reply

Your email address will not be published. Required fields are marked *