Rules of Thumb for MongoDB Schema Design

This is a summary of 6 Rules of Thumb for MongoDB Schema Design, which details how MongoDB schemas should be organized in three separate blog posts. So please take a look if this summarization is not sufficient.

If you are new to MongoDB it is natural to ask how should one structure the schema for One-to-Many relationship.

Basics

Relationships can be in three different forms:

One-to-Few
One-to-Many
One-to-Squillions

Each method for structuring has its pros and cons. So the user should know how to decide which one is better than the other in the given situation.

One-to-Few

// person
{
  name: "Keon Kim",
  hometown: "Seoul",
  addresses: [
    { city: 'Manhattan', state: 'NY', cc: 'USA' },
    { city: 'Jersey City', state: 'NJ', cc: 'USA' }
  ]
}

Pros

You can call all the information in one query

Cons

It is impossible to search the contained entity independently.

One-to-Many

// parts
{
  _id: ObjectID('AAAA'),
  partno: '123-aff-456',
  name: 'Awesometel 100Ghz CPU',
  qty: 102,
  cost: 1.21,
  price: 3.99
}
 
// products
{
  name: 'Weird Computer WC-3020',
  manufacturer: 'Haruair Eng.',
  catalog_number: 1234,
  parts: [
    ObjectID('AAAA'),
    ObjectID('DEFO'),
    ObjectID('EJFW')
  ]
}

Parent holds the list of ObjectID of the child documents. This requires an application level join, not the database level join.

// find product based on category_number.
> product = db.products.findOne({catalog_number: 1234});
 
// find all parts in the product's parts list.
> product_parts = db.parts.find({
                      _id: { $in : product.parts }
                   }).toArray();

Pros

It is easy to handle insert, delete on each documents independently.
It has flexibility for implementing N-to-N relationship because it is an application level join.

Cons

Performance drops as you call documents multiple times.

One-to-Squillions

If you need to store tons of data (ie. event logs), you need to use a different approach since a document cannot be larger than 16MB in size. You need to use 'parent-referencing'.

// host
{
  _id : ObjectID('AAAB'),
  name : 'goofy.example.com',
  ipaddr : '127.66.66.66'
}
 
// logmsg
{
  time : ISODate("2015-09-02T09:10:09.032Z"),
  message : 'cpu is on fire!',
  host: ObjectID('AAAB')       // references Host document
}

Later you can join like below:

// Search parent host document
> host = db.hosts.findOne({ipaddr : '127.66.66.66'});
 
// Search the most recent 5000 logs
// for a host based on host's ObjectID
> last_5k_msg = db.logmsg.find({host: host._id})
                         .sort({time : -1})
                         .limit(5000)
                         .toArray()

Denormalization

In order to speed up the query, you can denormalize the data.

// logmsg - before
{
  time : ISODate("2015-09-02T09:10:09.032Z"),
  message : 'cpu is on fire!',
  host: ObjectID('AAAB')
}
 
// logmsg - after
{
  time : ISODate("2015-09-02T09:10:09.032Z"),
  message : 'cpu is on fire!',
  host: ObjectID('AAAB'),
  ipaddr : '127.66.66.66'
}

> last_5k_msg = db.logmsg.find({ipaddr : '127.66.66.66'})
                         .sort({time : -1})
                         .limit(5000)
                         .toArray()

In fact, you can merge the two documents:

{
    time : ISODate("2015-09-02T09:10:09.032Z"),
    message : 'cpu is on fire!',
    ipaddr : '127.66.66.66',
    hostname : 'goofy.example.com'
}

It would be used like this in the code:

// receive log from monitoring program.
logmsg = get_log_msg();
log_message_here = logmsg.msg;
log_ip = logmsg.ipaddr;
 
// get timestamp
now = new Date();
 
// find host's id for update
host_doc = db.hosts.findOne({ ipaddr: log_ip },{ _id: 1 });
host_id = host_doc._id;
 
// save denoramlized data
db.logmsg.save({
    time : now,
    message : log_message_here,
    ipaddr : log_ip,
    host : host_id ) });
 
// Push the denoramlized data in 'one'
db.hosts.update( {_id: host_id }, {
    $push : {
      logmsgs : {
        $each:  [ { time : now, message : log_message_here } ],
        $sort:  { time : 1 },  // sort by time
        $slice: -1000          // get top 1000
      }
    }
  });

6 Rules of Thumb

Here are some "rules of thumb" to guide you through these innumerable (but not infinite) choices:

Favor embedding unless there is a compelling reason not to
Needing to access an object on its own is a compelling reason not to embed it
Arrays should not grow without bound. If there are more than a couple of hundred documents on the "many" side, don't embed them; if there are more than a few thousand documents on the "many" side, don't use an array of ObjectID references. High-cardinality arrays are a compelling reason not to embed.
Don't be afraid of application-level joins: if you index correctly and use the projection specifier then application-level joins are barely more expensive than server-side joins in a relational database.
Consider the write/read ratio when denormalizing. A field that will mostly be read and only seldom updated is a good candidate for denormalization: if you denormalize a field that is updated frequently then the extra work of finding and updating all the instances is likely to overwhelm the savings that you get from denormalizing.
As always with MongoDB, how you model your data depends – entirely – on your particular application's data access patterns. You want to structure your data to match the ways that your application queries and updates it.