This is a summary of 6 Rules of Thumb for MongoDB Schema Design, which details how MongoDB schemas should be organized in three separate blog posts. So please take a look if this summarization is not sufficient.
If you are new to MongoDB it is natural to ask how should one structure the schema for One-to-Many relationship.
Basics
Relationships can be in three different forms:
- One-to-Few
- One-to-Many
- One-to-Squillions
Each method for structuring has its pros and cons. So the user should know how to decide which one is better than the other in the given situation.
One-to-Few
// person
{
name: "Keon Kim",
hometown: "Seoul",
addresses: [
{ city: 'Manhattan', state: 'NY', cc: 'USA' },
{ city: 'Jersey City', state: 'NJ', cc: 'USA' }
]
}Pros
- You can call all the information in one query
Cons
- It is impossible to search the contained entity independently.
One-to-Many
// parts
{
_id: ObjectID('AAAA'),
partno: '123-aff-456',
name: 'Awesometel 100Ghz CPU',
qty: 102,
cost: 1.21,
price: 3.99
}
// products
{
name: 'Weird Computer WC-3020',
manufacturer: 'Haruair Eng.',
catalog_number: 1234,
parts: [
ObjectID('AAAA'),
ObjectID('DEFO'),
ObjectID('EJFW')
]
}Parent holds the list of ObjectID of the child documents. This requires an application level join, not the database level join.
// find product based on category_number.
> product = db.products.findOne({catalog_number: 1234});
// find all parts in the product's parts list.
> product_parts = db.parts.find({
_id: { $in : product.parts }
}).toArray();Pros
- It is easy to handle insert, delete on each documents independently.
- It has flexibility for implementing N-to-N relationship because it is an application level join.
Cons
- Performance drops as you call documents multiple times.
One-to-Squillions
If you need to store tons of data (ie. event logs), you need to use a different approach since a document cannot be larger than 16MB in size. You need to use 'parent-referencing'.
// host
{
_id : ObjectID('AAAB'),
name : 'goofy.example.com',
ipaddr : '127.66.66.66'
}
// logmsg
{
time : ISODate("2015-09-02T09:10:09.032Z"),
message : 'cpu is on fire!',
host: ObjectID('AAAB') // references Host document
}Later you can join like below:
// Search parent host document
> host = db.hosts.findOne({ipaddr : '127.66.66.66'});
// Search the most recent 5000 logs
// for a host based on host's ObjectID
> last_5k_msg = db.logmsg.find({host: host._id})
.sort({time : -1})
.limit(5000)
.toArray()Denormalization
In order to speed up the query, you can denormalize the data.
// logmsg - before
{
time : ISODate("2015-09-02T09:10:09.032Z"),
message : 'cpu is on fire!',
host: ObjectID('AAAB')
}
// logmsg - after
{
time : ISODate("2015-09-02T09:10:09.032Z"),
message : 'cpu is on fire!',
host: ObjectID('AAAB'),
ipaddr : '127.66.66.66'
}> last_5k_msg = db.logmsg.find({ipaddr : '127.66.66.66'})
.sort({time : -1})
.limit(5000)
.toArray()In fact, you can merge the two documents:
{
time : ISODate("2015-09-02T09:10:09.032Z"),
message : 'cpu is on fire!',
ipaddr : '127.66.66.66',
hostname : 'goofy.example.com'
}It would be used like this in the code:
// receive log from monitoring program.
logmsg = get_log_msg();
log_message_here = logmsg.msg;
log_ip = logmsg.ipaddr;
// get timestamp
now = new Date();
// find host's id for update
host_doc = db.hosts.findOne({ ipaddr: log_ip },{ _id: 1 });
host_id = host_doc._id;
// save denoramlized data
db.logmsg.save({
time : now,
message : log_message_here,
ipaddr : log_ip,
host : host_id ) });
// Push the denoramlized data in 'one'
db.hosts.update( {_id: host_id }, {
$push : {
logmsgs : {
$each: [ { time : now, message : log_message_here } ],
$sort: { time : 1 }, // sort by time
$slice: -1000 // get top 1000
}
}
});6 Rules of Thumb
Here are some "rules of thumb" to guide you through these innumerable (but not infinite) choices:
- Favor embedding unless there is a compelling reason not to
- Needing to access an object on its own is a compelling reason not to embed it
- Arrays should not grow without bound. If there are more than a couple of hundred documents on the "many" side, don't embed them; if there are more than a few thousand documents on the "many" side, don't use an array of ObjectID references. High-cardinality arrays are a compelling reason not to embed.
- Don't be afraid of application-level joins: if you index correctly and use the projection specifier then application-level joins are barely more expensive than server-side joins in a relational database.
- Consider the write/read ratio when denormalizing. A field that will mostly be read and only seldom updated is a good candidate for denormalization: if you denormalize a field that is updated frequently then the extra work of finding and updating all the instances is likely to overwhelm the savings that you get from denormalizing.
- As always with MongoDB, how you model your data depends – entirely – on your particular application's data access patterns. You want to structure your data to match the ways that your application queries and updates it.