“Can’t we all just get along?”
I assert that the explosion of so-called NoSQL database management systems (DBMS) is not displacing the well-known relational DBMS (RDBMS) that we love and admire. There is room for each, sometimes within one application. Why? Visits by three spirits could enlighten us …
1. Spirit of DBMS Past
The DBMS was invented before the mid-1960s. Those early DBMS had no SQL. They were not even relational. Oddly, they are not considered as NoSQL.
In 1970, E.F. Codd invented the RDBMS at IBM. That company was slow to adopt the technology, probably due to investment in its IMS hierarchical DBMS product. IMS contained no SQL. IBM invented a proprietary query language, SEQUEL, in the late seventies. In 1980, Larry Ellison launched a clean-room rendition of SEQUEL, named SQL. It was used with his RDBMS, Oracle Database. IBM eventually released DB2. The ACID RDBMS was off to the races.
2. Spirit of DBMS Present
Today’s RDBMS technology contains tons of sand. I’ve seen the following: “The number of data items in relational databases matches the number grains of sand on all the beaches.” The point is that it is unimaginable to migrate all RDBMS data to another home.
RDBMS databases are often integration databases for multiple applications. Such databases have rigid schemas. Schema governance can make individual application change painful. The recent decade saw a mitigating move to “data as a service” web service integration points. Here, each application may conform to its unique application data transfer schema, avoiding conformance to the central database schema. That gives wiggle-room for easier service client application change. Services don’t solve all schema dependency issues. One application may require a fast query index on a column, where another application may need fast insertions, inhibited by that very index.
The 21st century has spawned newcomers such as Amazon and Google. “Cloud,” no longer means “vapor.” The buzz phrases “big data,” “relaxed consistency,” and “eventual consistency” are in our faces.
Argh! Here be demons! RDBMS technology hits the wall under big data conditions. I’m talking BIG data. No, I mean BIG DATA – entire beaches of sand processed daily. Scaling out implies nodes and clusters. Alas, clusters of RDBMS nodes must share a single disk space to fulfill instantaneous ACID requirements. Even a failure-resistant RAID is a geographic single-point-of failure.
What about master-slave remote node replication? That means relaxing instantaneous slave consistency across a replication event. Many RDBMS shops have long used multi-version concurrency control (MVCC) for update performance. They assume a given item will not be updated often. This buys a chance at blazing read performance across large sets of data. That kind of “C” in ACID also implies detecting conflict caused by a stale read for an update … eventually.
Consistency through MVCC opens a door toward acceptance of the world “eventual.” There are more mundane demons in scaling a relational database. Many RDBMS products are licensed per-server or per-processor. Do the math. Another scaling tactic is to insert queue-driven workers that combine multiple requests into single requests to the database. This resilient approach works until it, too, cannot keep up with growth. The next technique is to shard keys across nodes, but sharding needs participation from the application. Additionally, when one shard node for a range of key values invariably goes belly up, that portion of the data becomes unavailable.
These are RDBMS Band-Aid scaling techniques. Fault-tolerance, tolerance of human error, and maintenance complexity eventually overrun realistic limits. Additionally, we have the well-known impedance mismatch between tables and objects. ORMs manage this to a degree, but they introduce an additional mapping layer while often not performing efficiently in high throughput conditions. The spirit paints a bleak picture of the huge data RDBMS, but there’s a lot of life left in that old dog. Let’s look at NoSQL before we try to put the RDBMS on the shelf.
New database systems emerged during this millennium. Startups run by bright young helmsmen have sprung to life. Here, open source is not as scary as it is to legal departments of some entrenched enterprises. Many startups must deal with big data, or new computational problems having unique data requirements. These applications need application data, not integration data.
Who knew the term, NoSQL, prior to 2009? It seems to correctly mean a “non-relational DBMS.” Some say NoSQL means “Not only SQL.” That smacks of acronym damage control. Some NoSQL products even provide a SQL subset. Let’s grant that the tag, #NoSQL, tweets well. These new DBMS appeared after the “turn of the century,” meaning 2000, not 1900. Apparently, only software written subsequent to the birth of its users qualifies for nutty acronyms.
A NoSQL database generally stores data as an aggregate, not a set of flat tuples or rows. A write to the aggregate is atomic. Outlier NoSQL paradigms exist, such as a graph DBMS, or a column-family DBMS. A NoSQL schema for an auction would be enforced by the auction application, not by the auction’s private application database. Two applications using a single NoSQL database as an integration point need to have solid agreement on its schema. An RDBMS is aggregate-ignorant, meaning that it has no clue how its data is used by an application.
A NoSQL DBMS may use document-oriented values, as opposed to opaque values. Document-oriented NoSQL can query and update a value based on introspection of items within the document. Think of JSON or XML aggregates here. Aggregate orientation grants enough information to the DBMS to enable it to organize data items to reside together on a given node. Scaling out to clusters is the fruit of the vine of NoSQL. The ascendance of NoSQL follows the rise of young organizations that have mind-boggling huge data requirements. Sometimes eventual consistency is good enough. Scaling involves tradeoffs.
We see a prospect of relaxing consistency to get increased scaling. There’s a theorem for that: The CAP Theorem. Of the three properties of data (1) Consistency, (2) Availability, and (3) Partition tolerance, we’re limited to choosing any two.
We covered consistency. Availability means that a server must always answer a request in some fashion, in order to be deemed available. A partition in CAP is a section of the DBMS that has no communication with any other section. A network breakage within a cluster severs it into parts – partitions – that cannot communicate with one-another. Partition tolerance is the measure of the ability to survive partitioning. CAP properties are not discrete. We can trade some of one to get more of another.
There is no time property associated with availability. Does a five-minute response time from a server mean that it is available? This is latency. We’re usually interested in the trade-off between latency and consistency. Many NoSQL implementations process distributed big data at blazing speed – low latency – at the expense of infrequent detectable failed updates.
Alternatively, some trade consistency with durability. Really? Consider a data logging application where the trend of the data is more important than logging the last few items before a server failure. Or, consider a DBMS that maintains session data in a responsive application that has a high number of simultaneous users. Updates have to be instantly consistent, but if the DBMS server crashes, the end-users simply lose sessions, to their minor annoyance, at worst. We may categorize a DBMS as to which two legs of the CAP Theorem it provides. For example “CouchDb is AP,” while “PostgreSQL is CA.”
There is no consistent definition for what constitutes NoSQL. It’s a wild bunch ranging from embedded systems to huge HUGE distributed systems. A general list of the characteristics of NoSQL follows:
- Generally open source
- No relational model
- No schema – schema in the mind of the app programmer
- Distributed, fault-tolerant
- No full ACID guarantee
- Atomically updates a given value
Common kinds of NoSQL DBMS:
|Key-Value||A unique key identifies a value aggregate that is meaningful to the application. The DBMS does not care or understand what is inside the value.||Riak, Redis, MemCached,|
|Document||A unique key identifies a value that the DBMS can understand at some level, so as to provide a value query capability. Values are stored as JSON, XML, or another well-known structured data format enforced by the DBMS. The DBMS can query items by key or by document content.||CouchDB; CouchBase; MongoDB; Lotus Notes (old, MVCC)|
|Column-family||Stores data as columns in a column family defined at creation time. Adds new columns without upsetting the application or the DBMS. Efficient for computing an aggregate over a subset of rows, or for all values of a column.||Cassandra; BigTable; HBase|
|Graph||Data stored as nodes and links. Each node or link may have arbitrary properties attached. Useful for modeling deeply nested relationships such as networks or geographic data. Usually transactional across multiple operations.||Neo4J, OrientDB (has SQL)|
Advantages of NoSQL
- Scales out
- Deals with explosions of data outstripping RDMS capability
- Fewer expensive DBAs required
- Uses clusters of low-cost servers
- Promotes application evolution without external schema change
- Is cheaper or free to obtain
Disadvantages of NoSQL
- Lacks maturity
- No big-name enterprise support providers
- Difficult business analytics – these databases are not integration points
- No zero-admin to ease installation and maintenance
- Lower expertise – NoSQL is young
- The application governs its own schema
3. Spirit of DBMS to Come
Humans prefer an “all or nothing” answer to binary questions such as “Is this the end of SQL?” The answer is another question: “Why can’t architects choose from
a DBMS palette that contains both RDMS and NoSQL?” This is the notion of polyglot persistence.
There are examples of an RDBMS coexisting with NoSQL within enterprise applications. A trivial example is Memcached, a NoSQL embedded in-memory key-value store often used to accelerate applications through keyed value caching of RDMS results. The majority of applications need extremely low-latency reads. Some applications also need immediate, consistent propagation of updates. Think of an inventory update after clicking “order now” or a financial transfer from savings to checking. Other applications are happy enough with hours of update propagation latency. Think of adding a user review to a movie database where that user originates in a separate social network.
The Spirit and I agree that the RDBMS has no end-of-life in sight. Certainly, it holds beaches of sand that nobody wants to relocate, but it has future relevance. We’ll continue to need lightning-fast coordinated consistent financial and e-commerce transactions. The governance imposed by the RDBMS schema and DBAs enhance the maintainability and safety of such applications. On the other hand, I want instant access to huge amounts of data, even where I know some items may be long-in-the-tooth.
For example, I expect instant, accurate routes from an Earth-load of data fronted by Google Maps, but I am tolerant of zooming into a picture of my driveway that shows a car I sold a year ago. Beyond caching, more than one kind of DBMS may persist parts of a single application that uses polyglot persistence.
Imagine that we create a genealogy site that consolidates data purchased from other services. It integrates those slowly changing sources through data-as-a-service, periodically updating its own huge data NoSQL document store. This data store supports imaginatively varied genealogy queries based on items within NoSQL documents. Those reside in a graph-oriented NoSQL DBMS. We pick OrientDB. Imagine snappy response time for concurrent users querying information about families. One user could query for an instant list of her third cousins without the costly deep join that an RDBMS would require. Everything is responsive. Everybody is happy. Our genealogy site is a for-profit business. It must recover development costs, data purchase fees, and operating expenses, while turning a profit. Its customers are a set of registered users that pay a fee. The application ties a NoSQL user registry to Facebook or Twitter, but associated fees, accounting, and financial reporting reside in an RDBMS.
That’s polyglot persistence, friends.
I expect that most people with skin in the game will not be serious NoSQL users for a while. This should not discourage developers and architects from experimenting with various NoSQL DBMS now, so as to make choices based on knowledge.
You don’t always need a download. There are clouds with free entry-level access (e.g. search for Cloudant, Heroku, Mongohq, or NuVola). The NoSQL world is changing rapidly. In most cases you’re currently better off sticking with RDBMS, unless you are dealing with big data, or if you have a case for a polyglot application.
Beware of insanity or substance abuse caused by trying to decide on a “favorite” NoSQL package. It IS a wild bunch. I have not said much about specific NoSQL solutions, nor have I mentioned MapReduce, a conceptual friend of big data. These are fodder for subsequent posts. SQL has a good prognosis, but there’s plenty of room for the emerging NoSQL wave.
– Louis Mauget, firstname.lastname@example.org