In the last years we follow the rising of NoSQL technology and its employ in a even more extended set of application. This article aim to make an objective comparison from SQL and NO-SQL technology and try to clarify some unclear aspect to help people to choose it’s backend knowledgeably.
- What is NoSQL
- NoSQL implementations
- Advantages of NoSQL
- Limits of NoSQL
- Performance comparison
- Points of interest
What is NoSQL
In simple words, NoSQL is a new data storage backends that do not follow the relational DB model. That means we are speaking about a “container” that works differently from traditional SQL based backends.
NoSQL technologies born to cover a set of new need came out while traditional database were already mature. Of course in last years application need changes and became even more discerning (big data, clustering, file repository), so this new storage system was designed keeping in mind this new requirements.
But, what I mean for “requirements”? Here a set of case that NoSQL is designed to support.
- Application works with large amount of data (Big Data)
- Application works with data that quickly change relationship and data types (semi-structured, unstructured and polymorphic data).
- Developers work in a small team using agile approach: lot of small spring against long term waterfall iterations
- Application served as service, may be published on the web,
- Application delivered to thousands of user instead of few people inside the company
- No certain about the future load of application: need to be scalable and dynamic, need base software on a backend cluster easily
There are many NoSQL solution offered by the market, open-source or not. Each of them works a little bit differently, maybe specializing so some particular need, but the basic idea and the common feature is to offer better scalability and performance. To do this they give up to some features of generic RDBMS, introducing new ones, but keeping enough functionality to be useful.
One of the breaking changes from SQL DBs is that, meanwhile SQL backends are general purpose storage system, NoSQL distribution are focused on a specific kind of data. This allow DBs to be more efficient on their scope and allow to have more performant system. In this section are reported some kind of NoSQL database to give an idea of what application they have. Note that they can be used together (also with traditional SQL systems) to get the best from all systems.
This type of database doesn't need to have a consistent data structure, so they are useful when you have to work with polymorphic data, or data structure changes constantly. This kind of backend can convert normalized entities like key-value dataset or EAV models to simple documents set.
- Goal: Storage non typed set of “records”, called “documents"
- Examples: MongoDB, CouchDB
- Target: Heterogeneous data, working object-oriented, agile development
We told NoSQL database remove the concept of relationship to achieve better performance. In this kind of db this is not true. Instead, these databases enforce the concept of relationship.
Their goal is to define data by its relation with other data. This kind of database can be useful when most of data structure is designed to keep relationship with entities (i.e.. if you have a lot of tables when there are mostly foreign keys columns).
- Goal: describe data relations
- Examples: Neo4j, GiraffeDB.
- Target: Data Mining
This is a kind of db designed to store a big amount of key-value pair data. This can be useful when database is used to store properties, translations, or caching purpose.
- Goal: Store data in key-value form
- Examples: Redis, Cassandra, MemcacheDB
- Target: key-value storage
Advantages of NOSQL
We are aware that NoSQL db have some interesting advantages and they can resolve easily simple problems that traditional RDMS can't. Nowadays their large employment in critical system, like big cloud systems and some large Saas product, confirm they are mature and useful. But question is, why i should move to them? And in this case, when the move is profitable? We can’t base such kind of decision only on our impression, and read some vendor brochure where it’s granted that NoSQL is very cool is not enough. Moreover we cannot stay on inadequate platform just because we fear the change.
In this chapter I’ll try to explain why this solution could be good enough to move on it and what use case make it more profitable.
As we said NoSQL databases were created in response to the limitations of traditional relational database technology. This means we will find some improvement, or better, some feature that in traditional rdbms are not present and can’t be added, even if producer would implement them.
The advantages of NoSQL include the capabilities to handle easily:
- Big data: where with this term we describe data sets containing very large volumes of data.
- Mutable data: Data could be structured, semi-structured, and unstructured as well. NoSQL can also manage transformation of data.
- Dynamic development: In contexts where we need agile sprints, quick iteration, frequent code pushes, and in summary to be responsive to changes, having a database that embraces dynamism is very helpful.
- Object-oriented: programming that is easy to use and flexible
- Expandable: we can easily implement and efficient, scalable architecture instead of expensive, monolithic architecture. Even in traditional db we can do that but it is harder and limited.
- Open Source: most of solution are open source, so with no cost for licenses
NoSQL databases are more scalable and provide better performances, and their data model it’s closer to the domain model used inside application. Today, companies starting project basing on NoSQL databases are growing. NoSQL databases also tend to be open-source and that means a relatively low-cost way of developing, implementing and sharing software.
Limits of NO-SQL
In evaluating the limitations of NoSQL databases, it's important to bear in mind that the NoSQL world is a diverse ecosystem. Not all NoSQL storage products are subject to all drawbacks to the same extent. And that's a good thing, since it means that organizations have a lot of options when weighing the pros and cons of different NoSQL solutions in order to decide which one is best for their specific needs. In this chapter are summarized some feature you may miss using a NoSQL solution.
By reading the article you'll find this chapter expanded more than the advantages one. This won't be a way to discourage to use NoSQL. This chapter would be an impartial description of all limits of NoSQL technologies and simply want to let you know every possible problem you could find employing them. Many point could be vary by the implementation (i.e.. when I said there are few tool in support, it could be right for most of these but not for all ), so consider them just an overview that will alert you about possible risk you could find. What I expect is that, after you choose a NoSQL product to employ, you can use this chapter as checklist to understand if this issue are present in you specific db and if are relevant for the application.
Security is something everybody wants but that’s hard to reach. Theoretically, there could be security issues on every technologies There was, and maybe there are security issue also on SQL system. So why i mark this a possible issues on NoSQL? There is no real issue on NoSQL “concept” related with security, but we can have security issues related with maturity of the product we employ. Security issues come out while the product grow and they are fixed. Intuitively, a young product could have many unknown security issues. Moreover, young product are on the market since few time, so consultants hadn’t time to make experience on them and many security constraint could be neglected. So, the problem is the youngness of most NoSQL platform. For business use I suggest to employ only mature solutions, with a vendor behind.
When we start learning RDBMS, they have taught us that ACID transaction is the best option to make operations that remains consistent across the whole database. Well, most of NoSQL technology doesn’t implement such kind of transaction. NoSQL systems are based on the principle of Eventual consistency . In practices, embracing a little risk of consistency (one node could be out of sync with others), they gain some performance boost. Yes it is a compromise, but we cannot have everything.
I have to mention that some NoSQL implementation, like FoundationDB, allow ACID-like transactions keeping NoSQL performances high. By the way, while we stay on NoSQL, data consistency remains a critical part: basing on application you are developing this could be an issues or not.
When you speak to someone that is trying to convert you to NoSQL technology, one of the first benefit you could hear from they is the benefit of performance due to removal of relations. All we agree that relation may bring performance reduction, but what we lose removing them? It’s like you was in dip rise with a big and heavy backpack on your back. Of course, dropping it you will go faster. It is convenient to do that? Depends on what this pack contains, depends on what is the value of the backpack content has for you. If it contains the tent for the night, maybe is better to reach destination one hour later but sleep warm then go faster. If you are bringing useful disposable things, maybe you can do the opposite.
Following this parallelism, can we accept to loose consistency to get performance? It is convenient?
To make a step back, I’ll start from the origin of joins. RDBMS use relation to link data from one table to another in order to keep data in a single place and do not replicate them. Join is construct that allow to reconnect them in query. Of course make a join between tables is an additional computational cost, respect to find data directly inside the table you are querying. But this cost is necessary to keep relation (no replication, consistency).
It’s clear that while such construct has an acceptable overhead, this is ok, an probably the best choice. But what when it slow down everything or require too much hardware? This issue allow NoSQL developer to claim the lack of JOIN to a feature, but NoSQL is the solution?
Not always. Some time we just need to redesign database structure, maybe removing some relation or make restructure data. Yes, we will loose some relation o we will replicate some part of date, but it could be acceptable ( in NoSQL we will lose all of relationship).
Another problem is with consistency. Think about category and products. We may have a nested tree of category, with many product as leaf of the tree. In traditional RDMS, change the category tree is just an update on a foreign key (self relationship) on the category table. This changes automatically reflect on all child categories and products. In the NoSQL way we could have redundant data on all category\product and a change will need a massive update on child elements.
Let me assume that our application can renounce to JOIN for gaining speed and in our case it is an acceptable trade off. We yet said that in may NoSQL implementations it is hard to keep the various entries consistent. When you work without transaction you can make many operation in sequence but after a crush you get inconsistency. This is true for firsts implementations of NoSQL and some new technology try to give out of the box something that offer out of the box transactions. You can also think to manage transaction at application level, trying to rollback dirty data, but it could be very hard to manage in many case.
Missing standard across vendor technologies
SQL is a standard language. There could be many variation that bring to specific dialects, but this in complex doesn’t forbid to abstract data access. Think about Hibernate, NHibernate, Doctrine, Entity Framework or the other ORM you like: they are the proof that difference between SQL dialects are not so relevant. We can conclude that SQL is a standard language, even if many vendor implement different database technologies. Also in case you are not based on ORM layer, if you produce a query for a db, most of the code could be reused in others. This make migration easier and developers can adapt quickly to different db solutions.
On the other side, on NoSQL world, there is more confusion. Each vendor implement their specific syntax without involving any shared standard. This means it is harder to migrate an application between different NoSQL implementations. This means it is harder to find a programmer that knows well many NoSQL technologies.
Schema flexibility could be a trouble
One of the peculiarities of NoSQL system is that they do not require a schema. In practices is the programmer that decide data structure in the moment he save it. So there is no place where it’s written how data is structured of what is the meaning of data. Even if you could easily recreate a db model starting from data relation using some automated tool, this could be something missing in traditional applications. Moreover, what if a bug occur? We know there could be situations where something is wrong with the code. Traditional RDMS are scaffolded, so in case you switch some fields or you are wrong with field format they protect you from inconsistency. In NoSQL case there is no help from the db, because without any schema defined, there isn’t any information about data should be saved: nobody can say if data is wrong or not. The worst side effect is that process bring lot of power and lot of responsibilities to the developer, that often doesn’t know all the process or the structure.
Moreover, even in case you now know what is saved where, do you think you’ll remember everything the next month? and the next year? Not all project are subject to continuous development, there could be business application that stay as-is for year, before we need to make some changes. Anyway in IT, company often commission a project to some supplier so this part have to be taken in account to ensure an easy handover at the end of the project, maybe asking for an accurate documentation about data is structured and what each field\collection means. Last problem related with schema flexibility is that every member of the team could not work in the project for all its life, so turnover is critical on little teams were not all the members have full knowledge of data structure or there isn't an adequate documentation.
Saving many nested data inside single documents you may lost analytic features like “SUM”, “COUNT” and so on. The bad things that this could not be a problem during first application development, but someone could ask later for some report, so what to do in this case? It is hard to change data structure after database is filled, and doing that could have may unpredictable effects due to the leak of a well defined data structure. Analytics is an hard point for NoSQL.
Moreover, while there are may commercial tool you can connect to your traditional db to manage analytic part, there is a limited support for NoSQL systems.
Another solution could be taken is to replicate some sort of "relationship" with unstructured data inside NoSQL db, maybe creating many collection and linking object one with others. If you are planning to follow this path to allow analytics reporting, keep in mind that this could slow down performances to be comparable with standard SQL systems. This could be acceptable when the part involved in this db is minimal and record count is limited. Anyway, even from my experience, construct that allow to join data on NoSQL query, because there aren't well defined relationship behind, are very limited and performance are not so good as we expect (i.e.. at the time this article was written, MongoDB do not support inner join and can evolve only to table each time without creating temp tables).
We spoke about the lack of standardization on NoSQL query language and syntax. This problem may be reflected also in tools, together with youngness of most platforms. I’m speaking about tools for querying, but also for migrate data between database, manage backup etc..
Of course, most of NoSQL project are growing, and we expect tools will grow with them, so this problem will be automatically solved in the near future.
The lack of standardization makes hard for third party vendors build tools that can support multiple NoSQL solution. Moreover young platforms means less users, less customer, and less time to develop mature tools.
It is important to specify how comparison are made. First of all I needed to place both solution in same condition. This means, for example to use the same hardware and have same level of tuning. So I installed MongoDB (last version) and SQLServer Express on same machine. Because we are not interested on performance inside the database itself I built my benchmark using C# code basing on standard framework.
Upon this two way to save data everything is shared (entities, logic, data generation) to ensure equity.
The list of all operation we will compare:
- mass insert
- transaction (or better, in NoSQL case, transaction simulation)
Mass operation on a single entity
This benchmark consist in a big set of object to insert to be performed in less time is possible. This test is replicated using a growing number of items to save to prove how performance scale in both systems. This benchmark measure execution time in ms and insist on a single table\collection.
This benchmark is focused on query feature. We separate following pattern:
- CASE 1 get one entity using primary key: this pattern is used to fetch a single entity from db using is unique identifier
- CASE 2 full scan with fail: when you are looking for a deleted element and database have to scan all index before reply “no”.
- CASE 3 Paged query: a complex query where you have some filters, one order condition, and you want to take just a page of data.
I created some benchmark simulating different ratio of patterns above. In example the first benchmark assume 5% of queries of kind 1, 70% of kind 2 and 25% of kind 3. This benchmark measure execution time in ms. This benchmark insist in a single table \ collection.
You can find all the code used to perform these test on git-hub.
| ||CASE 1||CASE 2||CASE 3|
First test is on a “small” set of data, about 2.500.000 rows.
Second test on a “bigger” data set, about 5M rows.
This benchmark highlights a big improvement of performance on query over index, but when MongoDB is used to read set of data the gain is reduced and kept stable over data increase.
We know transaction on NoSQL world are mostly unsupported. We also understood that renouncing to transaction could benefit on performance, an question is: how much I gain from this? I built this benchmark to compare insert in transaction of one master row related with many child. Benchmark is focused on execution time, expressed in ms.
|#transactions||SQL (ms)||NoSQL (ms)|
This benchmark is focused on analytics. Suppose we have a categorized master-detail data model, where you want to:
- export: whole join over all data tree
- report: sum all items in category for all category, i.e. give invoice amount for all customers
- KPI: sum all mater totals summing detail subtotals
On a base of 4M row after inner join:
|test||SQL (ms)||NoSQL (ms)|
Points of interest
Revolutions are inevitable in tech context. A new technology come and bring some revolutionary features, but often have to defeat developer preconceptions. Sometime are misunderstood so their weaknesses come out after they are employed.
About innovation we have to "opposite" cluster of people:
- "enthusiastic" people: that want to embrace change unconditionally and are ready to dispose everything done in the past to work with last technology;
- "conservatives" people: that hate changes and prefer to stay on its habits, rejecting any new technology.
On real life we have to stay on the middle, so it is important to know and understand what new technology could do for us and be ready to employ new technology before they are needed on projects. "Trial on the job" is a a bad habit that can lead to bad results.
Same principle is applied to the NoSQL technologies. Because we looked at NoSQL some times ago, now we know pro and cons, so we can take advantages from such kind of tools. When we analyze that technology, we cannot stop looking at the things we miss from our traditional habits, like transactions, schemas, and standards. We need to study and familiarize with these technology that until some years ago was young and new, but now are a concrete option. Study, learn, understand, employ: this is the nature of progress.
When I should use a NoSQL Db?
Why NoSQL would it be better than using a SQL Database? After you have read this article I'm sure you'll understood that NoSQL aren't replacement of SQL database, but just different storage system with different features and are usefully in some particular fields. So answer cannot be different from "it depends". Because it really depends on many characteristics of the project.
To be honest and cautious, NoSQL is the best solution when all following statement are true:
- when your project needs to scale, or it could in the future.
- when you have to work on big data, or your data will be big in the closer future
- when analytics component in application is simple, or not so important
- when your application needs fits database purpose (i.e.. you are saving data in graph and database do it)
In some cases, NoSQL could be a good alternative, but not essential to build a durable infrastructure. Of course, if in your application a NoSQL system cover the 99% of all needs, there are no reason to couple it with a RDBMS. But if you need relations, transaction, and other features of standard RDBMS maybe it's better to employ them as main storage system and use NoSQL to cover only the critical parts (maybe derived from size of data).
How much better are performance in the case above?
It depends on the specific use case. From one side we have lots of benefits on large table or massive usage, but from the other side we have some performance loss using lookup instead of join on small dataset. A realistic estimation, if there is the basis to employ a NoSQL db, we can improve performance from a 10 to 100 factor. Of course, this estimation take in account all aspect of an application and it is related to the final user experience. That means you could measure better speedup on db layer, but the final user experience is distorted by many factor that reduce the gap (cache, network delay, page rendering). Just to explain what I’m telling, take in example the limit case when there is a page that make a query and returns data. Assume for this page 500 ms for getting result using traditional database, 50ms using NoSQL,and 200ms for rendering the page and 1 second for transfer data over internet. Improvement of performance on db layer is -90%, but for final user there is only 450ms gained on 1700, so only 26%. With this example, I would explain that is hard to measure improvement on complex system and in many cases NoSQL is not enough to resolve performance issue. To be more direct, if you think to resolve performance issue due to bad design in an application moving to NoSQL, you are not in the right path.
But the big question is: what I loose to get this performance? Because in some case it is not possible to renounce to some feature like transaction or relationship. This is very important to understand before the move.
NoSQL system are mature to be employed on production environment?
Depends mainly on your needs, or better on the project requirements. We can state that NoSQL is certainly mature enough to use. So, if you need, you can use without any fear. But not all application needs to work with big data or to to scale that massively. Most of Saas product does, also many critical application inside the enterprise context does, but most of application nowadays are still very simple. On my experience, it’s hard to find tables with more than 100 000 rows in a database. Think to your db, exclude the 2-3 bigger table you have on it and look at the row count. How much big they are? Usual db structure on db application count lot of “little” table (less than 100 000 rows) related. For such kind of application, a traditional RDBMS is sufficient and it will be forever. What is important, instead of starting to employ it, is to understand benefit and development pattern to be ready when they will be needed on you case.
Is SQL obsolete?
When men invented airplane, cars became obsolete? No, of course. Also if airplanes are faster than cars. They simply are two different system to move people, with different characteristics. Basing on the kind of travel you are starting, the time you have to spend on the travel and the budget, you'll decide what’s the best suitable alternative for you. In the same way, NoSQL coming don’t make obsolete SQL. They are simply two different way to store data, with different characteristics. You’ll decide what’s the best for you basing on your need.
There are problems SQL isn't suitable for, so You don't have to start big data project with it. It will be like to try to reach an island with car instead of by plane. But SQL still got its strengths. Lots of data models are simply best represented as a collection of tables which reference each other. It will be like try to go to buy the milk using a plane. NoSQL databases aren't a replacement for SQL, but they are an alternative.
Is the market ready to NoSQL?
The key point to answer to this question is closer to the experience that developer achieve. Most database programmers were trained for year to think of data in a relational way. How they could change to thought differently in so few time? It is not so easy, especially when developer have to work in many projects together and some of them are SQL and other NoSQL. The temptation of reproduce the same pattern on SQL to NoSQL system is hard to defeat and often lead to bad results.
So actually, there is much more know-how for SQL around, there are more developer well skilled on RDBMS than on NoSQL. Meanwhile there are DBA that spend most of their time into focusing on relational databases, we can't expect to find the same on technology born less than ten year ago. SQL is reached at school and in universities, NoSQL is starting to be.
After this first point, a secondary one is that because these system are newer there are less development tools, or they are not so advanced like others, but I’m sure this is not a real problems. There are some “enterprise ready” solution, that come out with tools that manage all basics need and all we hope that these tools will be grow with the growing of their platform.
What is the best solutions?
There is not a best solution that cover any case. Answer is simple and still the same: "it depends". With this article I hope to have given you an overview of all capabilities of such systems and some basics to understand when they could be useful
- 2016-12-11: added charts to compare performances
- 2016-12-06: First version of this article.