I am looking for projects to work on
Please contact with me at!

Saturday, April 28, 2012

some notes on Big Data, NoSQL, and RDBMS

Recently, I read a blog post from Chris Swang. I pretty much agree with him. I particularly like his diagram that describes Big Data in a very straightforward way:

Clearly shown on the above diagram, Big Data, especially from the view of those NoSQL database, is the problem that has very large data volume and need relatively simple algorithm for analysis.

In other words, according to the above definition, it may not be called as Big Data problem if the problem can not be solved with simple algorithms (NoSQL and map/reduce ?). To help myself to have clear clue,I make a diagram as below,
I classify database techniques into four groups, which map to the "problem types" diagram.
  1. Structural database  I think those traditional RDBMSs should still be used in operational database, which asks for good support for transaction process. If the data volume is small, it will be ok to use RDBMS in data warehouse application as well. After all, structure database has been developed for decades to deal with even table that have to be partitioned.
  2.  Quant If advanced mathematics knowledge is required, quant analysis is demanded. 
  3. NoSQL NoSQL becomes popular as developer find that for some problems, they can dramatically speed up data I/O operation if they model the data in an nonstructural way, which against the rule of normalization in a RDBMS. Although traditional RDBMS also has techniques like materialized table, materialized view, and proxy table, they all have limitation and are all still behind SQL optimization engine.So, why dont we just directly got to hit data storage engine if we do not need SQL at all. Then, NoSQL comes into the stage. 

  4. Hybrid In fact, besides those popular NoSQL database like MongoDB, coutchDB, HBase etc, those traditional RDBMS vendors are also developing their NoSQL products too. For examples, MySQL cluster has key/value memcached and sockethandler that avoid overhead SQL stuff. In a real project, which needs to handle big data volume and transaction as well, I believe both structural database and NoSQL database are needed. 
 We can see that some NoSQL platform is developing their own query language, which looks like a structural query language too. However, we can also see that the result of query should be unstructured data. Otherwise, you are using NoSQL platform to do SQL platform's job. It will be interesting to know how those emerging products can handle the data better than traditional database vendor. If a distributed file system and mp/reduce can help on this, I can foresee traditional database vendor will add NoSQL platform into their products. In fact, we can see they are doing on this now. I even hope they can integrate NoSQL and SQL database in a more seamless way in terms of security, performance, cost etc.

I always try to be careful not to be fooled by marketing language. NoSQL and  RDBMS should have their own position in a large system, where hybrid solution is required. We need to choose right tools for different tasks.

HBase is Hadoop's NoSQL database

No comments:

Post a Comment