Right NoSQL Database: Comparing MongoDB, HBase, and Hypertable

ImilGot · Oct 09, 2023, 12:19 AM

Currently, my server is running MongoDB 2.4.1 with two collections in different databases. However, the performance seems to be lacking, especially when performing certain find operations which take between 180 to 600 milliseconds. It appears that these operations are waiting for the writing lock to be released.

The workload mainly consists of findOne and insert operations, with the data size for each insert not exceeding 120 bytes. In the future, I anticipate the data volume to grow significantly, reaching 500 to 1000 million records, and the number of read requests to increase up to 10 thousand per second.

I am looking for a database that can provide very low latency for key-based reads without complicated selections. It should also serve as a backend for caching purposes. Additionally, I want to avoid the per-database write lock limitation of MongoDB, as it negatively impacts performance.

One option I have considered is reading only from the slave node and writing to the master, but I am concerned about potential delays in data replication, which may lead to inconsistent data.

Can you recommend any alternatives such as HBase or Hypertable? My primary requirement is to achieve a maximum read latency of 2-5 milliseconds.

Jerry · Oct 09, 2023, 01:43 AM

Given your requirements, there are multiple solutions you could consider:

Apache Cassandra: This is ideal for applications that can't afford to lose data. With it's masterless design, it provides constant availability, and every node in the cluster has the same role. As your load increases, you can add more machines to your network to handle the extra weight. It offers tunable consistency which means you can choose between absolute consistency or higher availability. Moreover, it has built-in support for data replication across multiple datacenters.

Amazon DynamoDB: DynamoDB is a good NoSQL database service for applications that need consistent, single-digit millisecond latency at any scale. It's a managed database, i.e., you don't need to handle setup, maintenance, or backups yourself. However, be aware of its data model and pricing since it can get expensive with heavy loads. DynamoDB also makes it easy to scale by automatically distributing data and traffic for tables over sufficient servers to handle throughput and storage requirements.

Google Cloud Spanner: This is a globally distributed relational database that provides strong consistency across regions, continents, and even databases. It's a good pick if you have some complex transactional requirements which are typically not well managed by NoSQL systems.

Redis: It's a fast, open-source, in-memory key-value data store for use as a database, cache, message broker, and queue. Assuming your entire dataset can fit in memory, it could provide blazing-fast reads and writes. You can also use tools like Redisson to turn your Redis store into a distributed system and get most of the load balancing and fail-safe benefits.

Apache HBase: If Apache Hadoop is part of your current or future technology stack, it may be beneficial to use Apache HBase, which integrates with the Hadoop ecosystem and scales linearly to handle large data sets. However, HBase can be complex to set up and maintain, and you might need additional components (like Apache Phoenix for SQL-like operations) depending on your use case.

ScyllaDB: ScyllaDB is a high performance NoSQL database that is fully compatible with Apache Cassandra. It is designed to handle large volumes of data across many commodity servers with high read and write speeds. ScyllaDB utilizes a sharded design on each node, improving both concurrent throughput and reducing latency. It also uses asynchronous operations to ensure maximum performance.

Aerospike: Aerospike is a distributed NoSQL database and key-value store designed for speed. It's capable of processing over a million transactions per second per node with sub-millisecond latency, while guaranteeing ACID properties (Atomicity, Consistency, Isolation, Durability). It also fits your description as it's designed for use cases that need to make simple, direct key-value queries.

Couchbase: Couchbase Server is a NoSQL dоcument database with a distributed architecture. It's designed for ease of development coupled with powerful features to manage data distribution. Coupled with N1QL, it provides SQL for JSON that combines the power and familiarity of SQL with the flexibility and agility of the JSON format.

Tarantool: Tarantool is an in-memory computing platform that includes an in-memory database. It is Lua application server integrated with database management system and a multi-model NoSQL database. Tarantool is designed to be lightweight, starting promptly, and utilize computing resources efficiently.

RocksDB: RocksDB is an embeddable persistent key-value store for fast storage. It's developed by Facebook's database engineering team, and it's used in production to serve many kinds of workloads. RocksDB is highly customizable and can be used for a variety of workloads, whether they're I/O-bound or CPU-bound.

Comparing MongoDB, HBase, and Hypertable also brings into focus three distinct types of NoSQL databases: a dоcument-oriented database (MongoDB), a column-oriented database (HBase), and a NoSQL storage system (Hypertable). Let's delve into each, comparing them along several different dimensions:

Data Model: MongoDB is a dоcument-oriented NoSQL database and uses BSON (binary JSON) format. It's schema-free, and data can be stored in an nested 'dоcument' format, which makes it extremely flexible. HBase and Hypertable are both wide-column stores where data is stored in columns and identified by a unique key. This data model is best suited for analytics and workloads that involve scanning or writing large amounts of data at once.

Consistency: MongoDB is strongly consistent by default, as read and write operations pertain to a single dоcument. Clients can also choose to relax this to eventual consistency. HBase also provides strong consistency. By contrast, Hypertable offers eventual consistency.

Scalability and Distribution: MongoDB supports automatic sharding and can scale horizontally across many servers. However, it uses a system which designates one node as primary for reads and writes, while the others are secondary nodes used mostly for read operations. HBase and Hypertable, due to their integration with Hadoop and the Google's Bigtable design respectively, are designed to scale effectively and handle huge data sets distributed across large commodity servers.

Performance: This depends a lot on your workload and hardware setup. With appropriate hardware and indexes, MongoDB is typically well-performing for many use cases, including write-heavy workloads, as well as balanced read and write workloads. HBase and Hypertable are optimized for high volume, large-scale read and write throughput, but their performance can trail MongoDB in terms of latency for random lookups.

Community and Support: MongoDB has a wide user base and a strong community, providing lots of online resources and professional services. Due to its popularity, it's generally easier to find developers with MongoDB experience as opposed to HBase or Hypertable. While HBase is closely integrated with the Hadoop community, Hypertable has a smaller community and is not as actively maintained.

Secondary Indexing: MongoDB supports secondary indexes, allowing for speedy searches on any field within a dоcument. This feature can greatly improve query performance. HBase does not natively support secondary indexes; instead, you'd typically need to handle indexing manually or leverage tools or software like Apache Phoenix. Hypertable has limited support for secondary indexing.

Transactions: MongoDB provides multi-dоcument transactions ensuring ACID (atomicity, consistency, isolation, and durability) properties. This is an advantage when it comes to use cases that require modifying multiple dоcuments in an atomic manner. HBase natively supports single row transactions, whole multi-row transactions need to be manually implemented (or you have to use third-party solutions). Hypertable does not directly support transactions.

Query Language: MongoDB offers a rich query language with support for joins and secondary indexes. With MongoDB, you can perform complex queries, including aggregations, just like with SQL databases. HBase and Hypertable are key-value stores and do not provide a SQL-like querying mechanism. You would typically fetch data through get/put API calls on specified row keys.

Real-time Processing: Both HBase and Hypertable are excellent choices for real-time read/write access of large datasets. MongoDB also provides fast real-time access, but it tends to work optimally for smaller datasets (in comparison with HBase and Hypertable) or when the working set fits the configured server RAM.

Data Durability and Recovery: MongoDB's WiredTiger storage engine provides journaling and checkpointing to help ensure data durability. HBase offers data durability through Write-Ahead Log (WAL), and automated failover. Hypertable, less forgiving when it comes to data loss, depends on the underlying filesystem (like the Hadoop Distributed File System, or HDFS) for data durability.

API and Client Libraries: MongoDB has an array of client libraries and a well-dоcumented API available in many languages. It's relatively easy to integrate MongoDB with a variety of programming languages. HBase provides a native Java client API and REST, Thrift and Avro APIs for non-Java front ends but has fewer client libraries as compared to MongoDB. Hypertable also provides APIs for languages like Python, Ruby, Java, and PHP, but the development community is not as big as MongoDB or HBase.

RonaldVance · Oct 09, 2023, 03:36 AM

As we already know, storing 1 billion keys takes up 166 gigabytes of space, which is a typical amount for a server with 256Gb. However, if we also store data for each key, our volume capacity will no longer be sufficient. The maximum we can install on the mother is 512Gb.

One potential solution to consider is user-level sharding for redis or using memcached. This is something we are currently discussing.

On the other hand, it's not necessary to keep the entire billion keys in memory. Only 100 million of them are active at most. The rest can be stored in the database. And in case of a need, we can quickly subtract and put them in the cache.

Perhaps it's worth considering switching to a different type of database? We require delayed write (insert only) and fast read speeds (no more than 2 ms). We are ready to use SSD if it could help meet these requirements.

After all, the insert operation is initially done in the cache, and all nodes retrieve data from the cache. Therefore, the time it takes for the database to write the data doesn't have a significant impact. However, minimizing the delay in case of a cache miss is crucial.

agelinajohly · Oct 09, 2023, 05:36 AM

The issue with write operations globally is that every insertion requires the indexes to be rebuilt, which poses a problem. Therefore, the only feasible solution to address this issue is to divide the database (sharding) into multiple parts. As a result, the records are distributed across all shards, preferably unevenly depending on the key selection algorithm. This leads to the equation: number_records_1_node = total_records/number_node. Thus, for 10000k per second and 100 nodes, the calculation becomes 10000/100 = 100 write operations per second.

It seems that there are no other viable methods to scale records, as opium has correctly suggested. The simplest approach would be to shard within the monga itself, which would only cause partial blocking. Although Cassandra/Riak could have been more suitable alternatives, they are still cluster solutions that require a higher number of nodes to achieve better performance.

Additionally, it is worth noting that housing large volumes of records on the same server is not feasible.

Right NoSQL Database: Comparing MongoDB, HBase, and Hypertable

ImilGot

Jerry

RonaldVance

agelinajohly