Best Database for High Volume Unique Rows

isoconsultantgmg · Sep 01, 2023, 06:59 AM

As we continue to develop our project, we have encountered an interesting challenge.

Input data:
- 150 million unique strings ranging from 1 to 2048 characters in length (in this case, URLs, but it could be any binary strings)
- Daily growth rate of 1 million, with an expected growth range of 5-10 billion
- Currently stored in PostgreSQL, with the following format: id (primary key), link bytea (btree index). This occupies 22 GB and the index is 32 GB.

Task:
- Select a more optimal storage solution to reduce the physical size of the index and improve performance.

Requirements for the new database:
- Preferably a highly specialized and optimized storage solution for this type of data (unique string + its id)
- Efficient search functionality for both the key (link) and retrieving a link by its id
- Ability to scale horizontally
- Disk storage (while it is understood that storing everything in memory would provide the best performance, this option is not currently being considered)

Previous attempts:
- All key/value stores (leveldb, rocksdb, etc.) did not support value search (where the key is a link and the value is an id)
- An extensive search was conducted, but unfortunately, no suitable solution was found.

I would love to discuss this further and hear the opinions of those who have encountered similar challenges and how they were solved.

radiom · Sep 01, 2023, 08:47 AM

The challenge you're facing requires a specialized storage solution that can efficiently handle large volumes of unique strings and provide efficient search functionality. While there isn't a one-size-fits-all answer, there are a few potential approaches you can consider:

1. Distributed Key-Value Stores: Consider using distributed key-value stores like Cassandra or ScyllaDB. These databases are optimized for write-heavy workloads and can handle large data sets. You can store the string as the key and the ID as the value, allowing efficient lookup by both the key (link) and the value (id). These databases also have built-in horizontal scalability, which can help accommodate future growth.

2. Log-Structured Merge Trees (LSM Trees): Another option is to explore databases based on LSM Trees like Apache HBase or Apache Kudu. LSM Trees are well-suited for write-intensive workloads and can efficiently handle large data sets. You can store the strings as row keys and IDs as row values. These databases also provide fast random read access and can handle high write throughput.

3. Custom Indexing Solutions: If off-the-shelf databases do not meet your requirements, consider building a custom indexing solution. You can create an inverted index where the strings are mapped to IDs and optimize it specifically for your use case. This approach may involve developing your own storage engine or leveraging existing technologies like Apache Lucene or Elasticsearch for indexing.

4. Hybrid Approach: Another option is to consider a hybrid approach that combines multiple storage technologies. You can continue using PostgreSQL for persistence and utilize an in-memory caching layer like Redis or Memcached for faster lookup access. This way, you can benefit from PostgreSQL's flexibility while improving performance with the caching layer.

5. Trie-based Data Structures: Trie data structures are well-suited for efficient storage and retrieval of strings. You could explore using specialized databases like Redis with the Redis Trie module or implementing your own custom trie-based solution. Tries can provide fast searching by string and allow for prefix and wildcard searches as well.

6. Distributed File Systems: Distributed file systems like Apache Hadoop Distributed File System (HDFS) or Apache HBase's HFile format may be worth considering. These systems are designed for storing and managing large amounts of data across multiple nodes. They offer scalability, fault-tolerance, and efficient storage.

7. Custom Partitioning and Sharding: Depending on your requirements, you can design a custom partitioning and sharding strategy for your data. This involves splitting the dataset into multiple smaller subsets and distributing them across different database instances. By carefully selecting the partitioning scheme, you can optimize both storage and performance.

8. Object Storage: Object storage systems like Amazon S3 or Google Cloud Storage could be suitable for storing your unique strings. While they may not provide direct search functionality like databases, you can leverage external indexing mechanisms (e.g., Elasticsearch) or maintain indexes separately to facilitate efficient lookups.

When dealing with high volumes of unique rows, several databases are known for their ability to handle such scenarios efficiently. Some of the best databases for high-volume and high-cardinality workloads include:

1. Apache Cassandra: Cassandra is a highly scalable and distributed NoSQL database. It is designed to handle massive amounts of data across multiple nodes while maintaining fast read and write performance. Cassandra's decentralized architecture and partitioning scheme make it an excellent choice for high-volume unique rows.

2. Apache HBase: HBase is another distributed NoSQL database that excels in handling high-cardinality data. Built on top of Hadoop, it provides scalability, fault tolerance, and fast random read/write access. HBase's columnar storage format and ability to handle wide tables make it suitable for storing and querying high-volume unique rows.

3. Elasticsearch: While primarily known as a search engine, Elasticsearch can also serve as a high-performance database for unique rows. It provides full-text search capabilities, powerful indexing, and efficient dоcument retrieval. Elasticsearch is particularly useful when you need to perform complex search queries on high-volume unique rows.

4. ClickHouse: ClickHouse is a columnar database optimized for analytics workloads. It can handle high ingestion rates and efficiently store and query large volumes of unique rows. ClickHouse's compression techniques and data partitioning options make it a good fit for high cardina lity data.

5. Amazon Redshift: Redshift is a fully managed data warehousing service provided by AWS. It is built for handling large datasets and complex queries at scale. With its columnar storage and parallel processing capabilities, Redshift can handle high-volume unique rows efficiently.

6. MySQL or PostgreSQL: While not specifically designed for high-cardinality workloads, both MySQL and PostgreSQL have proven to handle large volumes of data effectively. They offer robust indexing capabilities and support various optimization techniques that can be leveraged to manage high-volume unique rows.

more databases known for their ability to handle high volumes of unique rows:

1. Apache Druid: Druid is a columnar store designed for real-time analytics. It can efficiently handle high-cardinality data by utilizing memory-mapped storage and indexing techniques. Druid's distributed architecture and ability to scale horizontally make it suitable for high-volume unique rows.

2. TimescaleDB: TimescaleDB is an extension of PostgreSQL specifically built for time-series data. It provides scalability, automatic partitioning, and optimized query performance for high-cardinality time-series data. TimescaleDB's hypertables allow for efficient storage and querying of high-volume unique rows over time.

3. InfluxDB: InfluxDB is another popular time-series database designed for handling high volumes of timestamped data. It offers high write and query performance, efficient compression, and flexible data retention policies. InfluxDB's indexing and built-in query language make it well-suited for working with high-cardinality unique rows in time-series data.

4. Google Bigtable: Bigtable is a distributed NoSQL database provided by Google Cloud. It is highly scalable and able to handle massive volumes of data with low latency. Bigtable's columnar storage, automatic sharding, and efficient data compression make it a good fit for storing and retrieving high-volume unique rows.

5. Apache Arrow Flight: While not a database itself, Apache Arrow Flight is a framework for high-performance data transfer across different systems. It can be used to efficiently exchange large volumes of unique rows between applications or databases, enabling fast and scalable data movement.

DusFriesteLet · Sep 01, 2023, 09:30 AM

1) If the hash from the link is used as the id, the index on the link becomes unnecessary.

Instead of using a btree, consider using a hash index as it eliminates the need for sorting.

2) Storing variable-length data in tables can be inefficient due to fragmentation and partial reservation. Overall, it results in increased storage requirements and slower performance.

sobata · Sep 01, 2023, 11:41 AM

You can achieve efficient compression of links by utilizing prefix search. Essentially, you search for the longest string that can be identified and substitute the entire string (prefix) with an ID. This approach allows for significant resource savings.

Best Database for High Volume Unique Rows

isoconsultantgmg

radiom

DusFriesteLet

sobata