Handling Massive Data Tables

johnadam · May 23, 2023, 12:16 AM

Greetings!

Prior to my current project, I had never worked with companies that dealt with a large amount of data.

I am faced with the challenge of managing a table with an indeterminate amount of data, possibly tens of millions of records, weighing in at over a dozen gigs including the indexes. Adding further indexes would require significant memory on disk.

In addition, there is a grid that needs to display and output this data, complete with filtering options for various fields, export capabilities with unlimited rows, and dynamic dropdown lists generated from queries to our database service via API.

To achieve these requirements, we rely on a series of queries, including retrieving ten records at a time while calculating pagination, querying for the total number of records, exporting all data filtered by user selection, and generating dropdown lists for each filter field.

As a result, queries are processed multiple times for a table with millions of records, leading to timeouts and crashes during filtering and sorting.

I have experimented with indexing by ID for better performance, but there are many variable fields such as dates, guids, project names, and even JSON data, that make it difficult to optimize performance without causing undue stress on the system.

I welcome feedback from anyone who has experience working with complex datasets on a Postgresql server, and any suggestions for how to improve query performance without sacrificing functionality.

ZvyagizevE · May 23, 2023, 02:35 AM

The question is poorly constructed and mixes real issues with unrealistic ideas, but I don't want to use profanity. It's not a matter of lacking knowledge on how to handle large databases, but rather the inability to work with the database as a whole.

When it comes to indexing, it's important to note that you shouldn't just recklessly add indexes to every field being searched for; rather, an index should be added wherever it is needed. Composite indexes may also be necessary after analyzing queries.

Using requests like '%...%' is not recommended and full-text search should be considered instead. However, it's best to avoid it if possible and utilize external search services like elastic. Select distinct for filters is the worst option since it shows a lack of understanding of database design principles such as normalization.

It's unclear why people think gigabyte indexes are necessary when most fields in a database are only a few bytes. A specific request that falls off should be analyzed with the result of EXPLAIN.

To work with large volumes, you simply need to have a firm grasp on working with the database and understand the relational model, normalization, indexes, and query optimization. For the grid, consider using Elastic / Sphinx to index all filters in the sample and make selections through the search service instead of a direct query to the database.

rnelmilaz9 · May 23, 2023, 03:54 AM

To improve database performance, avoid using '%...%' queries and consider using a third-party application for full-text search. It's important to create indexes where necessary to filter by fields efficiently.

In some cases, pre-fetching several pages of data can improve pagination speed as long as it doesn't overload the database. Another approach is to request the entire range of data when changing filters and save only the identifier data on a server table. Caching query results can also improve performance on the client-side and should be carefully invalidated to avoid issues.

Filtering data on the client-side may seem unusual, but it can be faster than doing so on the server in some cases. However, this depends on the data and may complicate pagination algorithms.

When it comes to choosing between HDD or SSD servers, setting up a small SSD as a cache for a slow HDD can improve performance significantly. It may also speed up recording the file system log removal on an SSD. For reliability, consider using RAID1 for write mode caching SSDs.

ipnesterov · May 23, 2023, 05:23 AM

70 GB may seem insignificant compared to the terabytes of data people handle these days. However, the real issue lies in performing a full scan of the table while executing a query, not its size. The wildcard condition '%word%' requires scanning each row, making regular indexes less useful. Although full-text indexes exist, they need proper preparation to function well. The optimal solution depends on the specific task. For instance, if the keywords are in a string format with spaces or separators, they can be stored separately in a table with indexed rows. This approach eliminates the need for a full-text search.

When dealing with large amounts of data, it's crucial to optimize query execution to minimize processing time. One way to achieve this is to use efficient indexing strategies designed for the specific dataset. Another approach is to restructure the data to reduce the need for full scans. By considering the requirements of the task at hand and the available resources, developers can often find creative solutions that improve query performance without sacrificing accuracy.

prorasa · Jun 03, 2024, 04:18 AM

The key to optimizing query performance lies in a comprehensive approach that addresses indexing, database schema design, caching, and system architecture.

Indexing: While indexing by ID is a good start, identify specific fields that are frequently used in filtering and sorting and consider adding indexes to those fields to improve query performance. It's important to carefully balance the number and type of indexes to avoid excessive disk memory usage.

Partitioning: Implementing database partitioning can help manage large tables by dividing them into smaller, more manageable chunks. Range or list partitioning may be particularly useful for tables with date fields, allowing for easier management and improved query performance.

Caching: Consider implementing caching mechanisms using tools such as Redis or Memcached to store the results of frequently executed queries. This can significantly reduce the need to reprocess the same data repeatedly, improving overall system performance.

Database Schema Optimization: Evaluate the database schema and consider denormalizing certain fields, particularly those containing JSON data. Denormalization can reduce the need for complex joins and data manipulation during retrieval, leading to improved query performance.

Client-Side Processing: For the grid that needs to display and output the data, consider implementing client-side pagination and filtering using front-end frameworks like React or Angular. This can offload some processing from the server and provide a smoother user experience.

Vertical and Horizontal Scaling: Depending on the system's requirements and performance demands, vertical scaling by upgrading hardware resources or horizontal scaling through the implementation of sharding or distributed database systems may be necessary to distribute the workload across multiple servers.

Thorough performance testing and benchmarking are essential to assess the impact of these optimizations. Consider utilizing tools like pg_stat_statements, explain plans, and pgBench to analyze query performance and system behavior under different load scenarios.

phrimall · Apr 26, 2025, 04:39 AM

To improve your PostgreSQL performance without sacrificing functionality, consider these adjustments:

Tune Configuration: Optimize your PostgreSQL configuration settings, such as shared_buffers, effective_cache_size, and work_mem.
Vacuum and Analyze: Regularly run VACUUM and ANALYZE to maintain table statistics and reclaim storage occupied by dead tuples.
Connection Pooling: Implement connection pooling to manage database connections efficiently.
Read Replicas: Set up read replicas to offload read-heavy workloads and improve query performance.

Handling Massive Data Tables

johnadam

ZvyagizevE

rnelmilaz9

ipnesterov

prorasa

phrimall