Expert Advice on Database Management

Started by drnagwaniseo, Feb 22, 2023, 12:08 AM

Previous topic - Next topic

drnagwaniseoTopic starter

I am seeking the input of database experts or individuals who have experience working with Big Data databases.

There is an analytical system in place that operates on multiple sites and saves statistics for each page and element. The data is aggregated in memory and transferred to a database table once a day. Analysts can generate reports for specific periods and segments based on site, page, and element.

The table structure includes fields for site ID, page ID, element ID, date, user segment ID, and other data fields. We are currently using Amazon Aurora with the InnoDB engine, and about 80 million lines are added to the table daily in blocks of 1000 lines per INSERT request.

However, we are facing several issues with this approach. Firstly, the speed of recording data drops significantly after a month due to an increase in the index. Secondly, report generation for pages with many elements takes too long. Lastly, deleting data for a specific site causes the entire database to be blocked.

What suggestions do you have to address these challenges?


Do you have a diskless database? Cloud databases allow for independent disks to be used, with tables spread across them to increase speed. This even works with SSDs, and can improve performance by separating index storage from data or old data from current data.

During analytics processing, file systems can be darkened and flush for table files disabled to speed up recording. This poses a risk of data loss if the OS is reset, but the likelihood is small. It's worth noting that data is already written to the database from some other storage, so if there is an issue with the server, processing for the current day can be restarted.

Removing indexes on record insertion can speed up the process significantly. First, insert data without indexing it, then create an index (which is much faster) and build analytics.

When dealing with general analytics, it's best to work with daily squeeze instead of the data itself. Homemade indexes can be used for this purpose, and basic requests like count, max, and min can be satisfied by adding up daily values and counting them for global results. For queries with conditions and complex groupings, aggregation by the hour can be employed by writing daily values by the hour in the indexes. To page by page, count for each page for each day, then aggregate accordingly.


A database is not always necessary. When interviewing programmers, almost all of them mention MySQL, yet they rarely mention transactions. However, the use of transactions is the reason behind using a database. Without them, only one insert is needed.

A regular log file can be used instead, and doesn't require any additional software. It can easily be parsed and rotated. Additionally, both Yandex and Google have appropriate tools for these purposes. Google even has reports and other features, so there's no need to store data locally.


Indexes and big data usually don't mix well due to their antagonism. As a solution, it's best to move away from indexes and towards partitioning, which is more complex but better suited for analytical samples.

Ideally, all data should be replicated to another database with a different table structure or even a different type of system. It's worth noting that any modern big data system will have lower ownership costs than a DBMS.


To help choose the right type of DBMS, we'll briefly describe the databases available in our cloud and when to use them.

PostgreSQL is a popular Open Source OLTP DBMS that supports SQL standard and ACID principles, making it ideal for processing transactions in real-time. It is available in versions 9.6, 10, 11, 12, Single-Instance, Master-Slave, and "Cluster" configurations. It doesn't scale horizontally, however, so problems may arise during data increases. PostgreSQL has many built-in functions suitable for solving analytical tasks.

Postgres Pro Standard is an Object-Relational Database Management System (ORDBMS) based on PostgreSQL, with new features implemented by Postgres Professional and third-party improvements. It's worth choosing when additional functionality is needed that's not available in PostgreSQL.

MySQL is another Open Source OLTP system that provides high performance during read operations. Compared to PostgreSQL, it provides less compliance with the SQL standard. It's worth choosing if a DBMS is needed for a transactional load without complex internal logic and complex analytical queries.

MongoDB is one of the most popular NoSQL systems. It's document-oriented, providing significant flexibility since there is no need to rigidly set the table schema in advance. It's also horizontally scaled, making it able to withstand very high loads. It's recommended for situations requiring high scalability and flexibility.


Here are some suggestions to address the challenges you're facing:

1. Partitioning: Consider partitioning your database table based on date or other relevant criteria. This will allow you to distribute the data across multiple physical storage resources, improving query performance and maintenance operations. Partition pruning can significantly speed up report generation as only relevant partitions are scanned.

2. Index Optimization: Review the indexing strategy for your table. Ensure that you have appropriate indexes on columns used in filtering and join conditions. Avoid unnecessary indexes that may slow down INSERT operations. Consider using composite indexes if your queries commonly involve multiple columns.

3. Batch INSERT Operations: Instead of inserting data in blocks of 1000 lines per INSERT request, explore batch INSERT operations. For example, you can use the "INSERT INTO ... VALUES (...), (...), ..." syntax to insert multiple rows in a single statement. This can improve the overall speed of recording data.

4. Analyze Query Performance: Identify the specific queries that are causing slow report generation for pages with many elements. Use tools like EXPLAIN or query profiling to understand the query execution plan and identify areas for optimization. Depending on the findings, you might need to refactor queries, optimize indexes, or consider query caching strategies.

5. Archive or Purge Old Data: Consider archiving or purging old data that is no longer actively used for reporting. This can help reduce the size of the table and improve overall database performance. You can create an archival process to move the old data to separate tables or a different database to keep it accessible if needed.

6. Database Replication and Read Replica: To prevent blocking the entire database when deleting data for a specific site, consider implementing database replication. With replication, you can have a read replica where reporting operations can be performed independently without impacting the main database. This helps to distribute the workload and avoid blocking scenarios.

7. Scaling Options: If the above optimizations don't provide sufficient performance improvements, consider scaling up your database infrastructure. This can involve increasing the compute and memory resources allocated to Amazon Aurora or exploring other Amazon Web Services (AWS) database solutions like Amazon Redshift or Amazon DynamoDB, depending on your specific requirements.

8. Denormalization: Consider denormalizing your database schema by duplicating some data across tables or introducing derived columns. This can help improve query performance by reducing the need for joins and aggregations. However, be cautious with denormalization as it can increase redundancy and maintenance complexity.

9. Materialized Views: Explore the use of materialized views to pre-aggregate and store frequently accessed or computationally expensive reports. Materialized views store the results of a query as a physical table, allowing for fast retrieval. These views can be refreshed periodically to keep them up to date.

10. Data Compression: Evaluate if you can compress your data to reduce storage requirements and potentially improve query performance. Amazon Aurora supports various compression techniques, such as column-level compression and page-level compression. Choose the appropriate compression method based on the data characteristics and query patterns.

11. Query Optimization: Review and optimize the queries that are used for generating reports. Ensure that the queries are well-structured, use appropriate join techniques, and make efficient use of indexes. Look for opportunities to rewrite queries or break them down into smaller, more optimized parts.

12. Horizontal Sharding: Consider implementing horizontal sharding, where data is partitioned across multiple database instances or clusters. Each shard can manage a subset of the data, which can improve scalability and parallelism for both read and write operations.

13. Automated Database Administration: Explore tools and services that can automate database administration tasks, such as performance monitoring, query optimization, and index maintenance. AWS provides services like Amazon RDS Performance Insights and Amazon RDS Performance Insights that can help in optimizing database performance and identifying bottlenecks.

14. Distributed Caching: Introduce a distributed caching layer, such as Amazon ElastiCache, to cache frequently accessed data. Caching can drastically improve query performance by reducing the need to fetch data from the database. However, it requires careful design and consideration to ensure data consistency.

15. Regular Database Maintenance: Perform regular maintenance tasks like index rebuilding, statistics updating, and database vacuuming. These activities optimize the performance of your database by eliminating fragmentation, updating query optimizers, and reclaiming unused space.