Print Page - How do I update a huge amount of data?

Title: How do I update a huge amount of data?
Post by: JPinto on Mar 15, 2023, 07:25 AM

I have a user database that requires weekly data updates from external sites. However, the update script takes too long to run within the 30-second hosting limit.

If I update one user's information at a time, the script can run within 10-20 seconds. However, creating hundreds of rules manually for running the script via cron every minute is not ideal.

Is there a solution to this problem without altering the script's execution time?

Title: Re: How do I update a large amount of data?
Post by: Crevand on Mar 15, 2023, 08:26 AM

1. To address the issue, consider making the parsing and data updating task asynchronous. The parser can create a pool of tasks for updating, which can then be processed by a second script to update the required data.

Implementing a trigger update could also be beneficial. As soon as the parser extracts and inserts the data into the database, the necessary tables can be automatically updated.
However, it is important to note that this approach may not always be ideal, as it can potentially burden the database with constant or infrequent indexing.

Title: Re: How do I update a large amount of data?
Post by: diyasaini on Mar 15, 2023, 10:39 AM

Consider running a script with unlimited execution time on a separate machine, connecting it to the hosting's database to perform all the necessary tasks.

To ensure security, configure the database to accept remote connections from specific IP addresses in addition to local calls (in MySQL, this can be achieved by limiting external access).

To optimize heavy operations within the database itself, consider writing utility stored procedures. This approach can help minimize data transfer over the network and enhance overall efficiency.

Title: Re: How do I update a large amount of data?
Post by: blazonbazaar on Mar 15, 2023, 01:38 PM

To work around the broken set_time_limit, I create a task queue system.

First, I retrieve a limited set of active tasks from the update_list table using a simple SELECT statement:

SELECT id, task FROM update_list WHERE active = 1 LIMIT 3

Once processing is completed for a given task, I update the corresponding row in the update_list table to deactivate it:

UPDATE update_list SET active = 0 WHERE id = $obj->id

The crucial aspect here is to ensure that each operation does not exceed the 30-second limit.

If a task does exceed the limit, it becomes necessary to split the tasks into smaller pieces and optimize the database accordingly. One approach could be dividing a single table into multiple tables based on record ID, such as:

users_10k
users_20k

By doing so, it becomes easier to manage and execute operations within the desired time constraints while maintaining efficient database performance.

Title: Re: How do I update a huge amount of data?
Post by: pmyualxlg on Sep 09, 2023, 02:25 AM

Yes, there are a few possible solutions to your problem.

One option is to modify your script to process multiple user updates at once. This can be done by fetching the necessary data for multiple users in a single request to the external sites, rather than making individual requests for each user. By optimizing the data retrieval process, you can reduce the overall execution time of the script.

Another approach is to use a task scheduling service or job queue system. Instead of relying on cron to run the script every minute, you can use a more sophisticated solution that allows you to schedule jobs with finer control over timing and execution. For example, you could use a service like Celery, which provides distributed task queues and supports scheduling tasks to run at specific intervals. This way, you can schedule the script to run every minute without manually creating hundreds of cron rules.

Additionally, you may want to consider caching frequently accessed data. If the external sites provide an API, you can retrieve the required data and store it locally in a cache. This can help reduce the number of requests made to the external sites, speeding up the data update process.

Lastly, it might be worth exploring alternative hosting options with longer execution time limits. If your current hosting provider imposes a 30-second limit, you could consider switching to a different provider that offers more flexible execution time constraints.

few more potential solutions:

1. Implement incremental updates: Rather than updating all user data every week, you could modify your script to only fetch and update data that has changed since the last update. By tracking the timestamp of the last update for each user, you can significantly reduce the amount of data being processed and improve the overall execution time.

2. Use parallel processing: If your system has the resources, you can consider parallelizing the script execution. This involves dividing the workload across multiple threads or processes, allowing multiple users' information to be updated simultaneously. This approach can be challenging to implement correctly and efficiently, but it can lead to substantial improvements in execution time.

3. Optimize your database queries: Analyze and optimize the database queries used in your script. Ensure that you have proper indexes on relevant columns and eliminate any unnecessary or redundant queries. By optimizing your queries, you can reduce the time spent retrieving and updating data from your database.

4. Consider using asynchronous programming: If your script relies on network requests to fetch data from external sites, using asynchronous programming can help improve efficiency. Asynchronous programming allows multiple requests to be made concurrently, reducing the waiting time for each request and potentially speeding up the overall execution.

Here are some code examples to illustrate the solutions I mentioned earlier:

1. Implementing incremental updates:

```python
# Pseudocode to fetch and update only changed data
for user in users:
last_update_time = user.last_update_time
updated_data = fetch_updated_data(user.id, last_update_time)
if updated_data:
update_user_data(user.id, updated_data)
user.last_update_time = current_time()
save_user(user)
```

2. Using parallel processing:

```python
import concurrent.futures

# Pseudocode for parallel processing
def update_user_data(user_id):
# Fetch and update user data here

with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
user_ids = [user.id for user in users]
executor.map(update_user_data, user_ids)
```

3. Optimizing database queries:

```python
# Pseudocode for optimizing database queries
def update_user_data(user_id):
# Assuming you're using an ORM, like SQLAlchemy
user = User.query.get(user_id)
updated_data = fetch_updated_data(user.id)
if updated_data:
user.update_fields(updated_data)
db.session.commit()
```

4. Using asynchronous programming:

```python
import asyncio

# Pseudocode for making asynchronous requests
async def fetch_user_data(user_id):
# Make request to external site to fetch data
response = await aiohttp.get(api_url)
data = await response.json()
return data

async def update_user_data(user):
data = await fetch_user_data(user.id)
if data:
# Process and update user data here

# Example of running asynchronous tasks
async def main():
tasks = [update_user_data(user) for user in users]
await asyncio.gather(*tasks)

loop = asyncio.get_event_loop()
loop.run_until_complete(main())

. Here's an example of optimizing database queries:

```python
import mysql.connector

# Establish a connection to MySQL
cnx = mysql.connector.connect(user='username', password='password',
host='localhost', database='your_database')

cursor = cnx.cursor()

# Pseudocode for optimizing database queries
def update_user_data(user_id):
query = "SELECT id, name, email FROM users WHERE id = %s"
cursor.execute(query, (user_id,))
user_data = cursor.fetchone()

updated_data = fetch_updated_data(user_data

) # Assuming the first column is the user ID

if updated_data:
# Update the user data
update_query = "UPDATE users SET name = %s, email = %s WHERE id = %s"
cursor.execute(update_query, (updated_data['name'], updated_data['email'], user_data

)) # Assuming 'name' and 'email' are keys in 'updated_data'

cnx.commit()

# Example invocation
for user in users:
update_user_data(user.id)

# Close the cursor and connection
cursor.close()
cnx.close()
```

This example demonstrates how to establish a connection to a MySQL database, execute queries, and perform updates based on the fetched data. You would need to replace `'username'`, `'password'`, `'localhost'`, and `'your_database'` with your MySQL credentials and database details.

Title: Re: How do I update a huge amount of data?
Post by: MariaL on May 23, 2025, 09:29 AM

Stop faffing about with manual cron rules—that's a noob trap and a scalability disaster. Write a bloody batch script to process one user per cycle, stash the state in a DB or file, and trigger it with a single cron job.

It's basic job queuing 101, not rocket science. If your hosting can't handle this lightweight workload, it's a junk-tier setup, and you're just polishing a turd. Upgrade to a proper VPS or cloud instance where you're not babysitting timeouts like some script kiddie.

Hosting & Domaining Forum

Hosting Discussion => Managed Hosting and Services => Topic started by: JPinto on Mar 15, 2023, 07:25 AM