If you like DNray Forum, you can support it by - BTC: bc1qppjcl3c2cyjazy6lepmrv3fh6ke9mxs7zpfky0 , TRC20 and more...

 

Uploading Large Data Sets

Started by iekkmooeca, Jun 19, 2023, 06:47 AM

Previous topic - Next topic

iekkmooecaTopic starter

Good day, I was given a task of uploading data which includes a photo, an Excel file with 800k products, and an empty SQL table. The process involves connecting a library to work with PHP Excel, parsing information from the document, uploading a photo, and filling out the SQL table. However, I am concerned about the capability of the hosting server and whether it can handle the load of uploading all 800,000 products at once.

One solution could be to break the process into smaller parts, say 50k products at a time. To implement this, I am thinking of using the GET parameter to determine the slice of records to upload, for example, site.com/import?step=2 to take a slice starting at 50k goods and ending at 100k goods.

I am wondering if this approach is the right way to go or if there is a better alternative. Perhaps there are other ways to optimize the upload process or to upgrade the server's capabilities to handle the load more efficiently.
  •  

thietkeweb3s

Do you use an Excel or CSV file to access data on your desktop? If you use Excel, are there any unnecessary elements like layouts and formulas or simply a table? If it's just a table, could you convert it into a CSV file?

When handling CVS-formatted data, MYSQL can be used to load everything without the need for PHP or its libraries. This can result in significantly faster processing times compared to sorting through PHP and then sending the data to MYSQL.

When I had to upload a product file containing millions of units into a database, using the LOAD DATA command proved to be the best option.

Here is a sample of my previous MySQL code for loading files:

// Load file
LOAD DATA LOCAL INFILE '/srv/cms_cpa/files/adimport_items.csv' INTO TABLE adimport_tmp CHARACTER SET utf8 FIELDS TERMINATED BY '|' ENCLOSED BY "'" LINES TERMINATED BY '\n' IGNORE 1 LINES (id_adimport,article,available,currencyId,delivery,description,id,name,oldprice,param,picture,price,url,vendor,advcampaign_id,advcampaign_name);

// Load only the required fields (!!!)
LOAD DATA LOCAL INFILE '/srv/cms_cpa/files/adimport_items.csv' INTO TABLE adimport_tmp CHARACTER SET utf8 FIELDS TERMINATED BY '|' ENCLOSED BY "'" LINES TERMINATED BY '\n' IGNORE 1 LINES (id_adimport,@ISBN,@аdult,@age,article,@attrs,@author,available,@barcode,@binding,@brand,@categoryId,@ country_of_origin,currencyId,delivery,description,@downloadable,@format,@gender,id,@local_delivery_cost,@manufacturer_warranty,@market_category,@model,@modified_time,name,oldprice,@orderingTime,@page_extent,param,@performed_by,@ pickup,picture,price,@publisher,@sales_notes,@series,@store,@syns,@topseller,@type,@typePrefix,url,vendor,@vendorCode,@weight,@year,advcampaign_id,advcampaign_name,@deeplink) ;

// All fields
LOAD DATA LOCAL INFILE '/srv/cms_cpa/files/adimport_items.csv' INTO TABLE adimport_tmp CHARACTER SET utf8 FIELDS TERMINATED BY '|' ENCLOSED BY "'" LINES TERMINATED BY '\n' IGNORE 1 LINES (id_adimport,ISBN,аdult,age,article,attrs,author,available,barcode,binding,brand,categoryId,country_of_origin,currencyId,delivery,description,downloadable, format,gender,id,local_delivery_cost,manufacturer_warranty,market_category,model,modified_time,name,oldprice,orderingTime,page_extent,param,performed_by,pickup,picture,price,publisher,sales_notes,series,store,syns,topseller,type,typePrefix, url,vendor,vendorCode,weight,year,advcampaign_id,advcampaign_name,deeplink);
  •  

lotuscontainer05

To avoid resource theft, it may be prudent to convert xls files to csv.

When sending files via post, be sure to check for size and limits.

Instead of processing the entire file at once, consider breaking it into smaller parts.

To optimize processing tasks, consider using cron.

Setting a longer timeout is possible and often necessary.

When it comes to this task, assigning breakdowns to the client via a web interface can be helpful. Using ajax to send small parts of the file, receiving a response, tracking execution percentage, and continuing the process can be an effective approach.

Addition: It's important to ensure efficient and secure file processing, especially when dealing with large files. Taking steps such as those mentioned can help streamline the process and protect against potential issues.
  •  

balkonreont

To prepare for selling 800k products, we recommend installing www.mysqlfront.de and using it to manage your data. Instead of dumping all photos into one folder, it's best practice to sort them into folders with around 1-3k photos each. A simple way to do this is by using any file manager and naming folders after the first three characters of the photo name to avoid collisions.

Although you've already made progress by naming your photos after their respective article, this packaging process should be done before uploading to an archive and then sending via ftp to your hosting provider. It's important to note that with an estimated average photo size of 50kb per product, 800k photos could total up to 40Gb. This begs the question of where the previously mentioned 2GB came from.
  •  

ZoriWser

Uploading a large volume of data, including a photo, an Excel file, and populating an empty SQL table, can indeed put a strain on server resources, potentially leading to performance issues or even server crashes.

The approach you mentioned, breaking the process into smaller parts and using GET parameters to determine the slice of records to upload, is a practical solution to address this challenge. By implementing this approach, you can effectively distribute the load and optimize the upload process.

To further optimize the upload process and enhance the server's capabilities, there are several considerations to take into account:

1. Use Chunked Uploads: Implement a chunked file upload process where the large Excel file is divided into smaller chunks and uploaded sequentially. This can prevent timeouts and memory issues that may arise from trying to process the entire file at once.

2. Utilize Background Processing: Implement a background processing mechanism, such as asynchronous processing or job queues, to handle the parsing of Excel data, photo uploading, and SQL table population. This can free up server resources for other tasks and improve overall system responsiveness.

3. Implement Caching Mechanisms: Utilize caching mechanisms to store parsed data and avoid repetitive parsing of the same Excel file. This can reduce the processing overhead and improve the overall efficiency of the upload process.

4. Optimize SQL Table Population: Consider optimizing the SQL table population process by using batch inserts or bulk operations to efficiently insert a large number of records into the database. This can reduce the number of individual SQL queries and improve database performance.

5. Server Capacity Planning: Evaluate the current server capacity and consider upgrading the server resources, such as increasing CPU, memory, or storage capacity, to better accommodate the anticipated load from the data upload process.
  •  

Muhammad

Using GET parameters for pagination can lead to URL length issues and may not be SEO-friendly. Also, it doesn't scale well if you need to import more than 800k products. A better approach would be to use POST requests with JSON data.
  •  


If you like DNray forum, you can support it by - BTC: bc1qppjcl3c2cyjazy6lepmrv3fh6ke9mxs7zpfky0 , TRC20 and more...