If you like DNray Forum, you can support it by - BTC: bc1qppjcl3c2cyjazy6lepmrv3fh6ke9mxs7zpfky0 , TRC20 and more...

 

PHP Parser Showdown

Started by chpolaxvm, Feb 06, 2024, 12:16 AM

Previous topic - Next topic

chpolaxvmTopic starter

I'm asked about the most effective and efficient method to write a parser in PHP. There are several options to consider:

1. phpQuery:
   + It supports a variety of selectors.
   - However, it operates at a relatively slow speed.

2. Simple HTML Dom:
   + It offers good dоcumentation and is easy to learn.
   + Unfortunately, it has been reported to have memory leaks, making it unsuitable for parsing large files.
   - It also has issues with parsing speed.

3. Nokogiri:
   + Nokogiri boasts high-speed operation.
   - On the downside, its dоcumentation is considered terrible.

There might be other options that I haven't mentioned, but based on these considerations, which parser would you recommend? And do you have any additional insights or suggestions on this topic?
  •  


apolice9

Considering the factors at hand, I would lean towards the PHP Simple HTML Dom parser due to its comprehensive dоcumentation and ease of learning. While it has been reported to have memory leaks and parsing speed issues, these drawbacks can be mitigated by careful implementation and by considering the size of the HTML files being parsed. For smaller HTML files, the Simple HTML Dom parser can be a suitable and relatively straightforward choice.

However, it's imperative to acknowledge that the landscape of PHP parsing libraries is diverse, and alternatives such as Goutte and Symfony DomCrawler offer unique advantages. Goutte, for instance, provides a user-friendly API for web scraping, while leveraging the powerful features of the Symfony DomCrawler component for more advanced DOM manipulation.

In addition to library selection, it's paramount to optimize the parsing process itself. Techniques such as caching parsed results, minimizing unnecessary DOM traversals, and utilizing asynchronous processing can significantly enhance performance and efficiency.
While the Simple HTML Dom parser may be an appropriate choice for many scenarios, it's crucial to carefully evaluate project-specific needs and consider alternative libraries to ensure the most effective and efficient parsing method is employed. Each project may benefit from a tailored approach, taking into account factors such as file size, complexity, and the specific nature of the data being extracted.
  •  

jamesanderson11

The idea is to avoid reinventing the wheel, right? We can use cURL to fetch the content, which has dоcumentation available on php.net/manual/book.curl.php. Another option could be the object-oriented wrapper at https://github.com/php-curl-class/php-curl-class. For parsing the dоcument, we can make use of SimpleXML, which comes bundled with PHP, and it provides a unified, dоcumented, and well-known interface.

Another useful component is Symfony2, which can be found at symfony.com/doc/current/components/dom_crawler.html. In case we only need to extract text, we can consider using DOMdоcument phpQuery. Based on my personal tests, it's approximately 4 times faster than other methods, although it does have a variety of selectors.

I also explored some other options: Simple HTML Dom is quite slow, while Nokogiri is primarily a parser and doesn't have a faster alternative. It seems to be just an overgrown version of DOMdоcument with some hacks, so speed-wise, there's no advantage to using it. So, why not take advantage of the existing efficient solutions instead of reinventing the wheel?
  •  

missveronica

I'd like to share with you my thoughts on an amazing web parsing tool I recently came across. The tool I'm referring to is called Goutte, and it's truly impressive when it comes to its ability to navigate through website links effortlessly. It has a simple and intuitive interface, making it user-friendly even for beginners.

One interesting point to note is that it relies on symfony/dom-crawler in the composer for its dependencies, which may impact its performance slightly as suggested by a fellow developer.
However, it compensates for this with fast and efficient entry threshold, allowing quick sampling by CSS selectors, data and attribute retrieval using methods similar to jQuery such as .text() and .attr(), and convenient iteration through .each().
  •  


If you like DNray forum, you can support it by - BTC: bc1qppjcl3c2cyjazy6lepmrv3fh6ke9mxs7zpfky0 , TRC20 and more...