Searching Legal dоcuments with Regex Filters: Backend vs. SQL Approaches

jesusbond · Feb 27, 2023, 04:05 AM

I possess a database containing one million dоcuments with text that includes information on court decisions, such as names and types of lawsuits. I aim to develop a web service for users to explore this database based on various data/parameters from these court decisions. Essentially, there is a need to work with text in order to execute a complex regex search, which can be achieved by implementing sphinx or ElasticSearch.

Users will be able to search for specific judges, such as "Judge Ivanov", and using regular expressions, the search will be conducted in such a way that "/Judge at the meeting: (.*?)/s", where $1 = "Ivanova".

To achieve the desired goal, it is essential to determine the optimal database (postgresql, mongodb, redis, etc.) and the appropriate language (php, ruby, python, etc.). The selected methods and technologies need to facilitate quick searches on large datasets for a million dоcuments.

In case it is not possible to execute the required regex search, it is still possible to ensure effective search functionality by searching for the entire set of a million dоcuments and applying a regex filter to each resulting dоcument on the backend, rather than conducting the filter on sql. This approach ensures that only relevant dоcuments are displayed.

AmitVermaSPS · Feb 27, 2023, 05:44 AM

In my opinion, ES is a highly appropriate solution here since the search is based on topics of words, making it possible to search for related phrases such as "Judges in the presidency of Ivanova" or "Judge Ivanova", in addition to "Judge at the meeting: Ivanov". Although ES has some advanced features, it is still user-friendly.

Using ES will help narrow down the list of dоcuments to a manageable level, allowing for effective software regexp filtering. After searching for all dоcuments with judges, meetings, and Ivanova, we can apply the software regexp filter to results. Additionally, ES supports scripts that can perform this function within the platform itself, and Linux search provides support for exact phrase search functionality.

rafiazafar · Feb 27, 2023, 08:05 AM

Consider Elastic for your search needs. It offers features like fuzzy search (Iavnov instead of Ivanov), faceted search (search by dоcument attributes, such as the judge's surname), regular expression search (add your own regular expression to the search query), synonym search (searching for "car" brings up dоcuments that mention "vehicle"), and more.

This technology is known as full-text search and can quickly handle millions of dоcuments. Applications like elastic and sphinx are widely used to search large datasets on popular platforms such as StackOverflow and Instagram. Consider using a postgres database, which supports efficient sharding and replication. Additionally, choose a language that is comfortable for you and ensure that you have a powerful server capable of handling the search engine workload. While PHP 8's RAM usage has reportedly doubled, Python's Django framework is more user-friendly and better suited for long-term support of existing code.

john.vanis5 · Feb 27, 2023, 09:48 AM

Sphinx would be an ideal choice for fine-tuning search functionality specifically for this task. Although it requires specialized knowledge, Sphinx is designed for this type of application. Programming language is not a significant factor, and I would prefer to use PHP since it is a language I am familiar with. However, it is important to identify specific requirements such as the type of data, its volume, update frequency, and frequency of new additions.

Regarding database selection, MySQL would be my recommendation. Another option is to index text files directly using Sphinx. While this approach may be suitable for some scenarios, it is important to evaluate if it aligns with specific needs and requirements.

puviyamilla · May 26, 2023, 11:31 AM

The MySQL DBMS provides several search tools, including the LIKE operator for simple searches and the RLIKE operator for regular expression-based searches.

Using the LIKE operator with the WHERE construction returns 1 or 0 and is often supplemented by special characters '%' and '_'. These allow matching any number of characters and one character, respectively. The operator also supports negation via NOT LIKE.

In contrast, the RLIKE (or REGEXP) operator offers more flexible search conditions using regular expressions, which are templates applied to text from left to right. MySQL DBMS implements dialects of regular expressions that adhere to the POSIX standard. The operator supports '^' and '$' to bind expressions to the beginning and end of a string, respectively, and '[[:<:]]' and '[[:>:]]' to bind patterns to the beginning and end of a word.

Special characters such as '|' and character classes like [0-9] can also be used in regular expressions. However, it is important to note that the RLIKE operator works slower than the LIKE operator due to its complex search functionality.

chloetanner · Jul 13, 2023, 09:28 AM

Based on your requirements, it seems that you need a database and language that can handle large datasets efficiently and allow for complex regex searches. Here are some recommendations:

1. Database: Both PostgreSQL and MongoDB can handle large datasets effectively. PostgreSQL is a relational database management system (RDBMS) that offers powerful features and supports complex queries. MongoDB is a NoSQL database that provides high scalability and flexibility for handling unstructured data. Choose the one that aligns better with your data model and query needs.

2. Language: Python is a popular choice for text processing and has excellent support for regex. It offers various libraries and frameworks, such as Flask or Django, for developing web services. Python's extensive ecosystem and ease of use make it a suitable language for your project.

3. Search Engine: Sphinx or Elasticsearch are both capable search engines that can handle complex regex searches efficiently. Sphinx is a full-text search engine designed specifically for indexing and querying large dоcument collections. Elasticsearch is a distributed search and analytics engine built on top of Apache Lucene, offering scalability and flexibility. Both options can be integrated with your chosen database and programming language.

Considering your need for quick searches on a large dataset, Elasticsearch might be a more suitable choice, as it is known for its fast and scalable search capabilities.

In summary, you can consider using PostgreSQL or MongoDB as your database, Python as the programming language, and Elasticsearch as the search engine to achieve your goals efficiently.

Johnny007 · Jun 25, 2025, 09:25 AM

I'd recommend Elasticsearch as the core search engine since it's a beast for full-text search and can handle complex queries way better than raw SQL. Pair it with PostgreSQL for structured data storage (like metadata or indexed fields) to keep things tidy. For the backend, Python with Django or Flask is my go-to - super flexible for integrating Elasticsearch and handling regex post-processing if needed.

Sphinx could work too, but it's kinda old-school compared to ES's scalability and indexing power. If regex searches can't be natively optimized in ES, pulling raw results and filtering on the backend is a legit workaround, though it's a bit of a performance hog. Focus on caching frequent queries with Redis to avoid hammering the DB. This setup should keep your search snappy even with a massive dataset

Searching Legal dоcuments with Regex Filters: Backend vs. SQL Approaches

jesusbond

AmitVermaSPS

rafiazafar

john.vanis5

puviyamilla

chloetanner

Johnny007