Searching Legal Documents with Regex Filters: Backend vs. SQL Approaches

Started by jesusbond, Feb 27, 2023, 04:05 AM

Previous topic - Next topic

jesusbondTopic starter

I possess a database containing one million documents with text that includes information on court decisions, such as names and types of lawsuits. I aim to develop a web service for users to explore this database based on various data/parameters from these court decisions. Essentially, there is a need to work with text in order to execute a complex regex search, which can be achieved by implementing sphinx or ElasticSearch.

Users will be able to search for specific judges, such as "Judge Ivanov", and using regular expressions, the search will be conducted in such a way that "/Judge at the meeting: (.*?)/s", where $1 = "Ivanova".

To achieve the desired goal, it is essential to determine the optimal database (postgresql, mongodb, redis, etc.) and the appropriate language (php, ruby, python, etc.). The selected methods and technologies need to facilitate quick searches on large datasets for a million documents.

In case it is not possible to execute the required regex search, it is still possible to ensure effective search functionality by searching for the entire set of a million documents and applying a regex filter to each resulting document on the backend, rather than conducting the filter on sql. This approach ensures that only relevant documents are displayed.


In my opinion, ES is a highly appropriate solution here since the search is based on topics of words, making it possible to search for related phrases such as "Judges in the presidency of Ivanova" or "Judge Ivanova", in addition to "Judge at the meeting: Ivanov". Although ES has some advanced features, it is still user-friendly.

Using ES will help narrow down the list of documents to a manageable level, allowing for effective software regexp filtering. After searching for all documents with judges, meetings, and Ivanova, we can apply the software regexp filter to results. Additionally, ES supports scripts that can perform this function within the platform itself, and Linux search provides support for exact phrase search functionality.


Consider Elastic for your search needs. It offers features like fuzzy search (Iavnov instead of Ivanov), faceted search (search by document attributes, such as the judge's surname), regular expression search (add your own regular expression to the search query), synonym search (searching for "car" brings up documents that mention "vehicle"), and more.

This technology is known as full-text search and can quickly handle millions of documents. Applications like elastic and sphinx are widely used to search large datasets on popular platforms such as StackOverflow and Instagram. Consider using a postgres database, which supports efficient sharding and replication. Additionally, choose a language that is comfortable for you and ensure that you have a powerful server capable of handling the search engine workload. While PHP 8's RAM usage has reportedly doubled, Python's Django framework is more user-friendly and better suited for long-term support of existing code.


Sphinx would be an ideal choice for fine-tuning search functionality specifically for this task. Although it requires specialized knowledge, Sphinx is designed for this type of application. Programming language is not a significant factor, and I would prefer to use PHP since it is a language I am familiar with. However, it is important to identify specific requirements such as the type of data, its volume, update frequency, and frequency of new additions.

Regarding database selection, MySQL would be my recommendation. Another option is to index text files directly using Sphinx. While this approach may be suitable for some scenarios, it is important to evaluate if it aligns with specific needs and requirements.


The MySQL DBMS provides several search tools, including the LIKE operator for simple searches and the RLIKE operator for regular expression-based searches.

Using the LIKE operator with the WHERE construction returns 1 or 0 and is often supplemented by special characters '%' and '_'. These allow matching any number of characters and one character, respectively. The operator also supports negation via NOT LIKE.

In contrast, the RLIKE (or REGEXP) operator offers more flexible search conditions using regular expressions, which are templates applied to text from left to right. MySQL DBMS implements dialects of regular expressions that adhere to the POSIX standard. The operator supports '^' and '$' to bind expressions to the beginning and end of a string, respectively, and '[[:<:]]' and '[[:>:]]' to bind patterns to the beginning and end of a word.

Special characters such as '|' and character classes like [0-9] can also be used in regular expressions. However, it is important to note that the RLIKE operator works slower than the LIKE operator due to its complex search functionality.


Based on your requirements, it seems that you need a database and language that can handle large datasets efficiently and allow for complex regex searches. Here are some recommendations:

1. Database: Both PostgreSQL and MongoDB can handle large datasets effectively. PostgreSQL is a relational database management system (RDBMS) that offers powerful features and supports complex queries. MongoDB is a NoSQL database that provides high scalability and flexibility for handling unstructured data. Choose the one that aligns better with your data model and query needs.

2. Language: Python is a popular choice for text processing and has excellent support for regex. It offers various libraries and frameworks, such as Flask or Django, for developing web services. Python's extensive ecosystem and ease of use make it a suitable language for your project.

3. Search Engine: Sphinx or Elasticsearch are both capable search engines that can handle complex regex searches efficiently. Sphinx is a full-text search engine designed specifically for indexing and querying large document collections. Elasticsearch is a distributed search and analytics engine built on top of Apache Lucene, offering scalability and flexibility. Both options can be integrated with your chosen database and programming language.

Considering your need for quick searches on a large dataset, Elasticsearch might be a more suitable choice, as it is known for its fast and scalable search capabilities.

In summary, you can consider using PostgreSQL or MongoDB as your database, Python as the programming language, and Elasticsearch as the search engine to achieve your goals efficiently.