MySQL mapping products with similar names

SmutSia · May 30, 2023, 06:25 AM

Hello.

My current task is to create some sort of automapping for products in a database to products in a price list. The goal is to link items with similar names together.

For example, if there is a product in the database called "Apple Macbook Pro 15", and in the price list there are "Apple Macbook Pro 15" and "Apple Macbook Pro 14", I need to link "Apple Macbook Pro 15" in the database to "Apple Macbook Pro 15" in the price list.

I have tried using MySQL full-text search, but it is not effective because it returns both "Apple Macbook Pro 15" and "Apple Macbook Pro 14" with the same relevance.

I have also attempted to use the Jaro-Winkler distance algorithm. Here's how it works: I take a product name from the database and break it into words, creating an array. Then I do the same thing for the name of the item from the price list. Next, I go through the array of words for the product in the database and look for a distance for it according to the Jaro-Winkler algorithm in the array of product words from the price list. If I find a distance greater than 0.95, I note that the corresponding word from the product that is in the price list was found for this word. Then, I check to see if each word in the product from the database has a corresponding word in the product from the price list, and consider them the same.

However, this method did not produce satisfactory results. For instance, if I have a product named "Apple Macbook Pro 15", and in the price list there is a product named "Apple Macbook Pro 15 bag is blue," the similarity score will be 100%, even though these are completely different products.

I'm not sure what other algorithms to try, so I would appreciate any advice or recommendations.

bidabings · May 30, 2023, 08:40 AM

The topic at hand revolves around intelligent searching and natural language processing, going beyond a simple SELECT query.

From the task description, it's clear that we need to perform semantic parsing of text in order to cluster product names.

To start with, it would be wise to look into the Sphynx search engine which has pre-built stemming capabilities for both Russian and English. It could also be fruitful to peruse presentations from Yandex and other companies about this subject.

centigon · May 30, 2023, 10:06 AM

The theme of discussion is "Fuzzy Search", also known as "Error Tolerant Search" or previously referred to as Fuzzy Search.

There are two important components to consider: the technology used to search and the methodology used to conduct the search.

If your data set is small and can be easily stored in memory, then use an inverted index. If not, consider using a search index like solr, elasticsearch or pure lucene.

One approach to enhancing the search process is by tuning one of the similarity coefficients. I would suggest considering the Sørensen coefficient (https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient ) or the inverse of the Dice measure. The Levenshtein distance takes longer to solve.

The basic steps involved in fuzzy search include data cleaning and indexing, followed by search query and ranking by relevance.

allricjohnson · May 30, 2023, 12:03 PM

Only a person can make an honest comparison with a 100% guarantee, as it requires an analysis of the meaning behind the letters and words.

To improve your AI, consider taking into account not just the name but also the price range or other attributes to differentiate between products like the MacBook and the bag.

Another suggestion is to use more specific descriptions for accessories, such as "bag" or "case", to avoid confusion. This approach should solve most issues.

berto · Jul 08, 2024, 02:07 AM

The key is to leverage a combination of natural language processing (NLP) techniques and domain-specific heuristics to achieve accurate and robust product matching.

1. Tokenization and Normalization:
- Begin by tokenizing the product names from both the database and the price list into individual words or n-grams (sequences of n words). This can be done using standard text processing libraries like NLTK (Natural Language Toolkit) or spaCy.
- Normalize the tokens by converting them to lowercase, removing stop words (common words like "the," "a," "and," etc.), and stemming or lemmatizing the words to their base forms. This helps to reduce the impact of minor variations in product naming conventions.

2. Similarity Scoring:
- Utilize a more advanced similarity metric than Jaro-Winkler, such as Cosine Similarity or Semantic Similarity.
- Cosine Similarity measures the cosine of the angle between two vectors (product name representations), providing a score between 0 and 1 that indicates the similarity between the two. This can be implemented using libraries like scikit-learn or TensorFlow.
- Semantic Similarity leverages pre-trained language models (e.g., BERT, Word2Vec) to capture the contextual meaning of words and phrases, which can be more effective in handling synonyms and related terms. You can use libraries like spaCy or Hugging Face Transformers to compute semantic similarity.

3. Weighted Scoring:
- Assign different weights to different parts of the product name based on their importance. For example, give higher weights to brand names, model numbers, and other key identifiers, and lower weights to more general descriptors.
- This can help differentiate between products with similar names but different core characteristics. You can experiment with various weighting schemes and evaluate their impact on the matching accuracy.

4. Domain-Specific Heuristics:
- Leverage domain knowledge about the products to enhance the matching process. For instance, you could maintain a dictionary of known product models, features, and variants to better identify and match similar products.
- Incorporate rules to handle common product naming conventions, such as the presence of prefixes, suffixes, or abbreviations. This can involve creating regular expression patterns or decision trees to capture these patterns.
- Consider incorporating external data sources, such as product catalogs or industry-specific taxonomies, to enrich your understanding of the product domain and improve the matching accuracy.

5. Ensemble Approach:
- Combine multiple similarity scoring techniques (e.g., Cosine Similarity, Semantic Similarity, string-based metrics) into an ensemble model. This can be done by training a meta-model (e.g., a random forest or a neural network) that takes the individual similarity scores as inputs and outputs a final matching probability.
- The ensemble approach can help improve the overall accuracy and robustness of the product matching by leveraging the strengths of different algorithms and reducing the impact of individual weaknesses.

6. Iterative Refinement:
- Continuously evaluate the performance of your product mapping solution and identify areas for improvement. This can involve running the system on a set of test cases, manually reviewing the results, and gathering feedback from domain experts.
- Implement active learning techniques, where the system learns from user feedback and adjusts its models accordingly. This can involve fine-tuning the language models, adjusting the weighting schemes, or adding new domain-specific rules.
- Maintain a feedback loop to continuously improve the system and adapt to changes in the product data or naming conventions.

7. Scalability and Performance:
- Ensure your solution can handle large-scale product databases and price lists efficiently.
- Explore indexing and caching strategies to optimize the lookup and matching processes.
- Consider implementing the solution as a microservice or a distributed system to leverage parallel processing and improve scalability.

By combining these techniques, you can develop a robust and accurate product mapping solution that can handle the complexities of your specific use case. Remember to continuously monitor and refine the system to maintain high performance and accuracy as your product databases evolve over time.

poojakryptonsoft · Dec 17, 2024, 11:26 AM

Your Jaro-Winkler approach is a solid start, but it seems overly simplistic for nuanced product names. Consider leveraging Natural Language Processing (NLP) techniques, such as tokenization and semantic similarity measures like Word2Vec or BERT embeddings.

These models can capture contextual relationships between words, allowing for more accurate comparisons. Instead of just checking for matching words, compute a similarity score based on the entire product description, which will help differentiate between "Apple Macbook Pro 15" and "Apple Macbook Pro 15 bag is blue."

MySQL mapping products with similar names

SmutSia

bidabings

centigon

allricjohnson

berto

poojakryptonsoft