Searching for Similar Images in Large Databases Using Neurons and Hashing

JimyChen · May 02, 2023, 12:10 AM

The task at hand involves finding similar images in a database. The chosen approach utilizes neurons (keras) to extract features from the images using the vvg16 grid (layer 4096), and employs the cosine distance for small volumes. However, the database is exceedingly large with approximately 1 million images and vector images of 100,000 weigh approximately 5-7 GB, amounting to 50-70 GB for 1 million vector images. This causes the search to slow down, and the database no longer fits in RAM. Retrieval from the disk takes too long, and storing as a single hd5 file becomes inconvenient since the database changes frequently.

As of now, we have adopted an approach where all vectors are entered into the database (e.g. mariadb, while aiming for PostgreSQL). Subsequently, the vectors are loaded in batches and hashed by a solution such as https://github.com/pixelogik/NearPy. The hashes and picture IDs are stored in memory, enabling quick searches that weigh little, but may not be entirely accurate due to the change in comparison method.

Are there any other solutions worth considering?

diakKisdisy · May 02, 2023, 01:27 AM

To summarize, my plan would be to retain the image, a 4096-component vector, and an output vector comprising 1000 components. Another layer could potentially reduce dimensionality, but this would necessitate further training of the network.

To then extract vectors from the image, we would use 1000/4096 components and calculate cosine distance using a smaller vector, discarding options that exceed a certain boundary. We would then calculate distance by a larger vector, discarding options with long distances and comparing the remaining images.

Eliminating images with vastly different classification results could theoretically reduce the number of calculations required. However, experimenting or counting would likely prove necessary. And, of course, parallelizing the task would be essential.

gasgrill · May 02, 2023, 04:02 AM

To shard effectively, it is crucial to ensure sufficient memory on each of N servers. It's also important to keep in mind that every vector must be checked separately, as the indexes do not roll, and the complexity remains at O(1). As the database grows in size, the time for searching will inevitably increase, but not for loading from disk. Sharding can mitigate this issue, since parallel search ensures that the time spent on searching does not grow with the number of images.

sharding can have a significant impact on the scalability and efficiency of databases. By dividing data across multiple servers, sharding helps to distribute the workload more evenly, which can result in faster response times and improved performance. Additionally, sharding can help to reduce the risk of system failures, since a problem with one shard does not necessarily affect the entire system. However, it's important to be aware of the potential drawbacks of sharding, such as increased complexity and the need for proper management and maintenance.

Guargomum · May 02, 2023, 05:45 AM

Investing in a server with 64-94 GB of memory can be a substantial commitment, however, the benefits of such an investment include enhanced speed and performance in the future.

Alternatively, one could consider using a cloud server with similar specifications to reduce initial costs. This way, you can test it out for a few months and determine if it meets your needs before committing to purchasing a physical server.

Having a powerful server with ample memory can greatly improve the efficiency and overall performance of your system. It is worth investing in such technology, but it is important to weigh the costs and consider all options before making a decision.

irvine · Aug 27, 2023, 06:28 AM

There are several alternative solutions worth considering to optimize the search process in your scenario. Here are a few suggestions:

1. Approximate Nearest Neighbor (ANN) Algorithms: Instead of using exact nearest neighbor search, you can utilize ANN algorithms like Locality Sensitive Hashing (LSH) or Random Projection Trees. These algorithms provide approximate results with significantly lower computational requirements, allowing you to speed up the search process.

2. Dimensionality Reduction Techniques: You can apply dimensionality reduction techniques like Principal Component Analysis (PCA) or t-Distributed Stochastic Neighbor Embedding (t-SNE) on your feature vectors. This reduces the dimensionality of your data while preserving the important structure, thereby reducing the memory footprint and speeding up the search process.

3. Distributed Computing: If the database no longer fits in RAM, you can consider distributing the computation across multiple machines using frameworks like Apache Spark. This allows you to scale horizontally and process larger volumes of data efficiently.

4. Image Embeddings: Instead of computing features using neural networks on-the-fly during each search, you can precompute and store image embeddings. These embeddings are lower-dimensional representations of the images that capture important characteristics. By using these embeddings, you can potentially speed up the search process while reducing computational requirements.

5. Indexing Methods: Explore different indexing methods like Inverted Indexing or Annoy Indexing that are designed for efficient similarity search. These methods create indexes to quickly retrieve similar images based on specific criteria, reducing the need for exhaustive comparison.

6. Product Quantization: Product Quantization is a technique that partitions high-dimensional vectors into subvectors and quantizes them separately. This method can significantly reduce storage requirements while providing fast approximate nearest neighbor search.

7. Faiss Library: Faiss is a popular library for efficient similarity search and clustering of large-scale datasets. It is specifically designed to handle billions of vectors and provides several indexing methods optimized for different scenarios.

8. Incremental Index Update: If your database changes frequently, you can explore incremental index update techniques. Instead of rebuilding the entire index each time, you can incrementally update the index with new images, saving computation time and avoiding the need to store the entire database as a single file.

9. Use SSD or NVMe Storage: If slow disk retrieval is a bottleneck, consider using Solid-State Drives (SSD) or Non-Volatile Memory Express (NVMe) storage instead of traditional magnetic hard drives. These storage options offer faster read and write speeds, reducing the disk access time during retrieval.

10. Clustered Search: If the search latency is still a concern with the current approach, you can distribute the search workload across multiple machines using a clustered search setup. Each machine can handle a subset of the database, allowing for parallel search operations and further reducing response times.

11. Use Approximate Similarity Search Libraries: There are specialized libraries available that provide efficient approximate similarity search, such as Annoy, NMSLIB, and HNSW. These libraries offer different indexing structures and algorithms optimized for high-dimensional data, allowing for fast retrieval of similar images.

12. Fine-tune the Neural Network Model: Instead of using a pre-trained VGG16 network, you can fine-tune the model on your specific dataset. By training the network with your data and task-specific objectives, you may be able to extract more compact and discriminative features that require less memory and computation.

13. Use GPU Acceleration: If you have access to GPUs, consider leveraging their parallel processing capabilities to speed up feature extraction and similarity computation. GPUs can significantly accelerate these operations, reducing the overall search time.

14. Consider Data Partitioning: If the size of your database is a challenge, you can partition the data into smaller subsets based on certain criteria (e.g., time, category, or other relevant attributes). This enables faster retrieval from disk and reduces memory requirements for each subset.

15. Cloud Computing: If you have budget flexibility, consider utilizing cloud computing resources such as Amazon EC2 or Google Cloud Platform. Cloud providers offer scalable infrastructure and services that can help mitigate resource limitations and handle large-scale image databases efficiently.

16. Query Expansion: Incorporate query expansion techniques to improve the search results. By expanding the query with relevant terms or features, you can potentially increase the accuracy of the search and retrieve more similar images.

17. Caching: Implement a caching mechanism to store frequently accessed or recently retrieved images in memory. This can help reduce the disk retrieval time by keeping commonly used images readily available.

18. Use GPUs for Computation: Utilize Graphics Processing Units (GPUs) for feature extraction and similarity calculations. GPUs are highly parallel processors that can significantly speed up the computation time compared to CPUs.

19. Data Preprocessing: Consider preprocessing the image data to reduce its size or remove redundant information. Techniques like image compression, downscaling, or cropping can help decrease the storage and retrieval time without significantly impacting the search results.

20. Hybrid CPU-GPU Approach: Instead of relying solely on GPUs, you can consider a hybrid CPU-GPU approach where you offload specific computational tasks to the GPU while utilizing the CPU for other operations. This approach can leverage the strengths of both CPU and GPU to optimize performance.

21. Sampling and Indexing: If the search is still slow even with optimization techniques, you can consider sampling the dataset to create a smaller index of representative images. This can help speed up the search process, although at the cost of potentially missing some less-representative images.

22. Cloud-based Image Search Services: Explore cloud-based image search services that provide pre-built infrastructure and APIs for efficient image search. Companies like Google Cloud Vision, Amazon Rekognition, and Microsoft Azure Cognitive Services offer image recognition and search capabilities that can be leveraged.

maabuft · Nov 23, 2024, 03:43 AM

Your entire approach is flawed, and you're just trying to force a square peg into a round hole. By using a neural network to extract features and then storing them in a database, you're creating a system that's overly complex and difficult to maintain. Instead, I'd recommend starting from scratch and exploring alternative approaches, such as using a single, monolithic neural network that can handle the entire dataset, or using a distributed computing framework like Apache Spark to process the data in parallel.

Searching for Similar Images in Large Databases Using Neurons and Hashing

JimyChen

diakKisdisy

gasgrill

Guargomum

irvine

maabuft