TY - JOUR
T1 - Fast retrieval of multi-modal embeddings for e-commerce applications
AU - Abluton, Alessandro
AU - Ciarlo, Daniele
AU - PORTINALE, Luigi
PY - 2024
Y1 - 2024
N2 - In this paper, we introduce a retrieval framework designed for e-commerce applications, which employs a multi-modal approach to represent items of interest. This approach incorporates both textual descriptions and images of products, alongside a locality-sensitive hashing (LSH) indexing scheme for rapid retrieval of potentially relevant products. Our focus is on a data-independent methodology, where the indexing mechanism remains unaffected by the specific dataset, while the multi-modal representation is learned beforehand. Specifically, we utilize a multi-modal architecture, CLIP, to learn a latent representation of items by combining text and images in a contrastive manner. The resulting item embeddings encapsulate both the visual and textual information of the products, which are then subjected to various types of LSH for balancing between result quality and retrieval speed. We present the findings of our experiments conducted on two real-world datasets sourced from e-commerce platforms, comprising both product images and textual descriptions. Promising results have been achieved, demonstrating favorable retrieval time and average precision. These results were obtained through testing the approach with a specifically selected set of queries and with synthetic queries generated using a Large Language Model.
AB - In this paper, we introduce a retrieval framework designed for e-commerce applications, which employs a multi-modal approach to represent items of interest. This approach incorporates both textual descriptions and images of products, alongside a locality-sensitive hashing (LSH) indexing scheme for rapid retrieval of potentially relevant products. Our focus is on a data-independent methodology, where the indexing mechanism remains unaffected by the specific dataset, while the multi-modal representation is learned beforehand. Specifically, we utilize a multi-modal architecture, CLIP, to learn a latent representation of items by combining text and images in a contrastive manner. The resulting item embeddings encapsulate both the visual and textual information of the products, which are then subjected to various types of LSH for balancing between result quality and retrieval speed. We present the findings of our experiments conducted on two real-world datasets sourced from e-commerce platforms, comprising both product images and textual descriptions. Promising results have been achieved, demonstrating favorable retrieval time and average precision. These results were obtained through testing the approach with a specifically selected set of queries and with synthetic queries generated using a Large Language Model.
KW - Multi-modal embeddings
KW - e-commerce applications
KW - locality sensitive hashing
KW - Multi-modal embeddings
KW - e-commerce applications
KW - locality sensitive hashing
UR - https://iris.uniupo.it/handle/11579/184422
U2 - 10.3233/kes-240006
DO - 10.3233/kes-240006
M3 - Article
SN - 1327-2314
VL - 28
SP - 765
EP - 779
JO - International Journal of Knowledge-Based and Intelligent Engineering Systems
JF - International Journal of Knowledge-Based and Intelligent Engineering Systems
IS - 4
ER -