Multimodal representation learning for tourism recommendation with two-tower architecture

Resource type
Authors/contributors
Title
Multimodal representation learning for tourism recommendation with two-tower architecture
Abstract
<jats:p>Personalized recommendation plays an important role in many online service fields. In the field of tourism recommendation, tourist attractions contain rich context and content information. These implicit features include not only text, but also images and videos. In order to make better use of these features, researchers usually introduce richer feature information or more efficient feature representation methods, but the unrestricted introduction of a large amount of feature information will undoubtedly reduce the performance of the recommendation system. We propose a novel heterogeneous multimodal representation learning method for tourism recommendation. The proposed model is based on two-tower architecture, in which the item tower handles multimodal latent features: Bidirectional Long Short-Term Memory (Bi-LSTM) is used to extract the text features of items, and an External Attention Transformer (EANet) is used to extract image features of items, and connect these feature vectors with item IDs to enrich the feature representation of items. In order to increase the expressiveness of the model, we introduce a deep fully connected stack layer to fuse multimodal feature vectors and capture the hidden relationship between them. The model is tested on the three different datasets, our model is better than the baseline models in NDCG and precision.</jats:p>
Publication
PLOS ONE
Volume
19
Issue
2
Date
2024-02-23
Language
en
ISSN
1932-6203
Accessed
11/11/25, 9:02 AM
Library Catalog
dspace.usj.edu.mo
Extra
Publisher: Public Library of Science (PLoS)
Citation
Cui, Y., Liang, S., & Zhang, Y. (2024). Multimodal representation learning for tourism recommendation with two-tower architecture. PLOS ONE, 19(2). https://doi.org/10.1371/journal.pone.0299370