Collaborators: Seung Wook Kim, Sanja Fidler, Anyi Rao
Modality translation is an application of machine learning which transforms data between modalities (e.g. image, text) while preserving the content. It has become popular recently due to the immense applications in multimedia and accessibility technologies. Current approaches often require large amounts of paired data which is non-trivial to collect, and only learn translation in one direction. This work proposes a model which leverages the unpaired (unimodal) datasets available and only a small amount of paired data to learn cross-modal generation.
The proposed method has generative models in both the image and text modalities, and is able to generate both modalities from a single shared representation. Crucially, we propose to use a memory module in the model to learn a latent space represented by multiple discrete indices, while modality-specic information is modelled with a continuous latent space, as shown in the figure. The final model on CUB uses only 4 memory blocks each with 16 dictionary vectors, which condense the entire dataset into 16 bits of information (164 = 216).
We evaluate the proposed model on standard benchmarks on two datasets to showcase the performance. We demonstrate that the memory module improves image-to-text translation performance (red line above blue line in the figure above), and the model can learn translations using a small amount of paired data, with the ability to generate outputs with diverse styles. Furthermore, we demonstrate that the memory module is able to learn a meaningful shared representation with an alignment of the two modalities. This work contributes to the larger bodies of research in multimodal representation learning by demonstrating the effectiveness of a discrete multimodal representation.