Vision, Language and Visual Retrieval

Completed 2017–2020

Multimodal methods connecting images and language: large-scale visual retrieval, semantic art understanding, and image caption generation.

This project sits at the intersection of computer vision and natural language, building multimodal systems that connect images with text and that retrieve visual content at scale. The work spans semantic understanding, large-scale retrieval, and caption generation.

Representative directions include semantic art understanding — the SemArt dataset and Text2Art challenge, which move art analysis beyond style classification towards relating paintings to their textual commentary; asymmetric spatio-temporal embeddings for image-to-video retrieval, which learn features that match a still query against video collections; retrieval of fashion products from film and television footage; and deep models for automatically captioning news images, with applications from multimedia management to accessibility for visually impaired users.

Collaborators

Related publications