Davide Moltisanti

Research fellow at Nanyang Technological University, Singapore · davide.moltisanti@ntu.edu.sg

Hello all, I’m Davide, a research fellow at Nanyang Technological University (NTU), Singapore
I joined NTU in March 2020. I work with prof. Chen Change Loy
I took my PhD in Computer Science at the University of Bristol (UK) in November 2019, supervised by Dr. Dima Damen
I took my MSc in Computer Science at the University of Catania (Italy) in November 2013

My area of research is Computer Vision
I did my PhD on action recognition in videos (you can find my thesis here)


Publications

The EPIC-KITCHENS Dataset: Collection, Challenges and Baselines

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, Michael Wray

Since its introduction in 2018, EPIC-KITCHENS has attracted attention as the largest egocentric video benchmark, offering a unique viewpoint on people's interaction with objects, their attention, and even intention. In this paper, we detail how this large-scale dataset was captured by 32 participants in their native kitchen environments, and densely annotated with actions and object interactions. Our videos depict nonscripted daily activities, as recording is started every time a participant entered their kitchen. Recording took place in 4 countries by participants belonging to 10 different nationalities, resulting in highly diverse kitchen habits and cooking styles. Our dataset features 55 hours of video consisting of 11.5M frames, which we densely labelled for a total of 39.6K action segments and 454.2K object bounding boxes. Our annotation is unique in that we had the participants narrate their own videos after recording, thus reflecting true intention, and we crowd-sourced ground-truths based on these. We describe our object, action and. anticipation challenges, and evaluate several baselines over two test splits, seen and unseen kitchens. We introduce new baselines that highlight the multimodal nature of the dataset and the importance of explicit temporal modelling to discriminate fine-grained actions e.g. 'closing a tap' from 'opening' it up.

Project webpage | Download paper

Transactions on Pattern Analysis and Machine Intelligence (TPAMI) - 2020

Action Recognition from Single Timestamp Supervision in Untrimmed Videos

Davide Moltisanti, Sanja Fidler, Dima Damen

Recognising actions in videos relies on labelled supervision during training, typically the start and end times of each action instance. This supervision is not only subjective, but also expensive to acquire. Weak video-level supervision has been successfully exploited for recognition in untrimmed videos, however it is challenged when the number of different actions in training videos increases. We propose a method that is supervised by single timestamps located around each action instance, in untrimmed videos. We replace expensive action bounds with sampling distributions initialised from these timestamps. We then use the classifier's response to iteratively update the sampling distributions. We demonstrate that these distributions converge to the location and extent of discriminative action segments.
We evaluate our method on three datasets for fine-grained recognition, with increasing number of different actions per video, and show that single timestamps offer a reasonable compromise between recognition performance and labelling effort, performing comparably to full temporal supervision. Our update method improves top-1 test accuracy by up to 5.4%. across the evaluated datasets.

Project webpage | Download paper

Conference on Computer Vision and Pattern Recognition (CVPR) - 2019

Scaling Egocentric Vision: The EPIC-Kitchens Dataset

Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, William Price, Michael Wray

First-person vision is gaining interest as it offers a unique viewpoint on people's interaction with objects, their attention, and even intention. However, progress in this challenging domain has been relatively slow due to the lack of sufficiently large datasets. In this paper, we introduce EPIC-Kitchens, a large-scale egocentric video benchmark recorded by 32 participants in their native kitchen environments. Our videos depict non-scripted daily activities: we simply asked each participant to start recording every time they entered their kitchen. Recording took place in 4 cities (in North America and Europe) by participants belonging to 10 different nationalities, resulting in highly diverse cooking styles. Our dataset features 55 hours of video consisting of 11.5M frames, which we densely labelled for a total of 39.6K action segments and 454.3K object bounding boxes. Our annotation is unique in that we had the participants narrate their own videos (after recording), thus reflecting true intention, and we crowd-sourced ground-truths based on these. We describe our object, action and anticipation challenges, and evaluate several baselines over two test splits.

Dataset webpage | Download paper

European Conference on Computer Vision (ECCV) - 2018

Trespassing the Boundaries: Labelling Temporal Bounds for Object Interactions in Egocentric Video

Davide Moltisanti, Michael Wray, Walterio Mayol-Cuevas, Dima Damen

Manual annotations of temporal bounds for object interactions (i.e. start and end times) are typical training input to recognition, localisation and detection algorithms. For three publicly available egocentric datasets, we uncover inconsistencies in ground truth temporal bounds within and across annotators and datasets. We systematically assess the robustness of state-of-the-art approaches to changes in labelled temporal bounds, for object interaction recognition. As boundaries are trespassed, a drop of up to 10% is observed for both Improved Dense Trajectories and Two-Stream Convolutional Neural Network. We demonstrate that such disagreement stems from a limited understanding of the distinct phases of an action, and propose annotating based on the Rubicon Boundaries, inspired by a similarly named cognitive model, for consistent temporal bounds of object interactions. Evaluated on a public dataset, we report a 4% increase in overall accuracy, and an increase in accuracy for 55% of classes when Rubicon Boundaries are used for temporal annotations.

Project webpage | Download paper

International Conference on Computer Vision (ICCV) - 2017

SEMBED: Semantic Embedding of Egocentric Action Videos

Michael Wray*, Davide Moltisanti*, Walterio Mayol-Cuevas, Dima Damen
(*equal contribution)

We present SEMBED, an approach for embedding an egocentric object interaction video in a semantic-visual graph to estimate the probability distribution over its potential semantic labels. When object interactions are annotated using unbounded choice of verbs, we embrace the wealth and ambiguity of these labels by capturing the semantic relationships as well as the visual similarities over motion and appearance features. We show how SEMBED can interpret a challenging dataset of 1225 freely annotated egocentric videos, outperforming SVM classification by more than 5%.

Project webpage | Download paper

European Conference on Computer Vision Workshops (ECCVW) - 2016

Monitoring Accropodes Breakwaters Using RGB-D Cameras

Davide Moltisanti, Giovanni Maria Farinella, Rosaria Ester Musumeci, Enrico Foti, Sebastiano Battiato
International Conference on Computer Vision Theory and Applications (VISAPP) - 2015

Web Scraping of Online Newspapers via Image Matching

Davide Moltisanti, Giovanni Maria Farinella, Sebastiano Battiato, Giovanni Giuffrida
European Consortium for Mathematics in Industry (ECMI) - 2014

Interests

In my free time I love taking pictures, playing the drums and the guitar, watching films and reading (I am a big National Geographic fan)

I am also a cycling and hiking enthusiast

My photo gallery