Executive Summary : | This study aims to explore the challenges of utilizing redundancy in videos and using pretrained foundational models for efficient visual scene understanding. Redundant features, resulting from high correlation of information across frames, are crucial for successful deep video models. However, they also mean waste of computation, so the researchers aim to reduce computation by generating only a few non-redundant features. The rest will be computed from the non-redundant features via cheap operations. The division between heavy and cheap operations will be decided on-the-fly using a dynamic policy network. The study also explores prompt learning towards resource efficient AI models. With the rise of powerful pretrained deep learning models, standard finetuning on smaller datasets is prone to overfitting. In Natural Language Processing (NLP), prompt learning has shown to benefit in the presence of limited labelled data. Starting with a pretrained vision-language model, the researchers propose learning task-relevant prompts instead of manual prompt engineering for resource efficient training of video scene understanding. This approach introduces only a small amount of trainable parameters while keeping the large vision-language model fixed. However, this approach still requires tediously annotated supervised training data in the form of video-action pairs. Contrastive semi-supervised learning will be used to perform prompt tuning, leveraging a large corpus of unlabelled videos and only a handful of labelled data. This approach not only improves performance but also makes models more cost and resource-effective, moving closer to Green AI. |