Learning Video Representations from Large Language Models
Yue Zhao, Ishan Misra, Philipp Krähenbühl, Rohit Girdhar
CVPR 2023 (Highlight, acceptance rate≈2.5%)
arxiv | bibtex | colab | 🤗 demo | website
LaViLa (Language augmented Video Language Pretraining) is a new approach to learning video representations from Large Language Models (LLMs). We repurpose LLMs to be visually conditioned "Narrators", and use them to automatically generate video-language paired data. We use this data to then learn a video-langauge representation, outperforming prior work by large margins.
Sample Generations:
Video | Generation 1 | Generation 2 |
---|---|---|
so now we're going to slice the bread | now i'm going to do is just slice this up into a nice chunk and then we're going to place it on the plate |
Try out our Narrator to generate text descriptions for your own videos! You can also try out a web demo here:
The resulting video-language model sets a new state-of-the-art on a number of popular video tasks!
LaViLa leverages Large Language Models (LLMs) as "NARRATOR"s (and "REPHRASER"s) to densely narrate long videos, and uses these narrations to train strong dual-encoder models.
See INSTALL.md to install this code.
NARRATOR is a visually conditioned LLM that takes videos frames as input and pseudo-labels this clip with narrations.
We provide some generated samples by our NARRATOR:
Run the narrator demo using Colab (no GPU needed):
or on the web using 🤗 Spaces: (thanks to @nateraw!)
Since Colab free account offers very limited RAM, if you'd like to run the demo with a larger model, please run ./demo_narrator.py locally. For more technical details, please refer to Sec 4.1 in our paper.
# CPU mode
python demo_narrator.py [--video-path $TEST_VIDEO]
# GPU mode
python demo_narrator.py --cuda
Our narrator also works on third-person videos! Below are several examples generated by our NARRATOR that is pre-trained on HowTo100M Auto-Aligned (HTM-AA) and applied to some stock footage video clips. Note that since the text corpus in HowTo100M is ASR transcription, the style of narration is slightly different from that of ground-truth captions. However the generated results are generally reasonable.
Below is a demo for 3rd-person videos.
python demo_narrator_3rd_person.py [--video-path $TEST_VIDEO] [--cuda]
The dual-encoder model contains a video encoder and a text encoder. It learns video-langauge representation from both human annotations and generated narrations using a contrastive loss like CLIP.
-
LaViLa's dual-encoder achieves excellent zero-shot performance on a wide range of egocentric benchmarks, outperforming previous state-of-the-art video-language pretraining methods by a large margin.
Backbone EK-100 MIR
avg. mAP^EK-100 MIR
avg. nDCG^Charades-Ego
mAPEGTEA
mean acc.EgoMCQ
intra-video acc.Prev. SOTA^^ TSF-B 22.1/23.3 22.1/27.9 25.2 17.6 57.2 LAVILA TSF-B 29.7/30.9 31.5/32.0 26.8 28.9 59.9 LAVILA TSF-L 35.0/36.1 34.2/34.6 28.9 34.1 63.1 ^ The two numbers are obtained by using different number of frames as input (4-frame and 16-frame).
^^ We use the checkpoints released by EgoVLP and convert them to be compatible with this codebase. Also note that our reproduced numbers are better than the reported numbers, especially on EK-100 MIR since we evaluate on raw videos directly (for more details, check out Appendix F & Table 10 in our paper).
For details on how to get the numbers, please refer to MODEL_ZOO.md.
-
Once fine-tuned on the down-stream dataset, LaViLa's dual-encoder can also achieve state-of-the-art results on it. We show some key results as follows.
EK-100 MIR
avg. mAPEK-100 MIR
avg. nDCGEK-100 CLS
Action top-1Charades-Ego
mAPEGTEA
mean acc.Prev. SOTA 45.0 59.4 50.5 32.1 65.9 LAVILA 50.9 66.5 50.9 36.1 76.0 For details on how to fine-tune the pre-trained dual-encoder on down-stream datasets, please refer to MODEL_ZOO.md.
The majority of LAVILA is licensed under a MIT License, however portions of the project are available under separate license terms:
-
/~https://github.com/EGO4D/episodic-memory is licensed under the MIT license.
-
The videos of cutting a loaf, kneading a dough, and preparing a sauce in a blender are licensed under the Mixkit Stock Video Free License.
@inproceedings{zhao2023lavila,
title={Learning Video Representations from Large Language Models},
author={Zhao, Yue and Misra, Ishan and Kr{\"a}henb{\"u}hl, Philipp and Girdhar, Rohit},
booktitle={CVPR},
year={2023}
}