MiVE: Multiscale Vision-language Features for
Reference-guided Video Editing

Supplementary Videos

Click thumbnails to switch samples. All videos play synchronized.

Task Definition

Reference-guided video editing takes a source video, a text instruction, and an edited reference image (the first frame modified according to the instruction) as input. The goal is to propagate the visual edits from the reference throughout the video while preserving original motion and unedited content.

Complex Scene Comparisons

1
2
3
4
5
6
7
8
9
10

Simple Scene Comparisons

1
2
3
4
5
6
7
8
9
10