Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image Diffusion Models


KAIST
ICLR 2024
TL;DR: Edit Multiple Attributes of your video using pre-trained text-image models, Without Any Training.
MY ALT TEXT

"A rabbit is eating a watermelon on the table."

MY ALT TEXT

rabbit → rat

MY ALT TEXT


watermelon → pink moon

MY ALT TEXT



table → snow

MY ALT TEXT

rabbit → dog
watermelon → basketball
table → moss

MY ALT TEXT

rabbit → dog
watermelon → basketball
table → sand

MY ALT TEXT

rabbit → kangaroo
watermelon → avocado
table → smow

MY ALT TEXT

rabbit → squirrel
watermelon → orange
table → grass
+ under the aurora

MY ALT TEXT

rabbit → dog
watermelon → basketball
table → sand
+ at sunrise

MY ALT TEXT

rabbit → kangaroo
watermelon → avocado
table → snow
+ under the sky

MY ALT TEXT

+ Chinese painting

MY ALT TEXT

rabbit → dog
watermelon → basketball
+ Cezanne painting

Abstract

In this work, we introduce a novel groundings guided video-to-video translation framework called Ground-A-Video. Recent endeavors in video editing have showcased promising results in single-attribute editing or style transfer tasks, either by training T2V models on text-video data or adopting training-free methods. However, when confronted with the complexities of multi-attribute editing scenarios, they exhibit shortcomings such as omitting or overlooking intended attribute changes, modifying the wrong elements of the input video, and failing to preserve regions of the input video that should remain intact. Ground-A-Video attains temporally consistent multi-attribute editing of input videos in a training-free manner without aforementioned shortcomings. Central to our method is the introduction of Cross-Frame Gated Attention which incorporates groundings information into the latent representations in a temporally consistent fashion, along with Modulated Cross-Attention and Optical flow guided inverted latents smoothing. Extensive experiments and applications demonstrate that Ground-A-Video's zero-shot capacity outperforms other baseline methods in terms of edit-accuracy and frame consistency.



Method



Given a series of input video frames, we automatically obtain video groundings via GLIP. Subsequently, the groundings and the source prompt are manually refined to form target groundings and target prompot, commonly following our Δτ (a list of semantic edits). On the other branch, the input frames undergo individual DDIM inversion and null optimizations, followed by our proposed Optical flow smoothing to form latent features. The latents are seperately fed into the inflated Stable Diffusion backbone and ControlNet, which are modified with a sequence of attentions to achieve the temporal consistency of video editing, while the latter taking an additional input of depth maps. Additionally, the target groundings are directed to the Cross-Frame Gated Attentions, while the target prompt conditioning and frame-independently optimized null-embeddings are channeled into the Modulated Cross Attentions. Finally, if a binary mask, the intersection of common outer spaces of target bounding boxes, exists, it is utilized for inpainting before each denoising step. Usage of this optional binary mask helps preserve regions that are not the target of editing.




Comparison with state-of-the-art approaches

"A cat is roaring." ⇨ "A dog is roaring on the beach, under the sky."
Input Video
Tune A Video w/ ControlNet
Control A Video
ControlVideo
Gen-1
Ground A Video (ours)


"A squirrel eating a carrot." ⇨ "A fennec fox eating a sausage on the dune, in desert."
Input Video
Tune A Video w/ ControlNet
Control A Video
ControlVideo
Gen-1
Ground A Video (ours)


"Brown bear walking on the rock, against a wall." ⇨ "Pink bear walking on snowy alpine, next to lake, against a blue wall."
Input Video
Tune A Video w/ ControlNet
Control A Video
ControlVideo
Gen-1
Ground A Video (ours)


"A silver jeep is driving down a curvy road in the countryside." ⇨ "An orange jeep is driving down a curvy road in the countryside, fireworks on the sky."
Input Video
Tune A Video w/ ControlNet
Control A Video
ControlVideo
Gen-1
Ground A Video (ours)


"A bird flying above lake in the forest." ⇨ "A crow flying above desert in the forest on fire."
Input Video
Tune A Video w/ ControlNet
Control A Video
Ground A Video (ours)


"A man is walking a dog on the road." ⇨ "Iron Man is walking a sheep on the lake."
Input Video
Gen-1
Control A Video
Ground A Video (ours)



From Single-Attribute to Multi-Attributes Editing

Input Video
cat → puppy
cat → tiger
cat → puppy
+ lake
cat → tiger
+ lawn
cat → dog
+ on the beach
+ under the sky
cat → tiger
+ on the snow
+ in the mountains
cat → white puppy
+ on the sand
+ in the dune
Input Video
bird → blue bird
bird → dragonfly
lake → emerald lake
bird → crow
lake → desert
forest → forest on fire
Input Video
swan → pink swan
bushes → white roses
swan → blue swan
bushes → snowy bushes
swan → blue swan
bushes → snowy bushes
lake → lagoon
Input Video
dog → lego dog
man → Homer Simpson
road → pond
road → ice
man → Iron Man
dog → sheep
man → Iron Man
dog → sheep
road → lake
dog → goat
man → farmer
road → snow road


Video Style Transfer

Input Video
+ Ukiyo-e art
Input Video
+Monet style
Input Video
+ Cezanne style
Input Video
+ Starry night style

Video Style Transfer with Attributes Change

Input Video
+ Chinese painting style
+ Van Gogh starry night style
+ Van Gogh starry night style
dog, man → tiger, Superman
+ Van Gogh starry night style
dog, man → cheetah, Superman

Text-to-Video Generation with Pose Control

Input Pose
Input Background
Stormtrooper is
dancing on Mars.
Input Pose
Input Background
Iron man is
dancing on the sand.


BibTeX


        @article{jeong2023ground,
          title={Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image Diffusion Models},
          author={Jeong, Hyeonho and Ye, Jong Chul},
          journal={arXiv preprint arXiv:2310.01107},
          year={2023}
        }