ShotAdapter: Text-to-Multi-Shot Video Generation with Diffusion Models

1University of Illinois Urbana-Champaign 2Adobe
CVPR 2025

Abstract

Current diffusion-based text-to-video methods are limited to producing short video clips of a single shot and lack the capability to generate multi-shot videos with discrete transitions where the same character performs distinct activities across the same or different backgrounds. To address this limitation we propose a framework that includes a dataset collection pipeline and architectural extensions to video diffusion models to enable text-to-multi-shot video generation. Our approach enables generation of multi-shot videos as a single video with full attention across all frames of all shots, ensuring character and background consistency, and allows users to control the number, duration, and content of shots through shot-specific conditioning. This is achieved by incorporating a transition token into the text-to-video model to control at which frames a new shot begins and a local attention masking strategy which controls the transition token's effect and allows shot-specific prompting. To obtain training data we propose a novel data collection pipeline to construct a multi-shot video dataset from existing single-shot video datasets. Extensive experiments demonstrate that fine-tuning a pre-trained text-to-video model for a few thousand iterations is enough for the model to subsequently be able to generate multi-shot videos with shot-specific control, outperforming the baselines.

Methodology

model_overview

(a) ShotAdapter fine-tunes a pre-trained T2V model by incorporating "transition tokens" (highlighted in light blue). We use n-1 transition tokens, initialized as learnable parameters, alongside an n-shot video with shot-specific prompts, which are fed through the pre-trained T2V model. (b) The model processes the concatenated input token sequence, guided by a "local attention mask" through joint attention layers within DiT blocks. (c) The local attention mask is structured to ensure that transition tokens interact only with the visual frames where transitions occur, while each textual token interacts exclusively with its corresponding visual tokens.

Dataset Collection

model_overview

A high-level overview of this pipeline is presented in (a). Our first method (gray box in (b)) samples videos with large motion, randomly splits them into n-shots with varied durations, and concatenates them into multi-shot videos. Our second method (yellow box in (b)) randomly samples n videos from pre-clustered groups containing videos of the same identities and concatenates them to form a multi-shot video. Finally, we post-process (c) the multi-shot videos to ensure identity consistency and obtain shot-specific captions using LLaVA-NeXT.

Qualitative Results

Here we include the complete videos of the examples shown in Figure 1 (teaser) and Figure 5 (qualitative results) of the main paper. We also provide additional multi-shot video and text pairs grouped based on the number of shots being generated. You can find examples where background consistency is maintained across shots (e.g. generated 4-shot video), as well as examples where the background changes (e.g. generated 3-shot video) between shots. Generated 2-, 3-, and 4-Shot Video Results with ShotAdapter (each row displays 1 generated multi-shot video). Each shot is displayed separately in the columns following the first column. For more results please refer to the supplementary material.

Generated 2-shot Video Shot-1 Prompt: "a young girl paints at an easel in her bedroom" Shot-2 Prompt: "she then reads a comic book in her bed"
Generated 3-shot Video Shot-1 Prompt: "a man sketches in a notebook at a quiet cafe, his hand moving quickly across the page" Shot-2 Prompt: "he pauses, looking up thoughtfully before continuing his drawing" Shot-3 Prompt: "later, the man steps outside, his notebook tucked under his arm as he takes in the city around him"
Generated 4-shot Video Shot-1 Prompt: "scientist in lab coat examines a specimen" Shot-2 Prompt: "she writes notes on a clipboard" Shot-3 Prompt: "she adjusts dials on a machine" Shot-4 Prompt: "she pours a liquid into a beaker"

Comparison

Shot-1 Prompt: "a man reads a book under tree"
Shot-2 Prompt: "a man walks from the forest towards lake"
ShotAdapter MEVG [1] FreeNoise [2] Gen-L-Video [3] SEINE [4]

BibTeX

@inproceedings{kara2025shotadapter,
  title={ShotAdapter: Text-to-Multi-Shot Video Generation with Diffusion Models},
  author={Ozgur Kara and Krishna Kumar Singh and Feng Liu and Duygu Ceylan and James M. Rehg and Tobias Hinz},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2025}
}

[1] Oh, G., Jeong, J., Kim, S., Byeon, W., Kim, J., Kim, S., & Kim, S. (2024, September). Mevg: Multi-event video generation with text-to-video models. In European Conference on Computer Vision (pp. 401-418). Cham: Springer Nature Switzerland.

[2] Qiu, H., Xia, M., Zhang, Y., He, Y., Wang, X., Shan, Y., & Liu, Z. (2024). FreeNoise: Tuning-Free Longer Video Diffusion via Noise Rescheduling. The Twelfth International Conference on Learning Representations. Retrieved from https://openreview.net/forum?id=ijoqFqSC7p

[3] Wang, F. Y., Chen, W., Song, G., Ye, H. J., Liu, Y., & Li, H. (2023). Gen-l-video: Multi-text to long video generation via temporal co-denoising. arXiv preprint arXiv:2305.18264.

[4] Chen, X., Wang, Y., Zhang, L., Zhuang, S., Ma, X., Yu, J., ... & Liu, Z. (2023, October). Seine: Short-to-long video diffusion model for generative transition and prediction. In The Twelfth International Conference on Learning Representations.