Timestep Embedding Tells: It's Time to Cache for Video Diffusion Model

1University of Chinese Academy of Sciences,  2Alibaba Group
3Institute of Automation, Chinese Academy of Sciences
4Fudan University,  5Nanyang Technological University
(* Work was done during internship at Alibaba Group. † Corresponding author.)

MY ALT TEXT Figure 1: Comparison of visual quality and efficiency (denoted by latency) with the competing method.

Abstract

As a fundamental backbone for video generation, diffusion models are challenged by low inference speed due to the sequential nature of denoising. Previous methods speed up the models by caching and reusing model outputs at uniformly selected timesteps. However, such a strategy neglects the fact that differences among model outputs are not uniform across timesteps, which hinders selecting the appropriate model outputs to cache, leading to a poor balance between inference efficiency and visual quality. In this study, we introduce Timestep Embedding Aware Cache (TeaCache), a training-free caching approach that estimates and leverages the fluctuating differences among model outputs across timesteps. Rather than directly using the time-consuming model outputs, TeaCache focuses on model inputs, which have a strong correlation with the modeloutputs while incurring negligible computational cost. TeaCache first modulates the noisy inputs using the timestep embeddings to ensure their differences better approximating those of model outputs. TeaCache then introduces a rescaling strategy to refine the estimated differences and utilizes them to indicate output caching. Experiments show that TeaCache achieves up to 4.41x acceleration over Open-Sora-Plan with negligible (-0.07% Vbench score) degradation of visual quality.

Timestep Embedding Aware Cache (TeaCache)

Motivation

Outputs of diffusion model are similar between the consecutive timesteps when denoising and previous methods propose to reduce redundancy by caching model outputs in a uniform way. However, the output difference between consecutive timesteps varies. Thus, the uniform caching strategy lacks flexibility to maximize the cache utilization. A better caching strategy is to reuse the cached output more frequently when difference between cached output and the current output is small. Unfortunately, such difference is not predictable before the current output is computed. To conquer this challenge, TeaCache leverages the following prior: There exists a strong correlation between a model's inputs and outputs.


Figure 2: Timestep embedding modulates the magnitude of block input and output thus has the potential to indicate the variation of output.

Figure 3: Visualization of input differences and output differences in consecutive timesteps of Open Sora, Latte, and OpenSora-Plan.

Implementation

Our study reveals two key observations of attention mechanisms in video diffusion transformers: Firstly, the difference of outputs exhibit distinct patterns across various models. In Open Sora, the pattern forms a 'U' shape, whereas in Latte and OpenSora-Plan, it resembles a horizontally flipped 'L'. Additionally, OpenSora-Plan features multiple peaks because its scheduler samples certain timesteps twice. Secondly, the noisy input across consecutive timesteps changes minimally and shows little correlation with the model output. In contrast, both the timestep embedding and the timestep embedding modulated noisy input demonstrate a strong correlation with the model output across multiple models.


Figure 4: TeaCache is capable of selectively caching informative intermediate model outputs during the inference process.
Building on these insights, we propose the Timestep Embedding Aware Cache (TeaCache). Rather than computing new outputs at each timestep, we reuse cached outputs from previous informative timesteps. The informative timesteps are selected by leveraging the difference of timestep embedding or timestep embedding modulated noisy input. Further, to reduce the estimation error between model output difference and timestep embedding difference, we apply polynomial fitting to rescale timestep embedding difference.

Figure 5: Visualization of corelation of rescaled input differences and output differences in consecutive timesteps.

Evaluations

Quantitative Results

Visual Results

Open-Sora, 44.56s
PAB, 31.85s
TeaCache, 28.78s
Open-Sora, 51-frame, 480P.

Latte, 26.90s
PAB, 19.98s
TeaCache, 14.46s
Latte, 16-frame, 512x512.

Open-Sora-Plan, 99.65s
PAB, 73.41s
TeaCache, 22.62s
Open-Sora-Plan, 65-frame, 512x512.

Quality-lantency Trade-off



Scaling to multiple GPUs



Performance at different resolutions and lengths



BibTeX

@misc{liu2024timestep,
      title={Timestep Embedding Tells: It's Time to Cache for Video Diffusion Model},
      author={Feng Liu and Shiwei Zhang and Xiaofeng Wang and Yujie Wei and Haonan Qiu and Yuzhong Zhao and Yingya Zhang and Qixiang Ye and Fang Wan},
      year={2024},
      eprint={2411.19108},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2411.19108},
}