Timestep Embedding Tells: It's Time to Cache for Video Diffusion Model

Abstract

As a fundamental backbone for video generation, diffusion models are challenged by low inference speed due to the sequential nature of denoising. Previous methods speed up the models by caching and reusing model outputs at uniformly selected timesteps. However, such a strategy neglects the fact that differences among model outputs are not uniform across timesteps, which hinders selecting the appropriate model outputs to cache, leading to a poor balance between inference efficiency and visual quality. In this study, we introduce Timestep Embedding Aware Cache (TeaCache), a training-free caching approach that estimates and leverages the fluctuating differences among model outputs across timesteps. Rather than directly using the time-consuming model outputs, TeaCache focuses on model inputs, which have a strong correlation with the modeloutputs while incurring negligible computational cost. TeaCache first modulates the noisy inputs using the timestep embeddings to ensure their differences better approximating those of model outputs. TeaCache then introduces a rescaling strategy to refine the estimated differences and utilizes them to indicate output caching. Experiments show that TeaCache achieves up to 4.41x acceleration over Open-Sora-Plan with negligible (-0.07% Vbench score) degradation of visual quality.

Timestep Embedding Aware Cache (TeaCache)

Motivation

Outputs of diffusion model are similar between the consecutive timesteps when denoising and previous methods propose to reduce redundancy by caching model outputs in a uniform way. However, the output difference between consecutive timesteps varies. Thus, the uniform caching strategy lacks flexibility to maximize the cache utilization. A better caching strategy is to reuse the cached output more frequently when difference between cached output and the current output is small. Unfortunately, such difference is not predictable before the current output is computed. To conquer this challenge, TeaCache leverages the following prior: There exists a strong correlation between a model's inputs and outputs.

Figure 2: Timestep embedding modulates the magnitude of block input and output thus has the potential to indicate the variation of output.

Figure 3: Visualization of input differences and output differences in consecutive timesteps of Open Sora, Latte, and OpenSora-Plan.

Implementation

Our study reveals two key observations of attention mechanisms in video diffusion transformers: Firstly, the difference of outputs exhibit distinct patterns across various models. In Open Sora, the pattern forms a 'U' shape, whereas in Latte and OpenSora-Plan, it resembles a horizontally flipped 'L'. Additionally, OpenSora-Plan features multiple peaks because its scheduler samples certain timesteps twice. Secondly, the noisy input across consecutive timesteps changes minimally and shows little correlation with the model output. In contrast, both the timestep embedding and the timestep embedding modulated noisy input demonstrate a strong correlation with the model output across multiple models.

Figure 4: TeaCache is capable of selectively caching informative intermediate model outputs during the inference process.

Building on these insights, we propose the Timestep Embedding Aware Cache (TeaCache). Rather than computing new outputs at each timestep, we reuse cached outputs from previous informative timesteps. The informative timesteps are selected by leveraging the difference of timestep embedding or timestep embedding modulated noisy input. Further, to reduce the estimation error between model output difference and timestep embedding difference, we apply polynomial fitting to rescale timestep embedding difference.

Figure 5: Visualization of corelation of rescaled input differences and output differences in consecutive timesteps.

Evaluations

Quantitative Results

Visual Results

Wan2.1 T2V

TeaCache, 1.4x speedup

TeaCache, 1.8x speedup

TeaCache, 2.0x speedup

Wan2.1 T2V, 5s, 720P.

Wan2.1 I2V

TeaCache, 1.7x speedup

TeaCache, 2.9x speedup

TeaCache, 2.4x speedup

Wan2.1 I2V, 3s, 720P.

Cosmos

TeaCache, 1.4x speedup

TeaCache, 2.0x speedup

Cosmos , 5s.

HunyuanVideo

TeaCache, 1.6x speedup

TeaCache, 2.1x speedup

HunyuanVideo, 5s, 720P.

CogVideoX1.5

TeaCache, 1.3x speedup

TeaCache, 1.8x speedup

TeaCache, 2.1x speedup

CogVideoX1.5 , 5s.

Mochi

TeaCache, 1.5x speedup

TeaCache, 2.1x speedup

Mochi, 5s, 480*848.

LTX-Video

TeaCache, 1.6x speedup

TeaCache, 2.1x speedup

LTX-Video, 6s, 512*768.

ConsisID

TeaCache, 1.6x speedup

TeaCache, 2.1x speedup

TeaCache, 2.7x speedup

ConsisID, 6s.

FLUX

TeaCache, 1.5x

TeaCache, 1.8x

TeaCache, 2.0x

TeaCache, 2.25x

FLUX.

Lumina T2X

TeaCache, 1.5x

TeaCache, 1.9x

TeaCache, 2.4x

TeaCache, 2.8x

Lumina T2X.

Open-Sora, 44.56s

PAB, 31.85s

TeaCache, 28.78s

Open-Sora, 51-frame, 480P.

Latte, 26.90s

PAB, 19.98s

TeaCache, 14.46s

Latte, 16-frame, 512x512.

Open-Sora-Plan, 99.65s

PAB, 73.41s

TeaCache, 22.62s

Open-Sora-Plan, 65-frame, 512x512.

Quality-lantency Trade-off

Scaling to multiple GPUs

Performance at different resolutions and lengths

BibTeX

@misc{liu2024timestep,
      title={Timestep Embedding Tells: It's Time to Cache for Video Diffusion Model},
      author={Feng Liu and Shiwei Zhang and Xiaofeng Wang and Yujie Wei and Haonan Qiu and Yuzhong Zhao and Yingya Zhang and Qixiang Ye and Fang Wan},
      year={2024},
      eprint={2411.19108},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2411.19108},
}