As a fundamental backbone for video generation, diffusion models are challenged by low inference speed due to the sequential nature of denoising. Previous methods speed up the models by caching and reusing model outputs at uniformly selected timesteps. However, such a strategy neglects the fact that differences among model outputs are not uniform across timesteps, which hinders selecting the appropriate model outputs to cache, leading to a poor balance between inference efficiency and visual quality. In this study, we introduce Timestep Embedding Aware Cache (TeaCache), a training-free caching approach that estimates and leverages the fluctuating differences among model outputs across timesteps. Rather than directly using the time-consuming model outputs, TeaCache focuses on model inputs, which have a strong correlation with the modeloutputs while incurring negligible computational cost. TeaCache first modulates the noisy inputs using the timestep embeddings to ensure their differences better approximating those of model outputs. TeaCache then introduces a rescaling strategy to refine the estimated differences and utilizes them to indicate output caching. Experiments show that TeaCache achieves up to 4.41x acceleration over Open-Sora-Plan with negligible (-0.07% Vbench score) degradation of visual quality.
Outputs of diffusion model are similar between the consecutive timesteps when denoising and previous methods propose to reduce redundancy by caching model outputs in a uniform way. However, the output difference between consecutive timesteps varies. Thus, the uniform caching strategy lacks flexibility to maximize the cache utilization. A better caching strategy is to reuse the cached output more frequently when difference between cached output and the current output is small. Unfortunately, such difference is not predictable before the current output is computed. To conquer this challenge, TeaCache leverages the following prior: There exists a strong correlation between a model's inputs and outputs.
Our study reveals two key observations of attention mechanisms in video diffusion transformers: Firstly, the difference of outputs exhibit distinct patterns across various models. In Open Sora, the pattern forms a 'U' shape, whereas in Latte and OpenSora-Plan, it resembles a horizontally flipped 'L'. Additionally, OpenSora-Plan features multiple peaks because its scheduler samples certain timesteps twice. Secondly, the noisy input across consecutive timesteps changes minimally and shows little correlation with the model output. In contrast, both the timestep embedding and the timestep embedding modulated noisy input demonstrate a strong correlation with the model output across multiple models.
@misc{liu2024timestep,
title={Timestep Embedding Tells: It's Time to Cache for Video Diffusion Model},
author={Feng Liu and Shiwei Zhang and Xiaofeng Wang and Yujie Wei and Haonan Qiu and Yuzhong Zhao and Yingya Zhang and Qixiang Ye and Fang Wan},
year={2024},
eprint={2411.19108},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2411.19108},
}