Unveiling CogVideoX-5B: Insights into Cutting-Edge Video Generation Models
📄 中文阅读 |
🤗 Huggingface Space |
🌐 Github |
📜 arxiv
📍 Check out QingYing and API Platform for hands-on experience with commercial video generation models.
Demo Showcase
Description: A vibrant garden buzzes with life as butterflies flutter among blossoms, casting gentle shadows. A majestic fountain delights in the background, its rhythmic sound enhancing the serene ambiance. A wooden chair offers a perfect spot for contemplation under the shade of a grand tree.
Description: A determined young boy races through pouring rain, illuminated by flashes of lightning. The rain’s fury creates a chaotic but mesmerizing choreography, while a cozy silhouette of home beckons in the distance, showcasing the resilience of childhood spirit against nature’s might.
Description: An astronaut in a suit, dusted in Martian red, engages in a momentous handshake with a shimmering blue alien beneath a dreamy pink sky. The sleek silver rocket stands tall in the background, symbolizing human ingenuity amidst the stark beauty of Mars.
Description: A serene elderly gentleman sits at the water’s edge, painting an exquisite oil masterpiece as he enjoys a steaming cup of tea. The breeze tousles his silver hair, and the soft colors of the sunset reflect off the tranquil sea, creating a calm atmosphere filled with inspiration.
Description: In a dimly lit bar, a thoughtful man contemplates life’s mysteries, his expression revealed in close-up as purple light bathes his face, fading shadows lending depth to the intimate atmosphere.
Description: A playful golden retriever, donning black sunglasses, joyfully races across a rain-kissed rooftop terrace. The dog’s energetic leaps bring the scene to life as it bounds towards the camera, exuberance radiating from its wagging tail.
Description: By a lakeshore swaying with willow trees, elegant swans glide effortlessly over shimmering waters reflecting the azure sky, gently crafting ripples that disturb the lake’s tranquil facade, exemplifying nature’s serene beauty.
Description: A Chinese mother, enveloped in a pastel robe, rocks a cozy chair in a softly lit nursery. Whimsical mobiles dance overhead as her swaddled baby coos contentedly against her as they share a tender moment filled with love and tranquility.
Model Overview
The CogVideoX model stands as an open-source advancement in video generation technology, initially developed by QingYing. Below is an overview of the video generation models available for developers, along with key specifications.
Model Name | CogVideoX-2B | CogVideoX-5B (This Repository) |
---|---|---|
Model Description | Entry-level model focusing on compatibility and cost-efficiency for development. | Advanced model providing high-quality video generation and enhanced visual effects. |
Inference Precision | FP16 (Recommended), BF16, FP32, FP8*, INT8, lacking INT4 support. | BF16 (Recommended), FP16, FP32, FP8*, INT8, lacking INT4 support. |
Single GPU VRAM Consumption | SAT FP16: 18GB diffusers FP16: starting from 4GB* diffusers INT8(torchao): starting from 3.6GB* | SAT BF16: 26GB diffusers BF16: starting from 5GB* diffusers INT8(torchao): starting from 4.4GB* |
Multi-GPU Inference VRAM Consumption | FP16: 10GB* using diffusers | BF16: 15GB* using diffusers |
Inference Speed (Step = 50, FP/BF16) | Single A100: ~90 seconds Single H100: ~45 seconds | Single A100: ~180 seconds Single H100: ~90 seconds |
Fine-tuning Precision | FP16 | BF16 |
Fine-tuning VRAM Consumption (per GPU) | 47 GB (bs=1, LORA) 61 GB (bs=2, LORA) 62GB (bs=1, SFT) | 63 GB (bs=1, LORA) 80 GB (bs=2, LORA) 75GB (bs=1, SFT) |
Prompt Language | English* | |
Prompt Length Limit | 226 Tokens | |
Video Length | 6 Seconds | |
Frame Rate | 8 Frames per Second | |
Video Resolution | 720 x 480, no support for alternative resolutions (including fine-tuning). | |
Positional Encoding | 3d_sincos_pos_embed | 3d_rope_pos_embed |
Data Usage Guidelines
- Testing with the
diffusers
library utilized all available optimizations. Actual VRAM/memory usage may vary with devices other than NVIDIA A100 / H100. If optimizations are disabled, peak VRAM usage can rise significantly.
pipe.enable_model_cpu_offload()
pipe.enable_sequential_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()
- For multi-GPU inference,
enable_model_cpu_offload()
must be disabled. - INT8 models decrease inference speed to sustain operational quality on lower VRAM devices.
- While the 2B model employs
FP16
, the 5B model utilizesBF16
. Recommended usage aligns with training precision for optimal results. - Utilize PytorchAO and Optimum-quanto to quantize components, facilitating operation on smaller VRAM GPUs.
- The inference speed tests integrated the VRAM optimization scheme, with increases of around 10% without optimizations.
- Only the
diffusers
variant supports model quantization. - Currently, only English input is supported; users can translate other languages as needed.
Important Notes
- For inference and fine-tuning, utilize SAT. More information is available on our GitHub page.
Quick Start: Deploying with 🤗
The following steps outline the deployment process using the Huggingface diffusers library.
For an enhanced experience, visit our GitHub for prompt optimization and conversion details.
- Install required dependencies:
# diffusers>=0.30.1
# transformers>=4.44.2
# accelerate>=0.33.0 (source installation suggested)
# imageio-ffmpeg>=0.5.1
pip install --upgrade transformers accelerate diffusers imageio-ffmpeg
- Execute the code:
import torch
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video
prompt = "A panda dressed in a small red jacket and a tiny hat sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously, some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."
pipe = CogVideoXPipeline.from_pretrained(
"THUDM/CogVideoX-5b",
torch_dtype=torch.bfloat16
)
pipe.enable_model_cpu_offload()
pipe.vae.enable_tiling()
video = pipe(
prompt=prompt,
num_videos_per_prompt=1,
num_inference_steps=50,
num_frames=49,
guidance_scale=6,
generator=torch.Generator(device="cuda").manual_seed(42),
).frames[0]
export_to_video(video, "output.mp4", fps=8)
Quantized Inference Techniques
PytorchAO and Optimum-quanto enable quantization of the Text Encoder, Transformer, and VAE modules, considerably reducing the memory usage of CogVideoX. This enhancement allows the model to function efficiently on free-tier T4 Colab or GPUs with limited VRAM. Additionally, TorchAO quantization fully supports torch.compile
for improved inference speed.
# Follow these steps to begin, with PytorchAO installed from GitHub and using PyTorch Nightly.
# Nightly installation is necessary until the next official release.
import torch
from diffusers import AutoencoderKLCogVideoX, CogVideoXTransformer3DModel, CogVideoXPipeline
from diffusers.utils import export_to_video
+ from transformers import T5EncoderModel
+ from torchao.quantization import quantize_, int8_weight_only, int8_dynamic_activation_int8_weight
+ quantization = int8_weight_only
+ text_encoder = T5EncoderModel.from_pretrained("THUDM/CogVideoX-5b", subfolder="text_encoder", torch_dtype=torch.bfloat16)
+ quantize_(text_encoder, quantization())
+ transformer = CogVideoXTransformer3DModel.from_pretrained("THUDM/CogVideoX-5b", subfolder="transformer", torch_dtype=torch.bfloat16)
+ quantize_(transformer, quantization())
+ vae = AutoencoderKLCogVideoX.from_pretrained("THUDM/CogVideoX-5b", subfolder="vae", torch_dtype=torch.bfloat16)
+ quantize_(vae, quantization())
# Create pipeline and run inference
pipe = CogVideoXPipeline.from_pretrained(
"THUDM/CogVideoX-5b",
+ text_encoder=text_encoder,
+ transformer=transformer,
+ vae=vae,
torch_dtype=torch.bfloat16,
)
pipe.enable_model_cpu_offload()
pipe.vae.enable_tiling()
prompt = "A panda dressed in a small red jacket and a tiny hat sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously, some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The backdrop includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."
video = pipe(
prompt=prompt,
num_videos_per_prompt=1,
num_inference_steps=50,
num_frames=49,
guidance_scale=6,
generator=torch.Generator(device="cuda").manual_seed(42),
).frames[0]
export_to_video(video, "output.mp4", fps=8)
Furthermore, these models can be serialized and stored in a quantized datatype to optimize disk space. Utilize the following resources for examples and benchmarks:
Explore the Model Further
Visit our GitHub for additional resources:
- In-depth technical details and code explanations.
- Prompt word optimization and conversion strategies.
- Model reasoning and fine-tuning information, including pre-release notes.
- Updates on project developments and interactive opportunities.
- Comprehensive toolchain for utilizing CogVideoX efficiently.
- Support for INT8 model inference code.
Model Licensing Information
The CogVideoX model is shared under the CogVideoX LICENSE.
Citation
@article{yang2024cogvideox,
title={CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer},
author={Yang, Zhuoyi and Teng, Jiayan and Zheng, Wendi and Ding, Ming and Huang, Shiyu and Xu, Jiazheng and Yang, Yuanming and Hong, Wenyi and Zhang, Xiaohan and Feng, Guanyu and others},
journal={arXiv preprint arXiv:2408.06072},
year={2024}
}
0 Comments