0:00

Unveiling CogVideoX-5B: Insights into Cutting-Edge Video Generation Models

📄 中文阅读 |
🤗 Huggingface Space |
🌐 Github |
📜 arxiv

📍 Check out QingYing and API Platform for hands-on experience with commercial video generation models.

Demo Showcase

Description: A vibrant garden buzzes with life as butterflies flutter among blossoms, casting gentle shadows. A majestic fountain delights in the background, its rhythmic sound enhancing the serene ambiance. A wooden chair offers a perfect spot for contemplation under the shade of a grand tree.

Description: A determined young boy races through pouring rain, illuminated by flashes of lightning. The rain’s fury creates a chaotic but mesmerizing choreography, while a cozy silhouette of home beckons in the distance, showcasing the resilience of childhood spirit against nature’s might.

Description: An astronaut in a suit, dusted in Martian red, engages in a momentous handshake with a shimmering blue alien beneath a dreamy pink sky. The sleek silver rocket stands tall in the background, symbolizing human ingenuity amidst the stark beauty of Mars.

Description: A serene elderly gentleman sits at the water’s edge, painting an exquisite oil masterpiece as he enjoys a steaming cup of tea. The breeze tousles his silver hair, and the soft colors of the sunset reflect off the tranquil sea, creating a calm atmosphere filled with inspiration.

Description: In a dimly lit bar, a thoughtful man contemplates life’s mysteries, his expression revealed in close-up as purple light bathes his face, fading shadows lending depth to the intimate atmosphere.

Description: A playful golden retriever, donning black sunglasses, joyfully races across a rain-kissed rooftop terrace. The dog’s energetic leaps bring the scene to life as it bounds towards the camera, exuberance radiating from its wagging tail.

Description: By a lakeshore swaying with willow trees, elegant swans glide effortlessly over shimmering waters reflecting the azure sky, gently crafting ripples that disturb the lake’s tranquil facade, exemplifying nature’s serene beauty.

Description: A Chinese mother, enveloped in a pastel robe, rocks a cozy chair in a softly lit nursery. Whimsical mobiles dance overhead as her swaddled baby coos contentedly against her as they share a tender moment filled with love and tranquility.

Model Overview

The CogVideoX model stands as an open-source advancement in video generation technology, initially developed by QingYing. Below is an overview of the video generation models available for developers, along with key specifications.

Model Name	CogVideoX-2B	CogVideoX-5B (This Repository)
Model Description	Entry-level model focusing on compatibility and cost-efficiency for development.	Advanced model providing high-quality video generation and enhanced visual effects.
Inference Precision	FP16 (Recommended), BF16, FP32, FP8*, INT8, lacking INT4 support.	BF16 (Recommended), FP16, FP32, FP8*, INT8, lacking INT4 support.
Single GPU VRAM Consumption	SAT FP16: 18GB diffusers FP16: starting from 4GB* diffusers INT8(torchao): starting from 3.6GB*	SAT BF16: 26GB diffusers BF16: starting from 5GB* diffusers INT8(torchao): starting from 4.4GB*
Multi-GPU Inference VRAM Consumption	FP16: 10GB* using diffusers	BF16: 15GB* using diffusers
Inference Speed (Step = 50, FP/BF16)	Single A100: ~90 seconds Single H100: ~45 seconds	Single A100: ~180 seconds Single H100: ~90 seconds
Fine-tuning Precision	FP16	BF16
Fine-tuning VRAM Consumption (per GPU)	47 GB (bs=1, LORA) 61 GB (bs=2, LORA) 62GB (bs=1, SFT)	63 GB (bs=1, LORA) 80 GB (bs=2, LORA) 75GB (bs=1, SFT)
Prompt Language	English*
Prompt Length Limit	226 Tokens
Video Length	6 Seconds
Frame Rate	8 Frames per Second
Video Resolution	720 x 480, no support for alternative resolutions (including fine-tuning).
Positional Encoding	3d_sincos_pos_embed	3d_rope_pos_embed

Data Usage Guidelines

Testing with the diffusers library utilized all available optimizations. Actual VRAM/memory usage may vary with devices other than NVIDIA A100 / H100. If optimizations are disabled, peak VRAM usage can rise significantly.

pipe.enable_model_cpu_offload()
pipe.enable_sequential_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()

For multi-GPU inference, enable_model_cpu_offload() must be disabled.
INT8 models decrease inference speed to sustain operational quality on lower VRAM devices.
While the 2B model employs FP16, the 5B model utilizes BF16. Recommended usage aligns with training precision for optimal results.
Utilize PytorchAO and Optimum-quanto to quantize components, facilitating operation on smaller VRAM GPUs.
The inference speed tests integrated the VRAM optimization scheme, with increases of around 10% without optimizations.
Only the diffusers variant supports model quantization.
Currently, only English input is supported; users can translate other languages as needed.

Important Notes

For inference and fine-tuning, utilize SAT. More information is available on our GitHub page.

Quick Start: Deploying with 🤗

The following steps outline the deployment process using the Huggingface diffusers library.

For an enhanced experience, visit our GitHub for prompt optimization and conversion details.

Install required dependencies:

# diffusers>=0.30.1
# transformers>=4.44.2
# accelerate>=0.33.0 (source installation suggested)
# imageio-ffmpeg>=0.5.1
pip install --upgrade transformers accelerate diffusers imageio-ffmpeg

Execute the code:

import torch
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video

prompt = "A panda dressed in a small red jacket and a tiny hat sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously, some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."

pipe = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX-5b",
    torch_dtype=torch.bfloat16
)

pipe.enable_model_cpu_offload()
pipe.vae.enable_tiling()

video = pipe(
    prompt=prompt,
    num_videos_per_prompt=1,
    num_inference_steps=50,
    num_frames=49,
    guidance_scale=6,
    generator=torch.Generator(device="cuda").manual_seed(42),
).frames[0]

export_to_video(video, "output.mp4", fps=8)

Quantized Inference Techniques

PytorchAO and Optimum-quanto enable quantization of the Text Encoder, Transformer, and VAE modules, considerably reducing the memory usage of CogVideoX. This enhancement allows the model to function efficiently on free-tier T4 Colab or GPUs with limited VRAM. Additionally, TorchAO quantization fully supports torch.compile for improved inference speed.

# Follow these steps to begin, with PytorchAO installed from GitHub and using PyTorch Nightly.
# Nightly installation is necessary until the next official release.

import torch
from diffusers import AutoencoderKLCogVideoX, CogVideoXTransformer3DModel, CogVideoXPipeline
from diffusers.utils import export_to_video
+ from transformers import T5EncoderModel
+ from torchao.quantization import quantize_, int8_weight_only, int8_dynamic_activation_int8_weight

+ quantization = int8_weight_only

+ text_encoder = T5EncoderModel.from_pretrained("THUDM/CogVideoX-5b", subfolder="text_encoder", torch_dtype=torch.bfloat16)
+ quantize_(text_encoder, quantization())

+ transformer = CogVideoXTransformer3DModel.from_pretrained("THUDM/CogVideoX-5b", subfolder="transformer", torch_dtype=torch.bfloat16)
+ quantize_(transformer, quantization())

+ vae = AutoencoderKLCogVideoX.from_pretrained("THUDM/CogVideoX-5b", subfolder="vae", torch_dtype=torch.bfloat16)
+ quantize_(vae, quantization())

# Create pipeline and run inference
pipe = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX-5b",
+    text_encoder=text_encoder,
+    transformer=transformer,
+    vae=vae,
    torch_dtype=torch.bfloat16,
)
pipe.enable_model_cpu_offload()
pipe.vae.enable_tiling()

prompt = "A panda dressed in a small red jacket and a tiny hat sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously, some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The backdrop includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."

video = pipe(
    prompt=prompt,
    num_videos_per_prompt=1,
    num_inference_steps=50,
    num_frames=49,
    guidance_scale=6,
    generator=torch.Generator(device="cuda").manual_seed(42),
).frames[0]

export_to_video(video, "output.mp4", fps=8)

Furthermore, these models can be serialized and stored in a quantized datatype to optimize disk space. Utilize the following resources for examples and benchmarks:

Explore the Model Further

Visit our GitHub for additional resources:

In-depth technical details and code explanations.
Prompt word optimization and conversion strategies.
Model reasoning and fine-tuning information, including pre-release notes.
Updates on project developments and interactive opportunities.
Comprehensive toolchain for utilizing CogVideoX efficiently.
Support for INT8 model inference code.

Model Licensing Information

The CogVideoX model is shared under the CogVideoX LICENSE.

Citation

@article{yang2024cogvideox,
  title={CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer},
  author={Yang, Zhuoyi and Teng, Jiayan and Zheng, Wendi and Ding, Ming and Huang, Shiyu and Xu, Jiazheng and Yang, Yuanming and Hong, Wenyi and Zhang, Xiaohan and Feng, Guanyu and others},
  journal={arXiv preprint arXiv:2408.06072},
  year={2024}
}

Source: