hpcaitech/Open-Sora new open source video generation model

Open-Sora is an open-source video generation initiative by the ColossalAI team, designed to efficiently produce high-quality videos with cost-effective, next-generation architectures and a full end-to-end training and inference toolkit complete with pre-trained weights for easy access. From the 1.3 release in February 2025 to the 2.0 launch (11B) on March 12, 2025, Open-Sora matches leading models on VBench and human preference benchmarks while keeping training costs around $200K. Released under the Apache-2.0 license, the repository boasts over 26.4K stars and an active community that regularly updates both code and technical reports. Users can also experiment via the free Hugging Face demo, explore a rich demo gallery, and follow step-by-step guides on data preprocessing, model training, and high-compression autoencoder techniques.

hpcaitech/Open-Sora new open source video generation model

Official website:https://video.luchentech.com/
Source code:https://github.com/hpcaitech/Open-Sora

Advanced Usage

Motion Score

During training, we provide motion score into the text prompt. During inference, you can use the following command to generate videos with motion score (the default score is 4):

torchrun --nproc_per_node 1 --standalone scripts/diffusion/inference.py configs/diffusion/inference/t2i2v_256px.py --save-dir samples --prompt "raining, sea" --motion-score 4

We also provide a dynamic motion score evaluator. After setting your OpenAI API key, you can use the following command to evaluate the motion score of a video:

torchrun --nproc_per_node 1 --standalone scripts/diffusion/inference.py configs/diffusion/inference/t2i2v_256px.py --save-dir samples --prompt "raining, sea" --motion-score dynamic
Score 1 4 7
hpcaitech/Open-Sora new open source video generation model hpcaitech/Open-Sora new open source video generation model hpcaitech/Open-Sora new open source video generation model

Prompt Refine

We take advantage of ChatGPT to refine the prompt. You can use the following command to refine the prompt. The function is available for both text-to-video and image-to-video generation.

export OPENAI_API_KEY=sk-xxxx
torchrun --nproc_per_node 1 --standalone scripts/diffusion/inference.py configs/diffusion/inference/t2i2v_256px.py --save-dir samples --prompt "raining, sea" --refine-prompt True

Reproductivity

To make the results reproducible, you can set the random seed by:

torchrun --nproc_per_node 1 --standalone scripts/diffusion/inference.py configs/diffusion/inference/t2i2v_256px.py --save-dir samples --prompt "raining, sea" --sampling_option.seed 42 --seed 42

Use --num-sample k to generate k samples for each prompt.

Computational Efficiency

We test the computational efficiency of text-to-video on H100/H800 GPU. For 256×256, we use colossalai’s tensor parallelism, and --offload True is used. For 768×768, we use colossalai’s sequence parallelism. All use number of steps 50. The results are presented in the format: Total time (s)/peak GPU memory (GB)

Resolution 1x GPU 2x GPUs 4x GPUs 8x GPUs
256×256 60/52.5 40/44.3 34/44.3
768×768 1656/60.3 863/48.3 466/44.3 276/44.3

Libre Depot original article,Publisher:Libre Depot,Please indicate the source when reprinting:https://www.libredepot.top/5488.html

Like (0)
Libre DepotLibre Depot
Previous 1 days ago
Next 05/05/2025 3:44 am

Related articles

Leave a Reply

Your email address will not be published. Required fields are marked *