Diffusers
Safetensors
LanguageBind commited on
Commit
f25d17e
1 Parent(s): 9e40191

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +11 -3
README.md CHANGED
@@ -2,6 +2,9 @@
2
  license: mit
3
  ---
4
 
 
 
 
5
 
6
 
7
  <h1 align="left"> <a href="https://github.com/PKU-YuanGroup/Open-Sora-Plan">Open-Sora Plan</a></h1>
@@ -170,14 +173,19 @@ Similar to previous work, we use a multi-stage training approach. With the 3D Di
170
 
171
  The video model is initialized with weights from a 480p image model. We first train 480p videos with 29 frames. Next, we adapt the weights to 720p resolution, training on approximately 7 million samples from panda70m, filtered for aesthetic quality and motion. Finally, we refine the model with a higher-quality (HQ) subset of 1 million samples for fine-tuning 93-frame 720p videos. Below is our training card.
172
 
 
 
173
  | Name | Stage 1 | Stage 2 | Stage 3 | Stage 4 |Stage 5 |
174
- |:---|:---|:---|:---|:---|:---|
175
  | Training Video Size | 1×320×240 | 1×640×480 | 29×640×480 | 29×1280×720 | 93×1280×720 |
176
  | Training Step| 146k | 200k | 30k | 21k | 3k |
177
  | Compute (#Num x #Hours) | 32 Ascend × 81 | 32 Ascend × 142 | 128 Ascend × 38 | 256 H100 × 64 | 256 H100 × 84 |
178
- | Checkpoint | - | - | - | - | [HF](https://huggingface.co/LanguageBind/Open-Sora-Plan-v1.2.0) |
179
  | Log | - | - | [wandb](https://api.wandb.ai/links/1471742727-Huawei/trdu2kba) | [wandb](https://api.wandb.ai/links/linbin/vvxvcd7s) | [wandb](https://api.wandb.ai/links/linbin/easg3qkl)
180
- | Training Data | 10M SAM | 5M internal image data | 4M Panda70M | 7M Panda70M | 1M HQ Panda70M |
 
 
 
181
 
182
  ### Training Image-to-Video Diffusion Model
183
 
 
2
  license: mit
3
  ---
4
 
5
+ # Please note that the weights for v1.2.0 29×720p and 93×480p were trained on Panda70M and have not undergone final high-quality data fine-tuning, so they may produce watermarks.
6
+
7
+ # We fine-tuned 3.5k steps from 93×720p to get 93×480p for community research use.
8
 
9
 
10
  <h1 align="left"> <a href="https://github.com/PKU-YuanGroup/Open-Sora-Plan">Open-Sora Plan</a></h1>
 
173
 
174
  The video model is initialized with weights from a 480p image model. We first train 480p videos with 29 frames. Next, we adapt the weights to 720p resolution, training on approximately 7 million samples from panda70m, filtered for aesthetic quality and motion. Finally, we refine the model with a higher-quality (HQ) subset of 1 million samples for fine-tuning 93-frame 720p videos. Below is our training card.
175
 
176
+
177
+
178
  | Name | Stage 1 | Stage 2 | Stage 3 | Stage 4 |Stage 5 |
179
+ |---|---|---|---|---|---|
180
  | Training Video Size | 1×320×240 | 1×640×480 | 29×640×480 | 29×1280×720 | 93×1280×720 |
181
  | Training Step| 146k | 200k | 30k | 21k | 3k |
182
  | Compute (#Num x #Hours) | 32 Ascend × 81 | 32 Ascend × 142 | 128 Ascend × 38 | 256 H100 × 64 | 256 H100 × 84 |
183
+ | Checkpoint | - | - | - | [HF](https://huggingface.co/LanguageBind/Open-Sora-Plan-v1.2.0/tree/main/29x720p) | [HF](https://huggingface.co/LanguageBind/Open-Sora-Plan-v1.2.0/tree/main/93x720p) |
184
  | Log | - | - | [wandb](https://api.wandb.ai/links/1471742727-Huawei/trdu2kba) | [wandb](https://api.wandb.ai/links/linbin/vvxvcd7s) | [wandb](https://api.wandb.ai/links/linbin/easg3qkl)
185
+ | Training Data | [10M SAM](https://huggingface.co/datasets/LanguageBind/Open-Sora-Plan-v1.2.0/blob/main/anno_json/sam_image_11185255_resolution.json) | 5M internal image data | [6M HQ Panda70M](https://huggingface.co/datasets/LanguageBind/Open-Sora-Plan-v1.2.0/blob/main/anno_json/Panda70M_HQ6M.json) | [6M HQ Panda70M](https://huggingface.co/datasets/LanguageBind/Open-Sora-Plan-v1.2.0/blob/main/anno_json/Panda70M_HQ6M.json) | [1M HQ Panda70M](https://huggingface.co/datasets/LanguageBind/Open-Sora-Plan-v1.2.0/blob/main/anno_json/Panda70M_HQ1M.json) and [100k HQ data](https://huggingface.co/datasets/LanguageBind/Open-Sora-Plan-v1.2.0/tree/main/anno_json) (collected in v1.1.0) |
186
+
187
+ Additionally, we fine-tuned 3.5k steps from the final 93×720p to get [93×480p](https://huggingface.co/LanguageBind/Open-Sora-Plan-v1.2.0/tree/main/93x480p) for community research use.
188
+
189
 
190
  ### Training Image-to-Video Diffusion Model
191