技术报告

用户5287

2024年2月28日修改

这篇是翻译的OpenAI技术报告，更多观点请查看：报道和观点

Sora：可作为世界模拟器得视觉生成模型

原文：https://openai.com/research/video-generation-models-as-world-simulators

SoraWebUI 最强工作流工具

https://github.com/SoraFlows/SoraFlows（欢迎Star）

开源在线Sora WebUI：https://www.soraflows.com （正在开发中）

common.docs_name - LarkCCM_Docs_Menu_Image

欢迎叫SoraFlows交流群

更多信息

1.6 入门：世界模型Sora 全网最强资料

报道和观点

案例大全

中学生都能看懂：Sora原理解读

“竞品”对比

Sora的Prompt

Sora爆火，普通人的10个赚钱机会。

Video generation models as world simulators

视频生成模型作为世界模拟器

We explore large-scale training of generative models on video data. Specifically, we train text-conditional diffusion models jointly on videos and images of variable durations, resolutions and aspect ratios. We leverage a transformer architecture that operates on spacetime patches of video and image latent codes. Our largest model, Sora, is capable of generating a minute of high fidelity video. Our results suggest that scaling video generation models is a promising path towards building general purpose simulators of the physical world.​

我们探索在视频数据上进行生成模型的大规模训练。具体来说，我们在可变持续时间、分辨率和宽高比的视频和图像上联合训练文本条件扩散模型。我们利用了一种转换架构，可以在视频和图像潜码的时空补丁上运行。我们最大的模型Sora能够生成一分钟的高保真视频。我们的结果表明，扩展视频生成模型是构建物理世界通用模拟器的有前途的途径。​

This technical report focuses on (1) our method for turning visual data of all types into a unified representation that enables large-scale training of generative models, and (2) qualitative evaluation of Sora’s capabilities and limitations. Model and implementation details are not included in this report.​

本技术报告侧重于（1）我们将所有类型的视觉数据转化为统一表示的方法，以实现生成模型的大规模训练，以及（2）对Sora的能力和局限性进行定性评估。模型和实现细节未包含在本报告中。​

Much prior work has studied generative modeling of video data using a variety of methods, including recurrent networks,(1,)(2,)(3) generative adversarial networks,(4,)(5,)(6,)(7) autoregressive transformers,(8,)(9) and diffusion models.(10,)(11,)(12) These works often focus on a narrow category of visual data, on shorter videos, or on videos of a fixed size. Sora is a generalist model of visual data—it can generate videos and images spanning diverse durations, aspect ratios and resolutions, up to a full minute of high definition video.​

许多先前的工作已经研究了使用各种方法对视频数据进行生成建模，包括循环网络，（1，）（2，）（3）生成对抗网络，（4，）（5，）（6，）（7）自回归变压器，（8，）（9）和扩散模型。（10，）（11，）（12）这些工作通常专注于狭窄的视觉数据类别，较短的视频或固定大小的视频。Sora是视觉数据的通用模型——它可以生成跨越不同持续时间、宽高比和分辨率的视频和图像，长达一分钟的高清视频。​

Turning visual data into patches

技术报告​

技术报告