OpenAI揭秘：Sora是如何生成視頻的？

來源：整理自數(shù)字生命卡茲克、新智元、騰訊科技、每日經(jīng)濟(jì)新聞時(shí)間：2024-02-20 16:01:51 作者：

　　當(dāng)?shù)貢r(shí)間2月15日，OpenAI 在其官網(wǎng)發(fā)布文生視頻模型 Sora。根據(jù)官網(wǎng)的演示，用戶在 Sora 上輸入一段文字指令，瞬間可以生成一段60秒、有電影質(zhì)感的視頻。

　　OpenAI 在其官網(wǎng)上展示了由 Sora 生成的48個(gè)視頻，這些視頻對(duì)人物、動(dòng)物或是其他物品的特寫纖毫畢現(xiàn)，背景豐富、細(xì)節(jié)生動(dòng)、運(yùn)鏡流暢，從一些畫面中能感受到豐富的情感。

01、文生視頻的GPT-3時(shí)刻！Sora技術(shù)報(bào)告揭秘6大核心優(yōu)勢

　　據(jù)外媒報(bào)道，Sora的推出標(biāo)志著AI研究的一個(gè)重要里程碑。憑借其模擬和理解現(xiàn)實(shí)世界的能力，Sora為未來實(shí)現(xiàn)通用人工智能（AGI）奠定了基礎(chǔ)。從本質(zhì)上講，Sora不僅僅是生成視頻，而是在突破AI所能完成的極限。

　　OpenAI CEO阿爾特曼在X平臺(tái)上透露，Sora目前已向紅隊(duì)成員（red teamers，指的是誤導(dǎo)信息、仇恨內(nèi)容和偏見內(nèi)容等方面的專家）和部分創(chuàng)意人士開放。

　　英偉達(dá)人工智能研究院首席研究科學(xué)家Jim Fan則在X平臺(tái)發(fā)文表示，“如果你還是把Sora看成DALLE那樣的生成式玩具，還是好好想想吧，這是一個(gè)數(shù)據(jù)驅(qū)動(dòng)的物理引擎。他是對(duì)許多世界的模擬，無論是真實(shí)的還是幻想的?！彼J(rèn)為，Sora是一個(gè)可學(xué)習(xí)的模擬器，或“世界模型”。

　　在他看來，Sora代表了文本生成視頻的GPT-3 時(shí)刻。而針對(duì)部分稱“Sora并沒有學(xué)習(xí)物理，僅僅是在二維空間里對(duì)像素進(jìn)行操作”的聲音，他表示，Sora所展現(xiàn)的軟物理仿真實(shí)際上是一種隨著規(guī)模擴(kuò)大而出現(xiàn)的特性。Sora 必須學(xué)習(xí)一些隱式的文本到 3D、3D 變換、光線追蹤渲染和物理規(guī)則，才有可能精確地模擬視頻像素。它必須理解游戲引擎的概念，才有可能生成視頻。

　　對(duì)于Sora的最大優(yōu)勢，360集團(tuán)創(chuàng)始人、董事長周鴻祎說，“這次OpenAI利用它的大語言模型優(yōu)勢，讓Sora實(shí)現(xiàn)了對(duì)現(xiàn)實(shí)世界的理解和對(duì)世界的模擬兩層能力，這樣產(chǎn)生的視頻才是真實(shí)的，才能跳出2D的范圍模擬真實(shí)的物理世界?！彼瑫r(shí)稱，“一旦人工智能接上攝像頭，把所有的電影都看一遍，把YouTube上和 TikTok 的視頻都看一遍，對(duì)世界的理解將遠(yuǎn)遠(yuǎn)超過文字學(xué)習(xí)，一幅圖勝過千言萬語，這就離AGI真的就不遠(yuǎn)了，不是10年、20年的問題，可能一兩年很快就可以實(shí)現(xiàn)?！?/p>

　　在發(fā)布新技術(shù)的同時(shí)，OpenAI 也發(fā)布了一份關(guān)于 Sora 的詳細(xì)技術(shù)報(bào)告。技術(shù)報(bào)告總結(jié)出了 Sora 的六大核心優(yōu)勢：

　?。?）準(zhǔn)確性和多樣性：Sora的顯著特征之一是能夠準(zhǔn)確解釋長達(dá)135個(gè)單詞的長提示。它可以準(zhǔn)確地解釋用戶提供的文本輸入，并生成具有各種場景和人物的高質(zhì)量視頻剪輯。這一新工具可將簡短的文本描述轉(zhuǎn)化成長達(dá)1分鐘的高清視頻。它涵蓋了廣泛的主題，從人物和動(dòng)物到郁郁蔥蔥的風(fēng)景、城市場景、花園，甚至是水下的紐約市，可根據(jù)用戶的要求提供多樣化的內(nèi)容。

　?。?）強(qiáng)大的語言理解：OpenAI利用Dall-E模型的re-captioning（重述要點(diǎn)）技術(shù)，生成視覺訓(xùn)練數(shù)據(jù)的描述性字幕，不僅能提高文本的準(zhǔn)確性，還能提升視頻的整體質(zhì)量。此外，與DALL·E3類似，OpenAI還利用GPT技術(shù)將簡短的用戶提示轉(zhuǎn)換為更長的詳細(xì)轉(zhuǎn)譯，并將其發(fā)送到視頻模型。這使Sora能夠精確地按照用戶提示生成高質(zhì)量的視頻。

　?。?）以圖/視頻生成視頻：Sora除了可以將文本轉(zhuǎn)化為視頻，還能接受其他類型的輸入提示，如已經(jīng)存在的圖像或視頻。這使Sora能夠執(zhí)行廣泛的圖像和視頻編輯任務(wù)，如創(chuàng)建完美的循環(huán)視頻、將靜態(tài)圖像轉(zhuǎn)化為動(dòng)畫、向前或向后擴(kuò)展視頻等。OpenAI在報(bào)告中展示了基于DALL·E2和DALL·E3的圖像生成的demo視頻。這不僅證明了Sora的強(qiáng)大功能，還展示了它在圖像和視頻編輯領(lǐng)域的無限潛力。

　?。?）視頻擴(kuò)展功能：由于可接受多樣化的輸入提示，用戶可以根據(jù)圖像創(chuàng)建視頻或補(bǔ)充現(xiàn)有視頻。作為基于Transformer的擴(kuò)散模型，Sora還能沿時(shí)間線向前或向后擴(kuò)展視頻。從OpenAI提供的4個(gè)demo視頻看，都從同一個(gè)視頻片段開始，向時(shí)間線的過去進(jìn)行延伸。因此，盡管開頭不同，但視頻結(jié)局都是相同的。

　?。?）優(yōu)異的設(shè)備適配性：Sora具備出色的采樣能力，從寬屏的1920x1080p到豎屏的1080x1920，兩者之間的任何視頻尺寸都能輕松應(yīng)對(duì)。這意味著Sora能夠?yàn)楦鞣N設(shè)備生成與其原始縱橫比完美匹配的內(nèi)容。而在生成高分辨率內(nèi)容之前，Sora還能以小尺寸迅速創(chuàng)建內(nèi)容原型。

　　（6）場景和物體的一致性和連續(xù)性：Sora可以生成帶有動(dòng)態(tài)視角變化的視頻，人物和場景元素在三維空間中的移動(dòng)會(huì)顯得更加自然。Sora能夠很好地處理遮擋問題?，F(xiàn)有模型的一個(gè)問題是，當(dāng)物體離開視野時(shí)，它們可能無法對(duì)其進(jìn)行追蹤。而通過一次性提供多幀預(yù)測，Sora可確保畫面主體即使暫時(shí)離開視野也能保持不變。

　　報(bào)告鏈接：https://openai.com/research/video-generation-models-as-world-simulators

02、OpenAI Sora視頻生成模型技術(shù)報(bào)告（中英對(duì)照）

　　Video generation models as world simulators

　　以視頻生成模型作為世界模擬器

　　We explore large-scale training of generative models on video data. Specifically, we train text-conditional diffusion models jointly on videos and images of variable durations, resolutions and aspect ratios. We leverage a transformer architecture that operates on spacetime patches of video and image latent codes. Our largest model, Sora, is capable of generating a minute of high fidelity video. Our results suggest that scaling video generation models is a promising path towards building general purpose simulators of the physical world.

　　我們探索了在視頻數(shù)據(jù)上進(jìn)行大規(guī)模訓(xùn)練生成模型。具體來說，我們聯(lián)合訓(xùn)練了文本條件擴(kuò)散模型，處理不同持續(xù)時(shí)間、分辨率和寬高比的視頻和圖像。我們利用了一種在視頻和圖像潛碼的時(shí)空塊上操作的變壓器架構(gòu)。我們最大的模型Sora能夠生成一分鐘的高保真視頻。我們的結(jié)果表明，擴(kuò)大視頻生成模型的規(guī)模是朝著構(gòu)建物理世界通用模擬器的有前途的路徑。

　　This technical report focuses on (1) our method for turning visual data of all types into a unified representation that enables large-scale training of generative models, and (2) qualitative evaluation of Sora’s capabilities and limitations. Model and implementation details are not included in this report.

　　本技術(shù)報(bào)告重點(diǎn)介紹：（1）我們將各類視覺數(shù)據(jù)轉(zhuǎn)換為統(tǒng)一表示的方法，該方法能夠?qū)崿F(xiàn)生成模型的大規(guī)模訓(xùn)練；（2）Sora能力和局限性的定性評(píng)估。報(bào)告中未包含模型和實(shí)現(xiàn)細(xì)節(jié)。

　　Much prior work has studied generative modeling of video data using a variety of methods, including recurrent networks,generative adversarial networks,4,5,6,7 autoregressive transformers,8,9 and diffusion models.10,11,12 These works often focus on a narrow category of visual data, on shorter videos, or on videos of a fixed size. Sora is a generalist model of visual data—it can generate videos and images spanning diverse durations, aspect ratios and resolutions, up to a full minute of high definition video.

　　以前的許多工作已經(jīng)研究了使用各種方法對(duì)視頻數(shù)據(jù)進(jìn)行生成建模，包括循環(huán)網(wǎng)絡(luò)、生成對(duì)抗網(wǎng)絡(luò)、自回歸變換器和擴(kuò)散模型。這些工作通常專注于狹窄類別的視覺數(shù)據(jù)、較短的視頻或固定大小的視頻。Sora是一種通用的視覺數(shù)據(jù)模型——它可以生成持續(xù)時(shí)間、寬高比和分辨率各異的視頻和圖像，長達(dá)一分鐘的高清視頻。

　　Turning visual data into patches

　　將視覺數(shù)據(jù)轉(zhuǎn)換為圖像塊

　　We take inspiration from large language models which acquire generalist capabilities by training on internet-scale data.13,14 The success of the LLM paradigm is enabled in part by the use of tokens that elegantly unify diverse modalities of text—code, math and various natural languages. In this work, we consider how generative models of visual data can inherit such benefits. Whereas LLMs have text tokens, Sora has visual patches. Patches have previously been shown to be an effective representation for models of visual data.15,16,17,18 We find that patches are a highly-scalable and effective representation for training generative models on diverse types of videos and images.

　　我們從大型語言模型中獲得靈感，這些模型通過在互聯(lián)網(wǎng)規(guī)模的數(shù)據(jù)上訓(xùn)練來獲得通用能力。這種范式的成功在一定程度上得益于使用詞元編碼/令牌（token），它們巧妙地統(tǒng)一了文本的多種形式——代碼、數(shù)學(xué)和各種自然語言。在這項(xiàng)工作中，我們考慮如何讓視覺數(shù)據(jù)的生成模型繼承這些好處。與擁有文本令牌的不同，Sora擁有視覺塊嵌入編碼（visual patches）。視覺塊已被證明是視覺數(shù)據(jù)模型的一種有效表示。我們發(fā)現(xiàn)，補(bǔ)丁是一種高度可擴(kuò)展且有效的表示形式，用于在多種類型的視頻和圖像上訓(xùn)練生成模型。

　　At a high level, we turn videos into patches by first compressing videos into a lower-dimensional latent space,19 and subsequently decomposing the representation into spacetime patches.

　　在高維上，我們首先將視頻壓縮到一個(gè)低維潛在空間，然后將表示分解成時(shí)空嵌入，從而將視頻轉(zhuǎn)換成一系列編碼塊。

　　Video compression network

　　視頻壓縮網(wǎng)絡(luò)

　　We train a network that reduces the dimensionality of visual data.20 This network takes raw video as input and outputs a latent representation that is compressed both temporally and spatially. Sora is trained on and subsequently generates videos within this compressed latent space. We also train a corresponding decoder model that maps generated latents back to pixel space.

　　我們訓(xùn)練了一個(gè)網(wǎng)絡(luò)，用于降低視覺數(shù)據(jù)的維度。這個(gè)網(wǎng)絡(luò)將原始視頻作為輸入，并輸出一個(gè)在時(shí)間和空間上都被壓縮的潛在表示。Sora在這個(gè)壓縮的潛在空間內(nèi)接受訓(xùn)練，并隨后生成視頻。我們還訓(xùn)練了一個(gè)相應(yīng)的解碼器模型，將生成的潛在表示映射回像素空間。

　　Spacetime Latent Patches

　　時(shí)空潛在塊

　　Given a compressed input video, we extract a sequence of spacetime patches which act as transformer tokens. This scheme works for images too since images are just videos with a single frame. Our patch-based representation enables Sora to train on videos and images of variable resolutions, durations and aspect ratios. At inference time, we can control the size of generated videos by arranging randomly-initialized patches in an appropriately-sized grid.

　　給定一個(gè)壓縮的輸入視頻，我們提取一系列時(shí)空編碼塊作為transformer令牌（token）。這種方案也適用于圖像，因?yàn)閳D像只是幀數(shù)為單一的視頻。我們基于補(bǔ)丁的表示使得Sora能夠訓(xùn)練不同分辨率、持續(xù)時(shí)間和寬高比的視頻和圖像。在推理時(shí)，我們可以通過在適當(dāng)大小的網(wǎng)格中排列隨機(jī)初始化的編碼塊來控制生成視頻的大小。

　　Scaling transformers for video generation

　　擴(kuò)展Transformer用于視頻生成

　　Sora is a diffusion model21,22,23,24,25; given input noisy patches (and conditioning information like text prompts), it’s trained to predict the original “clean” patches. Importantly, Sora is a diffusion transformer.26 Transformers have demonstrated remarkable scaling properties across a variety of domains, including language modeling,13,14 computer vision,15,16,17,18 and image generation.27,28,29

　　Sora是一個(gè)擴(kuò)散模型；給定輸入的噪聲塊（和像文本提示這樣的條件信息），它被訓(xùn)練來預(yù)測原始的“干凈”塊。重要的是，Sora是一個(gè)擴(kuò)散變換器。變換器在包括語言建模、計(jì)算機(jī)視覺和圖像生成等多個(gè)領(lǐng)域展現(xiàn)了顯著的擴(kuò)展屬性。

　　In this work, we find that diffusion transformers scale effectively as video models as well. Below, we show a comparison of video samples with fixed seeds and inputs as training progresses. Sample quality improves markedly as training compute increases.

　　在這項(xiàng)工作中，我們發(fā)現(xiàn)擴(kuò)散變換器作為視頻模型也能有效地?cái)U(kuò)展。下面，我們展示了訓(xùn)練進(jìn)展過程中，使用固定種子和輸入的視頻樣本比較。隨著訓(xùn)練計(jì)算量的增加，樣本質(zhì)量顯著提高。

　　Variable durations, resolutions, aspect ratios

　　多變的時(shí)長、分辨率和寬高比

　　Past approaches to image and video generation typically resize, crop or trim videos to a standard size – e.g., 4 second videos at 256x256 resolution. We find that instead training on data at its native size provides several benefits.

　　過去在圖像和視頻生成中的方法通常會(huì)將視頻調(diào)整大小、裁剪或剪輯到一個(gè)標(biāo)準(zhǔn)尺寸——例如，4秒長的視頻，分辨率為256x256。我們發(fā)現(xiàn)，直接在數(shù)據(jù)的原始尺寸上進(jìn)行訓(xùn)練可以帶來幾個(gè)好處。

　　Sampling flexibility

　　采樣靈活性

　　Sora can sample widescreen 1920x1080p videos, vertical 1080x1920 videos and everything inbetween. This lets Sora create content for different devices directly at their native aspect ratios. It also lets us quickly prototype content at lower sizes before generating at full resolution—all with the same model.

　　Sora可以采樣寬屏1920x1080p視頻、豎屏1080x1920視頻以及介于兩者之間的所有格式。這使得Sora能夠直接按照不同設(shè)備的原生寬高比創(chuàng)建內(nèi)容。它還允許我們?cè)谑褂猛荒Ｐ蜕扇直媛蕛?nèi)容之前，快速原型化較小尺寸的內(nèi)容。

　　Improved framing and composition

　　改善構(gòu)圖

　　We empirically find that training on videos at their native aspect ratios improves composition and framing. We compare Sora against a version of our model that crops all training videos to be square, which is common practice when training generative models. The model trained on square crops (left) sometimes generates videos where the subject is only partially in view. In comparison, videos from Sora (right)s have improved framing.

　　我們通過實(shí)證發(fā)現(xiàn)，在視頻的原始寬高比上進(jìn)行訓(xùn)練可以改善構(gòu)圖和取景。我們將Sora與一個(gè)版本的模型進(jìn)行了比較，該模型將所有訓(xùn)練視頻裁剪成正方形，這是訓(xùn)練生成模型時(shí)的常見做法。在正方形裁剪上訓(xùn)練的模型（左側(cè)）有時(shí)會(huì)生成主體只部分出現(xiàn)在視野中的視頻。相比之下，來自Sora的視頻（右側(cè)）具有改善的取景。

　　Language understanding

　　語言理解

　　Training text-to-video generation systems requires a large amount of videos with corresponding text captions. We apply the re-captioning technique introduced in DALL·E 330 to videos. We first train a highly descriptive captioner model and then use it to produce text captions for all videos in our training set. We find that training on highly descriptive video captions improves text fidelity as well as the overall quality of videos.

　　訓(xùn)練文本到視頻生成系統(tǒng)需要大量帶有相應(yīng)文字標(biāo)題的視頻。我們將在DALL·E 3中引入的重新標(biāo)注技術(shù)應(yīng)用到視頻上。我們首先訓(xùn)練一個(gè)高度描述性的標(biāo)注模型，然后使用它為我們訓(xùn)練集中的所有視頻生成文字標(biāo)題。我們發(fā)現(xiàn)，在高度描述性的視頻標(biāo)題上進(jìn)行訓(xùn)練可以提高文本的準(zhǔn)確性以及視頻的整體質(zhì)量。

　　Similar to DALL·E 3, we also leverage GPT to turn short user prompts into longer detailed captions that are sent to the video model. This enables Sora to generate high quality videos that accurately follow user prompts.

　　類似于DALL·E 3，我們也利用GPT將用戶的簡短提示轉(zhuǎn)換成更長的詳細(xì)說明，然后發(fā)送給視頻模型。這使得Sora能夠生成高質(zhì)量的視頻，準(zhǔn)確地遵循用戶的提示。

　　Prompting with images and videos

　　使用圖片和視頻進(jìn)行提示

　　All of the results above and in our landing page show text-to-video samples. But Sora can also be prompted with other inputs, such as pre-existing images or video. This capability enables Sora to perform a wide range of image and video editing tasks—creating perfectly looping video, animating static images, extending videos forwards or backwards in time, etc.

　　上述結(jié)果以及我們的登錄頁面展示了文本到視頻的樣本。但是Sora也可以通過其他輸入進(jìn)行提示，例如預(yù)先存在的圖片或視頻。這項(xiàng)能力使得Sora能夠執(zhí)行廣泛的圖像和視頻編輯任務(wù)——?jiǎng)?chuàng)建完美循環(huán)的視頻，為靜態(tài)圖像添加動(dòng)畫，向前或向后延長視頻的時(shí)間等。

　　Animating DALL·E images

　　使DALL-E圖像動(dòng)畫化

　　Sora is capable of generating videos provided an image and prompt as input. Below we show example videos generated based on DALL·E 231 and DALL·E 330 images.

　　Sora能夠根據(jù)輸入的圖片和提示生成視頻。下面我們展示了基于DALL·E 2 31 和DALL·E 3 30 圖片生成的示例視頻。

　　Extending generated videos

　　擴(kuò)展生成的視頻

　　Sora is also capable of extending videos, either forward or backward in time. Below are four videos that were all extended backward in time starting from a segment of a generated video. As a result, each of the four videos starts different from the others, yet all four videos lead to the same ending.

　　Sora也能夠?qū)⒁曨l向前或向后延長時(shí)間。下面是四個(gè)視頻，它們都是從生成的視頻片段開始向后延長的。因此，這四個(gè)視頻的開頭各不相同，但最終都會(huì)達(dá)到相同的結(jié)局。

　　We can use this method to extend a video both forward and backward to produce a seamless infinite loop.

　　我們可以使用這種方法將視頻向前和向后擴(kuò)展，以制作出無縫的無限循環(huán)。

　　Video-to-video editing

　　視頻到視頻編輯

　　Diffusion models have enabled a plethora of methods for editing images and videos from text prompts. Below we apply one of these methods, SDEdit,32 to Sora. This technique enables Sora to transform the styles and environments of input videos zero-shot.

　　擴(kuò)散模型使得從文本提示編輯圖像和視頻的方法層出不窮。下面我們將其中一種方法，SDEdit，應(yīng)用于Sora。這項(xiàng)技術(shù)使得Sora能夠零次學(xué)習(xí)地轉(zhuǎn)換輸入視頻的風(fēng)格和環(huán)境。

　　Connecting videos

　　連接視頻

　　We can also use Sora to gradually interpolate between two input videos, creating seamless transitions between videos with entirely different subjects and scene compositions. In the examples below, the videos in the center interpolate between the corresponding videos on the left and right.

　　我們還可以使用Sora在兩個(gè)輸入視頻之間逐漸插值，創(chuàng)建在完全不同主題和場景構(gòu)成的視頻之間的無縫過渡。在下面的例子中，中間的視頻在左右兩邊對(duì)應(yīng)視頻之間進(jìn)行插值。

　　Image generation capabilities

　　圖像生成能力

　　Sora is also capable of generating images. We do this by arranging patches of Gaussian noise in a spatial grid with a temporal extent of one frame. The model can generate images of variable sizes—up to 2048x2048 resolution.

　　Sora也能夠生成圖像。我們通過在具有一個(gè)幀時(shí)間范圍的空間網(wǎng)格中排列高斯噪聲塊來實(shí)現(xiàn)這一點(diǎn)。該模型可以生成不同大小的圖像——分辨率最高可達(dá)2048x2048。

Close-up portrait shot of a woman in autumn, extreme detail, shallow depth of field

秋天里一位女性的特寫肖像，極致細(xì)節(jié)，淺景深

Vibrant coral reef teeming with colorful fish and sea creature

充滿活力的珊瑚礁，擠滿了五彩繽紛的魚類和海洋生物

Digital art of a young tiger under an apple tree in a matte painting style with gorgeous details

數(shù)字藝術(shù)：一只幼年老虎在蘋果樹下，采用啞光繪畫風(fēng)格，細(xì)節(jié)華麗

A snowy mountain village with cozy cabins and a northern lights display, high detail and photorealistic dslr, 50mm f/1.2

一個(gè)雪山村莊，有著舒適的小木屋和北極光展示，高清晰度和逼真的數(shù)碼單反相機(jī)，50mm f/1.2鏡頭拍攝。

　　Emerging simulation capabilities

　　新興的模擬能力

　　We find that video models exhibit a number of interesting emergent capabilities when trained at scale. These capabilities enable Sora to simulate some aspects of people, animals and environments from the physical world. These properties emerge without any explicit inductive biases for 3D, objects, etc.—they are purely phenomena of scale.

　　我們發(fā)現(xiàn)，當(dāng)在大規(guī)模上訓(xùn)練時(shí)，視頻模型展現(xiàn)出許多有趣的新興能力。這些能力使得Sora能夠模擬現(xiàn)實(shí)世界中人類、動(dòng)物和環(huán)境的某些方面。這些屬性并沒有任何針對(duì)3D、物體等的明確歸納偏見——它們純粹是規(guī)模效應(yīng)的現(xiàn)象。

　　3D consistency. Sora can generate videos with dynamic camera motion. As the camera shifts and rotates, people and scene elements move consistently through three-dimensional space.

　　3D一致性。Sora能夠生成具有動(dòng)態(tài)相機(jī)運(yùn)動(dòng)的視頻。隨著相機(jī)的移動(dòng)和旋轉(zhuǎn)，人物和場景元素在三維空間中保持一致地移動(dòng)。

　　Long-range coherence and object permanence. A significant challenge for video generation systems has been maintaining temporal consistency when sampling long videos. We find that Sora is often, though not always, able to effectively model both short- and long-range dependencies. For example, our model can persist people, animals and objects even when they are occluded or leave the frame. Likewise, it can generate multiple shots of the same character in a single sample, maintaining their appearance throughout the video.

　　長距離一致性和物體恒存性。對(duì)于視頻生成系統(tǒng)來說，一個(gè)重大挑戰(zhàn)是在采樣長視頻時(shí)保持時(shí)間上的連貫性。我們發(fā)現(xiàn)，盡管不總是如此，Sora通常能夠有效地建模短距離和長距離依賴關(guān)系。例如，我們的模型即使在人、動(dòng)物和物體被遮擋或離開畫面時(shí)，也能持續(xù)保持它們的存在。同樣，它能在單個(gè)樣本中生成同一角色的多個(gè)鏡頭，并在整個(gè)視頻中保持其外觀。

　　Interacting with the world. Sora can sometimes simulate actions that affect the state of the world in simple ways. For example, a painter can leave new strokes along a canvas that persist over time, or a man can eat a burger and leave bite marks.

　　與世界互動(dòng)。Sora有時(shí)可以模擬一些簡單的動(dòng)作來影響世界的狀態(tài)。例如，畫家可以在畫布上留下隨時(shí)間持續(xù)存在的新筆觸，或者一個(gè)人可以吃一個(gè)漢堡并留下咬痕。

　　Simulating digital worlds. Sora is also able to simulate artificial processes–one example is video games. Sora can simultaneously control the player in Minecraft with a basic policy while also rendering the world and its dynamics in high fidelity. These capabilities can be elicited zero-shot by prompting Sora with captions mentioning “Minecraft.”

　　模擬數(shù)字世界。Sora也能夠模擬人工過程——一個(gè)例子是視頻游戲。Sora可以在同時(shí)控制《我的世界》中的玩家采用基本策略的同時(shí)，還能以高保真度渲染世界及其動(dòng)態(tài)。通過用提到“我的世界”的字幕提示Sora，可以零次嘗試地引發(fā)這些能力。

　　These capabilities suggest that continued scaling of video models is a promising path towards the development of highly-capable simulators of the physical and digital world, and the objects, animals and people that live within them.

　　這些能力表明，持續(xù)擴(kuò)展視頻模型是朝著開發(fā)高度能夠模擬物理和數(shù)字世界及其內(nèi)部的物體、動(dòng)物和人類的有希望的道路。

　　Discussion

　　討論

　　Sora currently exhibits numerous limitations as a simulator. For example, it does not accurately model the physics of many basic interactions, like glass shattering. Other interactions, like eating food, do not always yield correct changes in object state. We enumerate other common failure modes of the model—such as incoherencies that develop in long duration samples or spontaneous appearances of objects—in our landing page.

　　Sora作為一個(gè)模擬器目前展現(xiàn)出許多限制。例如，它并沒有準(zhǔn)確地模擬許多基本互動(dòng)的物理效應(yīng)，比如玻璃破碎。其他互動(dòng)，比如吃食物，不總是產(chǎn)生正確的物體狀態(tài)變化。我們?cè)谖覀兊牡卿涰撁媪信e了模型的其他常見故障模式——比如在長時(shí)間樣本中發(fā)展的不連貫性或物體的自發(fā)出現(xiàn)。

　　We believe the capabilities Sora has today demonstrate that continued scaling of video models is a promising path towards the development of capable simulators of the physical and digital world, and the objects, animals and people that live within them.

　　我們相信，Sora目前的能力表明，持續(xù)擴(kuò)展視頻模型是朝著開發(fā)能夠模擬物理和數(shù)字世界及其內(nèi)部的物體、動(dòng)物和人類的有能力的模擬器的有希望的道路。

責(zé)任編輯：張薇

精品无人区无码乱码毛片国产_性做久久久久久免费观看_天堂中文在线资源_7777久久亚洲中文字幕

OpenAI揭秘：Sora是如何生成視頻的？

OpenAI揭秘：Sora是如何生成視頻的？