China’s ByteDance Reveals World’s First Video Generation Tool Trained on Raw Visual Data
Lv Qian
DATE:  7 hours ago
/ SOURCE:  Yicai
China’s ByteDance Reveals World’s First Video Generation Tool Trained on Raw Visual Data China’s ByteDance Reveals World’s First Video Generation Tool Trained on Raw Visual Data

(Yicai) Feb. 11 -- Chinese tech giant ByteDance made public its experimental open-source video generation model VideoWorld yesterday. Unlike mainstream multimodal models such as US artificial intelligence firm OpenAI's Sora and DALL-E, which generate videos from text and prompts, this AI tool is the first in the industry to be able to recognize and understand the world through pure visual input, such as unlabeled videos, without relying on text or language models.

Developed by ByteDance’s Doubao Large Language Model team, Beijing Jiaotong University and the University of Science and Technology of China, the AI tool is part of an academic research project that is exploring new technical approaches and has not yet been released as a finished product, company insiders told Yicai.

LLMs’ knowledge extraction efficiency from video sequences lags significantly behind that of text forms, mainly because there is a lot of redundant information in videos, the Doubao team said. This led them to develop VideoWorld, which is able to achieve efficient video learning by retaining rich visual information while compressing visual variations that result from key decisions and actions.

VideoWorld is not the first video-based AI tool developed by ByteDance. Last week, ByteDance said it would soon release another multimodal video generation tool called OmniHuman, which can generate an AI video from only one picture and one audio clip. This is a closed-source model self-developed by the parent company of TikTok.

The Beijing-based firm has also previously released the text-to-video generative model MagicVideo-V2 and the general multimodal large model UniDoc.

Other internet behemoths such as Alibaba Group Holding, Tencent Holdings and Kuaishou Technology have also recently launched video generation tools and disclosed their developments in the multimodal domain.

There is fierce competition among leading developers of multimodal LLMs, according to a research report by CITIC Securities. The video genre better aligns with the entertainment needs of end users, especially given its strong compatibility with the short video industry. Therefore it has greater potential to produce popular applications, although models’ performance still needs to be improved.

Editor: Kim Taylor
 

Follow Yicai Global on
Keywords:   ByteDance,Doubao,VideoWorld,LLM