阿里再放“开源大招”!Wan-S2V电影级生成:镜头、口型、表情动作全部同步音频

快播影视 内地电影 2025-08-29 00:24 3

摘要:Prompt: "In the video, a woman stood on the deck of a sailing boat and sang loudly. The background was the choppy sea and the thun

刚扔出开源最强图像或文本生成视频模型 Wan2.2 的阿里,又放了个大招。

现在的视频生成模型已经强的离谱,但离真正能用的电影级生成,还差一个完美的口型同步模型。

27日晚,阿里巴巴通义实验室(Tongyi Lab)突然甩出了音频驱动电影视频生成模型——Wan-S2V。

这玩意儿……好像真的能用了啊?

你给它一张图、一段音频,它就能生成一段电影级别的视频,并且口型、表情、手势全部同步,甚至还能控制运镜和镜头语言。

咱们来看两个官方的例子。

Prompt: "In the video, a woman stood on the deck of a sailing boat and sang loudly. The background was the choppy sea and the thundering sky. It was raining heavily in the sky, the ship swayed, the camera swayed, and the waves splashed everywhere, creating a heroic atmosphere. The woman has long dark hair, part of which is wet by rain. Her expression is serious and firm, her eyes are sharp, and she seems to be staring at the distance or thinking."

Prompt: "In the video, it is raining heavily. It shows a man who is topless and has clear muscle lines, showing good physical fitness and strength. The rain wet his whole body. His arms were open and he was singing happily. His expression was engaged and his hands were slowly extended. The man's head is slightly raised, his eyes are upward, and his mouth is slightly open. His expression was full of surprise and expectation, giving a feeling that something important was about to happen."

这是如何实现的?

“文本+音频”的协同控制机制。

文本负责“大局”:比如镜头怎么运动、人物大致怎么走位、实体之间如何互动;而音频负责“细节”:嘴唇怎么动、眼睛眨不眨、手势怎么摆、头往哪偏。

更夸张的是,它还能生成长视频。

Wan-S2V引入了类似“FramePack”的压缩技术,能高效编码历史帧信息,从而实现长期一致性。比如你可以让它生成一个角色连续转身、走路、再坐下的视频,而它真的能做到动作流畅、人不变形、背景稳定。

这意味着短剧创作、虚拟人口播……这些需要长时序一致性的场景,突然之间,有解了。

研究团队在EMTD数据集上对比了当前几乎所有主流音频驱动视频生成方案,Wan-S2V在帧质量(FID)、视频连贯性(FVD)、结构相似性(SSIM)、身份一致性(CSIM)几乎各项指标全面领先。

尤其值得注意的是Sync-C指标(衡量口型同步性能),Wan-S2V拿到了4.51,和专门做口型同步的EMO2几乎打平——但这可是在一个同时控制镜头、动作、表情的完整视频生成模型中实现的。

这么强的模型,是怎么训出来的?

答案有点“暴力”——他们建了一个空前庞大的高质量人物视频数据集。

团队不仅从OpenHumanVid、Koala36M等开源数据集中筛选了数百万个视频,还手动标注了大量包含复杂人类活动(如演讲、唱歌、舞蹈)的片段。

这还没完,他们还用了五轮质量过滤:

清晰度评估(Dover)运动稳定性(UniMatch光流分析)人脸与手部锐度(拉普拉斯算子)美学质量(改进的审美预测器)字幕遮挡检测(OCR+人脸识别)

最后,他们还动用了QwenVL2.5-72B为每个视频生成详细字幕,包括摄像机角度、动作分解、环境特征等等。

这数据集,不光量大,还质高。

当然,Wan-S2V 也不是没有短板。

比如在极端复杂的多人互动场景中,它偶尔还是会出现手部穿帮或眼神失焦。

我相信,以阿里巴巴的勤劳程度,一定能快速迭代出更完美的电影级生成模型。

关键是,它还全开源。

来源:算泥社区

相关推荐