Text-to-video AI is here, but it’s far from perfect.
Jiebo Luo, Albert Arendt Hopeman professor of Engineering and professor of computer science at the University of Rochester, identifies a solution to one challenge.
Jiebo Luo is the Albert Arendt Hopeman Professor of Engineering and Professor of Computer Science at the University of Rochester. He is a Fellow of IEEE, ACM, AAAI, SPIE, NAI, IAPR and AIMBE, as well as a foreign member of Academia Europeae. His research area focuses on computer vision, multimodal AI, data‑driven social computing, and digital health. He has authored over 600 technical papers and holds more than 90 U.S. patents.
Text-to-Video AI Blossoms With New Metamorphic Video Capabilities
Imagine asking AI to generate a video that can illustrate a seed emerging from the earth or an ice cube melting into a single droplet. Conventional text to video systems rarely manage that kind of intrinsic change. Trained largely on general videos with less variation, they default to smooth camera moves or fades instead of showing matter transform.
To teach models the dynamics of transformation, our research team curates the ChronoMagic Dataset, which includes over 2,000 time lapse videos with meticulous captions depicting dough rising, steel rusting, flowers blossoming, and so on.
Building on that work, we have developed MagicTime, a diffusion model purposely built for illustrating metamorphosis. Learning unfolds in two training stages. A lightweight spatial adapter first focuses on static state information. Next, a temporal adapter encodes dynamic stage changes, guided by a dynamic frame selection strategy. The strategy highlights decisive instants, such as the moment a crust splits or a leaf uncurls. A text encoder deepens the model’s grasp of verbs such as “sprout,” “melt,” and “rust,” aligning visual change with corresponding captions.
Prompting MagicTime with “a yellow ranunculus bud opening into full bloom,” it delivers a coherent two second video at the 512×512 pixel resolution in which petals peel back, stretch, and settle, without abrupt changes or drifting viewpoint. In blind tests involving two hundred volunteers, our Magictime engine beats eight other advanced video generation models.
By modeling not only how things look but how they change over time, MagicTime brings AI a step closer to reasoning about the physical world. Metamorphic video simulation offers more than a visual spectacle. By generating accurate visual transformations via videos, researchers such as biologists and chemists could rapidly explore hypotheses via AI models before committing to costly physical trials, which accelerate iteration cycles and reduce the number of live tests required.
Read More:
[Rochester] – Text-to-video AI blossoms with new metamorphic video capabilities
Magic Time

