Diffusion and its variants

Unlike text, image and video data are often high-dimension and contain the inner spatial relationship. Thus, applying LLMs to image/video data is not straightforward.

In this blog, we will introduce the diffusion model and its variants, which have been widely used in image/video generation.

Table of contents
Image Synthesis
Video Synthesis
Text to 3D
Conclusion
References

Image Synthesis

Video Synthesis

Text to 3D

Although 2D synthesis has gained significant progress, the view consistency still remains a challenge. To bridge this gap, some works have proposed to generate 3D objects from text descriptions directly.

Usually, 3D synthesis is a challenging task due to

Limited high quality 3D training data
Hard to align different modal in 3D space

Conclusion

References

[Blog] What are Diffusion Models?, Lilian Weng
[Blog] Diffusion Models for Video Generation, Lilian Weng
[ICML’21] Zero-Shot Text-to-Image Generation, OpenAI
[CVPR’22] High-Resolution Image Synthesis with Latent Diffusion Models, Heidelberg University
[arXiv 2024.03] Scaling Rectified Flow Transformers for High-Resolution Image Synthesis, Stability AI
[arXiv 2023.11] Stable Video Diffusion, Stability AI
[arXiv 2024.03] Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation, Stability AI
[ICCV’23] Scalable Diffusion Models with Transformers, UC Berkeley
[arXiv 2023.11] RichDreamer: A Generalizable Normal-Depth Diffusion Model for Detail Richness in Text-to-3D, Alibaba
[arXiv 2024.03] DreamReward: Text-to-3D Generation with Human Preference, Tsinghua University
[CVPR’24 Highlight] HumanGaussian: Text-Driven 3D Human Generation with Gaussian Splatting, The Chinese University of Hong Kong
[NeurIPS’23] DreamHuman: Animatable 3D Avatars from Text, Google Research

Table of contents

Image Synthesis

Video Synthesis

Text to 3D

Conclusion

References