Unlike text, image and video data are often high-dimension and contain the inner spatial relationship. Thus, applying LLMs to image/video data is not straightforward.

In this blog, we will introduce the diffusion model and its variants, which have been widely used in image/video generation.

Table of contents

Image Synthesis

Video Synthesis

Text to 3D

Although 2D synthesis has gained significant progress, the view consistency still remains a challenge. To bridge this gap, some works have proposed to generate 3D objects from text descriptions directly.

Usually, 3D synthesis is a challenging task due to

  • Limited high quality 3D training data
  • Hard to align different modal in 3D space

Conclusion


References

  1. [Blog] What are Diffusion Models?, Lilian Weng
  2. [Blog] Diffusion Models for Video Generation, Lilian Weng
  3. [ICML’21] Zero-Shot Text-to-Image Generation, OpenAI
  4. [CVPR’22] High-Resolution Image Synthesis with Latent Diffusion Models, Heidelberg University
  5. [arXiv 2024.03] Scaling Rectified Flow Transformers for High-Resolution Image Synthesis, Stability AI
  6. [arXiv 2023.11] Stable Video Diffusion, Stability AI
  7. [arXiv 2024.03] Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation, Stability AI
  8. [ICCV’23] Scalable Diffusion Models with Transformers, UC Berkeley
  9. [arXiv 2023.11] RichDreamer: A Generalizable Normal-Depth Diffusion Model for Detail Richness in Text-to-3D, Alibaba
  10. [arXiv 2024.03] DreamReward: Text-to-3D Generation with Human Preference, Tsinghua University
  11. [CVPR’24 Highlight] HumanGaussian: Text-Driven 3D Human Generation with Gaussian Splatting, The Chinese University of Hong Kong
  12. [NeurIPS’23] DreamHuman: Animatable 3D Avatars from Text, Google Research