Unsupervised Compositional Concepts Discovery with Text-to-Image Generative Models
June 08, 2023
Nan Liu, Yilun Du, Shuang Li, Joshua B. Tenenbaum, Antonio Torralba
Text-to-image generative models have enabled high-resolution image synthesis
across different domains, but require users to specify the content they wish to
generate. In this paper, we consider the inverse problem – given a collection
of different images, can we discover the generative concepts that represent
each image? We present an unsupervised approach to discover generative concepts
from a collection of images, disentangling different art styles in paintings,
objects, and lighting from kitchen scenes, and discovering image classes given
ImageNet images. We show how such generative concepts can accurately represent
the content of images, be recombined and composed to generate new artistic and
hybrid images, and be further used as a representation for downstream
classification tasks.
PriSampler: Mitigating Property Inference of Diffusion Models
June 08, 2023
Hailong Hu, Jun Pang
Diffusion models have been remarkably successful in data synthesis. Such
successes have also driven diffusion models to apply to sensitive data, such as
human face data, but this might bring about severe privacy concerns. In this
work, we systematically present the first privacy study about property
inference attacks against diffusion models, in which adversaries aim to extract
sensitive global properties of the training set from a diffusion model, such as
the proportion of the training data for certain sensitive properties.
Specifically, we consider the most practical attack scenario: adversaries are
only allowed to obtain synthetic data. Under this realistic scenario, we
evaluate the property inference attacks on different types of samplers and
diffusion models. A broad range of evaluations shows that various diffusion
models and their samplers are all vulnerable to property inference attacks.
Furthermore, one case study on off-the-shelf pre-trained diffusion models also
demonstrates the effectiveness of the attack in practice. Finally, we propose a
new model-agnostic plug-in method PriSampler to mitigate the property inference
of diffusion models. PriSampler can be directly applied to well-trained
diffusion models and support both stochastic and deterministic sampling.
Extensive experiments illustrate the effectiveness of our defense and it makes
adversaries infer the proportion of properties as close as random guesses.
PriSampler also shows its significantly superior performance to diffusion
models trained with differential privacy on both model utility and defense
performance.
SyncDiffusion: Coherent Montage via Synchronized Joint Diffusions
June 08, 2023
Yuseung Lee, Kunho Kim, Hyunjin Kim, Minhyuk Sung
The remarkable capabilities of pretrained image diffusion models have been
utilized not only for generating fixed-size images but also for creating
panoramas. However, naive stitching of multiple images often results in visible
seams. Recent techniques have attempted to address this issue by performing
joint diffusions in multiple windows and averaging latent features in
overlapping regions. However, these approaches, which focus on seamless montage
generation, often yield incoherent outputs by blending different scenes within
a single image. To overcome this limitation, we propose SyncDiffusion, a
plug-and-play module that synchronizes multiple diffusions through gradient
descent from a perceptual similarity loss. Specifically, we compute the
gradient of the perceptual loss using the predicted denoised images at each
denoising step, providing meaningful guidance for achieving coherent montages.
Our experimental results demonstrate that our method produces significantly
more coherent outputs compared to previous methods (66.35% vs. 33.65% in our
user study) while still maintaining fidelity (as assessed by GIQA) and
compatibility with the input prompt (as measured by CLIP score).
Multi-Architecture Multi-Expert Diffusion Models
June 08, 2023
Yunsung Lee, Jin-Young Kim, Hyojun Go, Myeongho Jeong, Shinhyeok Oh, Seungtaek Choi
Diffusion models have achieved impressive results in generating diverse and
realistic data by employing multi-step denoising processes. However, the need
for accommodating significant variations in input noise at each time-step has
led to diffusion models requiring a large number of parameters for their
denoisers. We have observed that diffusion models effectively act as filters
for different frequency ranges at each time-step noise. While some previous
works have introduced multi-expert strategies, assigning denoisers to different
noise intervals, they overlook the importance of specialized operations for
high and low frequencies. For instance, self-attention operations are effective
at handling low-frequency components (low-pass filters), while convolutions
excel at capturing high-frequency features (high-pass filters). In other words,
existing diffusion models employ denoisers with the same architecture, without
considering the optimal operations for each time-step noise. To address this
limitation, we propose a novel approach called Multi-architecturE Multi-Expert
(MEME), which consists of multiple experts with specialized architectures
tailored to the operations required at each time-step interval. Through
extensive experiments, we demonstrate that MEME outperforms large competitors
in terms of both generation performance and computational efficiency.
Instructed Diffuser with Temporal Condition Guidance for Offline Reinforcement Learning
June 08, 2023
Jifeng Hu, Yanchao Sun, Sili Huang, SiYuan Guo, Hechang Chen, Li Shen, Lichao Sun, Yi Chang, Dacheng Tao
Recent works have shown the potential of diffusion models in computer vision
and natural language processing. Apart from the classical supervised learning
fields, diffusion models have also shown strong competitiveness in
reinforcement learning (RL) by formulating decision-making as sequential
generation. However, incorporating temporal information of sequential data and
utilizing it to guide diffusion models to perform better generation is still an
open challenge. In this paper, we take one step forward to investigate
controllable generation with temporal conditions that are refined from temporal
information. We observe the importance of temporal conditions in sequential
generation in sufficient explorative scenarios and provide a comprehensive
discussion and comparison of different temporal conditions. Based on the
observations, we propose an effective temporally-conditional diffusion model
coined Temporally-Composable Diffuser (TCD), which extracts temporal
information from interaction sequences and explicitly guides generation with
temporal conditions. Specifically, we separate the sequences into three parts
according to time expansion and identify historical, immediate, and prospective
conditions accordingly. Each condition preserves non-overlapping temporal
information of sequences, enabling more controllable generation when we jointly
use them to guide the diffuser. Finally, we conduct extensive experiments and
analysis to reveal the favorable applicability of TCD in offline RL tasks,
where our method reaches or matches the best performance compared with prior
SOTA baselines.
Interpreting and Improving Diffusion Models Using the Euclidean Distance Function
June 08, 2023
Frank Permenter, Chenyang Yuan
cs.LG, cs.CV, math.OC, stat.ML
Denoising is intuitively related to projection. Indeed, under the manifold
hypothesis, adding random noise is approximately equivalent to orthogonal
perturbation. Hence, learning to denoise is approximately learning to project.
In this paper, we use this observation to reinterpret denoising diffusion
models as approximate gradient descent applied to the Euclidean distance
function. We then provide straight-forward convergence analysis of the DDIM
sampler under simple assumptions on the projection-error of the denoiser.
Finally, we propose a new sampler based on two simple modifications to DDIM
using insights from our theoretical results. In as few as 5-10 function
evaluations, our sampler achieves state-of-the-art FID scores on pretrained
CIFAR-10 and CelebA models and can generate high quality samples on latent
diffusion models.
WOUAF: Weight Modulation for User Attribution and Fingerprinting in Text-to-Image Diffusion Models
June 07, 2023
Changhoon Kim, Kyle Min, Maitreya Patel, Sheng Cheng, Yezhou Yang
The rapid advancement of generative models, facilitating the creation of
hyper-realistic images from textual descriptions, has concurrently escalated
critical societal concerns such as misinformation. Traditional fake detection
mechanisms, although providing some mitigation, fall short in attributing
responsibility for the malicious use of synthetic images. This paper introduces
a novel approach to model fingerprinting that assigns responsibility for the
generated images, thereby serving as a potential countermeasure to model
misuse. Our method modifies generative models based on each user’s unique
digital fingerprint, imprinting a unique identifier onto the resultant content
that can be traced back to the user. This approach, incorporating fine-tuning
into Text-to-Image (T2I) tasks using the Stable Diffusion Model, demonstrates
near-perfect attribution accuracy with a minimal impact on output quality. We
rigorously scrutinize our method’s secrecy under two distinct scenarios: one
where a malicious user attempts to detect the fingerprint, and another where a
user possesses a comprehensive understanding of our method. We also evaluate
the robustness of our approach against various image post-processing
manipulations typically executed by end-users. Through extensive evaluation of
the Stable Diffusion models, our method presents a promising and novel avenue
for accountable model distribution and responsible use.
ConceptBed: Evaluating Concept Learning Abilities of Text-to-Image Diffusion Models
June 07, 2023
Maitreya Patel, Tejas Gokhale, Chitta Baral, Yezhou Yang
The ability to understand visual concepts and replicate and compose these
concepts from images is a central goal for computer vision. Recent advances in
text-to-image (T2I) models have lead to high definition and realistic image
quality generation by learning from large databases of images and their
descriptions. However, the evaluation of T2I models has focused on photorealism
and limited qualitative measures of visual understanding. To quantify the
ability of T2I models in learning and synthesizing novel visual concepts, we
introduce ConceptBed, a large-scale dataset that consists of 284 unique visual
concepts, 5K unique concept compositions, and 33K composite text prompts. Along
with the dataset, we propose an evaluation metric, Concept Confidence Deviation
(CCD), that uses the confidence of oracle concept classifiers to measure the
alignment between concepts generated by T2I generators and concepts contained
in ground truth images. We evaluate visual concepts that are either objects,
attributes, or styles, and also evaluate four dimensions of compositionality:
counting, attributes, relations, and actions. Our human study shows that CCD is
highly correlated with human understanding of concepts. Our results point to a
trade-off between learning the concepts and preserving the compositionality
which existing approaches struggle to overcome.
On the Design Fundamentals of Diffusion Models: A Survey
June 07, 2023
Ziyi Chang, George A. Koulieris, Hubert P. H. Shum
Diffusion models are generative models, which gradually add and remove noise
to learn the underlying distribution of training data for data generation. The
components of diffusion models have gained significant attention with many
design choices proposed. Existing reviews have primarily focused on
higher-level solutions, thereby covering less on the design fundamentals of
components. This study seeks to address this gap by providing a comprehensive
and coherent review on component-wise design choices in diffusion models.
Specifically, we organize this review according to their three key components,
namely the forward process, the reverse process, and the sampling procedure.
This allows us to provide a fine-grained perspective of diffusion models,
benefiting future studies in the analysis of individual components, the
applicability of design choices, and the implementation of diffusion models.
Multi-modal Latent Diffusion
June 07, 2023
Mustapha Bounoua, Giulio Franzese, Pietro Michiardi
Multi-modal data-sets are ubiquitous in modern applications, and multi-modal
Variational Autoencoders are a popular family of models that aim to learn a
joint representation of the different modalities. However, existing approaches
suffer from a coherence-quality tradeoff, where models with good generation
quality lack generative coherence across modalities, and vice versa. We discuss
the limitations underlying the unsatisfactory performance of existing methods,
to motivate the need for a different approach. We propose a novel method that
uses a set of independently trained, uni-modal, deterministic autoencoders.
Individual latent variables are concatenated into a common latent space, which
is fed to a masked diffusion model to enable generative modeling. We also
introduce a new multi-time training method to learn the conditional score
network for multi-modal diffusion. Our methodology substantially outperforms
competitors in both generation quality and coherence, as shown through an
extensive experimental campaign.
Improving Diffusion-based Image Translation using Asymmetric Gradient Guidance
June 07, 2023
Gihyun Kwon, Jong Chul Ye
cs.CV, cs.AI, cs.LG, stat.ML
Diffusion models have shown significant progress in image translation tasks
recently. However, due to their stochastic nature, there’s often a trade-off
between style transformation and content preservation. Current strategies aim
to disentangle style and content, preserving the source image’s structure while
successfully transitioning from a source to a target domain under text or
one-shot image conditions. Yet, these methods often require computationally
intense fine-tuning of diffusion models or additional neural networks. To
address these challenges, here we present an approach that guides the reverse
process of diffusion sampling by applying asymmetric gradient guidance. This
results in quicker and more stable image manipulation for both text-guided and
image-guided image translation. Our model’s adaptability allows it to be
implemented with both image- and latent-diffusion models. Experiments show that
our method outperforms various state-of-the-art models in image translation
tasks.
ScoreCL: Augmentation-Adaptive Contrastive Learning via Score-Matching Function
June 07, 2023
JinYoung Kim, Soonwoo Kwon, Hyojun Go, Yunsung Lee, Seungtaek Choi
Self-supervised contrastive learning (CL) has achieved state-of-the-art
performance in representation learning by minimizing the distance between
positive pairs while maximizing that of negative ones. Recently, it has been
verified that the model learns better representation with diversely augmented
positive pairs because they enable the model to be more view-invariant.
However, only a few studies on CL have considered the difference between
augmented views, and have not gone beyond the hand-crafted findings. In this
paper, we first observe that the score-matching function can measure how much
data has changed from the original through augmentation. With the observed
property, every pair in CL can be weighted adaptively by the difference of
score values, resulting in boosting the performance of the existing CL method.
We show the generality of our method, referred to as ScoreCL, by consistently
improving various CL methods, SimCLR, SimSiam, W-MSE, and VICReg, up to 3%p in
k-NN evaluation on CIFAR-10, CIFAR-100, and ImageNet-100. Moreover, we have
conducted exhaustive experiments and ablations, including results on diverse
downstream tasks, comparison with possible baselines, and improvement when used
with other proposed augmentation methods. We hope our exploration will inspire
more research in exploiting the score matching for CL.
A Survey on Generative Diffusion Models for Structured Data
June 07, 2023
Heejoon Koo
In recent years, generative diffusion models have achieved a rapid paradigm
shift in deep generative models by showing groundbreaking performance across
various applications. Meanwhile, structured data, encompassing tabular and time
series data, has been received comparatively limited attention from the deep
learning research community, despite its omnipresence and extensive
applications. Thus, there is still a lack of literature and its review on
structured data modelling via diffusion models, compared to other data
modalities such as computer vision and natural language processing. Hence, in
this paper, we present a comprehensive review of recently proposed diffusion
models in the field of structured data. First, this survey provides a concise
overview of the score-based diffusion model theory, subsequently proceeding to
the technical descriptions of the majority of pioneering works using structured
data in both data-driven general tasks and domain-specific applications.
Thereafter, we analyse and discuss the limitations and challenges shown in
existing works and suggest potential research directions. We hope this review
serves as a catalyst for the research community, promoting the developments in
generative diffusion models for structured data.
Phoenix: A Federated Generative Diffusion Model
June 07, 2023
Fiona Victoria Stanley Jothiraj, Afra Mashhadi
Generative AI has made impressive strides in enabling users to create diverse
and realistic visual content such as images, videos, and audio. However,
training generative models on large centralized datasets can pose challenges in
terms of data privacy, security, and accessibility. Federated learning (FL) is
an approach that uses decentralized techniques to collaboratively train a
shared deep learning model while retaining the training data on individual edge
devices to preserve data privacy. This paper proposes a novel method for
training a Denoising Diffusion Probabilistic Model (DDPM) across multiple data
sources using FL techniques. Diffusion models, a newly emerging generative
model, show promising results in achieving superior quality images than
Generative Adversarial Networks (GANs). Our proposed method Phoenix is an
unconditional diffusion model that leverages strategies to improve the data
diversity of generated samples even when trained on data with statistical
heterogeneity or Non-IID (Non-Independent and Identically Distributed) data. We
demonstrate how our approach outperforms the default diffusion model in an FL
setting. These results indicate that high-quality samples can be generated by
maintaining data diversity, preserving privacy, and reducing communication
between data sources, offering exciting new possibilities in the field of
generative AI.
Conditional Diffusion Models for Weakly Supervised Medical Image Segmentation
June 06, 2023
Xinrong Hu, Yu-Jen Chen, Tsung-Yi Ho, Yiyu Shi
Recent advances in denoising diffusion probabilistic models have shown great
success in image synthesis tasks. While there are already works exploring the
potential of this powerful tool in image semantic segmentation, its application
in weakly supervised semantic segmentation (WSSS) remains relatively
under-explored. Observing that conditional diffusion models (CDM) is capable of
generating images subject to specific distributions, in this work, we utilize
category-aware semantic information underlied in CDM to get the prediction mask
of the target object with only image-level annotations. More specifically, we
locate the desired class by approximating the derivative of the output of CDM
w.r.t the input condition. Our method is different from previous diffusion
model methods with guidance from an external classifier, which accumulates
noises in the background during the reconstruction process. Our method
outperforms state-of-the-art CAM and diffusion model methods on two public
medical image segmentation datasets, which demonstrates that CDM is a promising
tool in WSSS. Also, experiment shows our method is more time-efficient than
existing diffusion model methods, making it practical for wider applications.
Logic Diffusion for Knowledge Graph Reasoning
June 06, 2023
Xiaoying Xie, Biao Gong, Yiliang Lv, Zhen Han, Guoshuai Zhao, Xueming Qian
Most recent works focus on answering first order logical queries to explore
the knowledge graph reasoning via multi-hop logic predictions. However,
existing reasoning models are limited by the circumscribed logical paradigms of
training samples, which leads to a weak generalization of unseen logic. To
address these issues, we propose a plug-in module called Logic Diffusion (LoD)
to discover unseen queries from surroundings and achieves dynamical equilibrium
between different kinds of patterns. The basic idea of LoD is relation
diffusion and sampling sub-logic by random walking as well as a special
training mechanism called gradient adaption. Besides, LoD is accompanied by a
novel loss function to further achieve the robust logical diffusion when facing
noisy data in training or testing sets. Extensive experiments on four public
datasets demonstrate the superiority of mainstream knowledge graph reasoning
models with LoD over state-of-the-art. Moreover, our ablation study proves the
general effectiveness of LoD on the noise-rich knowledge graph.
June 06, 2023
Hefeng Wang, Jiale Cao, Rao Muhammad Anwer, Jin Xie, Fahad Shahbaz Khan, Yanwei Pang
This paper introduces an approach, named DFormer, for universal image
segmentation. The proposed DFormer views universal image segmentation task as a
denoising process using a diffusion model. DFormer first adds various levels of
Gaussian noise to ground-truth masks, and then learns a model to predict
denoising masks from corrupted masks. Specifically, we take deep pixel-level
features along with the noisy masks as inputs to generate mask features and
attention masks, employing diffusion-based decoder to perform mask prediction
gradually. At inference, our DFormer directly predicts the masks and
corresponding categories from a set of randomly-generated masks. Extensive
experiments reveal the merits of our proposed contributions on different image
segmentation tasks: panoptic segmentation, instance segmentation, and semantic
segmentation. Our DFormer outperforms the recent diffusion-based panoptic
segmentation method Pix2Seq-D with a gain of 3.6% on MS COCO val2017 set.
Further, DFormer achieves promising semantic segmentation performance
outperforming the recent diffusion-based method by 2.2% on ADE20K val set. Our
source code and models will be publicly on https://github.com/cp3wan/DFormer
Protecting the Intellectual Property of Diffusion Models by the Watermark Diffusion Process
June 06, 2023
Sen Peng, Yufei Chen, Cong Wang, Xiaohua Jia
Diffusion models have emerged as state-of-the-art deep generative
architectures with the increasing demands for generation tasks. Training large
diffusion models for good performance requires high resource costs, making them
valuable intellectual properties to protect. While most of the existing
ownership solutions, including watermarking, mainly focus on discriminative
models. This paper proposes WDM, a novel watermarking method for diffusion
models, including watermark embedding, extraction, and verification. WDM embeds
the watermark data through training or fine-tuning the diffusion model to learn
a Watermark Diffusion Process (WDP), different from the standard diffusion
process for the task data. The embedded watermark can be extracted by sampling
using the shared reverse noise from the learned WDP without degrading
performance on the original task. We also provide theoretical foundations and
analysis of the proposed method by connecting the WDP to the diffusion process
with a modified Gaussian kernel. Extensive experiments are conducted to
demonstrate its effectiveness and robustness against various attacks.
DreamSparse: Escaping from Plato’s Cave with 2D Diffusion Model Given Sparse Views
June 06, 2023
Paul Yoo, Jiaxian Guo, Yutaka Matsuo, Shixiang Shane Gu
Synthesizing novel view images from a few views is a challenging but
practical problem. Existing methods often struggle with producing high-quality
results or necessitate per-object optimization in such few-view settings due to
the insufficient information provided. In this work, we explore leveraging the
strong 2D priors in pre-trained diffusion models for synthesizing novel view
images. 2D diffusion models, nevertheless, lack 3D awareness, leading to
distorted image synthesis and compromising the identity. To address these
problems, we propose DreamSparse, a framework that enables the frozen
pre-trained diffusion model to generate geometry and identity-consistent novel
view image. Specifically, DreamSparse incorporates a geometry module designed
to capture 3D features from sparse views as a 3D prior. Subsequently, a spatial
guidance model is introduced to convert these 3D feature maps into spatial
information for the generative process. This information is then used to guide
the pre-trained diffusion model, enabling it to generate geometrically
consistent images without tuning it. Leveraging the strong image priors in the
pre-trained diffusion models, DreamSparse is capable of synthesizing
high-quality novel views for both object and scene-level images and
generalising to open-set images. Experimental results demonstrate that our
framework can effectively synthesize novel view images from sparse views and
outperforms baselines in both trained and open-set category images. More
results can be found on our project page:
https://sites.google.com/view/dreamsparse-webpage.
Brain tumor segmentation using synthetic MR images – A comparison of GANs and diffusion models
June 05, 2023
Muhammad Usman Akbar, Måns Larsson, Anders Eklund
Large annotated datasets are required for training deep learning models, but
in medical imaging data sharing is often complicated due to ethics,
anonymization and data protection legislation (e.g. the general data protection
regulation (GDPR)). Generative AI models, such as generative adversarial
networks (GANs) and diffusion models, can today produce very realistic
synthetic images, and can potentially facilitate data sharing as GDPR should
not apply for medical images which do not belong to a specific person. However,
in order to share synthetic images it must first be demonstrated that they can
be used for training different networks with acceptable performance. Here, we
therefore comprehensively evaluate four GANs (progressive GAN, StyleGAN 1-3)
and a diffusion model for the task of brain tumor segmentation. Our results
show that segmentation networks trained on synthetic images reach Dice scores
that are 80\% - 90\% of Dice scores when training with real images, but that
memorization of the training images can be a problem for diffusion models if
the original dataset is too small. Furthermore, we demonstrate that common
metrics for evaluating synthetic images, Fr'echet inception distance (FID) and
inception score (IS), do not correlate well with the obtained performance when
using the synthetic images for training segmentation networks.
Faster Training of Diffusion Models and Improved Density Estimation via Parallel Score Matching
June 05, 2023
Etrit Haxholli, Marco Lorenzi
In Diffusion Probabilistic Models (DPMs), the task of modeling the score
evolution via a single time-dependent neural network necessitates extended
training periods and may potentially impede modeling flexibility and capacity.
To counteract these challenges, we propose leveraging the independence of
learning tasks at different time points inherent to DPMs. More specifically, we
partition the learning task by utilizing independent networks, each dedicated
to learning the evolution of scores within a specific time sub-interval.
Further, inspired by residual flows, we extend this strategy to its logical
conclusion by employing separate networks to independently model the score at
each individual time point. As empirically demonstrated on synthetic and image
datasets, our approach not only significantly accelerates the training process
by introducing an additional layer of parallelization atop data
parallelization, but it also enhances density estimation performance when
compared to the conventional training methodology for DPMs.
Enhance Diffusion to Improve Robust Generalization
June 05, 2023
Jianhui Sun, Sanchit Sinha, Aidong Zhang
Deep neural networks are susceptible to human imperceptible adversarial
perturbations. One of the strongest defense mechanisms is \emph{Adversarial
Training} (AT). In this paper, we aim to address two predominant problems in
AT. First, there is still little consensus on how to set hyperparameters with a
performance guarantee for AT research, and customized settings impede a fair
comparison between different model designs in AT research. Second, the robustly
trained neural networks struggle to generalize well and suffer from tremendous
overfitting. This paper focuses on the primary AT framework - Projected
Gradient Descent Adversarial Training (PGD-AT). We approximate the dynamic of
PGD-AT by a continuous-time Stochastic Differential Equation (SDE), and show
that the diffusion term of this SDE determines the robust generalization. An
immediate implication of this theoretical finding is that robust generalization
is positively correlated with the ratio between learning rate and batch size.
We further propose a novel approach, \emph{Diffusion Enhanced Adversarial
Training} (DEAT), to manipulate the diffusion term to improve robust
generalization with virtually no extra computational burden. We theoretically
show that DEAT obtains a tighter generalization bound than PGD-AT. Our
empirical investigation is extensive and firmly attests that DEAT universally
outperforms PGD-AT by a significant margin.
SwinRDM: Integrate SwinRNN with Diffusion Model towards High-Resolution and High-Quality Weather Forecasting
June 05, 2023
Lei Chen, Fei Du, Yuan Hu, Fan Wang, Zhibin Wang
cs.AI, cs.CV, physics.ao-ph
Data-driven medium-range weather forecasting has attracted much attention in
recent years. However, the forecasting accuracy at high resolution is
unsatisfactory currently. Pursuing high-resolution and high-quality weather
forecasting, we develop a data-driven model SwinRDM which integrates an
improved version of SwinRNN with a diffusion model. SwinRDM performs
predictions at 0.25-degree resolution and achieves superior forecasting
accuracy to IFS (Integrated Forecast System), the state-of-the-art operational
NWP model, on representative atmospheric variables including 500 hPa
geopotential (Z500), 850 hPa temperature (T850), 2-m temperature (T2M), and
total precipitation (TP), at lead times of up to 5 days. We propose to leverage
a two-step strategy to achieve high-resolution predictions at 0.25-degree
considering the trade-off between computation memory and forecasting accuracy.
Recurrent predictions for future atmospheric fields are firstly performed at
1.40625-degree resolution, and then a diffusion-based super-resolution model is
leveraged to recover the high spatial resolution and finer-scale atmospheric
details. SwinRDM pushes forward the performance and potential of data-driven
models for a large margin towards operational applications.
Stable Diffusion is Untable
June 05, 2023
Chengbin Du, Yanxi Li, Zhongwei Qiu, Chang Xu
Recently, text-to-image models have been thriving. Despite their powerful
generative capacity, our research has uncovered a lack of robustness in this
generation process. Specifically, the introduction of small perturbations to
the text prompts can result in the blending of primary subjects with other
categories or their complete disappearance in the generated images. In this
paper, we propose Auto-attack on Text-to-image Models (ATM), a gradient-based
approach, to effectively and efficiently generate such perturbations. By
learning a Gumbel Softmax distribution, we can make the discrete process of
word replacement or extension continuous, thus ensuring the differentiability
of the perturbation generation. Once the distribution is learned, ATM can
sample multiple attack samples simultaneously. These attack samples can prevent
the generative model from generating the desired subjects without compromising
image quality. ATM has achieved a 91.1% success rate in short-text attacks and
an 81.2% success rate in long-text attacks. Further empirical analysis revealed
four attack patterns based on: 1) the variability in generation speed, 2) the
similarity of coarse-grained characteristics, 3) the polysemy of words, and 4)
the positioning of words.
Video Diffusion Models with Local-Global Context Guidance
June 05, 2023
Siyuan Yang, Lu Zhang, Yu Liu, Zhizhuo Jiang, You He
Diffusion models have emerged as a powerful paradigm in video synthesis tasks
including prediction, generation, and interpolation. Due to the limitation of
the computational budget, existing methods usually implement conditional
diffusion models with an autoregressive inference pipeline, in which the future
fragment is predicted based on the distribution of adjacent past frames.
However, only the conditions from a few previous frames can’t capture the
global temporal coherence, leading to inconsistent or even outrageous results
in long-term video prediction. In this paper, we propose a Local-Global Context
guided Video Diffusion model (LGC-VD) to capture multi-perception conditions
for producing high-quality videos in both conditional/unconditional settings.
In LGC-VD, the UNet is implemented with stacked residual blocks with
self-attention units, avoiding the undesirable computational cost in 3D Conv.
We construct a local-global context guidance strategy to capture the
multi-perceptual embedding of the past fragment to boost the consistency of
future prediction. Furthermore, we propose a two-stage training strategy to
alleviate the effect of noisy frames for more stable predictions. Our
experiments demonstrate that the proposed method achieves favorable performance
on video prediction, interpolation, and unconditional video generation. We
release code at https://github.com/exisas/LGC-VD.
PLANNER: Generating Diversified Paragraph via Latent Language Diffusion Model
June 05, 2023
Yizhe Zhang, Jiatao Gu, Zhuofeng Wu, Shuangfei Zhai, Josh Susskind, Navdeep Jaitly
Autoregressive models for text sometimes generate repetitive and low-quality
output because errors accumulate during the steps of generation. This issue is
often attributed to exposure bias - the difference between how a model is
trained, and how it is used during inference. Denoising diffusion models
provide an alternative approach in which a model can revisit and revise its
output. However, they can be computationally expensive and prior efforts on
text have led to models that produce less fluent output compared to
autoregressive models, especially for longer text and paragraphs. In this
paper, we propose PLANNER, a model that combines latent semantic diffusion with
autoregressive generation, to generate fluent text while exercising global
control over paragraphs. The model achieves this by combining an autoregressive
“decoding” module with a “planning” module that uses latent diffusion to
generate semantic paragraph embeddings in a coarse-to-fine manner. The proposed
method is evaluated on various conditional generation tasks, and results on
semantic generation, text completion and summarization show its effectiveness
in generating high-quality long-form text in an efficient manner.
Temporal Dynamic Quantization for Diffusion Models
June 04, 2023
Junhyuk So, Jungwon Lee, Daehyun Ahn, Hyungjun Kim, Eunhyeok Park
The diffusion model has gained popularity in vision applications due to its
remarkable generative performance and versatility. However, high storage and
computation demands, resulting from the model size and iterative generation,
hinder its use on mobile devices. Existing quantization techniques struggle to
maintain performance even in 8-bit precision due to the diffusion model’s
unique property of temporal variation in activation. We introduce a novel
quantization method that dynamically adjusts the quantization interval based on
time step information, significantly improving output quality. Unlike
conventional dynamic quantization techniques, our approach has no computational
overhead during inference and is compatible with both post-training
quantization (PTQ) and quantization-aware training (QAT). Our extensive
experiments demonstrate substantial improvements in output quality with the
quantized diffusion model across various datasets.
Training Data Attribution for Diffusion Models
June 03, 2023
Zheng Dai, David K Gifford
Diffusion models have become increasingly popular for synthesizing
high-quality samples based on training datasets. However, given the oftentimes
enormous sizes of the training datasets, it is difficult to assess how training
data impact the samples produced by a trained diffusion model. The difficulty
of relating diffusion model inputs and outputs poses significant challenges to
model explainability and training data attribution. Here we propose a novel
solution that reveals how training data influence the output of diffusion
models through the use of ensembles. In our approach individual models in an
encoded ensemble are trained on carefully engineered splits of the overall
training data to permit the identification of influential training examples.
The resulting model ensembles enable efficient ablation of training data
influence, allowing us to assess the impact of training data on model outputs.
We demonstrate the viability of these ensembles as generative models and the
validity of our approach to assessing influence.
Variational Gaussian Process Diffusion Processes
June 03, 2023
Prakhar Verma, Vincent Adam, Arno Solin
Diffusion processes are a class of stochastic differential equations (SDEs)
providing a rich family of expressive models that arise naturally in dynamic
modelling tasks. Probabilistic inference and learning under generative models
with latent processes endowed with a non-linear diffusion process prior are
intractable problems. We build upon work within variational inference
approximating the posterior process as a linear diffusion process, point out
pathologies in the approach, and propose an alternative parameterization of the
Gaussian variational process using a continuous exponential family description.
This allows us to trade a slow inference algorithm with fixed-point iterations
for a fast algorithm for convex optimization akin to natural gradient descent,
which also provides a better objective for the learning of model parameters.
June 03, 2023
Salva Rühling Cachay, Bo Zhao, Hailey James, Rose Yu
While diffusion models can successfully generate data and make predictions,
they are predominantly designed for static images. We propose an approach for
training diffusion models for dynamics forecasting that leverages the temporal
dynamics encoded in the data, directly coupling it with the diffusion steps in
the network. We train a stochastic, time-conditioned interpolator and a
backbone forecaster network that mimic the forward and reverse processes of
conventional diffusion models, respectively. This design choice naturally
encodes multi-step and long-range forecasting capabilities, allowing for highly
flexible, continuous-time sampling trajectories and the ability to trade-off
performance with accelerated sampling at inference time. In addition, the
dynamics-informed diffusion process imposes a strong inductive bias, allowing
for improved computational efficiency compared to traditional Gaussian
noise-based diffusion models. Our approach performs competitively on
probabilistic skill score metrics in complex dynamics forecasting of sea
surface temperatures, Navier-Stokes flows, and spring mesh systems.
Unlearnable Examples for Diffusion Models: Protect Data from Unauthorized Exploitation
June 02, 2023
Zhengyue Zhao, Jinhao Duan, Xing Hu, Kaidi Xu, Chenan Wang, Rui Zhang, Zidong Du, Qi Guo, Yunji Chen
Diffusion models have demonstrated remarkable performance in image generation
tasks, paving the way for powerful AIGC applications. However, these
widely-used generative models can also raise security and privacy concerns,
such as copyright infringement, and sensitive data leakage. To tackle these
issues, we propose a method, Unlearnable Diffusion Perturbation, to safeguard
images from unauthorized exploitation. Our approach involves designing an
algorithm to generate sample-wise perturbation noise for each image to be
protected. This imperceptible protective noise makes the data almost
unlearnable for diffusion models, i.e., diffusion models trained or fine-tuned
on the protected data cannot generate high-quality and diverse images related
to the protected training data. Theoretically, we frame this as a max-min
optimization problem and introduce EUDP, a noise scheduler-based method to
enhance the effectiveness of the protective noise. We evaluate our methods on
both Denoising Diffusion Probabilistic Model and Latent Diffusion Models,
demonstrating that training diffusion models on the protected data lead to a
significant reduction in the quality of the generated images. Especially, the
experimental results on Stable Diffusion demonstrate that our method
effectively safeguards images from being used to train Diffusion Models in
various tasks, such as training specific objects and styles. This achievement
holds significant importance in real-world scenarios, as it contributes to the
protection of privacy and copyright against AI-generated content.
DiffECG: A Generalized Probabilistic Diffusion Model for ECG Signals Synthesis
June 02, 2023
Nour Neifar, Achraf Ben-Hamadou, Afef Mdhaffar, Mohamed Jmaiel
In recent years, deep generative models have gained attention as a promising
data augmentation solution for heart disease detection using deep learning
approaches applied to ECG signals. In this paper, we introduce a novel approach
based on denoising diffusion probabilistic models for ECG synthesis that covers
three scenarios: heartbeat generation, partial signal completion, and full
heartbeat forecasting. Our approach represents the first generalized
conditional approach for ECG synthesis, and our experimental results
demonstrate its effectiveness for various ECG-related tasks. Moreover, we show
that our approach outperforms other state-of-the-art ECG generative models and
can enhance the performance of state-of-the-art classifiers.
Denoising Diffusion Semantic Segmentation with Mask Prior Modeling
June 02, 2023
Zeqiang Lai, Yuchen Duan, Jifeng Dai, Ziheng Li, Ying Fu, Hongsheng Li, Yu Qiao, Wenhai Wang
The evolution of semantic segmentation has long been dominated by learning
more discriminative image representations for classifying each pixel. Despite
the prominent advancements, the priors of segmentation masks themselves, e.g.,
geometric and semantic constraints, are still under-explored. In this paper, we
propose to ameliorate the semantic segmentation quality of existing
discriminative approaches with a mask prior modeled by a recently-developed
denoising diffusion generative model. Beginning with a unified architecture
that adapts diffusion models for mask prior modeling, we focus this work on a
specific instantiation with discrete diffusion and identify a variety of key
design choices for its successful application. Our exploratory analysis
revealed several important findings, including: (1) a simple integration of
diffusion models into semantic segmentation is not sufficient, and a
poorly-designed diffusion process might lead to degradation in segmentation
performance; (2) during the training, the object to which noise is added is
more important than the type of noise; (3) during the inference, the strict
diffusion denoising scheme may not be essential and can be relaxed to a simpler
scheme that even works better. We evaluate the proposed prior modeling with
several off-the-shelf segmentors, and our experimental results on ADE20K and
Cityscapes demonstrate that our approach could achieve competitively
quantitative performance and more appealing visual quality.
GANs Settle Scores!
June 02, 2023
Siddarth Asokan, Nishanth Shetty, Aadithya Srikanth, Chandra Sekhar Seelamantula
Generative adversarial networks (GANs) comprise a generator, trained to learn
the underlying distribution of the desired data, and a discriminator, trained
to distinguish real samples from those output by the generator. A majority of
GAN literature focuses on understanding the optimality of the discriminator
through integral probability metric (IPM) or divergence based analysis. In this
paper, we propose a unified approach to analyzing the generator optimization
through variational approach. In $f$-divergence-minimizing GANs, we show that
the optimal generator is the one that matches the score of its output
distribution with that of the data distribution, while in IPM GANs, we show
that this optimal generator matches score-like functions, involving the
flow-field of the kernel associated with a chosen IPM constraint space.
Further, the IPM-GAN optimization can be seen as one of smoothed
score-matching, where the scores of the data and the generator distributions
are convolved with the kernel associated with the constraint. The proposed
approach serves to unify score-based training and existing GAN flavors,
leveraging results from normalizing flows, while also providing explanations
for empirical phenomena such as the stability of non-saturating GAN losses.
Based on these results, we propose novel alternatives to $f$-GAN and IPM-GAN
training based on score and flow matching, and discriminator-guided Langevin
sampling.
PolyDiffuse: Polygonal Shape Reconstruction via Guided Set Diffusion Models
June 02, 2023
Jiacheng Chen, Ruizhi Deng, Yasutaka Furukawa
This paper presents PolyDiffuse, a novel structured reconstruction algorithm
that transforms visual sensor data into polygonal shapes with Diffusion Models
(DM), an emerging machinery amid exploding generative AI, while formulating
reconstruction as a generation process conditioned on sensor data. The task of
structured reconstruction poses two fundamental challenges to DM: 1) A
structured geometry is a set'' (e.g., a set of polygons for a floorplan
geometry), where a sample of $N$ elements has $N!$ different but equivalent
representations, making the denoising highly ambiguous; and 2) A
reconstruction’’ task has a single solution, where an initial noise needs to
be chosen carefully, while any initial noise works for a generation task. Our
technical contribution is the introduction of a Guided Set Diffusion Model
where 1) the forward diffusion process learns guidance networks to control
noise injection so that one representation of a sample remains distinct from
its other permutation variants, thus resolving denoising ambiguity; and 2) the
reverse denoising process reconstructs polygonal shapes, initialized and
directed by the guidance networks, as a conditional generation process subject
to the sensor data. We have evaluated our approach for reconstructing two types
of polygonal shapes: floorplan as a set of polygons and HD map for autonomous
cars as a set of polylines. Through extensive experiments on standard
benchmarks, we demonstrate that PolyDiffuse significantly advances the current
state of the art and enables broader practical applications.
Quantifying Sample Anonymity in Score-Based Generative Models with Adversarial Fingerprinting
June 02, 2023
Mischa Dombrowski, Bernhard Kainz
Recent advances in score-based generative models have led to a huge spike in
the development of downstream applications using generative models ranging from
data augmentation over image and video generation to anomaly detection. Despite
publicly available trained models, their potential to be used for privacy
preserving data sharing has not been fully explored yet. Training diffusion
models on private data and disseminating the models and weights rather than the
raw dataset paves the way for innovative large-scale data-sharing strategies,
particularly in healthcare, where safeguarding patients’ personal health
information is paramount. However, publishing such models without individual
consent of, e.g., the patients from whom the data was acquired, necessitates
guarantees that identifiable training samples will never be reproduced, thus
protecting personal health data and satisfying the requirements of policymakers
and regulatory bodies. This paper introduces a method for estimating the upper
bound of the probability of reproducing identifiable training images during the
sampling process. This is achieved by designing an adversarial approach that
searches for anatomic fingerprints, such as medical devices or dermal art,
which could potentially be employed to re-identify training images. Our method
harnesses the learned score-based model to estimate the probability of the
entire subspace of the score function that may be utilized for one-to-one
reproduction of training samples. To validate our estimates, we generate
anomalies containing a fingerprint and investigate whether generated samples
from trained generative models can be uniquely mapped to the original training
samples. Overall our results show that privacy-breaching images are reproduced
at sampling time if the models were trained without care.
Privacy Distillation: Reducing Re-identification Risk of Multimodal Diffusion Models
June 02, 2023
Virginia Fernandez, Pedro Sanchez, Walter Hugo Lopez Pinaya, Grzegorz Jacenków, Sotirios A. Tsaftaris, Jorge Cardoso
Knowledge distillation in neural networks refers to compressing a large model
or dataset into a smaller version of itself. We introduce Privacy Distillation,
a framework that allows a text-to-image generative model to teach another model
without exposing it to identifiable data. Here, we are interested in the
privacy issue faced by a data provider who wishes to share their data via a
multimodal generative model. A question that immediately arises is ``How can a
data provider ensure that the generative model is not leaking identifiable
information about a patient?’’. Our solution consists of (1) training a first
diffusion model on real data (2) generating a synthetic dataset using this
model and filtering it to exclude images with a re-identifiability risk (3)
training a second diffusion model on the filtered synthetic data only. We
showcase that datasets sampled from models trained with privacy distillation
can effectively reduce re-identification risk whilst maintaining downstream
performance.
Diffusion Self-Guidance for Controllable Image Generation
June 01, 2023
Dave Epstein, Allan Jabri, Ben Poole, Alexei A. Efros, Aleksander Holynski
Large-scale generative models are capable of producing high-quality images
from detailed text descriptions. However, many aspects of an image are
difficult or impossible to convey through text. We introduce self-guidance, a
method that provides greater control over generated images by guiding the
internal representations of diffusion models. We demonstrate that properties
such as the shape, location, and appearance of objects can be extracted from
these representations and used to steer sampling. Self-guidance works similarly
to classifier guidance, but uses signals present in the pretrained model
itself, requiring no additional models or training. We show how a simple set of
properties can be composed to perform challenging image manipulations, such as
modifying the position or size of objects, merging the appearance of objects in
one image with the layout of another, composing objects from many images into
one, and more. We also show that self-guidance can be used to edit real images.
For results and an interactive demo, see our project page at
https://dave.ml/selfguidance/
SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds
June 01, 2023
Yanyu Li, Huan Wang, Qing Jin, Ju Hu, Pavlo Chemerys, Yun Fu, Yanzhi Wang, Sergey Tulyakov, Jian Ren
Text-to-image diffusion models can create stunning images from natural
language descriptions that rival the work of professional artists and
photographers. However, these models are large, with complex network
architectures and tens of denoising iterations, making them computationally
expensive and slow to run. As a result, high-end GPUs and cloud-based inference
are required to run diffusion models at scale. This is costly and has privacy
implications, especially when user data is sent to a third party. To overcome
these challenges, we present a generic approach that, for the first time,
unlocks running text-to-image diffusion models on mobile devices in less than
$2$ seconds. We achieve so by introducing efficient network architecture and
improving step distillation. Specifically, we propose an efficient UNet by
identifying the redundancy of the original model and reducing the computation
of the image decoder via data distillation. Further, we enhance the step
distillation by exploring training strategies and introducing regularization
from classifier-free guidance. Our extensive experiments on MS-COCO show that
our model with $8$ denoising steps achieves better FID and CLIP scores than
Stable Diffusion v$1.5$ with $50$ steps. Our work democratizes content creation
by bringing powerful text-to-image diffusion models to the hands of users.
June 01, 2023
Felipe Nuti, Tim Franzmeyer, João F. Henriques
Diffusion models have achieved remarkable results in image generation, and
have similarly been used to learn high-performing policies in sequential
decision-making tasks. Decision-making diffusion models can be trained on
lower-quality data, and then be steered with a reward function to generate
near-optimal trajectories. We consider the problem of extracting a reward
function by comparing a decision-making diffusion model that models low-reward
behavior and one that models high-reward behavior; a setting related to inverse
reinforcement learning. We first define the notion of a relative reward
function of two diffusion models and show conditions under which it exists and
is unique. We then devise a practical learning algorithm for extracting it by
aligning the gradients of a reward function – parametrized by a neural network
– to the difference in outputs of both diffusion models. Our method finds
correct reward functions in navigation environments, and we demonstrate that
steering the base model with the learned reward functions results in
significantly increased performance in standard locomotion benchmarks. Finally,
we demonstrate that our approach generalizes beyond sequential decision-making
by learning a reward-like function from two large-scale image generation
diffusion models. The extracted reward function successfully assigns lower
rewards to harmful images.
Intriguing Properties of Text-guided Diffusion Models
June 01, 2023
Qihao Liu, Adam Kortylewski, Yutong Bai, Song Bai, Alan Yuille
Text-guided diffusion models (TDMs) are widely applied but can fail
unexpectedly. Common failures include: (i) natural-looking text prompts
generating images with the wrong content, or (ii) different random samples of
the latent variables that generate vastly different, and even unrelated,
outputs despite being conditioned on the same text prompt. In this work, we aim
to study and understand the failure modes of TDMs in more detail. To achieve
this, we propose SAGE, an adversarial attack on TDMs that uses image
classifiers as surrogate loss functions, to search over the discrete prompt
space and the high-dimensional latent space of TDMs to automatically discover
unexpected behaviors and failure cases in the image generation. We make several
technical contributions to ensure that SAGE finds failure cases of the
diffusion model, rather than the classifier, and verify this in a human study.
Our study reveals four intriguing properties of TDMs that have not been
systematically studied before: (1) We find a variety of natural text prompts
producing images that fail to capture the semantics of input texts. We
categorize these failures into ten distinct types based on the underlying
causes. (2) We find samples in the latent space (which are not outliers) that
lead to distorted images independent of the text prompt, suggesting that parts
of the latent space are not well-structured. (3) We also find latent samples
that lead to natural-looking images which are unrelated to the text prompt,
implying a potential misalignment between the latent and prompt spaces. (4) By
appending a single adversarial token embedding to an input prompt we can
generate a variety of specified target objects, while only minimally affecting
the CLIP score. This demonstrates the fragility of language representations and
raises potential safety concerns.
The Hidden Language of Diffusion Models
June 01, 2023
Hila Chefer, Oran Lang, Mor Geva, Volodymyr Polosukhin, Assaf Shocher, Michal Irani, Inbar Mosseri, Lior Wolf
Text-to-image diffusion models have demonstrated an unparalleled ability to
generate high-quality, diverse images from a textual concept (e.g., “a doctor”,
“love”). However, the internal process of mapping text to a rich visual
representation remains an enigma. In this work, we tackle the challenge of
understanding concept representations in text-to-image models by decomposing an
input text prompt into a small set of interpretable elements. This is achieved
by learning a pseudo-token that is a sparse weighted combination of tokens from
the model’s vocabulary, with the objective of reconstructing the images
generated for the given concept. Applied over the state-of-the-art Stable
Diffusion model, this decomposition reveals non-trivial and surprising
structures in the representations of concepts. For example, we find that some
concepts such as “a president” or “a composer” are dominated by specific
instances (e.g., “Obama”, “Biden”) and their interpolations. Other concepts,
such as “happiness” combine associated terms that can be concrete (“family”,
“laughter”) or abstract (“friendship”, “emotion”). In addition to peering into
the inner workings of Stable Diffusion, our method also enables applications
such as single-image decomposition to tokens, bias detection and mitigation,
and semantic image manipulation. Our code will be available at:
https://hila-chefer.github.io/Conceptor/
Differential Diffusion: Giving Each Pixel Its Strength
June 01, 2023
Eran Levin, Ohad Fried
cs.CV, cs.AI, cs.GR, cs.LG, I.3.3
Text-based image editing has advanced significantly in recent years. With the
rise of diffusion models, image editing via textual instructions has become
ubiquitous. Unfortunately, current models lack the ability to customize the
quantity of the change per pixel or per image fragment, resorting to changing
the entire image in an equal amount, or editing a specific region using a
binary mask. In this paper, we suggest a new framework which enables the user
to customize the quantity of change for each image fragment, thereby enhancing
the flexibility and verbosity of modern diffusion models. Our framework does
not require model training or fine-tuning, but instead performs everything at
inference time, making it easily applicable to an existing model. We show both
qualitatively and quantitatively that our method allows better controllability
and can produce results which are unattainable by existing models. Our code is
available at: https://github.com/exx8/differential-diffusion
Inserting Anybody in Diffusion Models via Celeb Basis
June 01, 2023
Ge Yuan, Xiaodong Cun, Yong Zhang, Maomao Li, Chenyang Qi, Xintao Wang, Ying Shan, Huicheng Zheng
Exquisite demand exists for customizing the pretrained large text-to-image
model, $\textit{e.g.}$, Stable Diffusion, to generate innovative concepts, such
as the users themselves. However, the newly-added concept from previous
customization methods often shows weaker combination abilities than the
original ones even given several images during training. We thus propose a new
personalization method that allows for the seamless integration of a unique
individual into the pre-trained diffusion model using just $\textbf{one facial
photograph}$ and only $\textbf{1024 learnable parameters}$ under $\textbf{3
minutes}$. So as we can effortlessly generate stunning images of this person in
any pose or position, interacting with anyone and doing anything imaginable
from text prompts. To achieve this, we first analyze and build a well-defined
celeb basis from the embedding space of the pre-trained large text encoder.
Then, given one facial photo as the target identity, we generate its own
embedding by optimizing the weight of this basis and locking all other
parameters. Empowered by the proposed celeb basis, the new identity in our
customized model showcases a better concept combination ability than previous
personalization methods. Besides, our model can also learn several new
identities at once and interact with each other where the previous
customization model fails to. The code will be released.
Conditioning Diffusion Models via Attributes and Semantic Masks for Face Generation
June 01, 2023
Nico Giambi, Giuseppe Lisanti
Deep generative models have shown impressive results in generating realistic
images of faces. GANs managed to generate high-quality, high-fidelity images
when conditioned on semantic masks, but they still lack the ability to
diversify their output. Diffusion models partially solve this problem and are
able to generate diverse samples given the same condition. In this paper, we
propose a multi-conditioning approach for diffusion models via cross-attention
exploiting both attributes and semantic masks to generate high-quality and
controllable face images. We also studied the impact of applying
perceptual-focused loss weighting into the latent space instead of the pixel
space. Our method extends the previous approaches by introducing conditioning
on more than one set of features, guaranteeing a more fine-grained control over
the generated face images. We evaluate our approach on the CelebA-HQ dataset,
and we show that it can generate realistic and diverse samples while allowing
for fine-grained control over multiple attributes and semantic regions.
Additionally, we perform an ablation study to evaluate the impact of different
conditioning strategies on the quality and diversity of the generated images.
UniDiff: Advancing Vision-Language Models with Generative and Discriminative Learning
June 01, 2023
Xiao Dong, Runhui Huang, Xiaoyong Wei, Zequn Jie, Jianxing Yu, Jian Yin, Xiaodan Liang
Recent advances in vision-language pre-training have enabled machines to
perform better in multimodal object discrimination (e.g., image-text semantic
alignment) and image synthesis (e.g., text-to-image generation). On the other
hand, fine-tuning pre-trained models with discriminative or generative
capabilities such as CLIP and Stable Diffusion on domain-specific datasets has
shown to be effective in various tasks by adapting to specific domains.
However, few studies have explored the possibility of learning both
discriminative and generative capabilities and leveraging their synergistic
effects to create a powerful and personalized multimodal model during
fine-tuning. This paper presents UniDiff, a unified multi-modal model that
integrates image-text contrastive learning (ITC), text-conditioned image
synthesis learning (IS), and reciprocal semantic consistency modeling (RSC).
UniDiff effectively learns aligned semantics and mitigates the issue of
semantic collapse during fine-tuning on small datasets by leveraging RSC on
visual features from CLIP and diffusion models, without altering the
pre-trained model’s basic architecture. UniDiff demonstrates versatility in
both multi-modal understanding and generative tasks. Experimental results on
three datasets (Fashion-man, Fashion-woman, and E-commercial Product) showcase
substantial enhancements in vision-language retrieval and text-to-image
generation, illustrating the advantages of combining discriminative and
generative fine-tuning. The proposed UniDiff model establishes a robust
pipeline for personalized modeling and serves as a benchmark for future
comparisons in the field.
Dissecting Arbitrary-scale Super-resolution Capability from Pre-trained Diffusion Generative Models
June 01, 2023
Ruibin Li, Qihua Zhou, Song Guo, Jie Zhang, Jingcai Guo, Xinyang Jiang, Yifei Shen, Zhenhua Han
Diffusion-based Generative Models (DGMs) have achieved unparalleled
performance in synthesizing high-quality visual content, opening up the
opportunity to improve image super-resolution (SR) tasks. Recent solutions for
these tasks often train architecture-specific DGMs from scratch, or require
iterative fine-tuning and distillation on pre-trained DGMs, both of which take
considerable time and hardware investments. More seriously, since the DGMs are
established with a discrete pre-defined upsampling scale, they cannot well
match the emerging requirements of arbitrary-scale super-resolution (ASSR),
where a unified model adapts to arbitrary upsampling scales, instead of
preparing a series of distinct models for each case. These limitations beg an
intriguing question: can we identify the ASSR capability of existing
pre-trained DGMs without the need for distillation or fine-tuning? In this
paper, we take a step towards resolving this matter by proposing Diff-SR, a
first ASSR attempt based solely on pre-trained DGMs, without additional
training efforts. It is motivated by an exciting finding that a simple
methodology, which first injects a specific amount of noise into the
low-resolution images before invoking a DGM’s backward diffusion process,
outperforms current leading solutions. The key insight is determining a
suitable amount of noise to inject, i.e., small amounts lead to poor low-level
fidelity, while over-large amounts degrade the high-level signature. Through a
finely-grained theoretical analysis, we propose the Perceptual Recoverable
Field (PRF), a metric that achieves the optimal trade-off between these two
factors. Extensive experiments verify the effectiveness, flexibility, and
adaptability of Diff-SR, demonstrating superior performance to state-of-the-art
solutions under diverse ASSR environments.
DiffRoom: Diffusion-based High-Quality 3D Room Reconstruction and Generation
June 01, 2023
Xiaoliang Ju, Zhaoyang Huang, Yijin Li, Guofeng Zhang, Yu Qiao, Hongsheng Li
We present DiffRoom, a novel framework for tackling the problem of
high-quality 3D indoor room reconstruction and generation, both of which are
challenging due to the complexity and diversity of the room geometry. Although
diffusion-based generative models have previously demonstrated impressive
performance in image generation and object-level 3D generation, they have not
yet been applied to room-level 3D generation due to their computationally
intensive costs. In DiffRoom, we propose a sparse 3D diffusion network that is
efficient and possesses strong generative performance for Truncated Signed
Distance Field (TSDF), based on a rough occupancy prior. Inspired by
KinectFusion’s incremental alignment and fusion of local SDFs, we propose a
diffusion-based TSDF fusion approach that iteratively diffuses and fuses TSDFs,
facilitating the reconstruction and generation of an entire room environment.
Additionally, to ease training, we introduce a curriculum diffusion learning
paradigm that speeds up the training convergence process and enables
high-quality reconstruction. According to the user study, the mesh quality
generated by our DiffRoom can even outperform the ground truth mesh provided by
ScanNet.
Image generation with shortest path diffusion
June 01, 2023
Ayan Das, Stathi Fotiadis, Anil Batra, Farhang Nabiei, FengTing Liao, Sattar Vakili, Da-shan Shiu, Alberto Bernacchia
The field of image generation has made significant progress thanks to the
introduction of Diffusion Models, which learn to progressively reverse a given
image corruption. Recently, a few studies introduced alternative ways of
corrupting images in Diffusion Models, with an emphasis on blurring. However,
these studies are purely empirical and it remains unclear what is the optimal
procedure for corrupting an image. In this work, we hypothesize that the
optimal procedure minimizes the length of the path taken when corrupting an
image towards a given final state. We propose the Fisher metric for the path
length, measured in the space of probability distributions. We compute the
shortest path according to this metric, and we show that it corresponds to a
combination of image sharpening, rather than blurring, and noise deblurring.
While the corruption was chosen arbitrarily in previous work, our Shortest Path
Diffusion (SPD) determines uniquely the entire spatiotemporal structure of the
corruption. We show that SPD improves on strong baselines without any
hyperparameter tuning, and outperforms all previous Diffusion Models based on
image blurring. Furthermore, any small deviation from the shortest path leads
to worse performance, suggesting that SPD provides the optimal procedure to
corrupt images. Our work sheds new light on observations made in recent works
and provides a new approach to improve diffusion models on images and other
types of data.
DiffPack: A Torsional Diffusion Model for Autoregressive Protein Side-Chain Packing
June 01, 2023
Yangtian Zhan, Zuobai Zhang, Bozitao Zhong, Sanchit Misra, Jian Tang
Proteins play a critical role in carrying out biological functions, and their
3D structures are essential in determining their functions. Accurately
predicting the conformation of protein side-chains given their backbones is
important for applications in protein structure prediction, design and
protein-protein interactions. Traditional methods are computationally intensive
and have limited accuracy, while existing machine learning methods treat the
problem as a regression task and overlook the restrictions imposed by the
constant covalent bond lengths and angles. In this work, we present DiffPack, a
torsional diffusion model that learns the joint distribution of side-chain
torsional angles, the only degrees of freedom in side-chain packing, by
diffusing and denoising on the torsional space. To avoid issues arising from
simultaneous perturbation of all four torsional angles, we propose
autoregressively generating the four torsional angles from \c{hi}1 to \c{hi}4
and training diffusion models for each torsional angle. We evaluate the method
on several benchmarks for protein side-chain packing and show that our method
achieves improvements of 11.9% and 13.5% in angle accuracy on CASP13 and
CASP14, respectively, with a significantly smaller model size (60x fewer
parameters). Additionally, we show the effectiveness of our method in enhancing
side-chain predictions in the AlphaFold2 model. Code will be available upon the
accept.
Controllable Motion Diffusion Model
June 01, 2023
Yi Shi, Jingbo Wang, Xuekun Jiang, Bo Dai
Generating realistic and controllable motions for virtual characters is a
challenging task in computer animation, and its implications extend to games,
simulations, and virtual reality. Recent studies have drawn inspiration from
the success of diffusion models in image generation, demonstrating the
potential for addressing this task. However, the majority of these studies have
been limited to offline applications that target at sequence-level generation
that generates all steps simultaneously. To enable real-time motion synthesis
with diffusion models in response to time-varying control signals, we propose
the framework of the Controllable Motion Diffusion Model (COMODO). Our
framework begins with an auto-regressive motion diffusion model (A-MDM), which
generates motion sequences step by step. In this way, simply using the standard
DDPM algorithm without any additional complexity, our framework is able to
generate high-fidelity motion sequences over extended periods with different
types of control signals. Then, we propose our reinforcement learning-based
controller and controlling strategies on top of the A-MDM model, so that our
framework can steer the motion synthesis process across multiple tasks,
including target reaching, joystick-based control, goal-oriented control, and
trajectory following. The proposed framework enables the real-time generation
of diverse motions that react adaptively to user commands on-the-fly, thereby
enhancing the overall user experience. Besides, it is compatible with the
inpainting-based editing methods and can predict much more diverse motions
without additional fine-tuning of the basic motion generation models. We
conduct comprehensive experiments to evaluate the effectiveness of our
framework in performing various tasks and compare its performance against
state-of-the-art methods.
Addressing Negative Transfer in Diffusion Models
June 01, 2023
Hyojun Go, JinYoung Kim, Yunsung Lee, Seunghyun Lee, Shinhyeok Oh, Hyeongdon Moon, Seungtaek Choi
Diffusion-based generative models have achieved remarkable success in various
domains. It trains a model on denoising tasks that encompass different noise
levels simultaneously, representing a form of multi-task learning (MTL).
However, analyzing and improving diffusion models from an MTL perspective
remains under-explored. In particular, MTL can sometimes lead to the well-known
phenomenon of $\textit{negative transfer}$, which results in the performance
degradation of certain tasks due to conflicts between tasks. In this paper, we
aim to analyze diffusion training from an MTL standpoint, presenting two key
observations: $\textbf{(O1)}$ the task affinity between denoising tasks
diminishes as the gap between noise levels widens, and $\textbf{(O2)}$ negative
transfer can arise even in the context of diffusion training. Building upon
these observations, our objective is to enhance diffusion training by
mitigating negative transfer. To achieve this, we propose leveraging existing
MTL methods, but the presence of a huge number of denoising tasks makes this
computationally expensive to calculate the necessary per-task loss or gradient.
To address this challenge, we propose clustering the denoising tasks into small
task clusters and applying MTL methods to them. Specifically, based on
$\textbf{(O2)}$, we employ interval clustering to enforce temporal proximity
among denoising tasks within clusters. We show that interval clustering can be
solved with dynamic programming and utilize signal-to-noise ratio, timestep,
and task affinity for clustering objectives. Through this, our approach
addresses the issue of negative transfer in diffusion models by allowing for
efficient computation of MTL methods. We validate the proposed clustering and
its integration with MTL methods through various experiments, demonstrating
improved sample quality of diffusion models.
Low-Light Image Enhancement with Wavelet-based Diffusion Models
June 01, 2023
Hai Jiang, Ao Luo, Songchen Han, Haoqiang Fan, Shuaicheng Liu
Diffusion models have achieved promising results in image restoration tasks,
yet suffer from time-consuming, excessive computational resource consumption,
and unstable restoration. To address these issues, we propose a robust and
efficient Diffusion-based Low-Light image enhancement approach, dubbed DiffLL.
Specifically, we present a wavelet-based conditional diffusion model (WCDM)
that leverages the generative power of diffusion models to produce results with
satisfactory perceptual fidelity. Additionally, it also takes advantage of the
strengths of wavelet transformation to greatly accelerate inference and reduce
computational resource usage without sacrificing information. To avoid chaotic
content and diversity, we perform both forward diffusion and reverse denoising
in the training phase of WCDM, enabling the model to achieve stable denoising
and reduce randomness during inference. Moreover, we further design a
high-frequency restoration module (HFRM) that utilizes the vertical and
horizontal details of the image to complement the diagonal information for
better fine-grained restoration. Extensive experiments on publicly available
real-world benchmarks demonstrate that our method outperforms the existing
state-of-the-art methods both quantitatively and visually, and it achieves
remarkable improvements in efficiency compared to previous diffusion-based
methods. In addition, we empirically show that the application for low-light
face detection also reveals the latent practical values of our method.
May 31, 2023
Peyman Gholami, Robert Xiao
cs.CV, cs.AI, cs.CL, cs.LG
Text-to-image generative models have made remarkable advancements in
generating high-quality images. However, generated images often contain
undesirable artifacts or other errors due to model limitations. Existing
techniques to fine-tune generated images are time-consuming (manual editing),
produce poorly-integrated results (inpainting), or result in unexpected changes
across the entire image (variation selection and prompt fine-tuning). In this
work, we present Diffusion Brush, a Latent Diffusion Model-based (LDM) tool to
efficiently fine-tune desired regions within an AI-synthesized image. Our
method introduces new random noise patterns at targeted regions during the
reverse diffusion process, enabling the model to efficiently make changes to
the specified regions while preserving the original context for the rest of the
image. We evaluate our method’s usability and effectiveness through a user
study with artists, comparing our technique against other state-of-the-art
image inpainting techniques and editing software for fine-tuning AI-generated
imagery.
SafeDiffuser: Safe Planning with Diffusion Probabilistic Models
May 31, 2023
Wei Xiao, Tsun-Hsuan Wang, Chuang Gan, Daniela Rus
cs.LG, cs.RO, cs.SY, eess.SY
Diffusion model-based approaches have shown promise in data-driven planning,
but there are no safety guarantees, thus making it hard to be applied for
safety-critical applications. To address these challenges, we propose a new
method, called SafeDiffuser, to ensure diffusion probabilistic models satisfy
specifications by using a class of control barrier functions. The key idea of
our approach is to embed the proposed finite-time diffusion invariance into the
denoising diffusion procedure, which enables trustworthy diffusion data
generation. Moreover, we demonstrate that our finite-time diffusion invariance
method through generative models not only maintains generalization performance
but also creates robustness in safe data generation. We test our method on a
series of safe planning tasks, including maze path generation, legged robot
locomotion, and 3D space manipulation, with results showing the advantages of
robustness and guarantees over vanilla diffusion models.
Understanding and Mitigating Copying in Diffusion Models
May 31, 2023
Gowthami Somepalli, Vasu Singla, Micah Goldblum, Jonas Geiping, Tom Goldstein
Images generated by diffusion models like Stable Diffusion are increasingly
widespread. Recent works and even lawsuits have shown that these models are
prone to replicating their training data, unbeknownst to the user. In this
paper, we first analyze this memorization problem in text-to-image diffusion
models. While it is widely believed that duplicated images in the training set
are responsible for content replication at inference time, we observe that the
text conditioning of the model plays a similarly important role. In fact, we
see in our experiments that data replication often does not happen for
unconditional models, while it is common in the text-conditional case.
Motivated by our findings, we then propose several techniques for reducing data
replication at both training and inference time by randomizing and augmenting
image captions in the training set.
Control4D: Dynamic Portrait Editing by Learning 4D GAN from 2D Diffusion-based Editor
May 31, 2023
Ruizhi Shao, Jingxiang Sun, Cheng Peng, Zerong Zheng, Boyao Zhou, Hongwen Zhang, Yebin Liu
Recent years have witnessed considerable achievements in editing images with
text instructions. When applying these editors to dynamic scene editing, the
new-style scene tends to be temporally inconsistent due to the frame-by-frame
nature of these 2D editors. To tackle this issue, we propose Control4D, a novel
approach for high-fidelity and temporally consistent 4D portrait editing.
Control4D is built upon an efficient 4D representation with a 2D
diffusion-based editor. Instead of using direct supervisions from the editor,
our method learns a 4D GAN from it and avoids the inconsistent supervision
signals. Specifically, we employ a discriminator to learn the generation
distribution based on the edited images and then update the generator with the
discrimination signals. For more stable training, multi-level information is
extracted from the edited images and used to facilitate the learning of the
generator. Experimental results show that Control4D surpasses previous
approaches and achieves more photo-realistic and consistent 4D editing
performances. The link to our project website is
https://control4darxiv.github.io.
A Unified Conditional Framework for Diffusion-based Image Restoration
May 31, 2023
Yi Zhang, Xiaoyu Shi, Dasong Li, Xiaogang Wang, Jian Wang, Hongsheng Li
Diffusion Probabilistic Models (DPMs) have recently shown remarkable
performance in image generation tasks, which are capable of generating highly
realistic images. When adopting DPMs for image restoration tasks, the crucial
aspect lies in how to integrate the conditional information to guide the DPMs
to generate accurate and natural output, which has been largely overlooked in
existing works. In this paper, we present a unified conditional framework based
on diffusion models for image restoration. We leverage a lightweight UNet to
predict initial guidance and the diffusion model to learn the residual of the
guidance. By carefully designing the basic module and integration module for
the diffusion model block, we integrate the guidance and other auxiliary
conditional information into every block of the diffusion model to achieve
spatially-adaptive generation conditioning. To handle high-resolution images,
we propose a simple yet effective inter-step patch-splitting strategy to
produce arbitrary-resolution images without grid artifacts. We evaluate our
conditional framework on three challenging tasks: extreme low-light denoising,
deblurring, and JPEG restoration, demonstrating its significant improvements in
perceptual quality and the generalization to restoration tasks.
Protein Design with Guided Discrete Diffusion
May 31, 2023
Nate Gruver, Samuel Stanton, Nathan C. Frey, Tim G. J. Rudner, Isidro Hotzel, Julien Lafrance-Vanasse, Arvind Rajpal, Kyunghyun Cho, Andrew Gordon Wilson
A popular approach to protein design is to combine a generative model with a
discriminative model for conditional sampling. The generative model samples
plausible sequences while the discriminative model guides a search for
sequences with high fitness. Given its broad success in conditional sampling,
classifier-guided diffusion modeling is a promising foundation for protein
design, leading many to develop guided diffusion models for structure with
inverse folding to recover sequences. In this work, we propose diffusioN
Optimized Sampling (NOS), a guidance method for discrete diffusion models that
follows gradients in the hidden states of the denoising network. NOS makes it
possible to perform design directly in sequence space, circumventing
significant limitations of structure-based methods, including scarce data and
challenging inverse design. Moreover, we use NOS to generalize LaMBO, a
Bayesian optimization procedure for sequence design that facilitates multiple
objectives and edit-based constraints. The resulting method, LaMBO-2, enables
discrete diffusions and stronger performance with limited edits through a novel
application of saliency maps. We apply LaMBO-2 to a real-world protein design
task, optimizing antibodies for higher expression yield and binding affinity to
a therapeutic target under locality and liability constraints, with 97%
expression rate and 25% binding rate in exploratory in vitro experiments.
A Geometric Perspective on Diffusion Models
May 31, 2023
Defang Chen, Zhenyu Zhou, Jian-Ping Mei, Chunhua Shen, Chun Chen, Can Wang
Recent years have witnessed significant progress in developing efficient
training and fast sampling approaches for diffusion models. A recent remarkable
advancement is the use of stochastic differential equations (SDEs) to describe
data perturbation and generative modeling in a unified mathematical framework.
In this paper, we reveal several intriguing geometric structures of diffusion
models and contribute a simple yet powerful interpretation to their sampling
dynamics. Through carefully inspecting a popular variance-exploding SDE and its
marginal-preserving ordinary differential equation (ODE) for sampling, we
discover that the data distribution and the noise distribution are smoothly
connected with an explicit, quasi-linear sampling trajectory, and another
implicit denoising trajectory, which even converges faster in terms of visual
quality. We also establish a theoretical relationship between the optimal
ODE-based sampling and the classic mean-shift (mode-seeking) algorithm, with
which we can characterize the asymptotic behavior of diffusion models and
identify the score deviation. These new geometric observations enable us to
improve previous sampling algorithms, re-examine latent interpolation, as well
as re-explain the working principles of distillation-based fast sampling
techniques.
Unsupervised Anomaly Detection in Medical Images Using Masked Diffusion Model
May 31, 2023
Hasan Iqbal, Umar Khalid, Jing Hua, Chen Chen
It can be challenging to identify brain MRI anomalies using supervised
deep-learning techniques due to anatomical heterogeneity and the requirement
for pixel-level labeling. Unsupervised anomaly detection approaches provide an
alternative solution by relying only on sample-level labels of healthy brains
to generate a desired representation to identify abnormalities at the pixel
level. Although, generative models are crucial for generating such anatomically
consistent representations of healthy brains, accurately generating the
intricate anatomy of the human brain remains a challenge. In this study, we
present a method called masked-DDPM (mDPPM), which introduces masking-based
regularization to reframe the generation task of diffusion models.
Specifically, we introduce Masked Image Modeling (MIM) and Masked Frequency
Modeling (MFM) in our self-supervised approach that enables models to learn
visual representations from unlabeled data. To the best of our knowledge, this
is the first attempt to apply MFM in DPPM models for medical applications. We
evaluate our approach on datasets containing tumors and numerous sclerosis
lesions and exhibit the superior performance of our unsupervised method as
compared to the existing fully/weakly supervised baselines. Code is available
at https://github.com/hasan1292/mDDPM.
Direct Diffusion Bridge using Data Consistency for Inverse Problems
May 31, 2023
Hyungjin Chung, Jeongsol Kim, Jong Chul Ye
cs.CV, cs.AI, cs.LG, stat.ML
Diffusion model-based inverse problem solvers have shown impressive
performance, but are limited in speed, mostly as they require reverse diffusion
sampling starting from noise. Several recent works have tried to alleviate this
problem by building a diffusion process, directly bridging the clean and the
corrupted for specific inverse problems. In this paper, we first unify these
existing works under the name Direct Diffusion Bridges (DDB), showing that
while motivated by different theories, the resulting algorithms only differ in
the choice of parameters. Then, we highlight a critical limitation of the
current DDB framework, namely that it does not ensure data consistency. To
address this problem, we propose a modified inference procedure that imposes
data consistency without the need for fine-tuning. We term the resulting method
data Consistent DDB (CDDB), which outperforms its inconsistent counterpart in
terms of both perception and distortion metrics, thereby effectively pushing
the Pareto-frontier toward the optimum. Our proposed method achieves
state-of-the-art results on both evaluation criteria, showcasing its
superiority over existing methods.
Spontaneous symmetry breaking in generative diffusion models
May 31, 2023
Gabriel Raya, Luca Ambrogioni
Generative diffusion models have recently emerged as a leading approach for
generating high-dimensional data. In this paper, we show that the dynamics of
these models exhibit a spontaneous symmetry breaking that divides the
generative dynamics into two distinct phases: 1) A linear steady-state dynamics
around a central fixed-point and 2) an attractor dynamics directed towards the
data manifold. These two “phases” are separated by the change in stability of
the central fixed-point, with the resulting window of instability being
responsible for the diversity of the generated samples. Using both theoretical
and empirical evidence, we show that an accurate simulation of the early
dynamics does not significantly contribute to the final generation, since early
fluctuations are reverted to the central fixed point. To leverage this insight,
we propose a Gaussian late initialization scheme, which significantly improves
model performance, achieving up to 3x FID improvements on fast samplers, while
also increasing sample diversity (e.g., racial composition of generated CelebA
images). Our work offers a new way to understand the generative dynamics of
diffusion models that has the potential to bring about higher performance and
less biased fast-samplers.
Mask, Stitch, and Re-Sample: Enhancing Robustness and Generalizability in Anomaly Detection through Automatic Diffusion Models
May 31, 2023
Cosmin I. Bercea, Michael Neumayr, Daniel Rueckert, Julia A. Schnabel
The introduction of diffusion models in anomaly detection has paved the way
for more effective and accurate image reconstruction in pathologies. However,
the current limitations in controlling noise granularity hinder diffusion
models’ ability to generalize across diverse anomaly types and compromise the
restoration of healthy tissues. To overcome these challenges, we propose
AutoDDPM, a novel approach that enhances the robustness of diffusion models.
AutoDDPM utilizes diffusion models to generate initial likelihood maps of
potential anomalies and seamlessly integrates them with the original image.
Through joint noised distribution re-sampling, AutoDDPM achieves harmonization
and in-painting effects. Our study demonstrates the efficacy of AutoDDPM in
replacing anomalous regions while preserving healthy tissues, considerably
surpassing diffusion models’ limitations. It also contributes valuable insights
and analysis on the limitations of current diffusion models, promoting robust
and interpretable anomaly detection in medical imaging - an essential aspect of
building autonomous clinical decision systems with higher interpretability.
DiffLoad: Uncertainty Quantification in Load Forecasting with Diffusion Model
May 31, 2023
Zhixian Wang, Qingsong Wen, Chaoli Zhang, Liang Sun, Yi Wang
Electrical load forecasting is of great significance for the decision makings
in power systems, such as unit commitment and energy management. In recent
years, various self-supervised neural network-based methods have been applied
to electrical load forecasting to improve forecasting accuracy and capture
uncertainties. However, most current methods are based on Gaussian likelihood
methods, which aim to accurately estimate the distribution expectation under a
given covariate. This kind of approach is difficult to adapt to situations
where temporal data has a distribution shift and outliers. In this paper, we
propose a diffusion-based Seq2seq structure to estimate epistemic uncertainty
and use the robust additive Cauchy distribution to estimate aleatoric
uncertainty. Rather than accurately forecasting conditional expectations, we
demonstrate our method’s ability in separating two types of uncertainties and
dealing with the mutant scenarios.
Improving Handwritten OCR with Training Samples Generated by Glyph Conditional Denoising Diffusion Probabilistic Model
May 31, 2023
Haisong Ding, Bozhi Luan, Dongnan Gui, Kai Chen, Qiang Huo
Constructing a highly accurate handwritten OCR system requires large amounts
of representative training data, which is both time-consuming and expensive to
collect. To mitigate the issue, we propose a denoising diffusion probabilistic
model (DDPM) to generate training samples. This model conditions on a printed
glyph image and creates mappings between printed characters and handwritten
images, thus enabling the generation of photo-realistic handwritten samples
with diverse styles and unseen text contents. However, the text contents in
synthetic images are not always consistent with the glyph conditional images,
leading to unreliable labels of synthetic samples. To address this issue, we
further propose a progressive data filtering strategy to add those samples with
a high confidence of correctness to the training set. Experimental results on
IAM benchmark task show that OCR model trained with augmented DDPM-synthesized
training samples can achieve about 45% relative word error rate reduction
compared with the one trained on real data only.
Label-Retrieval-Augmented Diffusion Models for Learning from Noisy Labels
May 31, 2023
Jian Chen, Ruiyi Zhang, Tong Yu, Rohan Sharma, Zhiqiang Xu, Tong Sun, Changyou Chen
Learning from noisy labels is an important and long-standing problem in
machine learning for real applications. One of the main research lines focuses
on learning a label corrector to purify potential noisy labels. However, these
methods typically rely on strict assumptions and are limited to certain types
of label noise. In this paper, we reformulate the label-noise problem from a
generative-model perspective, $\textit{i.e.}$, labels are generated by
gradually refining an initial random guess. This new perspective immediately
enables existing powerful diffusion models to seamlessly learn the stochastic
generative process. Once the generative uncertainty is modeled, we can perform
classification inference using maximum likelihood estimation of labels. To
mitigate the impact of noisy labels, we propose the
$\textbf{L}$abel-$\textbf{R}$etrieval-$\textbf{A}$ugmented (LRA) diffusion
model, which leverages neighbor consistency to effectively construct
pseudo-clean labels for diffusion training. Our model is flexible and general,
allowing easy incorporation of different types of conditional information,
$\textit{e.g.}$, use of pre-trained models, to further boost model performance.
Extensive experiments are conducted for evaluation. Our model achieves new
state-of-the-art (SOTA) results on all the standard real-world benchmark
datasets. Remarkably, by incorporating conditional information from the
powerful CLIP model, our method can boost the current SOTA accuracy by 10-20
absolute points in many cases.
Fine-grained Text Style Transfer with Diffusion-Based Language Models
May 31, 2023
Yiwei Lyu, Tiange Luo, Jiacheng Shi, Todd C. Hollon, Honglak Lee
Diffusion probabilistic models have shown great success in generating
high-quality images controllably, and researchers have tried to utilize this
controllability into text generation domain. Previous works on diffusion-based
language models have shown that they can be trained without external knowledge
(such as pre-trained weights) and still achieve stable performance and
controllability. In this paper, we trained a diffusion-based model on StylePTB
dataset, the standard benchmark for fine-grained text style transfers. The
tasks in StylePTB requires much more refined control over the output text
compared to tasks evaluated in previous works, and our model was able to
achieve state-of-the-art performance on StylePTB on both individual and
compositional transfers. Moreover, our model, trained on limited data from
StylePTB without external knowledge, outperforms previous works that utilized
pretrained weights, embeddings, and external grammar parsers, and this may
indicate that diffusion-based language models have great potential under
low-resource settings.
May 31, 2023
Shaoyan Pan, Elham Abouei, Jacob Wynne, Tonghe Wang, Richard L. J. Qiu, Yuheng Li, Chih-Wei Chang, Junbo Peng, Justin Roper, Pretesh Patel, David S. Yu, Hui Mao, Xiaofeng Yang
Magnetic resonance imaging (MRI)-based synthetic computed tomography (sCT)
simplifies radiation therapy treatment planning by eliminating the need for CT
simulation and error-prone image registration, ultimately reducing patient
radiation dose and setup uncertainty. We propose an MRI-to-CT transformer-based
denoising diffusion probabilistic model (MC-DDPM) to transform MRI into
high-quality sCT to facilitate radiation treatment planning. MC-DDPM implements
diffusion processes with a shifted-window transformer network to generate sCT
from MRI. The proposed model consists of two processes: a forward process which
adds Gaussian noise to real CT scans, and a reverse process in which a
shifted-window transformer V-net (Swin-Vnet) denoises the noisy CT scans
conditioned on the MRI from the same patient to produce noise-free CT scans.
With an optimally trained Swin-Vnet, the reverse diffusion process was used to
generate sCT scans matching MRI anatomy. We evaluated the proposed method by
generating sCT from MRI on a brain dataset and a prostate dataset. Qualitative
evaluation was performed using the mean absolute error (MAE) of Hounsfield unit
(HU), peak signal to noise ratio (PSNR), multi-scale Structure Similarity index
(MS-SSIM) and normalized cross correlation (NCC) indexes between ground truth
CTs and sCTs. MC-DDPM generated brain sCTs with state-of-the-art quantitative
results with MAE 43.317 HU, PSNR 27.046 dB, SSIM 0.965, and NCC 0.983. For the
prostate dataset, MC-DDPM achieved MAE 59.953 HU, PSNR 26.920 dB, SSIM 0.849,
and NCC 0.948. In conclusion, we have developed and validated a novel approach
for generating CT images from routine MRIs using a transformer-based DDPM. This
model effectively captures the complex relationship between CT and MRI images,
allowing for robust and high-quality synthetic CT (sCT) images to be generated
in minutes.
Ambient Diffusion: Learning Clean Distributions from Corrupted Data
May 30, 2023
Giannis Daras, Kulin Shah, Yuval Dagan, Aravind Gollakota, Alexandros G. Dimakis, Adam Klivans
cs.LG, cs.AI, cs.CV, cs.IT, math.IT
We present the first diffusion-based framework that can learn an unknown
distribution using only highly-corrupted samples. This problem arises in
scientific applications where access to uncorrupted samples is impossible or
expensive to acquire. Another benefit of our approach is the ability to train
generative models that are less likely to memorize individual training samples
since they never observe clean training data. Our main idea is to introduce
additional measurement distortion during the diffusion process and require the
model to predict the original corrupted image from the further corrupted image.
We prove that our method leads to models that learn the conditional expectation
of the full uncorrupted image given this additional measurement corruption.
This holds for any corruption process that satisfies some technical conditions
(and in particular includes inpainting and compressed sensing). We train models
on standard benchmarks (CelebA, CIFAR-10 and AFHQ) and show that we can learn
the distribution even when all the training samples have $90\%$ of their pixels
missing. We also show that we can finetune foundation models on small corrupted
datasets (e.g. MRI scans with block corruptions) and learn the clean
distribution without memorizing the training set.
Calliffusion: Chinese Calligraphy Generation and Style Transfer with Diffusion Modeling
May 30, 2023
Qisheng Liao, Gus Xia, Zhinuo Wang
In this paper, we propose Calliffusion, a system for generating high-quality
Chinese calligraphy using diffusion models. Our model architecture is based on
DDPM (Denoising Diffusion Probabilistic Models), and it is capable of
generating common characters in five different scripts and mimicking the styles
of famous calligraphers. Experiments demonstrate that our model can generate
calligraphy that is difficult to distinguish from real artworks and that our
controls for characters, scripts, and styles are effective. Moreover, we
demonstrate one-shot transfer learning, using LoRA (Low-Rank Adaptation) to
transfer Chinese calligraphy art styles to unseen characters and even
out-of-domain symbols such as English letters and digits.
Unsupervised Statistical Feature-Guided Diffusion Model for Sensor-based Human Activity Recognition
May 30, 2023
Si Zuo, Vitor Fortes Rey, Sungho Suh, Stephan Sigg, Paul Lukowicz
Recognizing human activities from sensor data is a vital task in various
domains, but obtaining diverse and labeled sensor data remains challenging and
costly. In this paper, we propose an unsupervised statistical feature-guided
diffusion model for sensor-based human activity recognition. The proposed
method aims to generate synthetic time-series sensor data without relying on
labeled data, addressing the scarcity and annotation difficulties associated
with real-world sensor data. By conditioning the diffusion model on statistical
information such as mean, standard deviation, Z-score, and skewness, we
generate diverse and representative synthetic sensor data. We conducted
experiments on public human activity recognition datasets and compared the
proposed method to conventional oversampling methods and state-of-the-art
generative adversarial network methods. The experimental results demonstrate
that the proposed method can improve the performance of human activity
recognition and outperform existing techniques.
DiffMatch: Diffusion Model for Dense Matching
May 30, 2023
Jisu Nam, Gyuseong Lee, Sunwoo Kim, Hyeonsu Kim, Hyoungwon Cho, Seyeon Kim, Seungryong Kim
The objective for establishing dense correspondence between paired images
consists of two terms: a data term and a prior term. While conventional
techniques focused on defining hand-designed prior terms, which are difficult
to formulate, recent approaches have focused on learning the data term with
deep neural networks without explicitly modeling the prior, assuming that the
model itself has the capacity to learn an optimal prior from a large-scale
dataset. The performance improvement was obvious, however, they often fail to
address inherent ambiguities of matching, such as textureless regions,
repetitive patterns, and large displacements. To address this, we propose
DiffMatch, a novel conditional diffusion-based framework designed to explicitly
model both the data and prior terms. Unlike previous approaches, this is
accomplished by leveraging a conditional denoising diffusion model. DiffMatch
consists of two main components: conditional denoising diffusion module and
cost injection module. We stabilize the training process and reduce memory
usage with a stage-wise training strategy. Furthermore, to boost performance,
we introduce an inference technique that finds a better path to the accurate
matching field. Our experimental results demonstrate significant performance
improvements of our method over existing approaches, and the ablation studies
validate our design choices along with the effectiveness of each component.
Project page is available at https://ku-cvlab.github.io/DiffMatch/.
Nested Diffusion Processes for Anytime Image Generation
May 30, 2023
Noam Elata, Bahjat Kawar, Tomer Michaeli, Michael Elad
Diffusion models are the current state-of-the-art in image generation,
synthesizing high-quality images by breaking down the generation process into
many fine-grained denoising steps. Despite their good performance, diffusion
models are computationally expensive, requiring many neural function
evaluations (NFEs). In this work, we propose an anytime diffusion-based method
that can generate viable images when stopped at arbitrary times before
completion. Using existing pretrained diffusion models, we show that the
generation scheme can be recomposed as two nested diffusion processes, enabling
fast iterative refinement of a generated image. We use this Nested Diffusion
approach to peek into the generation process and enable flexible scheduling
based on the instantaneous preference of the user. In experiments on ImageNet
and Stable Diffusion-based text-to-image generation, we show, both
qualitatively and quantitatively, that our method’s intermediate generation
quality greatly exceeds that of the original diffusion model, while the final
slow generation result remains comparable.
DiffSketching: Sketch Control Image Synthesis with Diffusion Models
May 30, 2023
Qiang Wang, Di Kong, Fengyin Lin, Yonggang Qi
Creative sketch is a universal way of visual expression, but translating
images from an abstract sketch is very challenging. Traditionally, creating a
deep learning model for sketch-to-image synthesis needs to overcome the
distorted input sketch without visual details, and requires to collect
large-scale sketch-image datasets. We first study this task by using diffusion
models. Our model matches sketches through the cross domain constraints, and
uses a classifier to guide the image synthesis more accurately. Extensive
experiments confirmed that our method can not only be faithful to user’s input
sketches, but also maintain the diversity and imagination of synthetic image
results. Our model can beat GAN-based method in terms of generation quality and
human evaluation, and does not rely on massive sketch-image datasets.
Additionally, we present applications of our method in image editing and
interpolation.
HiFA: High-fidelity Text-to-3D with Advanced Diffusion Guidance
May 30, 2023
Junzhe Zhu, Peiye Zhuang
Automatic text-to-3D synthesis has achieved remarkable advancements through
the optimization of 3D models. Existing methods commonly rely on pre-trained
text-to-image generative models, such as diffusion models, providing scores for
2D renderings of Neural Radiance Fields (NeRFs) and being utilized for
optimizing NeRFs. However, these methods often encounter artifacts and
inconsistencies across multiple views due to their limited understanding of 3D
geometry. To address these limitations, we propose a reformulation of the
optimization loss using the diffusion prior. Furthermore, we introduce a novel
training approach that unlocks the potential of the diffusion prior. To improve
3D geometry representation, we apply auxiliary depth supervision for
NeRF-rendered images and regularize the density field of NeRFs. Extensive
experiments demonstrate the superiority of our method over prior works,
resulting in advanced photo-realism and improved multi-view consistency.
Generating Behaviorally Diverse Policies with Latent Diffusion Models
May 30, 2023
Shashank Hegde, Sumeet Batra, K. R. Zentner, Gaurav S. Sukhatme
Recent progress in Quality Diversity Reinforcement Learning (QD-RL) has
enabled learning a collection of behaviorally diverse, high performing
policies. However, these methods typically involve storing thousands of
policies, which results in high space-complexity and poor scaling to additional
behaviors. Condensing the archive into a single model while retaining the
performance and coverage of the original collection of policies has proved
challenging. In this work, we propose using diffusion models to distill the
archive into a single generative model over policy parameters. We show that our
method achieves a compression ratio of 13x while recovering 98% of the original
rewards and 89% of the original coverage. Further, the conditioning mechanism
of diffusion models allows for flexibly selecting and sequencing behaviors,
including using language. Project website:
https://sites.google.com/view/policydiffusion/home
Diffusion-Stego: Training-free Diffusion Generative Steganography via Message Projection
May 30, 2023
Daegyu Kim, Chaehun Shin, Jooyoung Choi, Dahuin Jung, Sungroh Yoon
Generative steganography is the process of hiding secret messages in
generated images instead of cover images. Existing studies on generative
steganography use GAN or Flow models to obtain high hiding message capacity and
anti-detection ability over cover images. However, they create relatively
unrealistic stego images because of the inherent limitations of generative
models. We propose Diffusion-Stego, a generative steganography approach based
on diffusion models which outperform other generative models in image
generation. Diffusion-Stego projects secret messages into latent noise of
diffusion models and generates stego images with an iterative denoising
process. Since the naive hiding of secret messages into noise boosts visual
degradation and decreases extracted message accuracy, we introduce message
projection, which hides messages into noise space while addressing these
issues. We suggest three options for message projection to adjust the trade-off
between extracted message accuracy, anti-detection ability, and image quality.
Diffusion-Stego is a training-free approach, so we can apply it to pre-trained
diffusion models which generate high-quality images, or even large-scale
text-to-image models, such as Stable diffusion. Diffusion-Stego achieved a high
capacity of messages (3.0 bpp of binary messages with 98% accuracy, and 6.0 bpp
with 90% accuracy) as well as high quality (with a FID score of 2.77 for 1.0
bpp on the FFHQ 64$\times$64 dataset) that makes it challenging to distinguish
from real images in the PNG format.
LayerDiffusion: Layered Controlled Image Editing with Diffusion Models
May 30, 2023
Pengzhi Li, QInxuan Huang, Yikang Ding, Zhiheng Li
Text-guided image editing has recently experienced rapid development.
However, simultaneously performing multiple editing actions on a single image,
such as background replacement and specific subject attribute changes, while
maintaining consistency between the subject and the background remains
challenging. In this paper, we propose LayerDiffusion, a semantic-based layered
controlled image editing method. Our method enables non-rigid editing and
attribute modification of specific subjects while preserving their unique
characteristics and seamlessly integrating them into new backgrounds. We
leverage a large-scale text-to-image model and employ a layered controlled
optimization strategy combined with layered diffusion training. During the
diffusion process, an iterative guidance strategy is used to generate a final
image that aligns with the textual description. Experimental results
demonstrate the effectiveness of our method in generating highly coherent
images that closely align with the given textual description. The edited images
maintain a high similarity to the features of the input image and surpass the
performance of current leading image editing methods. LayerDiffusion opens up
new possibilities for controllable image editing.
On Diffusion Modeling for Anomaly Detection
May 29, 2023
Victor Livernoche, Vineet Jain, Yashar Hezaveh, Siamak Ravanbakhsh
Known for their impressive performance in generative modeling, diffusion
models are attractive candidates for density-based anomaly detection. This
paper investigates different variations of diffusion modeling for unsupervised
and semi-supervised anomaly detection. In particular, we find that Denoising
Diffusion Probability Models (DDPM) are performant on anomaly detection
benchmarks yet computationally expensive. By simplifying DDPM in application to
anomaly detection, we are naturally led to an alternative approach called
Diffusion Time Probabilistic Model (DTPM). DTPM estimates the posterior
distribution over diffusion time for a given input, enabling the identification
of anomalies due to their higher posterior density at larger timesteps. We
derive an analytical form for this posterior density and leverage a deep neural
network to improve inference efficiency. Through empirical evaluations on the
ADBench benchmark, we demonstrate that all diffusion-based anomaly detection
methods perform competitively. Notably, DTPM achieves orders of magnitude
faster inference time than DDPM, while outperforming it on this benchmark.
These results establish diffusion-based anomaly detection as an interpretable
and scalable alternative to traditional methods and recent deep-learning
techniques.
RAPHAEL: Text-to-Image Generation via Large Mixture of Diffusion Paths
May 29, 2023
Zeyue Xue, Guanglu Song, Qiushan Guo, Boxiao Liu, Zhuofan Zong, Yu Liu, Ping Luo
Text-to-image generation has recently witnessed remarkable achievements. We
introduce a text-conditional image diffusion model, termed RAPHAEL, to generate
highly artistic images, which accurately portray the text prompts, encompassing
multiple nouns, adjectives, and verbs. This is achieved by stacking tens of
mixture-of-experts (MoEs) layers, i.e., space-MoE and time-MoE layers, enabling
billions of diffusion paths (routes) from the network input to the output. Each
path intuitively functions as a “painter” for depicting a particular textual
concept onto a specified image region at a diffusion timestep. Comprehensive
experiments reveal that RAPHAEL outperforms recent cutting-edge models, such as
Stable Diffusion, ERNIE-ViLG 2.0, DeepFloyd, and DALL-E 2, in terms of both
image quality and aesthetic appeal. Firstly, RAPHAEL exhibits superior
performance in switching images across diverse styles, such as Japanese comics,
realism, cyberpunk, and ink illustration. Secondly, a single model with three
billion parameters, trained on 1,000 A100 GPUs for two months, achieves a
state-of-the-art zero-shot FID score of 6.61 on the COCO dataset. Furthermore,
RAPHAEL significantly surpasses its counterparts in human evaluation on the
ViLG-300 benchmark. We believe that RAPHAEL holds the potential to propel the
frontiers of image generation research in both academia and industry, paving
the way for future breakthroughs in this rapidly evolving field. More details
can be found on a project webpage: https://raphael-painter.github.io/.
Reconstructing the Mind’s Eye: fMRI-to-Image with Contrastive Learning and Diffusion Priors
May 29, 2023
Paul S. Scotti, Atmadeep Banerjee, Jimmie Goode, Stepan Shabalin, Alex Nguyen, Ethan Cohen, Aidan J. Dempster, Nathalie Verlinde, Elad Yundler, David Weisberg, Kenneth A. Norman, Tanishq Mathew Abraham
We present MindEye, a novel fMRI-to-image approach to retrieve and
reconstruct viewed images from brain activity. Our model comprises two parallel
submodules that are specialized for retrieval (using contrastive learning) and
reconstruction (using a diffusion prior). MindEye can map fMRI brain activity
to any high dimensional multimodal latent space, like CLIP image space,
enabling image reconstruction using generative models that accept embeddings
from this latent space. We comprehensively compare our approach with other
existing methods, using both qualitative side-by-side comparisons and
quantitative evaluations, and show that MindEye achieves state-of-the-art
performance in both reconstruction and retrieval tasks. In particular, MindEye
can retrieve the exact original image even among highly similar candidates
indicating that its brain embeddings retain fine-grained image-specific
information. This allows us to accurately retrieve images even from large-scale
databases like LAION-5B. We demonstrate through ablations that MindEye’s
performance improvements over previous methods result from specialized
submodules for retrieval and reconstruction, improved training techniques, and
training models with orders of magnitude more parameters. Furthermore, we show
that MindEye can better preserve low-level image features in the
reconstructions by using img2img, with outputs from a separate autoencoder. All
code is available on GitHub.
CamoDiffusion: Camouflaged Object Detection via Conditional Diffusion Models
May 29, 2023
Zhongxi Chen, Ke Sun, Xianming Lin, Rongrong Ji
Camouflaged Object Detection (COD) is a challenging task in computer vision
due to the high similarity between camouflaged objects and their surroundings.
Existing COD methods primarily employ semantic segmentation, which suffers from
overconfident incorrect predictions. In this paper, we propose a new paradigm
that treats COD as a conditional mask-generation task leveraging diffusion
models. Our method, dubbed CamoDiffusion, employs the denoising process of
diffusion models to iteratively reduce the noise of the mask. Due to the
stochastic sampling process of diffusion, our model is capable of sampling
multiple possible predictions from the mask distribution, avoiding the problem
of overconfident point estimation. Moreover, we develop specialized learning
strategies that include an innovative ensemble approach for generating robust
predictions and tailored forward diffusion methods for efficient training,
specifically for the COD task. Extensive experiments on three COD datasets
attest the superior performance of our model compared to existing
state-of-the-art methods, particularly on the most challenging COD10K dataset,
where our approach achieves 0.019 in terms of MAE.
Diff-Instruct: A Universal Approach for Transferring Knowledge From Pre-trained Diffusion Models
May 29, 2023
Weijian Luo, Tianyang Hu, Shifeng Zhang, Jiacheng Sun, Zhenguo Li, Zhihua Zhang
Due to the ease of training, ability to scale, and high sample quality,
diffusion models (DMs) have become the preferred option for generative
modeling, with numerous pre-trained models available for a wide variety of
datasets. Containing intricate information about data distributions,
pre-trained DMs are valuable assets for downstream applications. In this work,
we consider learning from pre-trained DMs and transferring their knowledge to
other generative models in a data-free fashion. Specifically, we propose a
general framework called Diff-Instruct to instruct the training of arbitrary
generative models as long as the generated samples are differentiable with
respect to the model parameters. Our proposed Diff-Instruct is built on a
rigorous mathematical foundation where the instruction process directly
corresponds to minimizing a novel divergence we call Integral Kullback-Leibler
(IKL) divergence. IKL is tailored for DMs by calculating the integral of the KL
divergence along a diffusion process, which we show to be more robust in
comparing distributions with misaligned supports. We also reveal non-trivial
connections of our method to existing works such as DreamFusion, and generative
adversarial training. To demonstrate the effectiveness and universality of
Diff-Instruct, we consider two scenarios: distilling pre-trained diffusion
models and refining existing GAN models. The experiments on distilling
pre-trained diffusion models show that Diff-Instruct results in
state-of-the-art single-step diffusion-based models. The experiments on
refining GAN models show that the Diff-Instruct can consistently improve the
pre-trained generators of GAN models across various settings.
Conditional Diffusion Models for Semantic 3D Medical Image Synthesis
May 29, 2023
Zolnamar Dorjsembe, Hsing-Kuo Pao, Sodtavilan Odonchimed, Furen Xiao
This paper introduces Med-DDPM, an innovative solution using diffusion models
for semantic 3D medical image synthesis, addressing the prevalent issues in
medical imaging such as data scarcity, inconsistent acquisition methods, and
privacy concerns. Experimental evidence illustrates that diffusion models
surpass Generative Adversarial Networks (GANs) in stability and performance,
generating high-quality, realistic 3D medical images. The distinct feature of
Med-DDPM is its use of semantic conditioning for the diffusion model in 3D
image synthesis. By controlling the generation process through pixel-level mask
labels, it facilitates the creation of realistic medical images. Empirical
evaluations underscore the superior performance of Med-DDPM over GAN techniques
in metrics such as accuracy, stability, and versatility. Furthermore, Med-DDPM
outperforms traditional augmentation techniques and synthetic GAN images in
enhancing the accuracy of segmentation models. It addresses challenges such as
insufficient datasets, lack of annotated data, and class imbalance. Noting the
limitations of the Frechet inception distance (FID) metric, we introduce a
histogram-equalized FID metric for effective performance evaluation. In
summary, Med-DDPM, by utilizing diffusion models, signifies a crucial step
forward in the domain of high-resolution semantic 3D medical image synthesis,
transcending the limitations of GANs and data constraints. This method paves
the way for a promising solution in medical imaging, primarily for data
augmentation and anonymization, thus contributing significantly to the field.
Generating Driving Scenes with Diffusion
May 29, 2023
Ethan Pronovost, Kai Wang, Nick Roy
In this paper we describe a learned method of traffic scene generation
designed to simulate the output of the perception system of a self-driving car.
In our “Scene Diffusion” system, inspired by latent diffusion, we use a novel
combination of diffusion and object detection to directly create realistic and
physically plausible arrangements of discrete bounding boxes for agents. We
show that our scene generation model is able to adapt to different regions in
the US, producing scenarios that capture the intricacies of each region.
Cognitively Inspired Cross-Modal Data Generation Using Diffusion Models
May 28, 2023
Zizhao Hu, Mohammad Rostami
Most existing cross-modal generative methods based on diffusion models use
guidance to provide control over the latent space to enable conditional
generation across different modalities. Such methods focus on providing
guidance through separately-trained models, each for one modality. As a result,
these methods suffer from cross-modal information loss and are limited to
unidirectional conditional generation. Inspired by how humans synchronously
acquire multi-modal information and learn the correlation between modalities,
we explore a multi-modal diffusion model training and sampling scheme that uses
channel-wise image conditioning to learn cross-modality correlation during the
training phase to better mimic the learning process in the brain. Our empirical
results demonstrate that our approach can achieve data generation conditioned
on all correlated modalities.
Conditional score-based diffusion models for Bayesian inference in infinite dimensions
May 28, 2023
Lorenzo Baldassari, Ali Siahkoohi, Josselin Garnier, Knut Solna, Maarten V. de Hoop
stat.ML, cs.LG, math.AP, math.PR, 62F15, 65N21, 68Q32, 60Hxx, 60Jxx
Since their first introduction, score-based diffusion models (SDMs) have been
successfully applied to solve a variety of linear inverse problems in
finite-dimensional vector spaces due to their ability to efficiently
approximate the posterior distribution. However, using SDMs for inverse
problems in infinite-dimensional function spaces has only been addressed
recently and by learning the unconditional score. While this approach has some
advantages, depending on the specific inverse problem at hand, in order to
sample from the conditional distribution it needs to incorporate the
information from the observed data with a proximal optimization step, solving
an optimization problem numerous times. This may not be feasible in inverse
problems with computationally costly forward operators. To address these
limitations, in this work we propose a method to learn the posterior
distribution in infinite-dimensional Bayesian linear inverse problems using
amortized conditional SDMs. In particular, we prove that the conditional
denoising estimator is a consistent estimator of the conditional score in
infinite dimensions. We show that the extension of SDMs to the conditional
setting requires some care because the conditional score typically blows up for
small times contrarily to the unconditional score. We also discuss the
robustness of the learned distribution against perturbations of the
observations. We conclude by presenting numerical examples that validate our
approach and provide additional insights.
Creating Personalized Synthetic Voices from Post-Glossectomy Speech with Guided Diffusion Models
May 27, 2023
Yusheng Tian, Guangyan Zhang, Tan Lee
This paper is about developing personalized speech synthesis systems with
recordings of mildly impaired speech. In particular, we consider consonant and
vowel alterations resulted from partial glossectomy, the surgical removal of
part of the tongue. The aim is to restore articulation in the synthesized
speech and maximally preserve the target speaker’s individuality. We propose to
tackle the problem with guided diffusion models. Specifically, a
diffusion-based speech synthesis model is trained on original recordings, to
capture and preserve the target speaker’s original articulation style. When
using the model for inference, a separately trained phone classifier will guide
the synthesis process towards proper articulation. Objective and subjective
evaluation results show that the proposed method substantially improves
articulation in the synthesized speech over original recordings, and preserves
more of the target speaker’s individuality than a voice conversion baseline.
MADiff: Offline Multi-agent Learning with Diffusion Models
May 27, 2023
Zhengbang Zhu, Minghuan Liu, Liyuan Mao, Bingyi Kang, Minkai Xu, Yong Yu, Stefano Ermon, Weinan Zhang
Diffusion model (DM), as a powerful generative model, recently achieved huge
success in various scenarios including offline reinforcement learning, where
the policy learns to conduct planning by generating trajectory in the online
evaluation. However, despite the effectiveness shown for single-agent learning,
it remains unclear how DMs can operate in multi-agent problems, where agents
can hardly complete teamwork without good coordination by independently
modeling each agent’s trajectories. In this paper, we propose MADiff, a novel
generative multi-agent learning framework to tackle this problem. MADiff is
realized with an attention-based diffusion model to model the complex
coordination among behaviors of multiple diffusion agents. To the best of our
knowledge, MADiff is the first diffusion-based multi-agent offline RL
framework, which behaves as both a decentralized policy and a centralized
controller, which includes opponent modeling and can be used for multi-agent
trajectory prediction. MADiff takes advantage of the powerful generative
ability of diffusion while well-suited in modeling complex multi-agent
interactions. Our experiments show the superior performance of MADiff compared
to baseline algorithms in a range of multi-agent learning tasks.
Contrast, Attend and Diffuse to Decode High-Resolution Images from Brain Activities
May 26, 2023
Jingyuan Sun, Mingxiao Li, Zijiao Chen, Yunhao Zhang, Shaonan Wang, Marie-Francine Moens
Decoding visual stimuli from neural responses recorded by functional Magnetic
Resonance Imaging (fMRI) presents an intriguing intersection between cognitive
neuroscience and machine learning, promising advancements in understanding
human visual perception and building non-invasive brain-machine interfaces.
However, the task is challenging due to the noisy nature of fMRI signals and
the intricate pattern of brain visual representations. To mitigate these
challenges, we introduce a two-phase fMRI representation learning framework.
The first phase pre-trains an fMRI feature learner with a proposed
Double-contrastive Mask Auto-encoder to learn denoised representations. The
second phase tunes the feature learner to attend to neural activation patterns
most informative for visual reconstruction with guidance from an image
auto-encoder. The optimized fMRI feature learner then conditions a latent
diffusion model to reconstruct image stimuli from brain activities.
Experimental results demonstrate our model’s superiority in generating
high-resolution and semantically accurate images, substantially exceeding
previous state-of-the-art methods by 39.34% in the 50-way-top-1 semantic
classification accuracy. Our research invites further exploration of the
decoding task’s potential and contributes to the development of non-invasive
brain-machine interfaces.
Functional Flow Matching
May 26, 2023
Gavin Kerrigan, Giosue Migliorini, Padhraic Smyth
In this work, we propose Functional Flow Matching (FFM), a function-space
generative model that generalizes the recently-introduced Flow Matching model
to operate directly in infinite-dimensional spaces. Our approach works by first
defining a path of probability measures that interpolates between a fixed
Gaussian measure and the data distribution, followed by learning a vector field
on the underlying space of functions that generates this path of measures. Our
method does not rely on likelihoods or simulations, making it well-suited to
the function space setting. We provide both a theoretical framework for
building such models and an empirical evaluation of our techniques. We
demonstrate through experiments on synthetic and real-world benchmarks that our
proposed FFM method outperforms several recently proposed function-space
generative models.
Flow Matching for Scalable Simulation-Based Inference
May 26, 2023
Maximilian Dax, Jonas Wildberger, Simon Buchholz, Stephen R. Green, Jakob H. Macke, Bernhard Schölkopf
Neural posterior estimation methods based on discrete normalizing flows have
become established tools for simulation-based inference (SBI), but scaling them
to high-dimensional problems can be challenging. Building on recent advances in
generative modeling, we here present flow matching posterior estimation (FMPE),
a technique for SBI using continuous normalizing flows. Like diffusion models,
and in contrast to discrete flows, flow matching allows for unconstrained
architectures, providing enhanced flexibility for complex data modalities. Flow
matching, therefore, enables exact density evaluation, fast training, and
seamless scalability to large architectures–making it ideal for SBI. We show
that FMPE achieves competitive performance on an established SBI benchmark, and
then demonstrate its improved scalability on a challenging scientific problem:
for gravitational-wave inference, FMPE outperforms methods based on comparable
discrete flows, reducing training time by 30% with substantially improved
accuracy. Our work underscores the potential of FMPE to enhance performance in
challenging inference scenarios, thereby paving the way for more advanced
applications to scientific problems.
High-Fidelity Image Compression with Score-based Generative Models
May 26, 2023
Emiel Hoogeboom, Eirikur Agustsson, Fabian Mentzer, Luca Versari, George Toderici, Lucas Theis
eess.IV, cs.CV, cs.LG, stat.ML
Despite the tremendous success of diffusion generative models in
text-to-image generation, replicating this success in the domain of image
compression has proven difficult. In this paper, we demonstrate that diffusion
can significantly improve perceptual quality at a given bit-rate, outperforming
state-of-the-art approaches PO-ELIC and HiFiC as measured by FID score. This is
achieved using a simple but theoretically motivated two-stage approach
combining an autoencoder targeting MSE followed by a further score-based
decoder. However, as we will show, implementation details matter and the
optimal design decisions can differ greatly from typical text-to-image models.
An Efficient Membership Inference Attack for the Diffusion Model by Proximal Initialization
May 26, 2023
Fei Kong, Jinhao Duan, RuiPeng Ma, Hengtao Shen, Xiaofeng Zhu, Xiaoshuang Shi, Kaidi Xu
cs.SD, cs.AI, cs.LG, eess.AS
Recently, diffusion models have achieved remarkable success in generating
tasks, including image and audio generation. However, like other generative
models, diffusion models are prone to privacy issues. In this paper, we propose
an efficient query-based membership inference attack (MIA), namely Proximal
Initialization Attack (PIA), which utilizes groundtruth trajectory obtained by
$\epsilon$ initialized in $t=0$ and predicted point to infer memberships.
Experimental results indicate that the proposed method can achieve competitive
performance with only two queries on both discrete-time and continuous-time
diffusion models. Moreover, previous works on the privacy of diffusion models
have focused on vision tasks without considering audio tasks. Therefore, we
also explore the robustness of diffusion models to MIA in the text-to-speech
(TTS) task, which is an audio generation task. To the best of our knowledge,
this work is the first to study the robustness of diffusion models to MIA in
the TTS task. Experimental results indicate that models with mel-spectrogram
(image-like) output are vulnerable to MIA, while models with audio output are
relatively robust to MIA. {Code is available at
\url{https://github.com/kong13661/PIA}}.
Accelerating Diffusion Models for Inverse Problems through Shortcut Sampling
May 26, 2023
Gongye Liu, Haoze Sun, Jiayi Li, Fei Yin, Yujiu Yang
Recently, diffusion models have demonstrated a remarkable ability to solve
inverse problems in an unsupervised manner. Existing methods mainly focus on
modifying the posterior sampling process while neglecting the potential of the
forward process. In this work, we propose Shortcut Sampling for Diffusion
(SSD), a novel pipeline for solving inverse problems. Instead of initiating
from random noise, the key concept of SSD is to find the “Embryo”, a
transitional state that bridges the measurement image y and the restored image
x. By utilizing the “shortcut” path of “input-Embryo-output”, SSD can achieve
precise and fast restoration. To obtain the Embryo in the forward process, We
propose Distortion Adaptive Inversion (DA Inversion). Moreover, we apply back
projection and attention injection as additional consistency constraints during
the generation process. Experimentally, we demonstrate the effectiveness of SSD
on several representative tasks, including super-resolution, deblurring, and
colorization. Compared to state-of-the-art zero-shot methods, our method
achieves competitive results with only 30 NFEs. Moreover, SSD with 100 NFEs can
outperform state-of-the-art zero-shot methods in certain tasks.
Error Bounds for Flow Matching Methods
May 26, 2023
Joe Benton, George Deligiannidis, Arnaud Doucet
Score-based generative models are a popular class of generative modelling
techniques relying on stochastic differential equations (SDE). From their
inception, it was realized that it was also possible to perform generation
using ordinary differential equations (ODE) rather than SDE. This led to the
introduction of the probability flow ODE approach and denoising diffusion
implicit models. Flow matching methods have recently further extended these
ODE-based approaches and approximate a flow between two arbitrary probability
distributions. Previous work derived bounds on the approximation error of
diffusion models under the stochastic sampling regime, given assumptions on the
$L^2$ loss. We present error bounds for the flow matching procedure using fully
deterministic sampling, assuming an $L^2$ bound on the approximation error and
a certain regularity condition on the data distributions.
Diverse and Expressive Speech Prosody Prediction with Denoising Diffusion Probabilistic Model
May 26, 2023
Xiang Li, Songxiang Liu, Max W. Y. Lam, Zhiyong Wu, Chao Weng, Helen Meng
Expressive human speech generally abounds with rich and flexible speech
prosody variations. The speech prosody predictors in existing expressive speech
synthesis methods mostly produce deterministic predictions, which are learned
by directly minimizing the norm of prosody prediction error. Its unimodal
nature leads to a mismatch with ground truth distribution and harms the model’s
ability in making diverse predictions. Thus, we propose a novel prosody
predictor based on the denoising diffusion probabilistic model to take
advantage of its high-quality generative modeling and training stability.
Experiment results confirm that the proposed prosody predictor outperforms the
deterministic baseline on both the expressiveness and diversity of prediction
results with even fewer network parameters.
Tree-Based Diffusion Schrödinger Bridge with Applications to Wasserstein Barycenters
May 26, 2023
Maxence Noble, Valentin De Bortoli, Arnaud Doucet, Alain Durmus
Multi-marginal Optimal Transport (mOT), a generalization of OT, aims at
minimizing the integral of a cost function with respect to a distribution with
some prescribed marginals. In this paper, we consider an entropic version of
mOT with a tree-structured quadratic cost, i.e., a function that can be written
as a sum of pairwise cost functions between the nodes of a tree. To address
this problem, we develop Tree-based Diffusion Schr"odinger Bridge (TreeDSB),
an extension of the Diffusion Schr"odinger Bridge (DSB) algorithm. TreeDSB
corresponds to a dynamic and continuous state-space counterpart of the
multimarginal Sinkhorn algorithm. A notable use case of our methodology is to
compute Wasserstein barycenters which can be recast as the solution of a mOT
problem on a star-shaped tree. We demonstrate that our methodology can be
applied in high-dimensional settings such as image interpolation and Bayesian
fusion.
Score-based Diffusion Models for Bayesian Image Reconstruction
May 25, 2023
Michael T. McCann, Hyungjin Chung, Jong Chul Ye, Marc L. Klasky
This paper explores the use of score-based diffusion models for Bayesian
image reconstruction. Diffusion models are an efficient tool for generative
modeling. Diffusion models can also be used for solving image reconstruction
problems. We present a simple and flexible algorithm for training a diffusion
model and using it for maximum a posteriori reconstruction, minimum mean square
error reconstruction, and posterior sampling. We present experiments on both a
linear and a nonlinear reconstruction problem that highlight the strengths and
limitations of the approach.
Anomaly Detection in Satellite Videos using Diffusion Models
May 25, 2023
Akash Awasthi, Son Ly, Jaer Nizam, Samira Zare, Videet Mehta, Safwan Ahmed, Keshav Shah, Ramakrishna Nemani, Saurabh Prasad, Hien Van Nguyen
The definition of anomaly detection is the identification of an unexpected
event. Real-time detection of extreme events such as wildfires, cyclones, or
floods using satellite data has become crucial for disaster management.
Although several earth-observing satellites provide information about
disasters, satellites in the geostationary orbit provide data at intervals as
frequent as every minute, effectively creating a video from space. There are
many techniques that have been proposed to identify anomalies in surveillance
videos; however, the available datasets do not have dynamic behavior, so we
discuss an anomaly framework that can work on very high-frequency datasets to
find very fast-moving anomalies. In this work, we present a diffusion model
which does not need any motion component to capture the fast-moving anomalies
and outperforms the other baseline methods.
Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models
May 25, 2023
Shihao Zhao, Dongdong Chen, Yen-Chun Chen, Jianmin Bao, Shaozhe Hao, Lu Yuan, Kwan-Yee K. Wong
Text-to-Image diffusion models have made tremendous progress over the past
two years, enabling the generation of highly realistic images based on
open-domain text descriptions. However, despite their success, text
descriptions often struggle to adequately convey detailed controls, even when
composed of long and complex texts. Moreover, recent studies have also shown
that these models face challenges in understanding such complex texts and
generating the corresponding images. Therefore, there is a growing need to
enable more control modes beyond text description. In this paper, we introduce
Uni-ControlNet, a novel approach that allows for the simultaneous utilization
of different local controls (e.g., edge maps, depth map, segmentation masks)
and global controls (e.g., CLIP image embeddings) in a flexible and composable
manner within one model. Unlike existing methods, Uni-ControlNet only requires
the fine-tuning of two additional adapters upon frozen pre-trained
text-to-image diffusion models, eliminating the huge cost of training from
scratch. Moreover, thanks to some dedicated adapter designs, Uni-ControlNet
only necessitates a constant number (i.e., 2) of adapters, regardless of the
number of local or global controls used. This not only reduces the fine-tuning
costs and model size, making it more suitable for real-world deployment, but
also facilitate composability of different conditions. Through both
quantitative and qualitative comparisons, Uni-ControlNet demonstrates its
superiority over existing methods in terms of controllability, generation
quality and composability. Code is available at
\url{https://github.com/ShihaoZhaoZSH/Uni-ControlNet}.
Parallel Sampling of Diffusion Models
May 25, 2023
Andy Shih, Suneel Belkhale, Stefano Ermon, Dorsa Sadigh, Nima Anari
Diffusion models are powerful generative models but suffer from slow
sampling, often taking 1000 sequential denoising steps for one sample. As a
result, considerable efforts have been directed toward reducing the number of
denoising steps, but these methods hurt sample quality. Instead of reducing the
number of denoising steps (trading quality for speed), in this paper we explore
an orthogonal approach: can we run the denoising steps in parallel (trading
compute for speed)? In spite of the sequential nature of the denoising steps,
we show that surprisingly it is possible to parallelize sampling via Picard
iterations, by guessing the solution of future denoising steps and iteratively
refining until convergence. With this insight, we present ParaDiGMS, a novel
method to accelerate the sampling of pretrained diffusion models by denoising
multiple steps in parallel. ParaDiGMS is the first diffusion sampling method
that enables trading compute for speed and is even compatible with existing
fast sampling techniques such as DDIM and DPMSolver. Using ParaDiGMS, we
improve sampling speed by 2-4x across a range of robotics and image generation
models, giving state-of-the-art sampling speeds of 0.2s on 100-step
DiffusionPolicy and 16s on 1000-step StableDiffusion-v2 with no measurable
degradation of task reward, FID score, or CLIP score.
UDPM: Upsampling Diffusion Probabilistic Models
May 25, 2023
Shady Abu-Hussein, Raja Giryes
In recent years, Denoising Diffusion Probabilistic Models (DDPM) have caught
significant attention. By composing a Markovian process that starts in the data
domain and then gradually adds noise until reaching pure white noise, they
achieve superior performance in learning data distributions. Yet, these models
require a large number of diffusion steps to produce aesthetically pleasing
samples, which is inefficient. In addition, unlike common generative
adversarial networks, the latent space of diffusion models is not
interpretable. In this work, we propose to generalize the denoising diffusion
process into an Upsampling Diffusion Probabilistic Model (UDPM), in which we
reduce the latent variable dimension in addition to the traditional noise level
addition. As a result, we are able to sample images of size $256\times 256$
with only 7 diffusion steps, which is less than two orders of magnitude
compared to standard DDPMs. We formally develop the Markovian diffusion
processes of the UDPM, and demonstrate its generation capabilities on the
popular FFHQ, LSUN horses, ImageNet, and AFHQv2 datasets. Another favorable
property of UDPM is that it is very easy to interpolate its latent space, which
is not the case with standard diffusion models. Our code is available online
\url{https://github.com/shadyabh/UDPM}
Trans-Dimensional Generative Modeling via Jump Diffusion Models
May 25, 2023
Andrew Campbell, William Harvey, Christian Weilbach, Valentin De Bortoli, Tom Rainforth, Arnaud Doucet
We propose a new class of generative models that naturally handle data of
varying dimensionality by jointly modeling the state and dimension of each
datapoint. The generative process is formulated as a jump diffusion process
that makes jumps between different dimensional spaces. We first define a
dimension destroying forward noising process, before deriving the dimension
creating time-reversed generative process along with a novel evidence lower
bound training objective for learning to approximate it. Simulating our learned
approximation to the time-reversed generative process then provides an
effective way of sampling data of varying dimensionality by jointly generating
state values and dimensions. We demonstrate our approach on molecular and video
datasets of varying dimensionality, reporting better compatibility with
test-time diffusion guidance imputation tasks and improved interpolation
capabilities versus fixed dimensional models that generate state values and
dimensions separately.
Prompt-Free Diffusion: Taking “Text” out of Text-to-Image Diffusion Models
May 25, 2023
Xingqian Xu, Jiayi Guo, Zhangyang Wang, Gao Huang, Irfan Essa, Humphrey Shi
Text-to-image (T2I) research has grown explosively in the past year, owing to
the large-scale pre-trained diffusion models and many emerging personalization
and editing approaches. Yet, one pain point persists: the text prompt
engineering, and searching high-quality text prompts for customized results is
more art than science. Moreover, as commonly argued: “an image is worth a
thousand words” - the attempt to describe a desired image with texts often ends
up being ambiguous and cannot comprehensively cover delicate visual details,
hence necessitating more additional controls from the visual domain. In this
paper, we take a bold step forward: taking “Text” out of a pre-trained T2I
diffusion model, to reduce the burdensome prompt engineering efforts for users.
Our proposed framework, Prompt-Free Diffusion, relies on only visual inputs to
generate new images: it takes a reference image as “context”, an optional image
structural conditioning, and an initial noise, with absolutely no text prompt.
The core architecture behind the scene is Semantic Context Encoder (SeeCoder),
substituting the commonly used CLIP-based or LLM-based text encoder. The
reusability of SeeCoder also makes it a convenient drop-in component: one can
also pre-train a SeeCoder in one T2I model and reuse it for another. Through
extensive experiments, Prompt-Free Diffusion is experimentally found to (i)
outperform prior exemplar-based image synthesis approaches; (ii) perform on par
with state-of-the-art T2I models using prompts following the best practice; and
(iii) be naturally extensible to other downstream applications such as anime
figure generation and virtual try-on, with promising quality. Our code and
models are open-sourced at https://github.com/SHI-Labs/Prompt-Free-Diffusion.
ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation with Variational Score Distillation
May 25, 2023
Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, Jun Zhu
Score distillation sampling (SDS) has shown great promise in text-to-3D
generation by distilling pretrained large-scale text-to-image diffusion models,
but suffers from over-saturation, over-smoothing, and low-diversity problems.
In this work, we propose to model the 3D parameter as a random variable instead
of a constant as in SDS and present variational score distillation (VSD), a
principled particle-based variational framework to explain and address the
aforementioned issues in text-to-3D generation. We show that SDS is a special
case of VSD and leads to poor samples with both small and large CFG weights. In
comparison, VSD works well with various CFG weights as ancestral sampling from
diffusion models and simultaneously improves the diversity and sample quality
with a common CFG weight (i.e., $7.5$). We further present various improvements
in the design space for text-to-3D such as distillation time schedule and
density initialization, which are orthogonal to the distillation algorithm yet
not well explored. Our overall approach, dubbed ProlificDreamer, can generate
high rendering resolution (i.e., $512\times512$) and high-fidelity NeRF with
rich structure and complex effects (e.g., smoke and drops). Further,
initialized from NeRF, meshes fine-tuned by VSD are meticulously detailed and
photo-realistic. Project page: https://ml.cs.tsinghua.edu.cn/prolificdreamer/
Unifying GANs and Score-Based Diffusion as Generative Particle Models
May 25, 2023
Jean-Yves Franceschi, Mike Gartrell, Ludovic Dos Santos, Thibaut Issenhuth, Emmanuel de Bézenac, Mickaël Chen, Alain Rakotomamonjy
cs.LG, cs.CV, cs.NE, stat.ML
Particle-based deep generative models, such as gradient flows and score-based
diffusion models, have recently gained traction thanks to their striking
performance. Their principle of displacing particle distributions by
differential equations is conventionally seen as opposed to the previously
widespread generative adversarial networks (GANs), which involve training a
pushforward generator network. In this paper, we challenge this interpretation
and propose a novel framework that unifies particle and adversarial generative
models by framing generator training as a generalization of particle models.
This suggests that a generator is an optional addition to any such generative
model. Consequently, integrating a generator into a score-based diffusion model
and training a GAN without a generator naturally emerge from our framework. We
empirically test the viability of these original models as proofs of concepts
of potential applications of our framework.
DiffusionShield: A Watermark for Copyright Protection against Generative Diffusion Models
May 25, 2023
Yingqian Cui, Jie Ren, Han Xu, Pengfei He, Hui Liu, Lichao Sun, Jiliang Tang
Recently, Generative Diffusion Models (GDMs) have showcased their remarkable
capabilities in learning and generating images. A large community of GDMs has
naturally emerged, further promoting the diversified applications of GDMs in
various fields. However, this unrestricted proliferation has raised serious
concerns about copyright protection. For example, artists including painters
and photographers are becoming increasingly concerned that GDMs could
effortlessly replicate their unique creative works without authorization. In
response to these challenges, we introduce a novel watermarking scheme,
DiffusionShield, tailored for GDMs. DiffusionShield protects images from
copyright infringement by GDMs through encoding the ownership information into
an imperceptible watermark and injecting it into the images. Its watermark can
be easily learned by GDMs and will be reproduced in their generated images. By
detecting the watermark from generated images, copyright infringement can be
exposed with evidence. Benefiting from the uniformity of the watermarks and the
joint optimization method, DiffusionShield ensures low distortion of the
original image, high watermark detection performance, and the ability to embed
lengthy messages. We conduct rigorous and comprehensive experiments to show the
effectiveness of DiffusionShield in defending against infringement by GDMs and
its superiority over traditional watermarking methods.
DiffCLIP: Leveraging Stable Diffusion for Language Grounded 3D Classification
May 25, 2023
Sitian Shen, Zilin Zhu, Linqian Fan, Harry Zhang, Xinxiao Wu
Large pre-trained models have had a significant impact on computer vision by
enabling multi-modal learning, where the CLIP model has achieved impressive
results in image classification, object detection, and semantic segmentation.
However, the model’s performance on 3D point cloud processing tasks is limited
due to the domain gap between depth maps from 3D projection and training images
of CLIP. This paper proposes DiffCLIP, a new pre-training framework that
incorporates stable diffusion with ControlNet to minimize the domain gap in the
visual branch. Additionally, a style-prompt generation module is introduced for
few-shot tasks in the textual branch. Extensive experiments on the ModelNet10,
ModelNet40, and ScanObjectNN datasets show that DiffCLIP has strong abilities
for 3D understanding. By using stable diffusion and style-prompt generation,
DiffCLIP achieves an accuracy of 43.2\% for zero-shot classification on OBJ_BG
of ScanObjectNN, which is state-of-the-art performance, and an accuracy of
80.6\% for zero-shot classification on ModelNet10, which is comparable to
state-of-the-art performance.
A Diffusion Probabilistic Prior for Low-Dose CT Image Denoising
May 25, 2023
Xuan Liu, Yaoqin Xie, Songhui Diao, Shan Tan, Xiaokun Liang
Low-dose computed tomography (CT) image denoising is crucial in medical image
computing. Recent years have been remarkable improvement in deep learning-based
methods for this task. However, training deep denoising neural networks
requires low-dose and normal-dose CT image pairs, which are difficult to obtain
in the clinic settings. To address this challenge, we propose a novel fully
unsupervised method for low-dose CT image denoising, which is based on
denoising diffusion probabilistic model – a powerful generative model. First,
we train an unconditional denoising diffusion probabilistic model capable of
generating high-quality normal-dose CT images from random noise. Subsequently,
the probabilistic priors of the pre-trained diffusion model are incorporated
into a Maximum A Posteriori (MAP) estimation framework for iteratively solving
the image denoising problem. Our method ensures the diffusion model produces
high-quality normal-dose CT images while keeping the image content consistent
with the input low-dose CT images. We evaluate our method on a widely used
low-dose CT image denoising benchmark, and it outperforms several supervised
low-dose CT image denoising methods in terms of both quantitative and visual
performance.
On Architectural Compression of Text-to-Image Diffusion Models
May 25, 2023
Bo-Kyeong Kim, Hyoung-Kyu Song, Thibault Castells, Shinkook Choi
Exceptional text-to-image (T2I) generation results of Stable Diffusion models
(SDMs) come with substantial computational demands. To resolve this issue,
recent research on efficient SDMs has prioritized reducing the number of
sampling steps and utilizing network quantization. Orthogonal to these
directions, this study highlights the power of classical architectural
compression for general-purpose T2I synthesis by introducing block-removed
knowledge-distilled SDMs (BK-SDMs). We eliminate several residual and attention
blocks from the U-Net of SDMs, obtaining over a 30% reduction in the number of
parameters, MACs per sampling step, and latency. We conduct distillation-based
pretraining with only 0.22M LAION pairs (fewer than 0.1% of the full training
pairs) on a single A100 GPU. Despite being trained with limited resources, our
compact models can imitate the original SDM by benefiting from transferred
knowledge and achieve competitive results against larger multi-billion
parameter models on the zero-shot MS-COCO benchmark. Moreover, we demonstrate
the applicability of our lightweight pretrained models in personalized
generation with DreamBooth finetuning.
Custom-Edit: Text-Guided Image Editing with Customized Diffusion Models
May 25, 2023
Jooyoung Choi, Yunjey Choi, Yunji Kim, Junho Kim, Sungroh Yoon
Text-to-image diffusion models can generate diverse, high-fidelity images
based on user-provided text prompts. Recent research has extended these models
to support text-guided image editing. While text guidance is an intuitive
editing interface for users, it often fails to ensure the precise concept
conveyed by users. To address this issue, we propose Custom-Edit, in which we
(i) customize a diffusion model with a few reference images and then (ii)
perform text-guided editing. Our key discovery is that customizing only
language-relevant parameters with augmented prompts improves reference
similarity significantly while maintaining source similarity. Moreover, we
provide our recipe for each customization and editing process. We compare
popular customization methods and validate our findings on two editing methods
using various datasets.
Differentially Private Latent Diffusion Models
May 25, 2023
Saiyue Lyu, Margarita Vinaroz, Michael F. Liu, Mijung Park
Diffusion models (DMs) are widely used for generating high-quality image
datasets. However, since they operate directly in the high-dimensional pixel
space, optimization of DMs is computationally expensive, requiring long
training times. This contributes to large amounts of noise being injected into
the differentially private learning process, due to the composability property
of differential privacy. To address this challenge, we propose training Latent
Diffusion Models (LDMs) with differential privacy. LDMs use powerful
pre-trained autoencoders to reduce the high-dimensional pixel space to a much
lower-dimensional latent space, making training DMs more efficient and fast.
Unlike [Ghalebikesabi et al., 2023] that pre-trains DMs with public data then
fine-tunes them with private data, we fine-tune only the attention modules of
LDMs at varying layers with privacy-sensitive data, reducing the number of
trainable parameters by approximately 96% compared to fine-tuning the entire
DM. We test our algorithm on several public-private data pairs, such as
ImageNet as public data and CIFAR10 and CelebA as private data, and SVHN as
public data and MNIST as private data. Our approach provides a promising
direction for training more powerful, yet training-efficient differentially
private DMs that can produce high-quality synthetic images.
Score-Based Multimodal Autoencoders
May 25, 2023
Daniel Wesego, Amirmohammad Rooshenas
Multimodal Variational Autoencoders (VAEs) represent a promising group of
generative models that facilitate the construction of a tractable posterior
within the latent space, given multiple modalities. Daunhawer et al. (2022)
demonstrate that as the number of modalities increases, the generative quality
of each modality declines. In this study, we explore an alternative approach to
enhance the generative performance of multimodal VAEs by jointly modeling the
latent space of unimodal VAEs using score-based models (SBMs). The role of the
SBM is to enforce multimodal coherence by learning the correlation among the
latent variables. Consequently, our model combines the superior generative
quality of unimodal VAEs with coherent integration across different modalities.
Zero-shot Generation of Training Data with Denoising Diffusion Probabilistic Model for Handwritten Chinese Character Recognition
May 25, 2023
Dongnan Gui, Kai Chen, Haisong Ding, Qiang Huo
There are more than 80,000 character categories in Chinese while most of them
are rarely used. To build a high performance handwritten Chinese character
recognition (HCCR) system supporting the full character set with a traditional
approach, many training samples need be collected for each character category,
which is both time-consuming and expensive. In this paper, we propose a novel
approach to transforming Chinese character glyph images generated from font
libraries to handwritten ones with a denoising diffusion probabilistic model
(DDPM). Training from handwritten samples of a small character set, the DDPM is
capable of mapping printed strokes to handwritten ones, which makes it possible
to generate photo-realistic and diverse style handwritten samples of unseen
character categories. Combining DDPM-synthesized samples of unseen categories
with real samples of other categories, we can build an HCCR system to support
the full character set. Experimental results on CASIA-HWDB dataset with 3,755
character categories show that the HCCR systems trained with synthetic samples
perform similarly with the one trained with real samples in terms of
recognition accuracy. The proposed method has the potential to address HCCR
with a larger vocabulary.
Manifold Diffusion Fields
May 24, 2023
Ahmed A. Elhag, Joshua M. Susskind, Miguel Angel Bautista
We present Manifold Diffusion Fields (MDF), an approach to learn generative
models of continuous functions defined over Riemannian manifolds. Leveraging
insights from spectral geometry analysis, we define an intrinsic coordinate
system on the manifold via the eigen-functions of the Laplace-Beltrami
Operator. MDF represents functions using an explicit parametrization formed by
a set of multiple input-output pairs. Our approach allows to sample continuous
functions on manifolds and is invariant with respect to rigid and isometric
transformations of the manifold. Empirical results on several datasets and
manifolds show that MDF can capture distributions of such functions with better
diversity and fidelity than previous approaches.
Solving Diffusion ODEs with Optimal Boundary Conditions for Better Image Super-Resolution
May 24, 2023
Yiyang Ma, Huan Yang, Wenhan Yang, Jianlong Fu, Jiaying Liu
Diffusion models, as a kind of powerful generative model, have given
impressive results on image super-resolution (SR) tasks. However, due to the
randomness introduced in the reverse process of diffusion models, the
performances of diffusion-based SR models are fluctuating at every time of
sampling, especially for samplers with few resampled steps. This inherent
randomness of diffusion models results in ineffectiveness and instability,
making it challenging for users to guarantee the quality of SR results.
However, our work takes this randomness as an opportunity: fully analyzing and
leveraging it leads to the construction of an effective plug-and-play sampling
method that owns the potential to benefit a series of diffusion-based SR
methods. More in detail, we propose to steadily sample high-quality SR images
from pretrained diffusion-based SR models by solving diffusion ordinary
differential equations (diffusion ODEs) with optimal boundary conditions (BCs)
and analyze the characteristics between the choices of BCs and their
corresponding SR results. Our analysis shows the route to obtain an
approximately optimal BC via an efficient exploration in the whole space. The
quality of SR results sampled by the proposed method with fewer steps
outperforms the quality of results sampled by current methods with randomness
from the same pretrained diffusion-based SR model, which means that our
sampling method “boosts” current diffusion-based SR models without any
additional training.
Training Energy-Based Normalizing Flow with Score-Matching Objectives
May 24, 2023
Chen-Hao Chao, Wei-Fang Sun, Yen-Chang Hsu, Zsolt Kira, Chun-Yi Lee
In this paper, we establish a connection between the parameterization of
flow-based and energy-based generative models, and present a new flow-based
modeling approach called energy-based normalizing flow (EBFlow). We demonstrate
that by optimizing EBFlow with score-matching objectives, the computation of
Jacobian determinants for linear transformations can be entirely bypassed. This
feature enables the use of arbitrary linear layers in the construction of
flow-based models without increasing the computational time complexity of each
training iteration from $\mathcal{O}(D^2L)$ to $\mathcal{O}(D^3L)$ for an
$L$-layered model that accepts $D$-dimensional inputs. This makes the training
of EBFlow more efficient than the commonly-adopted maximum likelihood training
method. In addition to the reduction in runtime, we enhance the training
stability and empirical performance of EBFlow through a number of techniques
developed based on our analysis on the score-matching methods. The experimental
results demonstrate that our approach achieves a significant speedup compared
to maximum likelihood estimation, while outperforming prior efficient training
techniques with a noticeable margin in terms of negative log-likelihood (NLL).
Diffusion-Based Audio Inpainting
May 24, 2023
Eloi Moliner, Vesa Välimäki
Audio inpainting aims to reconstruct missing segments in corrupted
recordings. Previous methods produce plausible reconstructions when the gap
length is shorter than about 100\;ms, but the quality decreases for longer
gaps. This paper explores recent advancements in deep learning and,
particularly, diffusion models, for the task of audio inpainting. The proposed
method uses an unconditionally trained generative model, which can be
conditioned in a zero-shot fashion for audio inpainting, offering high
flexibility to regenerate gaps of arbitrary length. An improved deep neural
network architecture based on the constant-Q transform, which allows the model
to exploit pitch-equivariant symmetries in audio, is also presented. The
performance of the proposed algorithm is evaluated through objective and
subjective metrics for the task of reconstructing short to mid-sized gaps. The
results of a formal listening test show that the proposed method delivers a
comparable performance against state-of-the-art for short gaps, while retaining
a good audio quality and outperforming the baselines for the longest gap
lengths tested, 150\;ms and 200\;ms. This work helps improve the restoration of
sound recordings having fairly long local disturbances or dropouts, which must
be reconstructed.
Robust Classification via a Single Diffusion Model
May 24, 2023
Huanran Chen, Yinpeng Dong, Zhengyi Wang, Xiao Yang, Chengqi Duan, Hang Su, Jun Zhu
Recently, diffusion models have been successfully applied to improving
adversarial robustness of image classifiers by purifying the adversarial noises
or generating realistic data for adversarial training. However, the
diffusion-based purification can be evaded by stronger adaptive attacks while
adversarial training does not perform well under unseen threats, exhibiting
inevitable limitations of these methods. To better harness the expressive power
of diffusion models, in this paper we propose Robust Diffusion Classifier
(RDC), a generative classifier that is constructed from a pre-trained diffusion
model to be adversarially robust. Our method first maximizes the data
likelihood of a given input and then predicts the class probabilities of the
optimized input using the conditional likelihood of the diffusion model through
Bayes’ theorem. Since our method does not require training on particular
adversarial attacks, we demonstrate that it is more generalizable to defend
against multiple unseen threats. In particular, RDC achieves $73.24\%$ robust
accuracy against $\ell_\infty$ norm-bounded perturbations with
$\epsilon_\infty=8/255$ on CIFAR-10, surpassing the previous state-of-the-art
adversarial training models by $+2.34\%$. The findings highlight the potential
of generative classifiers by employing diffusion models for adversarial
robustness compared with the commonly studied discriminative classifiers.
DiffBlender: Scalable and Composable Multimodal Text-to-Image Diffusion Models
May 24, 2023
Sungnyun Kim, Junsoo Lee, Kibeom Hong, Daesik Kim, Namhyuk Ahn
The recent progress in diffusion-based text-to-image generation models has
significantly expanded generative capabilities via conditioning the text
descriptions. However, since relying solely on text prompts is still
restrictive for fine-grained customization, we aim to extend the boundaries of
conditional generation to incorporate diverse types of modalities, e.g.,
sketch, box, and style embedding, simultaneously. We thus design a multimodal
text-to-image diffusion model, coined as DiffBlender, that achieves the
aforementioned goal in a single model by training only a few small
hypernetworks. DiffBlender facilitates a convenient scaling of input
modalities, without altering the parameters of an existing large-scale
generative model to retain its well-established knowledge. Furthermore, our
study sets new standards for multimodal generation by conducting quantitative
and qualitative comparisons with existing approaches. By diversifying the
channels of conditioning modalities, DiffBlender faithfully reflects the
provided information or, in its absence, creates imaginative generation.
Deceptive-NeRF: Enhancing NeRF Reconstruction using Pseudo-Observations from Diffusion Models
May 24, 2023
Xinhang Liu, Shiu-hong Kao, Jiaben Chen, Yu-Wing Tai, Chi-Keung Tang
This paper introduces Deceptive-NeRF, a new method for enhancing the quality
of reconstructed NeRF models using synthetically generated pseudo-observations,
capable of handling sparse input and removing floater artifacts. Our proposed
method involves three key steps: 1) reconstruct a coarse NeRF model from sparse
inputs; 2) generate pseudo-observations based on the coarse model; 3) refine
the NeRF model using pseudo-observations to produce a high-quality
reconstruction. To generate photo-realistic pseudo-observations that faithfully
preserve the identity of the reconstructed scene while remaining consistent
with the sparse inputs, we develop a rectification latent diffusion model that
generates images conditional on a coarse RGB image and depth map, which are
derived from the coarse NeRF and latent text embedding from input images.
Extensive experiments show that our method is effective and can generate
perceptually high-quality NeRF even with very sparse inputs.
Unpaired Image-to-Image Translation via Neural Schrödinger Bridge
May 24, 2023
Beomsu Kim, Gihyun Kwon, Kwanyoung Kim, Jong Chul Ye
cs.CV, cs.AI, cs.LG, stat.ML
Diffusion models are a powerful class of generative models which simulate
stochastic differential equations (SDEs) to generate data from noise. Although
diffusion models have achieved remarkable progress in recent years, they have
limitations in the unpaired image-to-image translation tasks due to the
Gaussian prior assumption. Schr"odinger Bridge (SB), which learns an SDE to
translate between two arbitrary distributions, have risen as an attractive
solution to this problem. However, none of SB models so far have been
successful at unpaired translation between high-resolution images. In this
work, we propose the Unpaired Neural Schr"odinger Bridge (UNSB), which
combines SB with adversarial training and regularization to learn a SB between
unpaired data. We demonstrate that UNSB is scalable, and that it successfully
solves various unpaired image-to-image translation tasks. Code:
\url{https://github.com/cyclomon/UNSB}
DuDGAN: Improving Class-Conditional GANs via Dual-Diffusion
May 24, 2023
Taesun Yeom, Minhyeok Lee
Class-conditional image generation using generative adversarial networks
(GANs) has been investigated through various techniques; however, it continues
to face challenges such as mode collapse, training instability, and low-quality
output in cases of datasets with high intra-class variation. Furthermore, most
GANs often converge in larger iterations, resulting in poor iteration efficacy
in training procedures. While Diffusion-GAN has shown potential in generating
realistic samples, it has a critical limitation in generating class-conditional
samples. To overcome these limitations, we propose a novel approach for
class-conditional image generation using GANs called DuDGAN, which incorporates
a dual diffusion-based noise injection process. Our method consists of three
unique networks: a discriminator, a generator, and a classifier. During the
training process, Gaussian-mixture noises are injected into the two noise-aware
networks, the discriminator and the classifier, in distinct ways. This noisy
data helps to prevent overfitting by gradually introducing more challenging
tasks, leading to improved model performance. As a result, our method
outperforms state-of-the-art conditional GAN models for image generation in
terms of performance. We evaluated our method using the AFHQ, Food-101, and
CIFAR-10 datasets and observed superior results across metrics such as FID,
KID, Precision, and Recall score compared with comparison models, highlighting
the effectiveness of our approach.
May 24, 2023
Jaemoo Choi, Jaewoong Choi, Myungjoo Kang
Optimal Transport (OT) problem investigates a transport map that bridges two
distributions while minimizing a given cost function. In this regard, OT
between tractable prior distribution and data has been utilized for generative
modeling tasks. However, OT-based methods are susceptible to outliers and face
optimization challenges during training. In this paper, we propose a novel
generative model based on the semi-dual formulation of Unbalanced Optimal
Transport (UOT). Unlike OT, UOT relaxes the hard constraint on distribution
matching. This approach provides better robustness against outliers, stability
during training, and faster convergence. We validate these properties
empirically through experiments. Moreover, we study the theoretical upper-bound
of divergence between distributions in UOT. Our model outperforms existing
OT-based generative models, achieving FID scores of 2.97 on CIFAR-10 and 5.80
on CelebA-HQ-256.
On the Generalization of Diffusion Model
May 24, 2023
Mingyang Yi, Jiacheng Sun, Zhenguo Li
The diffusion probabilistic generative models are widely used to generate
high-quality data. Though they can synthetic data that does not exist in the
training set, the rationale behind such generalization is still unexplored. In
this paper, we formally define the generalization of the generative model,
which is measured by the mutual information between the generated data and the
training set. The definition originates from the intuition that the model which
generates data with less correlation to the training set exhibits better
generalization ability. Meanwhile, we show that for the empirical optimal
diffusion model, the data generated by a deterministic sampler are all highly
related to the training set, thus poor generalization. This result contradicts
the observation of the trained diffusion model’s (approximating empirical
optima) extrapolation ability (generating unseen data). To understand this
contradiction, we empirically verify the difference between the sufficiently
trained diffusion model and the empirical optima. We found, though obtained
through sufficient training, there still exists a slight difference between
them, which is critical to making the diffusion model generalizable. Moreover,
we propose another training objective whose empirical optimal solution has no
potential generalization problem. We empirically show that the proposed
training objective returns a similar model to the original one, which further
verifies the generalization ability of the trained diffusion model.
Optimal Linear Subspace Search: Learning to Construct Fast and High-Quality Schedulers for Diffusion Models
May 24, 2023
Zhongjie Duan, Chengyu Wang, Cen Chen, Jun Huang, Weining Qian
In recent years, diffusion models have become the most popular and powerful
methods in the field of image synthesis, even rivaling human artists in
artistic creativity. However, the key issue currently limiting the application
of diffusion models is its extremely slow generation process. Although several
methods were proposed to speed up the generation process, there still exists a
trade-off between efficiency and quality. In this paper, we first provide a
detailed theoretical and empirical analysis of the generation process of the
diffusion models based on schedulers. We transform the designing problem of
schedulers into the determination of several parameters, and further transform
the accelerated generation process into an expansion process of the linear
subspace. Based on these analyses, we consequently propose a novel method
called Optimal Linear Subspace Search (OLSS), which accelerates the generation
process by searching for the optimal approximation process of the complete
generation process in the linear subspaces spanned by latent variables. OLSS is
able to generate high-quality images with a very small number of steps. To
demonstrate the effectiveness of our method, we conduct extensive comparative
experiments on open-source diffusion models. Experimental results show that
with a given number of steps, OLSS can significantly improve the quality of
generated images. Using an NVIDIA A100 GPU, we make it possible to generate a
high-quality image by Stable Diffusion within only one second without other
optimization techniques.
T1: Scaling Diffusion Probabilistic Fields to High-Resolution on Unified Visual Modalities
May 24, 2023
Kangfu Mei, Mo Zhou, Vishal M. Patel
Diffusion Probabilistic Field (DPF) models the distribution of continuous
functions defined over metric spaces. While DPF shows great potential for
unifying data generation of various modalities including images, videos, and 3D
geometry, it does not scale to a higher data resolution. This can be attributed
to the ``scaling property’’, where it is difficult for the model to capture
local structures through uniform sampling. To this end, we propose a new model
comprising of a view-wise sampling algorithm to focus on local structure
learning, and incorporating additional guidance, e.g., text description, to
complement the global geometry. The model can be scaled to generate
high-resolution data while unifying multiple modalities. Experimental results
on data generation in various modalities demonstrate the effectiveness of our
model, as well as its potential as a foundation framework for scalable
modality-unified visual content generation.
Diffusion Hyperfeatures: Searching Through Time and Space for Semantic Correspondence
May 23, 2023
Grace Luo, Lisa Dunlap, Dong Huk Park, Aleksander Holynski, Trevor Darrell
Diffusion models have been shown to be capable of generating high-quality
images, suggesting that they could contain meaningful internal representations.
Unfortunately, the feature maps that encode a diffusion model’s internal
information are spread not only over layers of the network, but also over
diffusion timesteps, making it challenging to extract useful descriptors. We
propose Diffusion Hyperfeatures, a framework for consolidating multi-scale and
multi-timestep feature maps into per-pixel feature descriptors that can be used
for downstream tasks. These descriptors can be extracted for both synthetic and
real images using the generation and inversion processes. We evaluate the
utility of our Diffusion Hyperfeatures on the task of semantic keypoint
correspondence: our method achieves superior performance on the SPair-71k real
image benchmark. We also demonstrate that our method is flexible and
transferable: our feature aggregation network trained on the inversion features
of real image pairs can be used on the generation features of synthetic image
pairs with unseen objects and compositions. Our code is available at
\url{https://diffusion-hyperfeatures.github.io}.
SEEDS: Exponential SDE Solvers for Fast High-Quality Sampling from Diffusion Models
May 23, 2023
Martin Gonzalez, Nelson Fernandez, Thuy Tran, Elies Gherbi, Hatem Hajri, Nader Masmoudi
cs.LG, cs.CV, cs.NA, math.NA, I.2.6
A potent class of generative models known as Diffusion Probabilistic Models
(DPMs) has become prominent. A forward diffusion process adds gradually noise
to data, while a model learns to gradually denoise. Sampling from pre-trained
DPMs is obtained by solving differential equations (DE) defined by the learnt
model, a process which has shown to be prohibitively slow. Numerous efforts on
speeding-up this process have consisted on crafting powerful ODE solvers.
Despite being quick, such solvers do not usually reach the optimal quality
achieved by available slow SDE solvers. Our goal is to propose SDE solvers that
reach optimal quality without requiring several hundreds or thousands of NFEs
to achieve that goal. In this work, we propose Stochastic Exponential
Derivative-free Solvers (SEEDS), improving and generalizing Exponential
Integrator approaches to the stochastic case on several frameworks. After
carefully analyzing the formulation of exact solutions of diffusion SDEs, we
craft SEEDS to analytically compute the linear part of such solutions. Inspired
by the Exponential Time-Differencing method, SEEDS uses a novel treatment of
the stochastic components of solutions, enabling the analytical computation of
their variance, and contains high-order terms allowing to reach optimal quality
sampling $\sim3$-$5\times$ faster than previous SDE methods. We validate our
approach on several image generation benchmarks, showing that SEEDS outperforms
or is competitive with previous SDE solvers. Contrary to the latter, SEEDS are
derivative and training free, and we fully prove strong convergence guarantees
for them.
Improved Convergence of Score-Based Diffusion Models via Prediction-Correction
May 23, 2023
Francesco Pedrotti, Jan Maas, Marco Mondelli
cs.LG, math.ST, stat.ML, stat.TH
Score-based generative models (SGMs) are powerful tools to sample from
complex data distributions. Their underlying idea is to (i) run a forward
process for time $T_1$ by adding noise to the data, (ii) estimate its score
function, and (iii) use such estimate to run a reverse process. As the reverse
process is initialized with the stationary distribution of the forward one, the
existing analysis paradigm requires $T_1\to\infty$. This is however
problematic: from a theoretical viewpoint, for a given precision of the score
approximation, the convergence guarantee fails as $T_1$ diverges; from a
practical viewpoint, a large $T_1$ increases computational costs and leads to
error propagation. This paper addresses the issue by considering a version of
the popular predictor-corrector scheme: after running the forward process, we
first estimate the final distribution via an inexact Langevin dynamics and then
revert the process. Our key technical contribution is to provide convergence
guarantees in Wasserstein distance which require to run the forward process
only for a finite time $T_1$. Our bounds exhibit a mild logarithmic dependence
on the input dimension and the subgaussian norm of the target distribution,
have minimal assumptions on the data, and require only to control the $L^2$
loss on the score approximation, which is the quantity minimized in practice.
Realistic Noise Synthesis with Diffusion Models
May 23, 2023
Qi Wu, Mingyan Han, Ting Jiang, Haoqiang Fan, Bing Zeng, Shuaicheng Liu
Deep learning-based approaches have achieved remarkable performance in
single-image denoising. However, training denoising models typically requires a
large amount of data, which can be difficult to obtain in real-world scenarios.
Furthermore, synthetic noise used in the past has often produced significant
differences compared to real-world noise due to the complexity of the latter
and the poor modeling ability of noise distributions of Generative Adversarial
Network (GAN) models, resulting in residual noise and artifacts within
denoising models. To address these challenges, we propose a novel method for
synthesizing realistic noise using diffusion models. This approach enables us
to generate large amounts of high-quality data for training denoising models by
controlling camera settings to simulate different environmental conditions and
employing guided multi-scale content information to ensure that our method is
more capable of generating real noise with multi-frequency spatial
correlations. In particular, we design an inversion mechanism for the setting,
which extends our method to more public datasets without setting information.
Based on the noise dataset we synthesized, we have conducted sufficient
experiments on multiple benchmarks, and experimental results demonstrate that
our method outperforms state-of-the-art methods on multiple benchmarks and
metrics, demonstrating its effectiveness in synthesizing realistic noise for
training denoising models.
Compositional Text-to-Image Synthesis with Attention Map Control of Diffusion Models
May 23, 2023
Ruichen Wang, Zekang Chen, Chen Chen, Jian Ma, Haonan Lu, Xiaodong Lin
Recent text-to-image (T2I) diffusion models show outstanding performance in
generating high-quality images conditioned on textual prompts. However, these
models fail to semantically align the generated images with the text
descriptions due to their limited compositional capabilities, leading to
attribute leakage, entity leakage, and missing entities. In this paper, we
propose a novel attention mask control strategy based on predicted object boxes
to address these three issues. In particular, we first train a BoxNet to
predict a box for each entity that possesses the attribute specified in the
prompt. Then, depending on the predicted boxes, unique mask control is applied
to the cross- and self-attention maps. Our approach produces a more
semantically accurate synthesis by constraining the attention regions of each
token in the prompt to the image. In addition, the proposed method is
straightforward and effective, and can be readily integrated into existing
cross-attention-diffusion-based T2I generators. We compare our approach to
competing methods and demonstrate that it not only faithfully conveys the
semantics of the original text to the generated content, but also achieves high
availability as a ready-to-use plugin.
Unsafe Diffusion: On the Generation of Unsafe Images and Hateful Memes From Text-To-Image Models
May 23, 2023
Yiting Qu, Xinyue Shen, Xinlei He, Michael Backes, Savvas Zannettou, Yang Zhang
cs.CV, cs.CR, cs.CY, cs.LG, cs.SI
State-of-the-art Text-to-Image models like Stable Diffusion and DALLE$\cdot$2
are revolutionizing how people generate visual content. At the same time,
society has serious concerns about how adversaries can exploit such models to
generate unsafe images. In this work, we focus on demystifying the generation
of unsafe images and hateful memes from Text-to-Image models. We first
construct a typology of unsafe images consisting of five categories (sexually
explicit, violent, disturbing, hateful, and political). Then, we assess the
proportion of unsafe images generated by four advanced Text-to-Image models
using four prompt datasets. We find that these models can generate a
substantial percentage of unsafe images; across four models and four prompt
datasets, 14.56% of all generated images are unsafe. When comparing the four
models, we find different risk levels, with Stable Diffusion being the most
prone to generating unsafe content (18.92% of all generated images are unsafe).
Given Stable Diffusion’s tendency to generate more unsafe content, we evaluate
its potential to generate hateful meme variants if exploited by an adversary to
attack a specific individual or community. We employ three image editing
methods, DreamBooth, Textual Inversion, and SDEdit, which are supported by
Stable Diffusion. Our evaluation result shows that 24% of the generated images
using DreamBooth are hateful meme variants that present the features of the
original hateful meme and the target individual/community; these generated
images are comparable to hateful meme variants collected from the real world.
Overall, our results demonstrate that the danger of large-scale generation of
unsafe images is imminent. We discuss several mitigating measures, such as
curating training data, regulating prompts, and implementing safety filters,
and encourage better safeguard tools to be developed to prevent unsafe
generation.
Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models
May 23, 2023
Weifeng Chen, Jie Wu, Pan Xie, Hefeng Wu, Jiashi Li, Xin Xia, Xuefeng Xiao, Liang Lin
cs.CV, cs.AI, cs.LG, cs.MM
This paper presents a controllable text-to-video (T2V) diffusion model, named
Video-ControlNet, that generates videos conditioned on a sequence of control
signals, such as edge or depth maps. Video-ControlNet is built on a pre-trained
conditional text-to-image (T2I) diffusion model by incorporating a
spatial-temporal self-attention mechanism and trainable temporal layers for
efficient cross-frame modeling. A first-frame conditioning strategy is proposed
to facilitate the model to generate videos transferred from the image domain as
well as arbitrary-length videos in an auto-regressive manner. Moreover,
Video-ControlNet employs a novel residual-based noise initialization strategy
to introduce motion prior from an input video, producing more coherent videos.
With the proposed architecture and strategies, Video-ControlNet can achieve
resource-efficient convergence and generate superior quality and consistent
videos with fine-grained control. Extensive experiments demonstrate its success
in various video generative tasks such as video editing and video style
transfer, outperforming previous methods in terms of consistency and quality.
Project Page: https://controlavideo.github.io/
WaveDM: Wavelet-Based Diffusion Models for Image Restoration
May 23, 2023
Yi Huang, Jiancheng Huang, Jianzhuang Liu, Yu Dong, Jiaxi Lv, Shifeng Chen
Latest diffusion-based methods for many image restoration tasks outperform
traditional models, but they encounter the long-time inference problem. To
tackle it, this paper proposes a Wavelet-Based Diffusion Model (WaveDM) with an
Efficient Conditional Sampling (ECS) strategy. WaveDM learns the distribution
of clean images in the wavelet domain conditioned on the wavelet spectrum of
degraded images after wavelet transform, which is more time-saving in each step
of sampling than modeling in the spatial domain. In addition, ECS follows the
same procedure as the deterministic implicit sampling in the initial sampling
period and then stops to predict clean images directly, which reduces the
number of total sampling steps to around 5. Evaluations on four benchmark
datasets including image raindrop removal, defocus deblurring, demoir'eing,
and denoising demonstrate that WaveDM achieves state-of-the-art performance
with the efficiency that is comparable to traditional one-pass methods and over
100 times faster than existing image restoration methods using vanilla
diffusion models.
DiffHand: End-to-End Hand Mesh Reconstruction via Diffusion Models
May 23, 2023
Lijun Li, Li'an Zhuo, Bang Zhang, Liefeng Bo, Chen Chen
Hand mesh reconstruction from the monocular image is a challenging task due
to its depth ambiguity and severe occlusion, there remains a non-unique mapping
between the monocular image and hand mesh. To address this, we develop
DiffHand, the first diffusion-based framework that approaches hand mesh
reconstruction as a denoising diffusion process. Our one-stage pipeline
utilizes noise to model the uncertainty distribution of the intermediate hand
mesh in a forward process. We reformulate the denoising diffusion process to
gradually refine noisy hand mesh and then select mesh with the highest
probability of being correct based on the image itself, rather than relying on
2D joints extracted beforehand. To better model the connectivity of hand
vertices, we design a novel network module called the cross-modality decoder.
Extensive experiments on the popular benchmarks demonstrate that our method
outperforms the state-of-the-art hand mesh reconstruction approaches by
achieving 5.8mm PA-MPJPE on the Freihand test set, 4.98mm PA-MPJPE on the
DexYCB test set.
Training Diffusion Models with Reinforcement Learning
May 22, 2023
Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, Sergey Levine
Diffusion models are a class of flexible generative models trained with an
approximation to the log-likelihood objective. However, most use cases of
diffusion models are not concerned with likelihoods, but instead with
downstream objectives such as human-perceived image quality or drug
effectiveness. In this paper, we investigate reinforcement learning methods for
directly optimizing diffusion models for such objectives. We describe how
posing denoising as a multi-step decision-making problem enables a class of
policy gradient algorithms, which we refer to as denoising diffusion policy
optimization (DDPO), that are more effective than alternative reward-weighted
likelihood approaches. Empirically, DDPO is able to adapt text-to-image
diffusion models to objectives that are difficult to express via prompting,
such as image compressibility, and those derived from human feedback, such as
aesthetic quality. Finally, we show that DDPO can improve prompt-image
alignment using feedback from a vision-language model without the need for
additional data collection or human annotation.
GSURE-Based Diffusion Model Training with Corrupted Data
May 22, 2023
Bahjat Kawar, Noam Elata, Tomer Michaeli, Michael Elad
Diffusion models have demonstrated impressive results in both data generation
and downstream tasks such as inverse problems, text-based editing,
classification, and more. However, training such models usually requires large
amounts of clean signals which are often difficult or impossible to obtain. In
this work, we propose a novel training technique for generative diffusion
models based only on corrupted data. We introduce a loss function based on the
Generalized Stein’s Unbiased Risk Estimator (GSURE), and prove that under some
conditions, it is equivalent to the training objective used in fully supervised
diffusion models. We demonstrate our technique on face images as well as
Magnetic Resonance Imaging (MRI), where the use of undersampled data
significantly alleviates data collection costs. Our approach achieves
generative performance comparable to its fully supervised counterpart without
training on any clean signals. In addition, we deploy the resulting diffusion
model in various downstream tasks beyond the degradation present in the
training set, showcasing promising results.
AudioToken: Adaptation of Text-Conditioned Diffusion Models for Audio-to-Image Generation
May 22, 2023
Guy Yariv, Itai Gat, Lior Wolf, Yossi Adi, Idan Schwartz
cs.SD, cs.CV, cs.LG, eess.AS
In recent years, image generation has shown a great leap in performance,
where diffusion models play a central role. Although generating high-quality
images, such models are mainly conditioned on textual descriptions. This begs
the question: “how can we adopt such models to be conditioned on other
modalities?”. In this paper, we propose a novel method utilizing latent
diffusion models trained for text-to-image-generation to generate images
conditioned on audio recordings. Using a pre-trained audio encoding model, the
proposed method encodes audio into a new token, which can be considered as an
adaptation layer between the audio and text representations. Such a modeling
paradigm requires a small number of trainable parameters, making the proposed
approach appealing for lightweight optimization. Results suggest the proposed
method is superior to the evaluated baseline methods, considering objective and
subjective metrics. Code and samples are available at:
https://pages.cs.huji.ac.il/adiyoss-lab/AudioToken.
Hierarchical Integration Diffusion Model for Realistic Image Deblurring
May 22, 2023
Zheng Chen, Yulun Zhang, Ding Liu, Bin Xia, Jinjin Gu, Linghe Kong, Xin Yuan
Diffusion models (DMs) have recently been introduced in image deblurring and
exhibited promising performance, particularly in terms of details
reconstruction. However, the diffusion model requires a large number of
inference iterations to recover the clean image from pure Gaussian noise, which
consumes massive computational resources. Moreover, the distribution
synthesized by the diffusion model is often misaligned with the target results,
leading to restrictions in distortion-based metrics. To address the above
issues, we propose the Hierarchical Integration Diffusion Model (HI-Diff), for
realistic image deblurring. Specifically, we perform the DM in a highly
compacted latent space to generate the prior feature for the deblurring
process. The deblurring process is implemented by a regression-based method to
obtain better distortion accuracy. Meanwhile, the highly compact latent space
ensures the efficiency of the DM. Furthermore, we design the hierarchical
integration module to fuse the prior into the regression-based model from
multiple scales, enabling better generalization in complex blurry scenarios.
Comprehensive experiments on synthetic and real-world blur datasets demonstrate
that our HI-Diff outperforms state-of-the-art methods. Code and trained models
are available at https://github.com/zhengchen1999/HI-Diff.
Is Synthetic Data From Diffusion Models Ready for Knowledge Distillation?
May 22, 2023
Zheng Li, Yuxuan Li, Penghai Zhao, Renjie Song, Xiang Li, Jian Yang
Diffusion models have recently achieved astonishing performance in generating
high-fidelity photo-realistic images. Given their huge success, it is still
unclear whether synthetic images are applicable for knowledge distillation when
real images are unavailable. In this paper, we extensively study whether and
how synthetic images produced from state-of-the-art diffusion models can be
used for knowledge distillation without access to real images, and obtain three
key conclusions: (1) synthetic data from diffusion models can easily lead to
state-of-the-art performance among existing synthesis-based distillation
methods, (2) low-fidelity synthetic images are better teaching materials, and
(3) relatively weak classifiers are better teachers. Code is available at
https://github.com/zhengli97/DM-KD.
ViT-TTS: Visual Text-to-Speech with Scalable Diffusion Transformer
May 22, 2023
Huadai Liu, Rongjie Huang, Xuan Lin, Wenqiang Xu, Maozong Zheng, Hong Chen, Jinzheng He, Zhou Zhao
Text-to-speech(TTS) has undergone remarkable improvements in performance,
particularly with the advent of Denoising Diffusion Probabilistic Models
(DDPMs). However, the perceived quality of audio depends not solely on its
content, pitch, rhythm, and energy, but also on the physical environment. In
this work, we propose ViT-TTS, the first visual TTS model with scalable
diffusion transformers. ViT-TTS complement the phoneme sequence with the visual
information to generate high-perceived audio, opening up new avenues for
practical applications of AR and VR to allow a more immersive and realistic
audio experience. To mitigate the data scarcity in learning visual acoustic
information, we 1) introduce a self-supervised learning framework to enhance
both the visual-text encoder and denoiser decoder; 2) leverage the diffusion
transformer scalable in terms of parameters and capacity to learn visual scene
information. Experimental results demonstrate that ViT-TTS achieves new
state-of-the-art results, outperforming cascaded systems and other baselines
regardless of the visibility of the scene. With low-resource data (1h, 2h, 5h),
ViT-TTS achieves comparative results with rich-resource
baselines.~\footnote{Audio samples are available at
\url{https://ViT-TTS.github.io/.}}
Towards Globally Consistent Stochastic Human Motion Prediction via Motion Diffusion
May 21, 2023
Jiarui Sun, Girish Chowdhary
Stochastic human motion prediction aims to predict multiple possible upcoming
pose sequences based on past human motion trajectories. Prior works focused
heavily on generating diverse motion samples, leading to inconsistent, abnormal
predictions from the immediate past observations. To address this issue, in
this work, we propose DiffMotion, a diffusion-based stochastic human motion
prediction framework that considers both the kinematic structure of the human
body and the globally temporally consistent nature of motion. Specifically,
DiffMotion consists of two modules: 1) a transformer-based network for
generating an initial motion reconstruction from corrupted motion, and 2) a
multi-stage graph convolutional network to iteratively refine the generated
motion based on past observations. Facilitated by the proposed direct target
prediction objective and the variance scheduler, our method is capable of
predicting accurate, realistic and consistent motion with an appropriate level
of diversity. Our results on benchmark datasets demonstrate that DiffMotion
outperforms previous methods by large margins in terms of accuracy and fidelity
while demonstrating superior robustness.
DiffUCD:Unsupervised Hyperspectral Image Change Detection with Semantic Correlation Diffusion Model
May 21, 2023
Xiangrong Zhang, Shunli Tian, Guanchun Wang, Huiyu Zhou, Licheng Jiao
Hyperspectral image change detection (HSI-CD) has emerged as a crucial
research area in remote sensing due to its ability to detect subtle changes on
the earth’s surface. Recently, diffusional denoising probabilistic models
(DDPM) have demonstrated remarkable performance in the generative domain. Apart
from their image generation capability, the denoising process in diffusion
models can comprehensively account for the semantic correlation of
spectral-spatial features in HSI, resulting in the retrieval of semantically
relevant features in the original image. In this work, we extend the diffusion
model’s application to the HSI-CD field and propose a novel unsupervised HSI-CD
with semantic correlation diffusion model (DiffUCD). Specifically, the semantic
correlation diffusion model (SCDM) leverages abundant unlabeled samples and
fully accounts for the semantic correlation of spectral-spatial features, which
mitigates pseudo change between multi-temporal images arising from inconsistent
imaging conditions. Besides, objects with the same semantic concept at the same
spatial location may exhibit inconsistent spectral signatures at different
times, resulting in pseudo change. To address this problem, we propose a
cross-temporal contrastive learning (CTCL) mechanism that aligns the spectral
feature representations of unchanged samples. By doing so, the spectral
difference invariant features caused by environmental changes can be obtained.
Experiments conducted on three publicly available datasets demonstrate that the
proposed method outperforms the other state-of-the-art unsupervised methods in
terms of Overall Accuracy (OA), Kappa Coefficient (KC), and F1 scores,
achieving improvements of approximately 3.95%, 8.13%, and 4.45%, respectively.
Notably, our method can achieve comparable results to those fully supervised
methods requiring numerous annotated samples.
Dual-Diffusion: Dual Conditional Denoising Diffusion Probabilistic Models for Blind Super-Resolution Reconstruction in RSIs
May 20, 2023
Mengze Xu, Jie Ma, Yuanyuan Zhu
Previous super-resolution reconstruction (SR) works are always designed on
the assumption that the degradation operation is fixed, such as bicubic
downsampling. However, as for remote sensing images, some unexpected factors
can cause the blurred visual performance, like weather factors, orbit altitude,
etc. Blind SR methods are proposed to deal with various degradations. There are
two main challenges of blind SR in RSIs: 1) the accu-rate estimation of
degradation kernels; 2) the realistic image generation in the ill-posed
problem. To rise to the challenge, we propose a novel blind SR framework based
on dual conditional denoising diffusion probabilistic models (DDSR). In our
work, we introduce conditional denoising diffusion probabilistic models (DDPM)
from two aspects: kernel estimation progress and re-construction progress,
named as the dual-diffusion. As for kernel estimation progress, conditioned on
low-resolution (LR) images, a new DDPM-based kernel predictor is constructed by
studying the invertible mapping between the kernel distribution and the latent
distribution. As for reconstruction progress, regarding the predicted
degradation kernels and LR images as conditional information, we construct a
DDPM-based reconstructor to learning the mapping from the LR images to HR
images. Com-prehensive experiments show the priority of our proposal com-pared
with SOTA blind SR methods. Source Code is available at
https://github.com/Lincoln20030413/DDSR
Any-to-Any Generation via Composable Diffusion
May 19, 2023
Zineng Tang, Ziyi Yang, Chenguang Zhu, Michael Zeng, Mohit Bansal
cs.CV, cs.CL, cs.LG, cs.SD, eess.AS
We present Composable Diffusion (CoDi), a novel generative model capable of
generating any combination of output modalities, such as language, image,
video, or audio, from any combination of input modalities. Unlike existing
generative AI systems, CoDi can generate multiple modalities in parallel and
its input is not limited to a subset of modalities like text or image. Despite
the absence of training datasets for many combinations of modalities, we
propose to align modalities in both the input and output space. This allows
CoDi to freely condition on any input combination and generate any group of
modalities, even if they are not present in the training data. CoDi employs a
novel composable generation strategy which involves building a shared
multimodal space by bridging alignment in the diffusion process, enabling the
synchronized generation of intertwined modalities, such as temporally aligned
video and audio. Highly customizable and flexible, CoDi achieves strong
joint-modality generation quality, and outperforms or is on par with the
unimodal state-of-the-art for single-modality synthesis. The project page with
demonstrations and code is at https://codi-gen.github.io
The probability flow ODE is provably fast
May 19, 2023
Sitan Chen, Sinho Chewi, Holden Lee, Yuanzhi Li, Jianfeng Lu, Adil Salim
cs.LG, math.ST, stat.ML, stat.TH
We provide the first polynomial-time convergence guarantees for the
probability flow ODE implementation (together with a corrector step) of
score-based generative modeling. Our analysis is carried out in the wake of
recent results obtaining such guarantees for the SDE-based implementation
(i.e., denoising diffusion probabilistic modeling or DDPM), but requires the
development of novel techniques for studying deterministic dynamics without
contractivity. Through the use of a specially chosen corrector step based on
the underdamped Langevin diffusion, we obtain better dimension dependence than
prior works on DDPM ($O(\sqrt{d})$ vs. $O(d)$, assuming smoothness of the data
distribution), highlighting potential advantages of the ODE framework.
Efficient Cross-Lingual Transfer for Chinese Stable Diffusion with Images as Pivots
May 19, 2023
Jinyi Hu, Xu Han, Xiaoyuan Yi, Yutong Chen, Wenhao Li, Zhiyuan Liu, Maosong Sun
Diffusion models have made impressive progress in text-to-image synthesis.
However, training such large-scale models (e.g. Stable Diffusion), from scratch
requires high computational costs and massive high-quality text-image pairs,
which becomes unaffordable in other languages. To handle this challenge, we
propose IAP, a simple but effective method to transfer English Stable Diffusion
into Chinese. IAP optimizes only a separate Chinese text encoder with all other
parameters fixed to align Chinese semantics space to the English one in CLIP.
To achieve this, we innovatively treat images as pivots and minimize the
distance of attentive features produced from cross-attention between images and
each language respectively. In this way, IAP establishes connections of
Chinese, English and visual semantics in CLIP’s embedding space efficiently,
advancing the quality of the generated image with direct Chinese prompts.
Experimental results show that our method outperforms several strong Chinese
diffusion models with only 5%~10% training data.
A Preliminary Study on Augmenting Speech Emotion Recognition using a Diffusion Model
May 19, 2023
Ibrahim Malik, Siddique Latif, Raja Jurdak, Björn Schuller
In this paper, we propose to utilise diffusion models for data augmentation
in speech emotion recognition (SER). In particular, we present an effective
approach to utilise improved denoising diffusion probabilistic models (IDDPM)
to generate synthetic emotional data. We condition the IDDPM with the textual
embedding from bidirectional encoder representations from transformers (BERT)
to generate high-quality synthetic emotional samples in different speakers’
voices\footnote{synthetic samples URL:
\url{https://emulationai.com/research/diffusion-ser.}}. We implement a series
of experiments and show that better quality synthetic data helps improve SER
performance. We compare results with generative adversarial networks (GANs) and
show that the proposed model generates better-quality synthetic samples that
can considerably improve the performance of SER when augmented with synthetic
data.
SlotDiffusion: Object-Centric Generative Modeling with Diffusion Models
May 18, 2023
Ziyi Wu, Jingyu Hu, Wuyue Lu, Igor Gilitschenski, Animesh Garg
Object-centric learning aims to represent visual data with a set of object
entities (a.k.a. slots), providing structured representations that enable
systematic generalization. Leveraging advanced architectures like Transformers,
recent approaches have made significant progress in unsupervised object
discovery. In addition, slot-based representations hold great potential for
generative modeling, such as controllable image generation and object
manipulation in image editing. However, current slot-based methods often
produce blurry images and distorted objects, exhibiting poor generative
modeling capabilities. In this paper, we focus on improving slot-to-image
decoding, a crucial aspect for high-quality visual generation. We introduce
SlotDiffusion – an object-centric Latent Diffusion Model (LDM) designed for
both image and video data. Thanks to the powerful modeling capacity of LDMs,
SlotDiffusion surpasses previous slot models in unsupervised object
segmentation and visual generation across six datasets. Furthermore, our
learned object features can be utilized by existing object-centric dynamics
models, improving video prediction quality and downstream temporal reasoning
tasks. Finally, we demonstrate the scalability of SlotDiffusion to
unconstrained real-world datasets such as PASCAL VOC and COCO, when integrated
with self-supervised pre-trained image encoders.
UniControl: A Unified Diffusion Model for Controllable Visual Generation In the Wild
May 18, 2023
Can Qin, Shu Zhang, Ning Yu, Yihao Feng, Xinyi Yang, Yingbo Zhou, Huan Wang, Juan Carlos Niebles, Caiming Xiong, Silvio Savarese, Stefano Ermon, Yun Fu, Ran Xu
Achieving machine autonomy and human control often represent divergent
objectives in the design of interactive AI systems. Visual generative
foundation models such as Stable Diffusion show promise in navigating these
goals, especially when prompted with arbitrary languages. However, they often
fall short in generating images with spatial, structural, or geometric
controls. The integration of such controls, which can accommodate various
visual conditions in a single unified model, remains an unaddressed challenge.
In response, we introduce UniControl, a new generative foundation model that
consolidates a wide array of controllable condition-to-image (C2I) tasks within
a singular framework, while still allowing for arbitrary language prompts.
UniControl enables pixel-level-precise image generation, where visual
conditions primarily influence the generated structures and language prompts
guide the style and context. To equip UniControl with the capacity to handle
diverse visual conditions, we augment pretrained text-to-image diffusion models
and introduce a task-aware HyperNet to modulate the diffusion models, enabling
the adaptation to different C2I tasks simultaneously. Trained on nine unique
C2I tasks, UniControl demonstrates impressive zero-shot generation abilities
with unseen visual conditions. Experimental results show that UniControl often
surpasses the performance of single-task-controlled methods of comparable model
sizes. This control versatility positions UniControl as a significant
advancement in the realm of controllable visual generation.
Blackout Diffusion: Generative Diffusion Models in Discrete-State Spaces
May 18, 2023
Javier E Santos, Zachary R. Fox, Nicholas Lubbers, Yen Ting Lin
Typical generative diffusion models rely on a Gaussian diffusion process for
training the backward transformations, which can then be used to generate
samples from Gaussian noise. However, real world data often takes place in
discrete-state spaces, including many scientific applications. Here, we develop
a theoretical formulation for arbitrary discrete-state Markov processes in the
forward diffusion process using exact (as opposed to variational) analysis. We
relate the theory to the existing continuous-state Gaussian diffusion as well
as other approaches to discrete diffusion, and identify the corresponding
reverse-time stochastic process and score function in the continuous-time
setting, and the reverse-time mapping in the discrete-time setting. As an
example of this framework, we introduce ``Blackout Diffusion’’, which learns to
produce samples from an empty image instead of from noise. Numerical
experiments on the CIFAR-10, Binarized MNIST, and CelebA datasets confirm the
feasibility of our approach. Generalizing from specific (Gaussian) forward
processes to discrete-state processes without a variational approximation sheds
light on how to interpret diffusion models, which we discuss.
Unsupervised Pansharpening via Low-rank Diffusion Model
May 18, 2023
Xiangyu Rui, Xiangyong Cao, Zeyu Zhu, Zongsheng Yue, Deyu Meng
Pansharpening is a process of merging a highresolution panchromatic (PAN)
image and a low-resolution multispectral (LRMS) image to create a single
high-resolution multispectral (HRMS) image. Most of the existing deep
learningbased pansharpening methods have poor generalization ability and the
traditional model-based pansharpening methods need careful manual exploration
for the image structure prior. To alleviate these issues, this paper proposes
an unsupervised pansharpening method by combining the diffusion model with the
low-rank matrix factorization technique. Specifically, we assume that the HRMS
image is decomposed into the product of two low-rank tensors, i.e., the base
tensor and the coefficient matrix. The base tensor lies on the image field and
has low spectral dimension, we can thus conveniently utilize a pre-trained
remote sensing diffusion model to capture its image structures. Additionally,
we derive a simple yet quite effective way to preestimate the coefficient
matrix from the observed LRMS image, which preserves the spectral information
of the HRMS. Extensive experimental results on some benchmark datasets
demonstrate that our proposed method performs better than traditional
model-based approaches and has better generalization ability than deep
learning-based techniques. The code is released in
https://github.com/xyrui/PLRDiff.
Structural Pruning for Diffusion Models
May 18, 2023
Gongfan Fang, Xinyin Ma, Xinchao Wang
Generative modeling has recently undergone remarkable advancements, primarily
propelled by the transformative implications of Diffusion Probabilistic Models
(DPMs). The impressive capability of these models, however, often entails
significant computational overhead during both training and inference. To
tackle this challenge, we present Diff-Pruning, an efficient compression method
tailored for learning lightweight diffusion models from pre-existing ones,
without the need for extensive re-training. The essence of Diff-Pruning is
encapsulated in a Taylor expansion over pruned timesteps, a process that
disregards non-contributory diffusion steps and ensembles informative gradients
to identify important weights. Our empirical assessment, undertaken across four
diverse datasets highlights two primary benefits of our proposed method: 1)
Efficiency: it enables approximately a 50% reduction in FLOPs at a mere 10% to
20% of the original training expenditure; 2) Consistency: the pruned diffusion
models inherently preserve generative behavior congruent with their pre-trained
progenitors. Code is available at \url{https://github.com/VainF/Diff-Pruning}.
Diffusion-Based Mel-Spectrogram Enhancement for Personalized Speech Synthesis with Found Data
May 18, 2023
Yusheng Tian, Wei Liu, Tan Lee
Creating synthetic voices with found data is challenging, as real-world
recordings often contain various types of audio degradation. One way to address
this problem is to pre-enhance the speech with an enhancement model and then
use the enhanced data for text-to-speech (TTS) model training. This paper
investigates the use of conditional diffusion models for generalized speech
enhancement, which aims at addressing multiple types of audio degradation
simultaneously. The enhancement is performed on the log Mel-spectrogram domain
to align with the TTS training objective. Text information is introduced as an
additional condition to improve the model robustness. Experiments on real-world
recordings demonstrate that the synthetic voice built on data enhanced by the
proposed model produces higher-quality synthetic speech, compared to those
trained on data enhanced by strong baselines. Code and pre-trained parameters
of the proposed enhancement model are available at
\url{https://github.com/dmse4tts/DMSE4TTS}
TextDiffuser: Diffusion Models as Text Painters
May 18, 2023
Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, Furu Wei
Diffusion models have gained increasing attention for their impressive
generation abilities but currently struggle with rendering accurate and
coherent text. To address this issue, we introduce TextDiffuser, focusing on
generating images with visually appealing text that is coherent with
backgrounds. TextDiffuser consists of two stages: first, a Transformer model
generates the layout of keywords extracted from text prompts, and then
diffusion models generate images conditioned on the text prompt and the
generated layout. Additionally, we contribute the first large-scale text images
dataset with OCR annotations, MARIO-10M, containing 10 million image-text pairs
with text recognition, detection, and character-level segmentation annotations.
We further collect the MARIO-Eval benchmark to serve as a comprehensive tool
for evaluating text rendering quality. Through experiments and user studies, we
show that TextDiffuser is flexible and controllable to create high-quality text
images using text prompts alone or together with text template images, and
conduct text inpainting to reconstruct incomplete images with text. The code,
model, and dataset will be available at \url{https://aka.ms/textdiffuser}.
LDM3D: Latent Diffusion Model for 3D
May 18, 2023
Gabriela Ben Melech Stan, Diana Wofk, Scottie Fox, Alex Redden, Will Saxton, Jean Yu, Estelle Aflalo, Shao-Yen Tseng, Fabio Nonato, Matthias Muller, Vasudev Lal
This research paper proposes a Latent Diffusion Model for 3D (LDM3D) that
generates both image and depth map data from a given text prompt, allowing
users to generate RGBD images from text prompts. The LDM3D model is fine-tuned
on a dataset of tuples containing an RGB image, depth map and caption, and
validated through extensive experiments. We also develop an application called
DepthFusion, which uses the generated RGB images and depth maps to create
immersive and interactive 360-degree-view experiences using TouchDesigner. This
technology has the potential to transform a wide range of industries, from
entertainment and gaming to architecture and design. Overall, this paper
presents a significant contribution to the field of generative AI and computer
vision, and showcases the potential of LDM3D and DepthFusion to revolutionize
content creation and digital experiences. A short video summarizing the
approach can be found at https://t.ly/tdi2.
DiffUTE: Universal Text Editing Diffusion Model
May 18, 2023
Haoxing Chen, Zhuoer Xu, Zhangxuan Gu, Jun Lan, Xing Zheng, Yaohui Li, Changhua Meng, Huijia Zhu, Weiqiang Wang
Diffusion model based language-guided image editing has achieved great
success recently. However, existing state-of-the-art diffusion models struggle
with rendering correct text and text style during generation. To tackle this
problem, we propose a universal self-supervised text editing diffusion model
(DiffUTE), which aims to replace or modify words in the source image with
another one while maintaining its realistic appearance. Specifically, we build
our model on a diffusion model and carefully modify the network structure to
enable the model for drawing multilingual characters with the help of glyph and
position information. Moreover, we design a self-supervised learning framework
to leverage large amounts of web data to improve the representation ability of
the model. Experimental results show that our method achieves an impressive
performance and enables controllable editing on in-the-wild images with high
fidelity. Our code will be avaliable in
\url{https://github.com/chenhaoxing/DiffUTE}.
Democratized Diffusion Language Model
May 18, 2023
Nikita Balagansky, Daniil Gavrilov
Despite the potential benefits of Diffusion Models for NLP applications,
publicly available implementations, trained models, or reproducible training
procedures currently need to be publicly available. We present the Democratized
Diffusion Language Model (DDLM), based on the Continuous Diffusion for
Categorical Data (CDCD) framework, to address these challenges. We propose a
simplified training procedure for DDLM using the C4 dataset and perform an
in-depth analysis of the trained model’s behavior. Furthermore, we introduce a
novel early-exiting strategy for faster sampling with models trained with score
interpolation. Since no previous works aimed at solving downstream tasks with
pre-trained Diffusion LM (e.g., classification tasks), we experimented with
GLUE Benchmark to study the ability of DDLM to transfer knowledge. With this
paper, we propose available training and evaluation pipelines to other
researchers and pre-trained DDLM models, which could be used in future research
with Diffusion LMs.
Diffusion-Based Speech Enhancement with Joint Generative and Predictive Decoders
May 18, 2023
Hao Shi, Kazuki Shimada, Masato Hirano, Takashi Shibuya, Yuichiro Koyama, Zhi Zhong, Shusuke Takahashi, Tatsuya Kawahara, Yuki Mitsufuji
Diffusion-based speech enhancement (SE) has been investigated recently, but
its decoding is very time-consuming. One solution is to initialize the decoding
process with the enhanced feature estimated by a predictive SE system. However,
this two-stage method ignores the complementarity between predictive and
diffusion SE. In this paper, we propose a unified system that integrates these
two SE modules. The system encodes both generative and predictive information,
and then applies both generative and predictive decoders, whose outputs are
fused. Specifically, the two SE modules are fused in the first and final
diffusion steps: the first step fusion initializes the diffusion process with
the predictive SE for improving the convergence, and the final step fusion
combines the two complementary SE outputs to improve the SE performance.
Experiments on the Voice-Bank dataset show that the diffusion score estimation
can benefit from the predictive information and speed up the decoding.
Discriminative Diffusion Models as Few-shot Vision and Language Learners
May 18, 2023
Xuehai He, Weixi Feng, Tsu-Jui Fu, Varun Jampani, Arjun Akula, Pradyumna Narayana, Sugato Basu, William Yang Wang, Xin Eric Wang
Diffusion models, such as Stable Diffusion, have shown incredible performance
on text-to-image generation. Since text-to-image generation often requires
models to generate visual concepts with fine-grained details and attributes
specified in text prompts, can we leverage the powerful representations learned
by pre-trained diffusion models for discriminative tasks such as image-text
matching? To answer this question, we propose a novel approach, Discriminative
Stable Diffusion (DSD), which turns pre-trained text-to-image diffusion models
into few-shot discriminative learners. Our approach uses the cross-attention
score of a Stable Diffusion model to capture the mutual influence between
visual and textual information and fine-tune the model via attention-based
prompt learning to perform image-text matching. By comparing DSD with
state-of-the-art methods on several benchmark datasets, we demonstrate the
potential of using pre-trained diffusion models for discriminative tasks with
superior results on few-shot image-text matching.
Sampling, Diffusions, and Stochastic Localization
May 18, 2023
Andrea Montanari
Diffusions are a successful technique to sample from high-dimensional
distributions can be either explicitly given or learnt from a collection of
samples. They implement a diffusion process whose endpoint is a sample from the
target distribution and whose drift is typically represented as a neural
network. Stochastic localization is a successful technique to prove mixing of
Markov Chains and other functional inequalities in high dimension. An
algorithmic version of stochastic localization was introduced in [EAMS2022], to
obtain an algorithm that samples from certain statistical mechanics models.
This notes have three objectives: (i) Generalize the construction [EAMS2022]
to other stochastic localization processes; (ii) Clarify the connection between
diffusions and stochastic localization. In particular we show that standard
denoising diffusions are stochastic localizations but other examples that are
naturally suggested by the proposed viewpoint; (iii) Describe some insights
that follow from this viewpoint.
Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models
May 17, 2023
Songwei Ge, Seungjun Nah, Guilin Liu, Tyler Poon, Andrew Tao, Bryan Catanzaro, David Jacobs, Jia-Bin Huang, Ming-Yu Liu, Yogesh Balaji
Despite tremendous progress in generating high-quality images using diffusion
models, synthesizing a sequence of animated frames that are both photorealistic
and temporally coherent is still in its infancy. While off-the-shelf
billion-scale datasets for image generation are available, collecting similar
video data of the same scale is still challenging. Also, training a video
diffusion model is computationally much more expensive than its image
counterpart. In this work, we explore finetuning a pretrained image diffusion
model with video data as a practical solution for the video synthesis task. We
find that naively extending the image noise prior to video noise prior in video
diffusion leads to sub-optimal performance. Our carefully designed video noise
prior leads to substantially better performance. Extensive experimental
validation shows that our model, Preserve Your Own Correlation (PYoCo), attains
SOTA zero-shot text-to-video results on the UCF-101 and MSR-VTT benchmarks. It
also achieves SOTA video generation quality on the small-scale UCF-101
benchmark with a $10\times$ smaller model using significantly less computation
than the prior art.
Raising the Bar for Certified Adversarial Robustness with Diffusion Models
May 17, 2023
Thomas Altstidl, David Dobre, Björn Eskofier, Gauthier Gidel, Leo Schwinn
Certified defenses against adversarial attacks offer formal guarantees on the
robustness of a model, making them more reliable than empirical methods such as
adversarial training, whose effectiveness is often later reduced by unseen
attacks. Still, the limited certified robustness that is currently achievable
has been a bottleneck for their practical adoption. Gowal et al. and Wang et
al. have shown that generating additional training data using state-of-the-art
diffusion models can considerably improve the robustness of adversarial
training. In this work, we demonstrate that a similar approach can
substantially improve deterministic certified defenses. In addition, we provide
a list of recommendations to scale the robustness of certified training
approaches. One of our main insights is that the generalization gap, i.e., the
difference between the training and test accuracy of the original model, is a
good predictor of the magnitude of the robustness improvement when using
additional generated data. Our approach achieves state-of-the-art deterministic
robustness certificates on CIFAR-10 for the $\ell_2$ ($\epsilon = 36/255$) and
$\ell_\infty$ ($\epsilon = 8/255$) threat models, outperforming the previous
best results by $+3.95\%$ and $+1.39\%$, respectively. Furthermore, we report
similar improvements for CIFAR-100.
Selective Guidance: Are All the Denoising Steps of Guided Diffusion Important?
May 16, 2023
Pareesa Ameneh Golnari, Zhewei Yao, Yuxiong He
This study examines the impact of optimizing the Stable Diffusion (SD) guided
inference pipeline. We propose optimizing certain denoising steps by limiting
the noise computation to conditional noise and eliminating unconditional noise
computation, thereby reducing the complexity of the target iterations by 50%.
Additionally, we demonstrate that later iterations of the SD are less sensitive
to optimization, making them ideal candidates for applying the suggested
optimization. Our experiments show that optimizing the last 20% of the
denoising loop iterations results in an 8.2% reduction in inference time with
almost no perceivable changes to the human eye. Furthermore, we found that by
extending the optimization to 50% of the last iterations, we can reduce
inference time by approximately 20.3%, while still generating visually pleasing
images.
A score-based operator Newton method for measure transport
May 16, 2023
Nisha Chandramoorthy, Florian Schaefer, Youssef Marzouk
math.ST, cs.LG, cs.NA, math.NA, stat.TH
Transportation of probability measures underlies many core tasks in
statistics and machine learning, from variational inference to generative
modeling. A typical goal is to represent a target probability measure of
interest as the push-forward of a tractable source measure through a learned
map. We present a new construction of such a transport map, given the ability
to evaluate the score of the target distribution. Specifically, we characterize
the map as a zero of an infinite-dimensional score-residual operator and derive
a Newton-type method for iteratively constructing such a zero. We prove
convergence of these iterations by invoking classical elliptic regularity
theory for partial differential equations (PDE) and show that this construction
enjoys rapid convergence, under smoothness assumptions on the target score. A
key element of our approach is a generalization of the elementary Newton method
to infinite-dimensional operators, other forms of which have appeared in
nonlinear PDE and in dynamical systems. Our Newton construction, while
developed in a functional setting, also suggests new iterative algorithms for
approximating transport maps.
Expressiveness Remarks for Denoising Diffusion Models and Samplers
May 16, 2023
Francisco Vargas, Teodora Reu, Anna Kerekes
Denoising diffusion models are a class of generative models which have
recently achieved state-of-the-art results across many domains. Gradual noise
is added to the data using a diffusion process, which transforms the data
distribution into a Gaussian. Samples from the generative model are then
obtained by simulating an approximation of the time reversal of this diffusion
initialized by Gaussian samples. Recent research has explored adapting
diffusion models for sampling and inference tasks. In this paper, we leverage
known connections to stochastic control akin to the F"ollmer drift to extend
established neural network approximation results for the F"ollmer drift to
denoising diffusion models and samplers.
AR-Diffusion: Auto-Regressive Diffusion Model for Text Generation
May 16, 2023
Tong Wu, Zhihao Fan, Xiao Liu, Yeyun Gong, Yelong Shen, Jian Jiao, Hai-Tao Zheng, Juntao Li, Zhongyu Wei, Jian Guo, Nan Duan, Weizhu Chen
Diffusion models have gained significant attention in the realm of image
generation due to their exceptional performance. Their success has been
recently expanded to text generation via generating all tokens within a
sequence concurrently. However, natural language exhibits a far more pronounced
sequential dependency in comparison to images, and the majority of existing
language models are trained with a left-to-right auto-regressive approach. To
account for the inherent sequential characteristic of natural language, we
introduce Auto-Regressive Diffusion (AR-Diffusion). AR-Diffusion ensures that
the generation of tokens on the right depends on the generated ones on the
left, a mechanism achieved through employing a dynamic number of denoising
steps that vary based on token position. This results in tokens on the left
undergoing fewer denoising steps than those on the right, thereby enabling them
to generate earlier and subsequently influence the generation of tokens on the
right. In a series of experiments on various text generation tasks, including
text summarization, machine translation, and common sense generation,
AR-Diffusion clearly demonstrated its superiority over existing diffusion
language models and that it can be $100\times\sim600\times$ faster when
achieving comparable results. Our code is available at
https://github.com/microsoft/ProphetNet/tree/master/AR-diffusion.
Discrete Diffusion Probabilistic Models for Symbolic Music Generation
May 16, 2023
Matthias Plasser, Silvan Peter, Gerhard Widmer
Denoising Diffusion Probabilistic Models (DDPMs) have made great strides in
generating high-quality samples in both discrete and continuous domains.
However, Discrete DDPMs (D3PMs) have yet to be applied to the domain of
Symbolic Music. This work presents the direct generation of Polyphonic Symbolic
Music using D3PMs. Our model exhibits state-of-the-art sample quality,
according to current quantitative evaluation metrics, and allows for flexible
infilling at the note level. We further show, that our models are accessible to
post-hoc classifier guidance, widening the scope of possible applications.
However, we also cast a critical view on quantitative evaluation of music
sample quality via statistical metrics, and present a simple algorithm that can
confound our metrics with completely spurious, non-musical samples.
A Conditional Denoising Diffusion Probabilistic Model for Radio Interferometric Image Reconstruction
May 16, 2023
Ruoqi Wang, Zhuoyang Chen, Qiong Luo, Feng Wang
astro-ph.IM, astro-ph.GA, cs.CV, cs.LG, eess.IV
In radio astronomy, signals from radio telescopes are transformed into images
of observed celestial objects, or sources. However, these images, called dirty
images, contain real sources as well as artifacts due to signal sparsity and
other factors. Therefore, radio interferometric image reconstruction is
performed on dirty images, aiming to produce clean images in which artifacts
are reduced and real sources are recovered. So far, existing methods have
limited success on recovering faint sources, preserving detailed structures,
and eliminating artifacts. In this paper, we present VIC-DDPM, a Visibility and
Image Conditioned Denoising Diffusion Probabilistic Model. Our main idea is to
use both the original visibility data in the spectral domain and dirty images
in the spatial domain to guide the image generation process with DDPM. This
way, we can leverage DDPM to generate fine details and eliminate noise, while
utilizing visibility data to separate signals from noise and retaining spatial
information in dirty images. We have conducted experiments in comparison with
both traditional methods and recent deep learning based approaches. Our results
show that our method significantly improves the resulting images by reducing
artifacts, preserving fine details, and recovering dim sources. This
advancement further facilitates radio astronomical data analysis tasks on
celestial phenomena.
Denoising Diffusion Models for Plug-and-Play Image Restoration
May 15, 2023
Yuanzhi Zhu, Kai Zhang, Jingyun Liang, Jiezhang Cao, Bihan Wen, Radu Timofte, Luc Van Gool
Plug-and-play Image Restoration (IR) has been widely recognized as a flexible
and interpretable method for solving various inverse problems by utilizing any
off-the-shelf denoiser as the implicit image prior. However, most existing
methods focus on discriminative Gaussian denoisers. Although diffusion models
have shown impressive performance for high-quality image synthesis, their
potential to serve as a generative denoiser prior to the plug-and-play IR
methods remains to be further explored. While several other attempts have been
made to adopt diffusion models for image restoration, they either fail to
achieve satisfactory results or typically require an unacceptable number of
Neural Function Evaluations (NFEs) during inference. This paper proposes
DiffPIR, which integrates the traditional plug-and-play method into the
diffusion sampling framework. Compared to plug-and-play IR methods that rely on
discriminative Gaussian denoisers, DiffPIR is expected to inherit the
generative ability of diffusion models. Experimental results on three
representative IR tasks, including super-resolution, image deblurring, and
inpainting, demonstrate that DiffPIR achieves state-of-the-art performance on
both the FFHQ and ImageNet datasets in terms of reconstruction faithfulness and
perceptual quality with no more than 100 NFEs. The source code is available at
{\url{https://github.com/yuanzhi-zhu/DiffPIR}}
Laughing Matters: Introducing Laughing-Face Generation using Diffusion Models
May 15, 2023
Antoni Bigata Casademunt, Rodrigo Mira, Nikita Drobyshev, Konstantinos Vougioukas, Stavros Petridis, Maja Pantic
Speech-driven animation has gained significant traction in recent years, with
current methods achieving near-photorealistic results. However, the field
remains underexplored regarding non-verbal communication despite evidence
demonstrating its importance in human interaction. In particular, generating
laughter sequences presents a unique challenge due to the intricacy and nuances
of this behaviour. This paper aims to bridge this gap by proposing a novel
model capable of generating realistic laughter sequences, given a still
portrait and an audio clip containing laughter. We highlight the failure cases
of traditional facial animation methods and leverage recent advances in
diffusion models to produce convincing laughter videos. We train our model on a
diverse set of laughter datasets and introduce an evaluation metric
specifically designed for laughter. When compared with previous speech-driven
approaches, our model achieves state-of-the-art performance across all metrics,
even when these are re-trained for laughter generation.
May 15, 2023
Ryan Webster
Recently, Carlini et al. demonstrated the widely used model Stable Diffusion
can regurgitate real training samples, which is troublesome from a copyright
perspective. In this work, we provide an efficient extraction attack on par
with the recent attack, with several order of magnitudes less network
evaluations. In the process, we expose a new phenomena, which we dub template
verbatims, wherein a diffusion model will regurgitate a training sample largely
in tact. Template verbatims are harder to detect as they require retrieval and
masking to correctly label. Furthermore, they are still generated by newer
systems, even those which de-duplicate their training set, and we give insight
into why they still appear during generation. We extract training images from
several state of the art systems, including Stable Diffusion 2.0, Deep Image
Floyd, and finally Midjourney v4. We release code to verify our extraction
attack, perform the attack, as well as all extracted prompts at
\url{https://github.com/ryanwebster90/onestep-extraction}.
Common Diffusion Noise Schedules and Sample Steps are Flawed
May 15, 2023
Shanchuan Lin, Bingchen Liu, Jiashi Li, Xiao Yang
We discover that common diffusion noise schedules do not enforce the last
timestep to have zero signal-to-noise ratio (SNR), and some implementations of
diffusion samplers do not start from the last timestep. Such designs are flawed
and do not reflect the fact that the model is given pure Gaussian noise at
inference, creating a discrepancy between training and inference. We show that
the flawed design causes real problems in existing implementations. In Stable
Diffusion, it severely limits the model to only generate images with medium
brightness and prevents it from generating very bright and dark samples. We
propose a few simple fixes: (1) rescale the noise schedule to enforce zero
terminal SNR; (2) train the model with v prediction; (3) change the sampler to
always start from the last timestep; (4) rescale classifier-free guidance to
prevent over-exposure. These simple changes ensure the diffusion process is
congruent between training and inference and allow the model to generate
samples more faithful to the original data distribution.
TESS: Text-to-Text Self-Conditioned Simplex Diffusion
May 15, 2023
Rabeeh Karimi Mahabadi, Jaesung Tae, Hamish Ivison, James Henderson, Iz Beltagy, Matthew E. Peters, Arman Cohan
Diffusion models have emerged as a powerful paradigm for generation,
obtaining strong performance in various domains with continuous-valued inputs.
Despite the promises of fully non-autoregressive text generation, applying
diffusion models to natural language remains challenging due to its discrete
nature. In this work, we propose Text-to-text Self-conditioned Simplex
Diffusion (TESS), a text diffusion model that is fully non-autoregressive,
employs a new form of self-conditioning, and applies the diffusion process on
the logit simplex space rather than the typical learned embedding space.
Through extensive experiments on natural language understanding and generation
tasks including summarization, text simplification, paraphrase generation, and
question generation, we demonstrate that TESS outperforms state-of-the-art
non-autoregressive models and is competitive with pretrained autoregressive
sequence-to-sequence models.
May 14, 2023
Wentao Hu, Xiurong Jiang, Jiarun Liu, Yuqi Yang, Hui Tian
In the field of few-shot learning (FSL), extensive research has focused on
improving network structures and training strategies. However, the role of data
processing modules has not been fully explored. Therefore, in this paper, we
propose Meta-DM, a generalized data processing module for FSL problems based on
diffusion models. Meta-DM is a simple yet effective module that can be easily
integrated with existing FSL methods, leading to significant performance
improvements in both supervised and unsupervised settings. We provide a
theoretical analysis of Meta-DM and evaluate its performance on several
algorithms. Our experiments show that combining Meta-DM with certain methods
achieves state-of-the-art results.
May 14, 2023
Raza Imam, Muhammad Huzaifa, Mohammed El-Amine Azz
Privacy and confidentiality of medical data are of utmost importance in
healthcare settings. ViTs, the SOTA vision model, rely on large amounts of
patient data for training, which raises concerns about data security and the
potential for unauthorized access. Adversaries may exploit vulnerabilities in
ViTs to extract sensitive patient information and compromising patient privacy.
This work address these vulnerabilities to ensure the trustworthiness and
reliability of ViTs in medical applications. In this work, we introduced a
defensive diffusion technique as an adversarial purifier to eliminate
adversarial noise introduced by attackers in the original image. By utilizing
the denoising capabilities of the diffusion model, we employ a reverse
diffusion process to effectively eliminate the adversarial noise from the
attack sample, resulting in a cleaner image that is then fed into the ViT
blocks. Our findings demonstrate the effectiveness of the diffusion model in
eliminating attack-agnostic adversarial noise from images. Additionally, we
propose combining knowledge distillation with our framework to obtain a
lightweight student model that is both computationally efficient and robust
against gray box attacks. Comparison of our method with a SOTA baseline method,
SEViT, shows that our work is able to outperform the baseline. Extensive
experiments conducted on a publicly available Tuberculosis X-ray dataset
validate the computational efficiency and improved robustness achieved by our
proposed architecture.
Beware of diffusion models for synthesizing medical images – A comparison with GANs in terms of memorizing brain tumor images
May 12, 2023
Muhammad Usman Akbar, Wuhao Wang, Anders Eklund
Diffusion models were initially developed for text-to-image generation and
are now being utilized to generate high quality synthetic images. Preceded by
GANs, diffusion models have shown impressive results using various evaluation
metrics. However, commonly used metrics such as FID and IS are not suitable for
determining whether diffusion models are simply reproducing the training
images. Here we train StyleGAN and diffusion models, using BRATS20 and BRATS21
datasets, to synthesize brain tumor images, and measure the correlation between
the synthetic images and all training images. Our results show that diffusion
models are much more likely to memorize the training images, especially for
small datasets. Researchers should be careful when using diffusion models for
medical imaging, if the final goal is to share the synthetic images.
Provably Convergent Schrödinger Bridge with Applications to Probabilistic Time Series Imputation
May 12, 2023
Yu Chen, Wei Deng, Shikai Fang, Fengpei Li, Nicole Tianjiao Yang, Yikai Zhang, Kashif Rasul, Shandian Zhe, Anderson Schneider, Yuriy Nevmyvaka
The Schr"odinger bridge problem (SBP) is gaining increasing attention in
generative modeling and showing promising potential even in comparison with the
score-based generative models (SGMs). SBP can be interpreted as an
entropy-regularized optimal transport problem, which conducts projections onto
every other marginal alternatingly. However, in practice, only approximated
projections are accessible and their convergence is not well understood. To
fill this gap, we present a first convergence analysis of the Schr"odinger
bridge algorithm based on approximated projections. As for its practical
applications, we apply SBP to probabilistic time series imputation by
generating missing values conditioned on observed data. We show that optimizing
the transport cost improves the performance and the proposed algorithm achieves
the state-of-the-art result in healthcare and environmental data while
exhibiting the advantage of exploring both temporal and feature patterns in
probabilistic time series imputation.
Exploiting Diffusion Prior for Real-World Image Super-Resolution
May 11, 2023
Jianyi Wang, Zongsheng Yue, Shangchen Zhou, Kelvin C. K. Chan, Chen Change Loy
We present a novel approach to leverage prior knowledge encapsulated in
pre-trained text-to-image diffusion models for blind super-resolution (SR).
Specifically, by employing our time-aware encoder, we can achieve promising
restoration results without altering the pre-trained synthesis model, thereby
preserving the generative prior and minimizing training cost. To remedy the
loss of fidelity caused by the inherent stochasticity of diffusion models, we
introduce a controllable feature wrapping module that allows users to balance
quality and fidelity by simply adjusting a scalar value during the inference
process. Moreover, we develop a progressive aggregation sampling strategy to
overcome the fixed-size constraints of pre-trained diffusion models, enabling
adaptation to resolutions of any size. A comprehensive evaluation of our method
using both synthetic and real-world benchmarks demonstrates its superiority
over current state-of-the-art approaches.
MolDiff: Addressing the Atom-Bond Inconsistency Problem in 3D Molecule Diffusion Generation
May 11, 2023
Xingang Peng, Jiaqi Guan, Qiang Liu, Jianzhu Ma
q-bio.BM, cs.LG, q-bio.QM
Deep generative models have recently achieved superior performance in 3D
molecule generation. Most of them first generate atoms and then add chemical
bonds based on the generated atoms in a post-processing manner. However, there
might be no corresponding bond solution for the temporally generated atoms as
their locations are generated without considering potential bonds. We define
this problem as the atom-bond inconsistency problem and claim it is the main
reason for current approaches to generating unrealistic 3D molecules. To
overcome this problem, we propose a new diffusion model called MolDiff which
can generate atoms and bonds simultaneously while still maintaining their
consistency by explicitly modeling the dependence between their relationships.
We evaluated the generation ability of our proposed model and the quality of
the generated molecules using criteria related to both geometry and chemical
properties. The empirical studies showed that our model outperforms previous
approaches, achieving a three-fold improvement in success rate and generating
molecules with significantly better quality.
Analyzing Bias in Diffusion-based Face Generation Models
May 10, 2023
Malsha V. Perera, Vishal M. Patel
Diffusion models are becoming increasingly popular in synthetic data
generation and image editing applications. However, these models can amplify
existing biases and propagate them to downstream applications. Therefore, it is
crucial to understand the sources of bias in their outputs. In this paper, we
investigate the presence of bias in diffusion-based face generation models with
respect to attributes such as gender, race, and age. Moreover, we examine how
dataset size affects the attribute composition and perceptual quality of both
diffusion and Generative Adversarial Network (GAN) based face generation models
across various attribute classes. Our findings suggest that diffusion models
tend to worsen distribution bias in the training data for various attributes,
which is heavily influenced by the size of the dataset. Conversely, GAN models
trained on balanced datasets with a larger number of samples show less bias
across different attributes.
Relightify: Relightable 3D Faces from a Single Image via Diffusion Models
May 10, 2023
Foivos Paraperas Papantoniou, Alexandros Lattas, Stylianos Moschoglou, Stefanos Zafeiriou
Following the remarkable success of diffusion models on image generation,
recent works have also demonstrated their impressive ability to address a
number of inverse problems in an unsupervised way, by properly constraining the
sampling process based on a conditioning input. Motivated by this, in this
paper, we present the first approach to use diffusion models as a prior for
highly accurate 3D facial BRDF reconstruction from a single image. We start by
leveraging a high-quality UV dataset of facial reflectance (diffuse and
specular albedo and normals), which we render under varying illumination
settings to simulate natural RGB textures and, then, train an unconditional
diffusion model on concatenated pairs of rendered textures and reflectance
components. At test time, we fit a 3D morphable model to the given image and
unwrap the face in a partial UV texture. By sampling from the diffusion model,
while retaining the observed texture part intact, the model inpaints not only
the self-occluded areas but also the unknown reflectance components, in a
single sequence of denoising steps. In contrast to existing methods, we
directly acquire the observed texture from the input image, thus, resulting in
more faithful and consistent reflectance estimation. Through a series of
qualitative and quantitative comparisons, we demonstrate superior performance
in both texture completion as well as reflectance reconstruction tasks.
Diffusion-based Signal Refiner for Speech Enhancement
May 10, 2023
Masato Hirano, Kazuki Shimada, Yuichiro Koyama, Shusuke Takahashi, Yuki Mitsufuji
We have developed a diffusion-based speech refiner that improves the
reference-free perceptual quality of the audio predicted by preceding
single-channel speech separation models. Although modern deep neural
network-based speech separation models have show high performance in
reference-based metrics, they often produce perceptually unnatural artifacts.
The recent advancements made to diffusion models motivated us to tackle this
problem by restoring the degraded parts of initial separations with a
generative approach. Utilizing the denoising diffusion restoration model (DDRM)
as a basis, we propose a shared DDRM-based refiner that generates samples
conditioned on the global information of preceding outputs from arbitrary
speech separation models. We experimentally show that our refiner can provide a
clearer harmonic structure of speech and improves the reference-free metric of
perceptual quality for arbitrary preceding model architectures. Furthermore, we
tune the variance of the measurement noise based on preceding outputs, which
results in higher scores in both reference-free and reference-based metrics.
The separation quality can also be further improved by blending the
discriminative and generative outputs.
Controllable Light Diffusion for Portraits
May 08, 2023
David Futschik, Kelvin Ritland, James Vecore, Sean Fanello, Sergio Orts-Escolano, Brian Curless, Daniel Sýkora, Rohit Pandey
We introduce light diffusion, a novel method to improve lighting in
portraits, softening harsh shadows and specular highlights while preserving
overall scene illumination. Inspired by professional photographers’ diffusers
and scrims, our method softens lighting given only a single portrait photo.
Previous portrait relighting approaches focus on changing the entire lighting
environment, removing shadows (ignoring strong specular highlights), or
removing shading entirely. In contrast, we propose a learning based method that
allows us to control the amount of light diffusion and apply it on in-the-wild
portraits. Additionally, we design a method to synthetically generate plausible
external shadows with sub-surface scattering effects while conforming to the
shape of the subject’s face. Finally, we show how our approach can increase the
robustness of higher level vision applications, such as albedo estimation,
geometry estimation and semantic segmentation.
ReGeneration Learning of Diffusion Models with Rich Prompts for Zero-Shot Image Translation
May 08, 2023
Yupei Lin, Sen Zhang, Xiaojun Yang, Xiao Wang, Yukai Shi
Large-scale text-to-image models have demonstrated amazing ability to
synthesize diverse and high-fidelity images. However, these models are often
violated by several limitations. Firstly, they require the user to provide
precise and contextually relevant descriptions for the desired image
modifications. Secondly, current models can impose significant changes to the
original image content during the editing process. In this paper, we explore
ReGeneration learning in an image-to-image Diffusion model (ReDiffuser), that
preserves the content of the original image without human prompting and the
requisite editing direction is automatically discovered within the text
embedding space. To ensure consistent preservation of the shape during image
editing, we propose cross-attention guidance based on regeneration learning.
This novel approach allows for enhanced expression of the target domain
features while preserving the original shape of the image. In addition, we
introduce a cooperative update strategy, which allows for efficient
preservation of the original shape of an image, thereby improving the quality
and consistency of shape preservation throughout the editing process. Our
proposed method leverages an existing pre-trained text-image diffusion model
without any additional training. Extensive experiments show that the proposed
method outperforms existing work in both real and synthetic image editing.
Real-World Denoising via Diffusion Model
May 08, 2023
Cheng Yang, Lijing Liang, Zhixun Su
Real-world image denoising is an extremely important image processing
problem, which aims to recover clean images from noisy images captured in
natural environments. In recent years, diffusion models have achieved very
promising results in the field of image generation, outperforming previous
generation models. However, it has not been widely used in the field of image
denoising because it is difficult to control the appropriate position of the
added noise. Inspired by diffusion models, this paper proposes a novel general
denoising diffusion model that can be used for real-world image denoising. We
introduce a diffusion process with linear interpolation, and the intermediate
noisy image is interpolated from the original clean image and the corresponding
real-world noisy image, so that this diffusion model can handle the level of
added noise. In particular, we also introduce two sampling algorithms for this
diffusion model. The first one is a simple sampling procedure defined according
to the diffusion process, and the second one targets the problem of the first
one and makes a number of improvements. Our experimental results show that our
proposed method with a simple CNNs Unet achieves comparable results compared to
the Transformer architecture. Both quantitative and qualitative evaluations on
real-world denoising benchmarks show that the proposed general diffusion model
performs almost as well as against the state-of-the-art methods.
A Variational Perspective on Solving Inverse Problems with Diffusion Models
May 07, 2023
Morteza Mardani, Jiaming Song, Jan Kautz, Arash Vahdat
cs.LG, cs.CV, cs.NA, math.NA, stat.ML
Diffusion models have emerged as a key pillar of foundation models in visual
domains. One of their critical applications is to universally solve different
downstream inverse tasks via a single diffusion prior without re-training for
each task. Most inverse tasks can be formulated as inferring a posterior
distribution over data (e.g., a full image) given a measurement (e.g., a masked
image). This is however challenging in diffusion models since the nonlinear and
iterative nature of the diffusion process renders the posterior intractable. To
cope with this challenge, we propose a variational approach that by design
seeks to approximate the true posterior distribution. We show that our approach
naturally leads to regularization by denoising diffusion process (RED-Diff)
where denoisers at different timesteps concurrently impose different structural
constraints over the image. To gauge the contribution of denoisers from
different timesteps, we propose a weighting mechanism based on
signal-to-noise-ratio (SNR). Our approach provides a new variational
perspective for solving inverse problems with diffusion models, allowing us to
formulate sampling as stochastic optimization, where one can simply apply
off-the-shelf solvers with lightweight iterates. Our experiments for image
restoration tasks such as inpainting and superresolution demonstrate the
strengths of our method compared with state-of-the-art sampling-based diffusion
models.
A Latent Diffusion Model for Protein Structure Generation
May 06, 2023
Cong Fu, Keqiang Yan, Limei Wang, Wing Yee Au, Michael McThrow, Tao Komikado, Koji Maruhashi, Kanji Uchino, Xiaoning Qian, Shuiwang Ji
Proteins are complex biomolecules that perform a variety of crucial functions
within living organisms. Designing and generating novel proteins can pave the
way for many future synthetic biology applications, including drug discovery.
However, it remains a challenging computational task due to the large modeling
space of protein structures. In this study, we propose a latent diffusion model
that can reduce the complexity of protein modeling while flexibly capturing the
distribution of natural protein structures in a condensed latent space.
Specifically, we propose an equivariant protein autoencoder that embeds
proteins into a latent space and then uses an equivariant diffusion model to
learn the distribution of the latent protein representations. Experimental
results demonstrate that our method can effectively generate novel protein
backbone structures with high designability and efficiency.
Efficient and Degree-Guided Graph Generation via Discrete Diffusion Modeling
May 06, 2023
Xiaohui Chen, Jiaxing He, Xu Han, Li-Ping Liu
Diffusion-based generative graph models have been proven effective in
generating high-quality small graphs. However, they need to be more scalable
for generating large graphs containing thousands of nodes desiring graph
statistics. In this work, we propose EDGE, a new diffusion-based generative
graph model that addresses generative tasks with large graphs. To improve
computation efficiency, we encourage graph sparsity by using a discrete
diffusion process that randomly removes edges at each time step and finally
obtains an empty graph. EDGE only focuses on a portion of nodes in the graph at
each denoising step. It makes much fewer edge predictions than previous
diffusion-based models. Moreover, EDGE admits explicitly modeling the node
degrees of the graphs, further improving the model performance. The empirical
study shows that EDGE is much more efficient than competing methods and can
generate large graphs with thousands of nodes. It also outperforms baseline
models in generation quality: graphs generated by our approach have more
similar graph statistics to those of the training graphs.
Improved Techniques for Maximum Likelihood Estimation for Diffusion ODEs
May 06, 2023
Kaiwen Zheng, Cheng Lu, Jianfei Chen, Jun Zhu
Diffusion models have exhibited excellent performance in various domains. The
probability flow ordinary differential equation (ODE) of diffusion models
(i.e., diffusion ODEs) is a particular case of continuous normalizing flows
(CNFs), which enables deterministic inference and exact likelihood evaluation.
However, the likelihood estimation results by diffusion ODEs are still far from
those of the state-of-the-art likelihood-based generative models. In this work,
we propose several improved techniques for maximum likelihood estimation for
diffusion ODEs, including both training and evaluation perspectives. For
training, we propose velocity parameterization and explore variance reduction
techniques for faster convergence. We also derive an error-bounded high-order
flow matching objective for finetuning, which improves the ODE likelihood and
smooths its trajectory. For evaluation, we propose a novel training-free
truncated-normal dequantization to fill the training-evaluation gap commonly
existing in diffusion ODEs. Building upon these techniques, we achieve
state-of-the-art likelihood estimation results on image datasets (2.56 on
CIFAR-10, 3.43/3.69 on ImageNet-32) without variational dequantization or data
augmentation.
DocDiff: Document Enhancement via Residual Diffusion Models
May 06, 2023
Zongyuan Yang, Baolin Liu, Yongping Xiong, Lan Yi, Guibin Wu, Xiaojun Tang, Ziqi Liu, Junjie Zhou, Xing Zhang
Removing degradation from document images not only improves their visual
quality and readability, but also enhances the performance of numerous
automated document analysis and recognition tasks. However, existing
regression-based methods optimized for pixel-level distortion reduction tend to
suffer from significant loss of high-frequency information, leading to
distorted and blurred text edges. To compensate for this major deficiency, we
propose DocDiff, the first diffusion-based framework specifically designed for
diverse challenging document enhancement problems, including document
deblurring, denoising, and removal of watermarks and seals. DocDiff consists of
two modules: the Coarse Predictor (CP), which is responsible for recovering the
primary low-frequency content, and the High-Frequency Residual Refinement (HRR)
module, which adopts the diffusion models to predict the residual
(high-frequency information, including text edges), between the ground-truth
and the CP-predicted image. DocDiff is a compact and computationally efficient
model that benefits from a well-designed network architecture, an optimized
training loss objective, and a deterministic sampling process with short time
steps. Extensive experiments demonstrate that DocDiff achieves state-of-the-art
(SOTA) performance on multiple benchmark datasets, and can significantly
enhance the readability and recognizability of degraded document images.
Furthermore, our proposed HRR module in pre-trained DocDiff is plug-and-play
and ready-to-use, with only 4.17M parameters. It greatly sharpens the text
edges generated by SOTA deblurring methods without additional joint training.
Available codes: https://github.com/Royalvice/DocDiff
Conditional Diffusion Feature Refinement for Continuous Sign Language Recognition
May 05, 2023
Leming Guo, Wanli Xue, Qing Guo, Yuxi Zhou, Tiantian Yuan, Shengyong Chen
In this work, we are dedicated to leveraging the denoising diffusion models’
success and formulating feature refinement as the autoencoder-formed diffusion
process, which is a mask-and-predict scheme. The state-of-the-art CSLR
framework consists of a spatial module, a visual module, a sequence module, and
a sequence learning function. However, this framework has faced sequence module
overfitting caused by the objective function and small-scale available
benchmarks, resulting in insufficient model training. To overcome the
overfitting problem, some CSLR studies enforce the sequence module to learn
more visual temporal information or be guided by more informative supervision
to refine its representations. In this work, we propose a novel
autoencoder-formed conditional diffusion feature refinement~(ACDR) to refine
the sequence representations to equip desired properties by learning the
encoding-decoding optimization process in an end-to-end way. Specifically, for
the ACDR, a noising Encoder is proposed to progressively add noise equipped
with semantic conditions to the sequence representations. And a denoising
Decoder is proposed to progressively denoise the noisy sequence representations
with semantic conditions. Therefore, the sequence representations can be imbued
with the semantics of provided semantic conditions. Further, a semantic
constraint is employed to prevent the denoised sequence representations from
semantic corruption. Extensive experiments are conducted to validate the
effectiveness of our ACDR, benefiting state-of-the-art methods and achieving a
notable gain on three benchmarks.
Diffusion Explainer: Visual Explanation for Text-to-image Stable Diffusion
May 04, 2023
Seongmin Lee, Benjamin Hoover, Hendrik Strobelt, Zijie J. Wang, ShengYun Peng, Austin Wright, Kevin Li, Haekyu Park, Haoyang Yang, Duen Horng Chau
cs.CL, cs.AI, cs.HC, cs.LG
Diffusion-based generative models’ impressive ability to create convincing
images has captured global attention. However, their complex internal
structures and operations often make them difficult for non-experts to
understand. We present Diffusion Explainer, the first interactive visualization
tool that explains how Stable Diffusion transforms text prompts into images.
Diffusion Explainer tightly integrates a visual overview of Stable Diffusion’s
complex components with detailed explanations of their underlying operations,
enabling users to fluidly transition between multiple levels of abstraction
through animations and interactive elements. By comparing the evolutions of
image representations guided by two related text prompts over refinement
timesteps, users can discover the impact of prompts on image generation.
Diffusion Explainer runs locally in users’ web browsers without the need for
installation or specialized hardware, broadening the public’s education access
to modern AI techniques. Our open-sourced tool is available at:
https://poloclub.github.io/diffusion-explainer/. A video demo is available at
https://youtu.be/Zg4gxdIWDds.
Multimodal-driven Talking Face Generation, Face Swapping, Diffusion Model
May 04, 2023
Chao Xu, Shaoting Zhu, Junwei Zhu, Tianxin Huang, Jiangning Zhang, Ying Tai, Yong Liu
Multimodal-driven talking face generation refers to animating a portrait with
the given pose, expression, and gaze transferred from the driving image and
video, or estimated from the text and audio. However, existing methods ignore
the potential of text modal, and their generators mainly follow the
source-oriented feature rearrange paradigm coupled with unstable GAN
frameworks. In this work, we first represent the emotion in the text prompt,
which could inherit rich semantics from the CLIP, allowing flexible and
generalized emotion control. We further reorganize these tasks as the
target-oriented texture transfer and adopt the Diffusion Models. More
specifically, given a textured face as the source and the rendered face
projected from the desired 3DMM coefficients as the target, our proposed
Texture-Geometry-aware Diffusion Model decomposes the complex transfer problem
into multi-conditional denoising process, where a Texture Attention-based
module accurately models the correspondences between appearance and geometry
cues contained in source and target conditions, and incorporate extra implicit
information for high-fidelity talking face generation. Additionally, TGDM can
be gracefully tailored for face swapping. We derive a novel paradigm free of
unstable seesaw-style optimization, resulting in simple, stable, and effective
training and inference schemes. Extensive experiments demonstrate the
superiority of our method.
May 04, 2023
Shang Chai, Liansheng Zhuang, Fengying Yan
Automatic layout generation that can synthesize high-quality layouts is an
important tool for graphic design in many applications. Though existing methods
based on generative models such as Generative Adversarial Networks (GANs) and
Variational Auto-Encoders (VAEs) have progressed, they still leave much room
for improving the quality and diversity of the results. Inspired by the recent
success of diffusion models in generating high-quality images, this paper
explores their potential for conditional layout generation and proposes
Transformer-based Layout Diffusion Model (LayoutDM) by instantiating the
conditional denoising diffusion probabilistic model (DDPM) with a purely
transformer-based architecture. Instead of using convolutional neural networks,
a transformer-based conditional Layout Denoiser is proposed to learn the
reverse diffusion process to generate samples from noised layout data.
Benefitting from both transformer and DDPM, our LayoutDM is of desired
properties such as high-quality generation, strong sample diversity, faithful
distribution coverage, and stationary training in comparison to GANs and VAEs.
Quantitative and qualitative experimental results show that our method
outperforms state-of-the-art generative models in terms of quality and
diversity.
Solving Inverse Problems with Score-Based Generative Priors learned from Noisy Data
May 02, 2023
Asad Aali, Marius Arvinte, Sidharth Kumar, Jonathan I. Tamir
We present SURE-Score: an approach for learning score-based generative models
using training samples corrupted by additive Gaussian noise. When a large
training set of clean samples is available, solving inverse problems via
score-based (diffusion) generative models trained on the underlying
fully-sampled data distribution has recently been shown to outperform
end-to-end supervised deep learning. In practice, such a large collection of
training data may be prohibitively expensive to acquire in the first place. In
this work, we present an approach for approximately learning a score-based
generative model of the clean distribution, from noisy training data. We
formulate and justify a novel loss function that leverages Stein’s unbiased
risk estimate to jointly denoise the data and learn the score function via
denoising score matching, while using only the noisy samples. We demonstrate
the generality of SURE-Score by learning priors and applying posterior sampling
to ill-posed inverse problems in two practical applications from different
domains: compressive wireless multiple-input multiple-output channel estimation
and accelerated 2D multi-coil magnetic resonance imaging reconstruction, where
we demonstrate competitive reconstruction performance when learning at
signal-to-noise ratio values of 0 and 10 dB, respectively.
In-Context Learning Unlocked for Diffusion Models
May 01, 2023
Zhendong Wang, Yifan Jiang, Yadong Lu, Yelong Shen, Pengcheng He, Weizhu Chen, Zhangyang Wang, Mingyuan Zhou
We present Prompt Diffusion, a framework for enabling in-context learning in
diffusion-based generative models. Given a pair of task-specific example
images, such as depth from/to image and scribble from/to image, and a text
guidance, our model automatically understands the underlying task and performs
the same task on a new query image following the text guidance. To achieve
this, we propose a vision-language prompt that can model a wide range of
vision-language tasks and a diffusion model that takes it as input. The
diffusion model is trained jointly over six different tasks using these
prompts. The resulting Prompt Diffusion model is the first diffusion-based
vision-language foundation model capable of in-context learning. It
demonstrates high-quality in-context generation on the trained tasks and
generalizes effectively to new, unseen vision tasks with their respective
prompts. Our model also shows compelling text-guided image editing results. Our
framework, with code publicly available at
https://github.com/Zhendong-Wang/Prompt-Diffusion, aims to facilitate research
into in-context learning for computer vision.
Class-Balancing Diffusion Models
April 30, 2023
Yiming Qin, Huangjie Zheng, Jiangchao Yao, Mingyuan Zhou, Ya Zhang
Diffusion-based models have shown the merits of generating high-quality
visual data while preserving better diversity in recent studies. However, such
observation is only justified with curated data distribution, where the data
samples are nicely pre-processed to be uniformly distributed in terms of their
labels. In practice, a long-tailed data distribution appears more common and
how diffusion models perform on such class-imbalanced data remains unknown. In
this work, we first investigate this problem and observe significant
degradation in both diversity and fidelity when the diffusion model is trained
on datasets with class-imbalanced distributions. Especially in tail classes,
the generations largely lose diversity and we observe severe mode-collapse
issues. To tackle this problem, we set from the hypothesis that the data
distribution is not class-balanced, and propose Class-Balancing Diffusion
Models (CBDM) that are trained with a distribution adjustment regularizer as a
solution. Experiments show that images generated by CBDM exhibit higher
diversity and quality in both quantitative and qualitative ways. Our method
benchmarked the generation results on CIFAR100/CIFAR100LT dataset and shows
outstanding performance on the downstream recognition task.
Cycle-guided Denoising Diffusion Probability Model for 3D Cross-modality MRI Synthesis
April 28, 2023
Shaoyan Pan, Chih-Wei Chang, Junbo Peng, Jiahan Zhang, Richard L. J. Qiu, Tonghe Wang, Justin Roper, Tian Liu, Hui Mao, Xiaofeng Yang
This study aims to develop a novel Cycle-guided Denoising Diffusion
Probability Model (CG-DDPM) for cross-modality MRI synthesis. The CG-DDPM
deploys two DDPMs that condition each other to generate synthetic images from
two different MRI pulse sequences. The two DDPMs exchange random latent noise
in the reverse processes, which helps to regularize both DDPMs and generate
matching images in two modalities. This improves image-to-image translation
ac-curacy. We evaluated the CG-DDPM quantitatively using mean absolute error
(MAE), multi-scale structural similarity index measure (MSSIM), and peak
sig-nal-to-noise ratio (PSNR), as well as the network synthesis consistency, on
the BraTS2020 dataset. Our proposed method showed high accuracy and reliable
consistency for MRI synthesis. In addition, we compared the CG-DDPM with
several other state-of-the-art networks and demonstrated statistically
significant improvements in the image quality of synthetic MRIs. The proposed
method enhances the capability of current multimodal MRI synthesis approaches,
which could contribute to more accurate diagnosis and better treatment planning
for patients by synthesizing additional MRI modalities.
Motion-Conditioned Diffusion Model for Controllable Video Synthesis
April 27, 2023
Tsai-Shien Chen, Chieh Hubert Lin, Hung-Yu Tseng, Tsung-Yi Lin, Ming-Hsuan Yang
Recent advancements in diffusion models have greatly improved the quality and
diversity of synthesized content. To harness the expressive power of diffusion
models, researchers have explored various controllable mechanisms that allow
users to intuitively guide the content synthesis process. Although the latest
efforts have primarily focused on video synthesis, there has been a lack of
effective methods for controlling and describing desired content and motion. In
response to this gap, we introduce MCDiff, a conditional diffusion model that
generates a video from a starting image frame and a set of strokes, which allow
users to specify the intended content and dynamics for synthesis. To tackle the
ambiguity of sparse motion inputs and achieve better synthesis quality, MCDiff
first utilizes a flow completion model to predict the dense video motion based
on the semantic understanding of the video frame and the sparse motion control.
Then, the diffusion model synthesizes high-quality future frames to form the
output video. We qualitatively and quantitatively show that MCDiff achieves the
state-the-of-art visual quality in stroke-guided controllable video synthesis.
Additional experiments on MPII Human Pose further exhibit the capability of our
model on diverse content and motion synthesis.
DiffuseExpand: Expanding dataset for 2D medical image segmentation using diffusion models
April 26, 2023
Shitong Shao, Xiaohan Yuan, Zhen Huang, Ziming Qiu, Shuai Wang, Kevin Zhou
Dataset expansion can effectively alleviate the problem of data scarcity for
medical image segmentation, due to privacy concerns and labeling difficulties.
However, existing expansion algorithms still face great challenges due to their
inability of guaranteeing the diversity of synthesized images with paired
segmentation masks. In recent years, Diffusion Probabilistic Models (DPMs) have
shown powerful image synthesis performance, even better than Generative
Adversarial Networks. Based on this insight, we propose an approach called
DiffuseExpand for expanding datasets for 2D medical image segmentation using
DPM, which first samples a variety of masks from Gaussian noise to ensure the
diversity, and then synthesizes images to ensure the alignment of images and
masks. After that, DiffuseExpand chooses high-quality samples to further
enhance the effectiveness of data expansion. Our comparison and ablation
experiments on COVID-19 and CGMH Pelvis datasets demonstrate the effectiveness
of DiffuseExpand. Our code is released at
https://github.com/shaoshitong/DiffuseExpand.
Score-based Generative Modeling Through Backward Stochastic Differential Equations: Inversion and Generation
April 26, 2023
Zihao Wang
The proposed BSDE-based diffusion model represents a novel approach to
diffusion modeling, which extends the application of stochastic differential
equations (SDEs) in machine learning. Unlike traditional SDE-based diffusion
models, our model can determine the initial conditions necessary to reach a
desired terminal distribution by adapting an existing score function. We
demonstrate the theoretical guarantees of the model, the benefits of using
Lipschitz networks for score matching, and its potential applications in
various areas such as diffusion inversion, conditional diffusion, and
uncertainty quantification. Our work represents a contribution to the field of
score-based generative learning and offers a promising direction for solving
real-world problems.
Single-View Height Estimation with Conditional Diffusion Probabilistic Models
April 26, 2023
Isaac Corley, Peyman Najafirad
Digital Surface Models (DSM) offer a wealth of height information for
understanding the Earth’s surface as well as monitoring the existence or change
in natural and man-made structures. Classical height estimation requires
multi-view geospatial imagery or LiDAR point clouds which can be expensive to
acquire. Single-view height estimation using neural network based models shows
promise however it can struggle with reconstructing high resolution features.
The latest advancements in diffusion models for high resolution image synthesis
and editing have yet to be utilized for remote sensing imagery, particularly
height estimation. Our approach involves training a generative diffusion model
to learn the joint distribution of optical and DSM images across both domains
as a Markov chain. This is accomplished by minimizing a denoising score
matching objective while being conditioned on the source image to generate
realistic high resolution 3D surfaces. In this paper we experiment with
conditional denoising diffusion probabilistic models (DDPM) for height
estimation from a single remotely sensed image and show promising results on
the Vaihingen benchmark dataset.
Latent diffusion models for generative precipitation nowcasting with accurate uncertainty quantification
April 25, 2023
Jussi Leinonen, Ulrich Hamann, Daniele Nerini, Urs Germann, Gabriele Franch
physics.ao-ph, cs.LG, eess.IV, I.2.10; J.2
Diffusion models have been widely adopted in image generation, producing
higher-quality and more diverse samples than generative adversarial networks
(GANs). We introduce a latent diffusion model (LDM) for precipitation
nowcasting - short-term forecasting based on the latest observational data. The
LDM is more stable and requires less computation to train than GANs, albeit
with more computationally expensive generation. We benchmark it against the
GAN-based Deep Generative Models of Rainfall (DGMR) and a statistical model,
PySTEPS. The LDM produces more accurate precipitation predictions, while the
comparisons are more mixed when predicting whether the precipitation exceeds
predefined thresholds. The clearest advantage of the LDM is that it generates
more diverse predictions than DGMR or PySTEPS. Rank distribution tests indicate
that the distribution of samples from the LDM accurately reflects the
uncertainty of the predictions. Thus, LDMs are promising for any applications
where uncertainty quantification is important, such as weather and climate.
April 25, 2023
Zezhou Zhang, Chuanchuan Yang, Yifeng Qin, Hao Feng, Jiqiang Feng, Hongbin Li
Conventional meta-atom designs rely heavily on researchers’ prior knowledge
and trial-and-error searches using full-wave simulations, resulting in
time-consuming and inefficient processes. Inverse design methods based on
optimization algorithms, such as evolutionary algorithms, and topological
optimizations, have been introduced to design metamaterials. However, none of
these algorithms are general enough to fulfill multi-objective tasks. Recently,
deep learning methods represented by Generative Adversarial Networks (GANs)
have been applied to inverse design of metamaterials, which can directly
generate high-degree-of-freedom meta-atoms based on S-parameter requirements.
However, the adversarial training process of GANs makes the network unstable
and results in high modeling costs. This paper proposes a novel metamaterial
inverse design method based on the diffusion probability theory. By learning
the Markov process that transforms the original structure into a Gaussian
distribution, the proposed method can gradually remove the noise starting from
the Gaussian distribution and generate new high-degree-of-freedom meta-atoms
that meet S-parameter conditions, which avoids the model instability introduced
by the adversarial training process of GANs and ensures more accurate and
high-quality generation results. Experiments have proven that our method is
superior to representative methods of GANs in terms of model convergence speed,
generation accuracy, and quality.
Patch Diffusion: Faster and More Data-Efficient Training of Diffusion Models
April 25, 2023
Zhendong Wang, Yifan Jiang, Huangjie Zheng, Peihao Wang, Pengcheng He, Zhangyang Wang, Weizhu Chen, Mingyuan Zhou
Diffusion models are powerful, but they require a lot of time and data to
train. We propose Patch Diffusion, a generic patch-wise training framework, to
significantly reduce the training time costs while improving data efficiency,
which thus helps democratize diffusion model training to broader users. At the
core of our innovations is a new conditional score function at the patch level,
where the patch location in the original image is included as additional
coordinate channels, while the patch size is randomized and diversified
throughout training to encode the cross-region dependency at multiple scales.
Sampling with our method is as easy as in the original diffusion model. Through
Patch Diffusion, we could achieve $\mathbf{\ge 2\times}$ faster training, while
maintaining comparable or better generation quality. Patch Diffusion meanwhile
improves the performance of diffusion models trained on relatively small
datasets, $e.g.$, as few as 5,000 images to train from scratch. We achieve
state-of-the-art FID scores 1.77 on CelebA-64$\times$64 and 1.93 on
AFHQv2-Wild-64$\times$64. We will share our code and pre-trained models soon.
RenderDiffusion: Text Generation as Image Generation
April 25, 2023
Junyi Li, Wayne Xin Zhao, Jian-Yun Nie, Ji-Rong Wen
Diffusion models have become a new generative paradigm for text generation.
Considering the discrete categorical nature of text, in this paper, we propose
GlyphDiffusion, a novel diffusion approach for text generation via text-guided
image generation. Our key idea is to render the target text as a glyph image
containing visual language content. In this way, conditional text generation
can be cast as a glyph image generation task, and it is then natural to apply
continuous diffusion models to discrete texts. Specially, we utilize a cascaded
architecture (ie a base and a super-resolution diffusion model) to generate
high-fidelity glyph images, conditioned on the input text. Furthermore, we
design a text grounding module to transform and refine the visual language
content from generated glyph images into the final texts. In experiments over
four conditional text generation tasks and two classes of metrics (ie quality
and diversity), GlyphDiffusion can achieve comparable or even better results
than several baselines, including pretrained language models. Our model also
makes significant improvements compared to the recent diffusion model.
Variational Diffusion Auto-encoder: Deep Latent Variable Model with Unconditional Diffusion Prior
April 24, 2023
Georgios Batzolis, Jan Stanczuk, Carola-Bibiane Schönlieb
As a widely recognized approach to deep generative modeling, Variational
Auto-Encoders (VAEs) still face challenges with the quality of generated
images, often presenting noticeable blurriness. This issue stems from the
unrealistic assumption that approximates the conditional data distribution,
$p(\textbf{x} | \textbf{z})$, as an isotropic Gaussian. In this paper, we
propose a novel solution to address these issues. We illustrate how one can
extract a latent space from a pre-existing diffusion model by optimizing an
encoder to maximize the marginal data log-likelihood. Furthermore, we
demonstrate that a decoder can be analytically derived post encoder-training,
employing the Bayes rule for scores. This leads to a VAE-esque deep latent
variable model, which discards the need for Gaussian assumptions on
$p(\textbf{x} | \textbf{z})$ or the training of a separate decoder network. Our
method, which capitalizes on the strengths of pre-trained diffusion models and
equips them with latent spaces, results in a significant enhancement to the
performance of VAEs.
Hierarchical Diffusion Autoencoders and Disentangled Image Manipulation
April 24, 2023
Zeyu Lu, Chengyue Wu, Xinyuan Chen, Yaohui Wang, Lei Bai, Yu Qiao, Xihui Liu
Diffusion models have attained impressive visual quality for image synthesis.
However, how to interpret and manipulate the latent space of diffusion models
has not been extensively explored. Prior work diffusion autoencoders encode the
semantic representations into a semantic latent code, which fails to reflect
the rich information of details and the intrinsic feature hierarchy. To
mitigate those limitations, we propose Hierarchical Diffusion Autoencoders
(HDAE) that exploit the fine-grained-to-abstract and lowlevel-to-high-level
feature hierarchy for the latent space of diffusion models. The hierarchical
latent space of HDAE inherently encodes different abstract levels of semantics
and provides more comprehensive semantic representations. In addition, we
propose a truncated-feature-based approach for disentangled image manipulation.
We demonstrate the effectiveness of our proposed approach with extensive
experiments and applications on image reconstruction, style mixing,
controllable interpolation, detail-preserving and disentangled image
manipulation, and multi-modal semantic image synthesis.
Score-Based Diffusion Models as Principled Priors for Inverse Imaging
April 23, 2023
Berthy T. Feng, Jamie Smith, Michael Rubinstein, Huiwen Chang, Katherine L. Bouman, William T. Freeman
It is important in computational imaging to understand the uncertainty of
images reconstructed from imperfect measurements. We propose turning
score-based diffusion models into principled priors (``score-based priors’’)
for analyzing a posterior of images given measurements. Previously,
probabilistic priors were limited to handcrafted regularizers and simple
distributions. In this work, we empirically validate the theoretically-proven
probability function of a score-based diffusion model. We show how to sample
from resulting posteriors by using this probability function for variational
inference. Our results, including experiments on denoising, deblurring, and
interferometric imaging, suggest that score-based priors enable principled
inference with a sophisticated, data-driven image prior.
DiffVoice: Text-to-Speech with Latent Diffusion
April 23, 2023
Zhijun Liu, Yiwei Guo, Kai Yu
eess.AS, cs.AI, cs.HC, cs.LG, cs.SD
In this work, we present DiffVoice, a novel text-to-speech model based on
latent diffusion. We propose to first encode speech signals into a phoneme-rate
latent representation with a variational autoencoder enhanced by adversarial
training, and then jointly model the duration and the latent representation
with a diffusion model. Subjective evaluations on LJSpeech and LibriTTS
datasets demonstrate that our method beats the best publicly available systems
in naturalness. By adopting recent generative inverse problem solving
algorithms for diffusion models, DiffVoice achieves the state-of-the-art
performance in text-based speech editing, and zero-shot adaptation.
On Accelerating Diffusion-Based Sampling Process via Improved Integration Approximation
April 22, 2023
Guoqiang Zhang, Niwa Kenta, W. Bastiaan Kleijn
One popular diffusion-based sampling strategy attempts to solve the reverse
ordinary differential equations (ODEs) effectively. The coefficients of the
obtained ODE solvers are pre-determined by the ODE formulation, the reverse
discrete timesteps, and the employed ODE methods. In this paper, we consider
accelerating several popular ODE-based sampling processes by optimizing certain
coefficients via improved integration approximation (IIA). At each reverse
timestep, we propose to minimize a mean squared error (MSE) function with
respect to certain selected coefficients. The MSE is constructed by applying
the original ODE solver for a set of fine-grained timesteps which in principle
provides a more accurate integration approximation in predicting the next
diffusion hidden state. Given a pre-trained diffusion model, the procedure for
IIA for a particular number of neural functional evaluations (NFEs) only needs
to be conducted once over a batch of samples. The obtained optimal solutions
for those selected coefficients via minimum MSE (MMSE) can be restored and
reused later on to accelerate the sampling process. Extensive experiments on
EDM and DDIM show the IIA technique leads to significant performance gain when
the numbers of NFEs are small.
Lookahead Diffusion Probabilistic Models for Refining Mean Estimation
April 22, 2023
Guoqiang Zhang, Niwa Kenta, W. Bastiaan Kleijn
We propose lookahead diffusion probabilistic models (LA-DPMs) to exploit the
correlation in the outputs of the deep neural networks (DNNs) over subsequent
timesteps in diffusion probabilistic models (DPMs) to refine the mean
estimation of the conditional Gaussian distributions in the backward process. A
typical DPM first obtains an estimate of the original data sample
$\boldsymbol{x}$ by feeding the most recent state $\boldsymbol{z}i$ and index
$i$ into the DNN model and then computes the mean vector of the conditional
Gaussian distribution for $\boldsymbol{z}{i-1}$. We propose to calculate a
more accurate estimate for $\boldsymbol{x}$ by performing extrapolation on the
two estimates of $\boldsymbol{x}$ that are obtained by feeding
$(\boldsymbol{z}{i+1},i+1)$ and $(\boldsymbol{z}{i},i)$ into the DNN model.
The extrapolation can be easily integrated into the backward process of
existing DPMs by introducing an additional connection over two consecutive
timesteps, and fine-tuning is not required. Extensive experiments showed that
plugging in the additional connection into DDPM, DDIM, DEIS, S-PNDM, and
high-order DPM-Solvers leads to a significant performance gain in terms of FID
score.
Speed Is All You Need: On-Device Acceleration of Large Diffusion Models via GPU-Aware Optimizations
April 21, 2023
Yu-Hui Chen, Raman Sarokin, Juhyun Lee, Jiuqiang Tang, Chuo-Ling Chang, Andrei Kulik, Matthias Grundmann
The rapid development and application of foundation models have
revolutionized the field of artificial intelligence. Large diffusion models
have gained significant attention for their ability to generate photorealistic
images and support various tasks. On-device deployment of these models provides
benefits such as lower server costs, offline functionality, and improved user
privacy. However, common large diffusion models have over 1 billion parameters
and pose challenges due to restricted computational and memory resources on
devices. We present a series of implementation optimizations for large
diffusion models that achieve the fastest reported inference latency to-date
(under 12 seconds for Stable Diffusion 1.4 without int8 quantization on Samsung
S23 Ultra for a 512x512 image with 20 iterations) on GPU-equipped mobile
devices. These enhancements broaden the applicability of generative AI and
improve the overall user experience across a wide range of devices.
BoDiffusion: Diffusing Sparse Observations for Full-Body Human Motion Synthesis
April 21, 2023
Angela Castillo, Maria Escobar, Guillaume Jeanneret, Albert Pumarola, Pablo Arbeláez, Ali Thabet, Artsiom Sanakoyeu
Mixed reality applications require tracking the user’s full-body motion to
enable an immersive experience. However, typical head-mounted devices can only
track head and hand movements, leading to a limited reconstruction of full-body
motion due to variability in lower body configurations. We propose BoDiffusion
– a generative diffusion model for motion synthesis to tackle this
under-constrained reconstruction problem. We present a time and space
conditioning scheme that allows BoDiffusion to leverage sparse tracking inputs
while generating smooth and realistic full-body motion sequences. To the best
of our knowledge, this is the first approach that uses the reverse diffusion
process to model full-body tracking as a conditional sequence generation task.
We conduct experiments on the large-scale motion-capture dataset AMASS and show
that our approach outperforms the state-of-the-art approaches by a significant
margin in terms of full-body motion realism and joint reconstruction error.
Improved Diffusion-based Image Colorization via Piggybacked Models
April 21, 2023
Hanyuan Liu, Jinbo Xing, Minshan Xie, Chengze Li, Tien-Tsin Wong
Image colorization has been attracting the research interests of the
community for decades. However, existing methods still struggle to provide
satisfactory colorized results given grayscale images due to a lack of
human-like global understanding of colors. Recently, large-scale Text-to-Image
(T2I) models have been exploited to transfer the semantic information from the
text prompts to the image domain, where text provides a global control for
semantic objects in the image. In this work, we introduce a colorization model
piggybacking on the existing powerful T2I diffusion model. Our key idea is to
exploit the color prior knowledge in the pre-trained T2I diffusion model for
realistic and diverse colorization. A diffusion guider is designed to
incorporate the pre-trained weights of the latent diffusion model to output a
latent color prior that conforms to the visual semantics of the grayscale
input. A lightness-aware VQVAE will then generate the colorized result with
pixel-perfect alignment to the given grayscale image. Our model can also
achieve conditional colorization with additional inputs (e.g. user hints and
texts). Extensive experiments show that our method achieves state-of-the-art
performance in terms of perceptual quality.
SILVR: Guided Diffusion for Molecule Generation
April 21, 2023
Nicholas T. Runcie, Antonia S. J. S. Mey
Computationally generating novel synthetically accessible compounds with high
affinity and low toxicity is a great challenge in drug design. Machine-learning
models beyond conventional pharmacophoric methods have shown promise in
generating novel small molecule compounds, but require significant tuning for a
specific protein target. Here, we introduce a method called selective iterative
latent variable refinement (SILVR) for conditioning an existing diffusion-based
equivariant generative model without retraining. The model allows the
generation of new molecules that fit into a binding site of a protein based on
fragment hits. We use the SARS-CoV-2 Main protease fragments from Diamond
X-Chem that form part of the COVID Moonshot project as a reference dataset for
conditioning the molecule generation. The SILVR rate controls the extent of
conditioning and we show that moderate SILVR rates make it possible to generate
new molecules of similar shape to the original fragments, meaning that the new
molecules fit the binding site without knowledge of the protein. We can also
merge up to 3 fragments into a new molecule without affecting the quality of
molecules generated by the underlying generative model. Our method is
generalizable to any protein target with known fragments and any
diffusion-based model for molecule generation.
Persistently Trained, Diffusion-assisted Energy-based Models
April 21, 2023
Xinwei Zhang, Zhiqiang Tan, Zhijian Ou
Maximum likelihood (ML) learning for energy-based models (EBMs) is
challenging, partly due to non-convergence of Markov chain Monte Carlo.Several
variations of ML learning have been proposed, but existing methods all fail to
achieve both post-training image generation and proper density estimation. We
propose to introduce diffusion data and learn a joint EBM, called diffusion
assisted-EBMs, through persistent training (i.e., using persistent contrastive
divergence) with an enhanced sampling algorithm to properly sample from
complex, multimodal distributions. We present results from a 2D illustrative
experiment and image experiments and demonstrate that, for the first time for
image data, persistently trained EBMs can {\it simultaneously} achieve long-run
stability, post-training image generation, and superior out-of-distribution
detection.
Long-Term Photometric Consistent Novel View Synthesis with Diffusion Models
April 21, 2023
Jason J. Yu, Fereshteh Forghani, Konstantinos G. Derpanis, Marcus A. Brubaker
Novel view synthesis from a single input image is a challenging task, where
the goal is to generate a new view of a scene from a desired camera pose that
may be separated by a large motion. The highly uncertain nature of this
synthesis task due to unobserved elements within the scene (i.e., occlusion)
and outside the field-of-view makes the use of generative models appealing to
capture the variety of possible outputs. In this paper, we propose a novel
generative model which is capable of producing a sequence of photorealistic
images consistent with a specified camera trajectory, and a single starting
image. Our approach is centred on an autoregressive conditional diffusion-based
model capable of interpolating visible scene elements, and extrapolating
unobserved regions in a view, in a geometrically consistent manner.
Conditioning is limited to an image capturing a single camera view and the
(relative) pose of the new camera view. To measure the consistency over a
sequence of generated views, we introduce a new metric, the thresholded
symmetric epipolar distance (TSED), to measure the number of consistent frame
pairs in a sequence. While previous methods have been shown to produce high
quality images and consistent semantics across pairs of views, we show
empirically with our metric that they are often inconsistent with the desired
camera poses. In contrast, we demonstrate that our method produces both
photorealistic and view-consistent imagery.
Collaborative Diffusion for Multi-Modal Face Generation and Editing
April 20, 2023
Ziqi Huang, Kelvin C. K. Chan, Yuming Jiang, Ziwei Liu
Diffusion models arise as a powerful generative tool recently. Despite the
great progress, existing diffusion models mainly focus on uni-modal control,
i.e., the diffusion process is driven by only one modality of condition. To
further unleash the users’ creativity, it is desirable for the model to be
controllable by multiple modalities simultaneously, e.g., generating and
editing faces by describing the age (text-driven) while drawing the face shape
(mask-driven). In this work, we present Collaborative Diffusion, where
pre-trained uni-modal diffusion models collaborate to achieve multi-modal face
generation and editing without re-training. Our key insight is that diffusion
models driven by different modalities are inherently complementary regarding
the latent denoising steps, where bilateral connections can be established
upon. Specifically, we propose dynamic diffuser, a meta-network that adaptively
hallucinates multi-modal denoising steps by predicting the spatial-temporal
influence functions for each pre-trained uni-modal model. Collaborative
Diffusion not only collaborates generation capabilities from uni-modal
diffusion models, but also integrates multiple uni-modal manipulations to
perform multi-modal editing. Extensive qualitative and quantitative experiments
demonstrate the superiority of our framework in both image quality and
condition consistency.
A data augmentation perspective on diffusion models and retrieval
April 20, 2023
Max F. Burg, Florian Wenzel, Dominik Zietlow, Max Horn, Osama Makansi, Francesco Locatello, Chris Russell
Diffusion models excel at generating photorealistic images from text-queries.
Naturally, many approaches have been proposed to use these generative abilities
to augment training datasets for downstream tasks, such as classification.
However, diffusion models are themselves trained on large noisily supervised,
but nonetheless, annotated datasets. It is an open question whether the
generalization capabilities of diffusion models beyond using the additional
data of the pre-training process for augmentation lead to improved downstream
performance. We perform a systematic evaluation of existing methods to generate
images from diffusion models and study new extensions to assess their benefit
for data augmentation. While we find that personalizing diffusion models
towards the target data outperforms simpler prompting strategies, we also show
that using the training data of the diffusion model alone, via a simple nearest
neighbor retrieval procedure, leads to even stronger downstream performance.
Overall, our study probes the limitations of diffusion models for data
augmentation but also highlights its potential in generating new training data
to improve performance on simple downstream vision tasks.
NeuralField-LDM: Scene Generation with Hierarchical Latent Diffusion Models
April 19, 2023
Seung Wook Kim, Bradley Brown, Kangxue Yin, Karsten Kreis, Katja Schwarz, Daiqing Li, Robin Rombach, Antonio Torralba, Sanja Fidler
Automatically generating high-quality real world 3D scenes is of enormous
interest for applications such as virtual reality and robotics simulation.
Towards this goal, we introduce NeuralField-LDM, a generative model capable of
synthesizing complex 3D environments. We leverage Latent Diffusion Models that
have been successfully utilized for efficient high-quality 2D content creation.
We first train a scene auto-encoder to express a set of image and pose pairs as
a neural field, represented as density and feature voxel grids that can be
projected to produce novel views of the scene. To further compress this
representation, we train a latent-autoencoder that maps the voxel grids to a
set of latent representations. A hierarchical diffusion model is then fit to
the latents to complete the scene generation pipeline. We achieve a substantial
improvement over existing state-of-the-art scene generation models.
Additionally, we show how NeuralField-LDM can be used for a variety of 3D
content creation applications, including conditional scene generation, scene
inpainting and scene style manipulation.
DiFaReli : Diffusion Face Relighting
April 19, 2023
Puntawat Ponglertnapakorn, Nontawat Tritrong, Supasorn Suwajanakorn
We present a novel approach to single-view face relighting in the wild.
Handling non-diffuse effects, such as global illumination or cast shadows, has
long been a challenge in face relighting. Prior work often assumes Lambertian
surfaces, simplified lighting models or involves estimating 3D shape, albedo,
or a shadow map. This estimation, however, is error-prone and requires many
training examples with lighting ground truth to generalize well. Our work
bypasses the need for accurate estimation of intrinsic components and can be
trained solely on 2D images without any light stage data, multi-view images, or
lighting ground truth. Our key idea is to leverage a conditional diffusion
implicit model (DDIM) for decoding a disentangled light encoding along with
other encodings related to 3D shape and facial identity inferred from
off-the-shelf estimators. We also propose a novel conditioning technique that
eases the modeling of the complex interaction between light and geometry by
using a rendered shading reference to spatially modulate the DDIM. We achieve
state-of-the-art performance on standard benchmark Multi-PIE and can
photorealistically relight in-the-wild images. Please visit our page:
https://diffusion-face-relighting.github.io
Denoising Diffusion Medical Models
April 19, 2023
Pham Ngoc Huy, Tran Minh Quan
In this study, we introduce a generative model that can synthesize a large
number of radiographical image/label pairs, and thus is asymptotically
favorable to downstream activities such as segmentation in bio-medical image
analysis. Denoising Diffusion Medical Model (DDMM), the proposed technique, can
create realistic X-ray images and associated segmentations on a small number of
annotated datasets as well as other massive unlabeled datasets with no
supervision. Radiograph/segmentation pairs are generated jointly by the DDMM
sampling process in probabilistic mode. As a result, a vanilla UNet that uses
this data augmentation for segmentation task outperforms other similarly
data-centric approaches.
UPGPT: Universal Diffusion Model for Person Image Generation, Editing and Pose Transfer
April 18, 2023
Soon Yau Cheong, Armin Mustafa, Andrew Gilbert
Existing person image generative models can do either image generation or
pose transfer but not both. We propose a unified diffusion model, UPGPT to
provide a universal solution to perform all the person image tasks -
generative, pose transfer, and editing. With fine-grained multimodality and
disentanglement capabilities, our approach offers fine-grained control over the
generation and the editing process of images using a combination of pose, text,
and image, all without needing a semantic segmentation mask which can be
challenging to obtain or edit. We also pioneer the parameterized body SMPL
model in pose-guided person image generation to demonstrate new capability -
simultaneous pose and camera view interpolation while maintaining a person’s
appearance. Results on the benchmark DeepFashion dataset show that UPGPT is the
new state-of-the-art while simultaneously pioneering new capabilities of edit
and pose transfer in human image generation.
Two-stage Denoising Diffusion Model for Source Localization in Graph Inverse Problems
April 18, 2023
Bosong Huang, Weihao Yu, Ruzhong Xie, Jing Xiao, Jin Huang
Source localization is the inverse problem of graph information dissemination
and has broad practical applications.
However, the inherent intricacy and uncertainty in information dissemination
pose significant challenges, and the ill-posed nature of the source
localization problem further exacerbates these challenges. Recently, deep
generative models, particularly diffusion models inspired by classical
non-equilibrium thermodynamics, have made significant progress. While diffusion
models have proven to be powerful in solving inverse problems and producing
high-quality reconstructions, applying them directly to the source localization
is infeasible for two reasons. Firstly, it is impossible to calculate the
posterior disseminated results on a large-scale network for iterative denoising
sampling, which would incur enormous computational costs. Secondly, in the
existing methods for this field, the training data itself are ill-posed
(many-to-one); thus simply transferring the diffusion model would only lead to
local optima.
To address these challenges, we propose a two-stage optimization framework,
the source localization denoising diffusion model (SL-Diff). In the coarse
stage, we devise the source proximity degrees as the supervised signals to
generate coarse-grained source predictions. This aims to efficiently initialize
the next stage, significantly reducing its convergence time and calibrating the
convergence process. Furthermore, the introduction of cascade temporal
information in this training method transforms the many-to-one mapping
relationship into a one-to-one relationship, perfectly addressing the ill-posed
problem. In the fine stage, we design a diffusion model for the graph inverse
problem that can quantify the uncertainty in the dissemination. The proposed
SL-Diff yields excellent prediction results within a reasonable sampling time
at extensive experiments.
Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models
April 18, 2023
Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, Karsten Kreis
Latent Diffusion Models (LDMs) enable high-quality image synthesis while
avoiding excessive compute demands by training a diffusion model in a
compressed lower-dimensional latent space. Here, we apply the LDM paradigm to
high-resolution video generation, a particularly resource-intensive task. We
first pre-train an LDM on images only; then, we turn the image generator into a
video generator by introducing a temporal dimension to the latent space
diffusion model and fine-tuning on encoded image sequences, i.e., videos.
Similarly, we temporally align diffusion model upsamplers, turning them into
temporally consistent video super resolution models. We focus on two relevant
real-world applications: Simulation of in-the-wild driving data and creative
content creation with text-to-video modeling. In particular, we validate our
Video LDM on real driving videos of resolution 512 x 1024, achieving
state-of-the-art performance. Furthermore, our approach can easily leverage
off-the-shelf pre-trained image LDMs, as we only need to train a temporal
alignment model in that case. Doing so, we turn the publicly available,
state-of-the-art text-to-image LDM Stable Diffusion into an efficient and
expressive text-to-video model with resolution up to 1280 x 2048. We show that
the temporal layers trained in this way generalize to different fine-tuned
text-to-image LDMs. Utilizing this property, we show the first results for
personalized text-to-video generation, opening exciting directions for future
content creation. Project page:
https://research.nvidia.com/labs/toronto-ai/VideoLDM/
Synthetic Data from Diffusion Models Improves ImageNet Classification
April 17, 2023
Shekoofeh Azizi, Simon Kornblith, Chitwan Saharia, Mohammad Norouzi, David J. Fleet
cs.CV, cs.AI, cs.CL, cs.LG
Deep generative models are becoming increasingly powerful, now generating
diverse high fidelity photo-realistic samples given text prompts. Have they
reached the point where models of natural images can be used for generative
data augmentation, helping to improve challenging discriminative tasks? We show
that large-scale text-to image diffusion models can be fine-tuned to produce
class conditional models with SOTA FID (1.76 at 256x256 resolution) and
Inception Score (239 at 256x256). The model also yields a new SOTA in
Classification Accuracy Scores (64.96 for 256x256 generative samples, improving
to 69.24 for 1024x1024 samples). Augmenting the ImageNet training set with
samples from the resulting models yields significant improvements in ImageNet
classification accuracy over strong ResNet and Vision Transformer baselines.
Refusion: Enabling Large-Size Realistic Image Restoration with Latent-Space Diffusion Models
April 17, 2023
Ziwei Luo, Fredrik K. Gustafsson, Zheng Zhao, Jens Sjölund, Thomas B. Schön
This work aims to improve the applicability of diffusion models in realistic
image restoration. Specifically, we enhance the diffusion model in several
aspects such as network architecture, noise level, denoising steps, training
image size, and optimizer/scheduler. We show that tuning these hyperparameters
allows us to achieve better performance on both distortion and perceptual
scores. We also propose a U-Net based latent diffusion model which performs
diffusion in a low-resolution latent space while preserving high-resolution
information from the original input for the decoding process. Compared to the
previous latent-diffusion model which trains a VAE-GAN to compress the image,
our proposed U-Net compression strategy is significantly more stable and can
recover highly accurate images without relying on adversarial optimization.
Importantly, these modifications allow us to apply diffusion models to various
image restoration tasks, including real-world shadow removal, HR
non-homogeneous dehazing, stereo super-resolution, and bokeh effect
transformation. By simply replacing the datasets and slightly changing the
noise network, our model, named Refusion, is able to deal with large-size
images (e.g., 6000 x 4000 x 3 in HR dehazing) and produces good results on all
the above restoration problems. Our Refusion achieves the best perceptual
performance in the NTIRE 2023 Image Shadow Removal Challenge and wins 2nd place
overall.
Towards Controllable Diffusion Models via Reward-Guided Exploration
April 14, 2023
Hengtong Zhang, Tingyang Xu
By formulating data samples’ formation as a Markov denoising process,
diffusion models achieve state-of-the-art performances in a collection of
tasks. Recently, many variants of diffusion models have been proposed to enable
controlled sample generation. Most of these existing methods either formulate
the controlling information as an input (i.e.,: conditional representation) for
the noise approximator, or introduce a pre-trained classifier in the test-phase
to guide the Langevin dynamic towards the conditional goal. However, the former
line of methods only work when the controlling information can be formulated as
conditional representations, while the latter requires the pre-trained guidance
classifier to be differentiable. In this paper, we propose a novel framework
named RGDM (Reward-Guided Diffusion Model) that guides the training-phase of
diffusion models via reinforcement learning (RL). The proposed training
framework bridges the objective of weighted log-likelihood and maximum entropy
RL, which enables calculating policy gradients via samples from a pay-off
distribution proportional to exponential scaled rewards, rather than from
policies themselves. Such a framework alleviates the high gradient variances
and enables diffusion models to explore for highly rewarded samples in the
reverse process. Experiments on 3D shape and molecule generation tasks show
significant improvements over existing conditional diffusion models.
Delta Denoising Score
April 14, 2023
Amir Hertz, Kfir Aberman, Daniel Cohen-Or
We introduce Delta Denoising Score (DDS), a novel scoring function for
text-based image editing that guides minimal modifications of an input image
towards the content described in a target prompt. DDS leverages the rich
generative prior of text-to-image diffusion models and can be used as a loss
term in an optimization problem to steer an image towards a desired direction
dictated by a text. DDS utilizes the Score Distillation Sampling (SDS)
mechanism for the purpose of image editing. We show that using only SDS often
produces non-detailed and blurry outputs due to noisy gradients. To address
this issue, DDS uses a prompt that matches the input image to identify and
remove undesired erroneous directions of SDS. Our key premise is that SDS
should be zero when calculated on pairs of matched prompts and images, meaning
that if the score is non-zero, its gradients can be attributed to the erroneous
component of SDS. Our analysis demonstrates the competence of DDS for text
based image-to-image translation. We further show that DDS can be used to train
an effective zero-shot image translation model. Experimental results indicate
that DDS outperforms existing methods in terms of stability and quality,
highlighting its potential for real-world applications in text-based image
editing.
Memory Efficient Diffusion Probabilistic Models via Patch-based Generation
April 14, 2023
Shinei Arakawa, Hideki Tsunashima, Daichi Horita, Keitaro Tanaka, Shigeo Morishima
Diffusion probabilistic models have been successful in generating
high-quality and diverse images. However, traditional models, whose input and
output are high-resolution images, suffer from excessive memory requirements,
making them less practical for edge devices. Previous approaches for generative
adversarial networks proposed a patch-based method that uses positional
encoding and global content information. Nevertheless, designing a patch-based
approach for diffusion probabilistic models is non-trivial. In this paper, we
resent a diffusion probabilistic model that generates images on a
patch-by-patch basis. We propose two conditioning methods for a patch-based
generation. First, we propose position-wise conditioning using one-hot
representation to ensure patches are in proper positions. Second, we propose
Global Content Conditioning (GCC) to ensure patches have coherent content when
concatenated together. We evaluate our model qualitatively and quantitatively
on CelebA and LSUN bedroom datasets and demonstrate a moderate trade-off
between maximum memory consumption and generated image quality. Specifically,
when an entire image is divided into 2 x 2 patches, our proposed approach can
reduce the maximum memory consumption by half while maintaining comparable
image quality.
Soundini: Sound-Guided Diffusion for Natural Video Editing
April 13, 2023
Seung Hyun Lee, Sieun Kim, Innfarn Yoo, Feng Yang, Donghyeon Cho, Youngseo Kim, Huiwen Chang, Jinkyu Kim, Sangpil Kim
We propose a method for adding sound-guided visual effects to specific
regions of videos with a zero-shot setting. Animating the appearance of the
visual effect is challenging because each frame of the edited video should have
visual changes while maintaining temporal consistency. Moreover, existing video
editing solutions focus on temporal consistency across frames, ignoring the
visual style variations over time, e.g., thunderstorm, wave, fire crackling. To
overcome this limitation, we utilize temporal sound features for the dynamic
style. Specifically, we guide denoising diffusion probabilistic models with an
audio latent representation in the audio-visual latent space. To the best of
our knowledge, our work is the first to explore sound-guided natural video
editing from various sound sources with sound-specialized properties, such as
intensity, timbre, and volume. Additionally, we design optical flow-based
guidance to generate temporally consistent video frames, capturing the
pixel-wise relationship between adjacent frames. Experimental results show that
our method outperforms existing video editing techniques, producing more
realistic visual effects that reflect the properties of sound. Please visit our
page: https://kuai-lab.github.io/soundini-gallery/.
Single-Stage Diffusion NeRF: A Unified Approach to 3D Generation and Reconstruction
April 13, 2023
Hansheng Chen, Jiatao Gu, Anpei Chen, Wei Tian, Zhuowen Tu, Lingjie Liu, Hao Su
3D-aware image synthesis encompasses a variety of tasks, such as scene
generation and novel view synthesis from images. Despite numerous task-specific
methods, developing a comprehensive model remains challenging. In this paper,
we present SSDNeRF, a unified approach that employs an expressive diffusion
model to learn a generalizable prior of neural radiance fields (NeRF) from
multi-view images of diverse objects. Previous studies have used two-stage
approaches that rely on pretrained NeRFs as real data to train diffusion
models. In contrast, we propose a new single-stage training paradigm with an
end-to-end objective that jointly optimizes a NeRF auto-decoder and a latent
diffusion model, enabling simultaneous 3D reconstruction and prior learning,
even from sparsely available views. At test time, we can directly sample the
diffusion prior for unconditional generation, or combine it with arbitrary
observations of unseen objects for NeRF reconstruction. SSDNeRF demonstrates
robust results comparable to or better than leading task-specific methods in
unconditional generation and single/sparse-view 3D reconstruction.
DiffusionRig: Learning Personalized Priors for Facial Appearance Editing
April 13, 2023
Zheng Ding, Xuaner Zhang, Zhihao Xia, Lars Jebe, Zhuowen Tu, Xiuming Zhang
We address the problem of learning person-specific facial priors from a small
number (e.g., 20) of portrait photos of the same person. This enables us to
edit this specific person’s facial appearance, such as expression and lighting,
while preserving their identity and high-frequency facial details. Key to our
approach, which we dub DiffusionRig, is a diffusion model conditioned on, or
“rigged by,” crude 3D face models estimated from single in-the-wild images by
an off-the-shelf estimator. On a high level, DiffusionRig learns to map
simplistic renderings of 3D face models to realistic photos of a given person.
Specifically, DiffusionRig is trained in two stages: It first learns generic
facial priors from a large-scale face dataset and then person-specific priors
from a small portrait photo collection of the person of interest. By learning
the CGI-to-photo mapping with such personalized priors, DiffusionRig can “rig”
the lighting, facial expression, head pose, etc. of a portrait photo,
conditioned only on coarse 3D models while preserving this person’s identity
and other high-frequency characteristics. Qualitative and quantitative
experiments show that DiffusionRig outperforms existing approaches in both
identity preservation and photorealism. Please see the project website:
https://diffusionrig.github.io for the supplemental material, video, code, and
data.
Learning Controllable 3D Diffusion Models from Single-view Images
April 13, 2023
Jiatao Gu, Qingzhe Gao, Shuangfei Zhai, Baoquan Chen, Lingjie Liu, Josh Susskind
Diffusion models have recently become the de-facto approach for generative
modeling in the 2D domain. However, extending diffusion models to 3D is
challenging due to the difficulties in acquiring 3D ground truth data for
training. On the other hand, 3D GANs that integrate implicit 3D representations
into GANs have shown remarkable 3D-aware generation when trained only on
single-view image datasets. However, 3D GANs do not provide straightforward
ways to precisely control image synthesis. To address these challenges, We
present Control3Diff, a 3D diffusion model that combines the strengths of
diffusion models and 3D GANs for versatile, controllable 3D-aware image
synthesis for single-view datasets. Control3Diff explicitly models the
underlying latent distribution (optionally conditioned on external inputs),
thus enabling direct control during the diffusion process. Moreover, our
approach is general and applicable to any type of controlling input, allowing
us to train it with the same diffusion objective without any auxiliary
supervision. We validate the efficacy of Control3Diff on standard image
generation benchmarks, including FFHQ, AFHQ, and ShapeNet, using various
conditioning inputs such as images, sketches, and text prompts. Please see the
project website (\url{https://jiataogu.me/control3diff}) for video comparisons.
DiffFit: Unlocking Transferability of Large Diffusion Models via Simple Parameter-Efficient Fine-Tuning
April 13, 2023
Enze Xie, Lewei Yao, Han Shi, Zhili Liu, Daquan Zhou, Zhaoqiang Liu, Jiawei Li, Zhenguo Li
Diffusion models have proven to be highly effective in generating
high-quality images. However, adapting large pre-trained diffusion models to
new domains remains an open challenge, which is critical for real-world
applications. This paper proposes DiffFit, a parameter-efficient strategy to
fine-tune large pre-trained diffusion models that enable fast adaptation to new
domains. DiffFit is embarrassingly simple that only fine-tunes the bias term
and newly-added scaling factors in specific layers, yet resulting in
significant training speed-up and reduced model storage costs. Compared with
full fine-tuning, DiffFit achieves 2$\times$ training speed-up and only needs
to store approximately 0.12\% of the total model parameters. Intuitive
theoretical analysis has been provided to justify the efficacy of scaling
factors on fast adaptation. On 8 downstream datasets, DiffFit achieves superior
or competitive performances compared to the full fine-tuning while being more
efficient. Remarkably, we show that DiffFit can adapt a pre-trained
low-resolution generative model to a high-resolution one by adding minimal
cost. Among diffusion-based methods, DiffFit sets a new state-of-the-art FID of
3.02 on ImageNet 512$\times$512 benchmark by fine-tuning only 25 epochs from a
public pre-trained ImageNet 256$\times$256 checkpoint while being 30$\times$
more training efficient than the closest competitor.
An Edit Friendly DDPM Noise Space: Inversion and Manipulations
April 12, 2023
Inbar Huberman-Spiegelglas, Vladimir Kulikov, Tomer Michaeli
Denoising diffusion probabilistic models (DDPMs) employ a sequence of white
Gaussian noise samples to generate an image. In analogy with GANs, those noise
maps could be considered as the latent code associated with the generated
image. However, this native noise space does not possess a convenient
structure, and is thus challenging to work with in editing tasks. Here, we
propose an alternative latent noise space for DDPM that enables a wide range of
editing operations via simple means, and present an inversion method for
extracting these edit-friendly noise maps for any given image (real or
synthetically generated). As opposed to the native DDPM noise space, the
edit-friendly noise maps do not have a standard normal distribution and are not
statistically independent across timesteps. However, they allow perfect
reconstruction of any desired image, and simple transformations on them
translate into meaningful manipulations of the output image (e.g., shifting,
color edits). Moreover, in text-conditional models, fixing those noise maps
while changing the text prompt, modifies semantics while retaining structure.
We illustrate how this property enables text-based editing of real images via
the diverse DDPM sampling scheme (in contrast to the popular non-diverse DDIM
inversion). We also show how it can be used within existing diffusion-based
editing methods to improve their quality and diversity.
E(3)xSO(3)-Equivariant Networks for Spherical Deconvolution in Diffusion MRI
April 12, 2023
Axel Elaldi, Guido Gerig, Neel Dey
We present Roto-Translation Equivariant Spherical Deconvolution (RT-ESD), an
$E(3)\times SO(3)$ equivariant framework for sparse deconvolution of volumes
where each voxel contains a spherical signal. Such 6D data naturally arises in
diffusion MRI (dMRI), a medical imaging modality widely used to measure
microstructure and structural connectivity. As each dMRI voxel is typically a
mixture of various overlapping structures, there is a need for blind
deconvolution to recover crossing anatomical structures such as white matter
tracts. Existing dMRI work takes either an iterative or deep learning approach
to sparse spherical deconvolution, yet it typically does not account for
relationships between neighboring measurements. This work constructs
equivariant deep learning layers which respect to symmetries of spatial
rotations, reflections, and translations, alongside the symmetries of voxelwise
spherical rotations. As a result, RT-ESD improves on previous work across
several tasks including fiber recovery on the DiSCo dataset,
deconvolution-derived partial volume estimation on real-world \textit{in vivo}
human brain dMRI, and improved downstream reconstruction of fiber tractograms
on the Tractometer dataset. Our implementation is available at
https://github.com/AxelElaldi/e3so3_conv
Continual Diffusion: Continual Customization of Text-to-Image Diffusion with C-LoRA
April 12, 2023
James Seale Smith, Yen-Chang Hsu, Lingyu Zhang, Ting Hua, Zsolt Kira, Yilin Shen, Hongxia Jin
Recent works demonstrate a remarkable ability to customize text-to-image
diffusion models while only providing a few example images. What happens if you
try to customize such models using multiple, fine-grained concepts in a
sequential (i.e., continual) manner? In our work, we show that recent
state-of-the-art customization of text-to-image models suffer from catastrophic
forgetting when new concepts arrive sequentially. Specifically, when adding a
new concept, the ability to generate high quality images of past, similar
concepts degrade. To circumvent this forgetting, we propose a new method,
C-LoRA, composed of a continually self-regularized low-rank adaptation in cross
attention layers of the popular Stable Diffusion model. Furthermore, we use
customization prompts which do not include the word of the customized object
(i.e., “person” for a human face dataset) and are initialized as completely
random embeddings. Importantly, our method induces only marginal additional
parameter costs and requires no storage of user data for replay. We show that
C-LoRA not only outperforms several baselines for our proposed setting of
text-to-image continual customization, which we refer to as Continual
Diffusion, but that we achieve a new state-of-the-art in the well-established
rehearsal-free continual learning setting for image classification. The high
achieving performance of C-LoRA in two separate domains positions it as a
compelling solution for a wide range of applications, and we believe it has
significant potential for practical impact.
DreamPose: Fashion Image-to-Video Synthesis via Stable Diffusion
April 12, 2023
Johanna Karras, Aleksander Holynski, Ting-Chun Wang, Ira Kemelmacher-Shlizerman
We present DreamPose, a diffusion-based method for generating animated
fashion videos from still images. Given an image and a sequence of human body
poses, our method synthesizes a video containing both human and fabric motion.
To achieve this, we transform a pretrained text-to-image model (Stable
Diffusion) into a pose-and-image guided video synthesis model, using a novel
finetuning strategy, a set of architectural changes to support the added
conditioning signals, and techniques to encourage temporal consistency. We
fine-tune on a collection of fashion videos from the UBC Fashion dataset. We
evaluate our method on a variety of clothing styles and poses, and demonstrate
that our method produces state-of-the-art results on fashion video animation.
Video results are available on our project page.
SpectralDiff: Hyperspectral Image Classification with Spectral-Spatial Diffusion Models
April 12, 2023
Ning Chen, Jun Yue, Leyuan Fang, Shaobo Xia
Hyperspectral image (HSI) classification is an important topic in the field
of remote sensing, and has a wide range of applications in Earth science. HSIs
contain hundreds of continuous bands, which are characterized by high dimension
and high correlation between adjacent bands. The high dimension and redundancy
of HSI data bring great difficulties to HSI classification. In recent years, a
large number of HSI feature extraction and classification methods based on deep
learning have been proposed. However, their ability to model the global
relationships among samples in both spatial and spectral domains is still
limited. In order to solve this problem, an HSI classification method with
spectral-spatial diffusion models is proposed. The proposed method realizes the
reconstruction of spectral-spatial distribution of the training samples with
the forward and reverse spectral-spatial diffusion process, thus modeling the
global spatial-spectral relationship between samples. Then, we use the
spectral-spatial denoising network of the reverse process to extract the
unsupervised diffusion features. Features extracted by the spectral-spatial
diffusion models can achieve cross-sample perception from the reconstructed
distribution of the training samples, thus obtaining better classification
performance. Experiments on three public HSI datasets show that the proposed
method can achieve better performance than the state-of-the-art methods. The
source code and the pre-trained spectral-spatial diffusion model will be
publicly available at https://github.com/chenning0115/SpectralDiff.
Diffusion models with location-scale noise
April 12, 2023
Alexia Jolicoeur-Martineau, Kilian Fatras, Ke Li, Tal Kachman
cs.LG, cs.AI, cs.NA, math.NA
Diffusion Models (DMs) are powerful generative models that add Gaussian noise
to the data and learn to remove it. We wanted to determine which noise
distribution (Gaussian or non-Gaussian) led to better generated data in DMs.
Since DMs do not work by design with non-Gaussian noise, we built a framework
that allows reversing a diffusion process with non-Gaussian location-scale
noise. We use that framework to show that the Gaussian distribution performs
the best over a wide range of other distributions (Laplace, Uniform, t,
Generalized-Gaussian).
InterGen: Diffusion-based Multi-human Motion Generation under Complex Interactions
April 12, 2023
Han Liang, Wenqian Zhang, Wenxuan Li, Jingyi Yu, Lan Xu
We have recently seen tremendous progress in diffusion advances for
generating realistic human motions. Yet, they largely disregard the rich
multi-human interactions. In this paper, we present InterGen, an effective
diffusion-based approach that incorporates human-to-human interactions into the
motion diffusion process, which enables layman users to customize high-quality
two-person interaction motions, with only text guidance. We first contribute a
multimodal dataset, named InterHuman. It consists of about 107M frames for
diverse two-person interactions, with accurate skeletal motions and 16,756
natural language descriptions. For the algorithm side, we carefully tailor the
motion diffusion model to our two-person interaction setting. To handle the
symmetry of human identities during interactions, we propose two cooperative
transformer-based denoisers that explicitly share weights, with a mutual
attention mechanism to further connect the two denoising processes. Then, we
propose a novel representation for motion input in our interaction diffusion
model, which explicitly formulates the global relations between the two
performers in the world frame. We further introduce two novel regularization
terms to encode spatial relations, equipped with a corresponding damping scheme
during the training of our interaction diffusion model. Extensive experiments
validate the effectiveness and generalizability of InterGen. Notably, it can
generate more diverse and compelling two-person motions than previous methods
and enables various downstream applications for human interactions.
CamDiff: Camouflage Image Augmentation via Diffusion Model
April 11, 2023
Xue-Jing Luo, Shuo Wang, Zongwei Wu, Christos Sakaridis, Yun Cheng, Deng-Ping Fan, Luc Van Gool
The burgeoning field of camouflaged object detection (COD) seeks to identify
objects that blend into their surroundings. Despite the impressive performance
of recent models, we have identified a limitation in their robustness, where
existing methods may misclassify salient objects as camouflaged ones, despite
these two characteristics being contradictory. This limitation may stem from
lacking multi-pattern training images, leading to less saliency robustness. To
address this issue, we introduce CamDiff, a novel approach inspired by
AI-Generated Content (AIGC) that overcomes the scarcity of multi-pattern
training images. Specifically, we leverage the latent diffusion model to
synthesize salient objects in camouflaged scenes, while using the zero-shot
image classification ability of the Contrastive Language-Image Pre-training
(CLIP) model to prevent synthesis failures and ensure the synthesized object
aligns with the input prompt. Consequently, the synthesized image retains its
original camouflage label while incorporating salient objects, yielding
camouflage samples with richer characteristics. The results of user studies
show that the salient objects in the scenes synthesized by our framework
attract the user’s attention more; thus, such samples pose a greater challenge
to the existing COD models. Our approach enables flexible editing and efficient
large-scale dataset generation at a low cost. It significantly enhances COD
baselines’ training and testing phases, emphasizing robustness across diverse
domains. Our newly-generated datasets and source code are available at
https://github.com/drlxj/CamDiff.
Diffusion Models for Constrained Domains
April 11, 2023
Nic Fishman, Leo Klarner, Valentin De Bortoli, Emile Mathieu, Michael Hutchinson
Denoising diffusion models are a recent class of generative models which
achieve state-of-the-art results in many domains such as unconditional image
generation and text-to-speech tasks. They consist of a noising process
destroying the data and a backward stage defined as the time-reversal of the
noising diffusion. Building on their success, diffusion models have recently
been extended to the Riemannian manifold setting. Yet, these Riemannian
diffusion models require geodesics to be defined for all times. While this
setting encompasses many important applications, it does not include manifolds
defined via a set of inequality constraints, which are ubiquitous in many
scientific domains such as robotics and protein design. In this work, we
introduce two methods to bridge this gap. First, we design a noising process
based on the logarithmic barrier metric induced by the inequality constraints.
Second, we introduce a noising process based on the reflected Brownian motion.
As existing diffusion model techniques cannot be applied in this setting, we
derive new tools to define such models in our framework. We empirically
demonstrate the applicability of our methods to a number of synthetic and
real-world tasks, including the constrained conformational modelling of protein
backbones and robotic arms.
Mask-conditioned latent diffusion for generating gastrointestinal polyp images
April 11, 2023
Roman Macháček, Leila Mozaffari, Zahra Sepasdar, Sravanthi Parasa, Pål Halvorsen, Michael A. Riegler, Vajira Thambawita
In order to take advantage of AI solutions in endoscopy diagnostics, we must
overcome the issue of limited annotations. These limitations are caused by the
high privacy concerns in the medical field and the requirement of getting aid
from experts for the time-consuming and costly medical data annotation process.
In computer vision, image synthesis has made a significant contribution in
recent years as a result of the progress of generative adversarial networks
(GANs) and diffusion probabilistic models (DPM). Novel DPMs have outperformed
GANs in text, image, and video generation tasks. Therefore, this study proposes
a conditional DPM framework to generate synthetic GI polyp images conditioned
on given generated segmentation masks. Our experimental results show that our
system can generate an unlimited number of high-fidelity synthetic polyp images
with the corresponding ground truth masks of polyps. To test the usefulness of
the generated data, we trained binary image segmentation models to study the
effect of using synthetic data. Results show that the best micro-imagewise IOU
of 0.7751 was achieved from DeepLabv3+ when the training data consists of both
real data and synthetic data. However, the results reflect that achieving good
segmentation performance with synthetic data heavily depends on model
architectures.
Generative modeling for time series via Schr{ö}dinger bridge
April 11, 2023
Mohamed Hamdouche, Pierre Henry-Labordere, Huyên Pham
math.OC, math.PR, q-fin.CP, stat.ML
We propose a novel generative model for time series based on Schr{"o}dinger
bridge (SB) approach. This consists in the entropic interpolation via optimal
transport between a reference probability measure on path space and a target
measure consistent with the joint data distribution of the time series. The
solution is characterized by a stochastic differential equation on finite
horizon with a path-dependent drift function, hence respecting the temporal
dynamics of the time series distribution. We can estimate the drift function
from data samples either by kernel regression methods or with LSTM neural
networks, and the simulation of the SB diffusion yields new synthetic data
samples of the time series. The performance of our generative model is
evaluated through a series of numerical experiments. First, we test with a toy
autoregressive model, a GARCH Model, and the example of fractional Brownian
motion, and measure the accuracy of our algorithm with marginal and temporal
dependencies metrics. Next, we use our SB generated synthetic samples for the
application to deep hedging on real-data sets. Finally, we illustrate the SB
approach for generating sequence of images.
Binary Latent Diffusion
April 10, 2023
Ze Wang, Jiang Wang, Zicheng Liu, Qiang Qiu
In this paper, we show that a binary latent space can be explored for compact
yet expressive image representations. We model the bi-directional mappings
between an image and the corresponding latent binary representation by training
an auto-encoder with a Bernoulli encoding distribution. On the one hand, the
binary latent space provides a compact discrete image representation of which
the distribution can be modeled more efficiently than pixels or continuous
latent representations. On the other hand, we now represent each image patch as
a binary vector instead of an index of a learned cookbook as in discrete image
representations with vector quantization. In this way, we obtain binary latent
representations that allow for better image quality and high-resolution image
representations without any multi-stage hierarchy in the latent space. In this
binary latent space, images can now be generated effectively using a binary
latent diffusion model tailored specifically for modeling the prior over the
binary image representations. We present both conditional and unconditional
image generation experiments with multiple datasets, and show that the proposed
method performs comparably to state-of-the-art methods while dramatically
improving the sampling efficiency to as few as 16 steps without using any
test-time acceleration. The proposed framework can also be seamlessly scaled to
$1024 \times 1024$ high-resolution image generation without resorting to latent
hierarchy or multi-stage refinements.
Ambiguous Medical Image Segmentation using Diffusion Models
April 10, 2023
Aimon Rahman, Jeya Maria Jose Valanarasu, Ilker Hacihaliloglu, Vishal M Patel
Collective insights from a group of experts have always proven to outperform
an individual’s best diagnostic for clinical tasks. For the task of medical
image segmentation, existing research on AI-based alternatives focuses more on
developing models that can imitate the best individual rather than harnessing
the power of expert groups. In this paper, we introduce a single diffusion
model-based approach that produces multiple plausible outputs by learning a
distribution over group insights. Our proposed model generates a distribution
of segmentation masks by leveraging the inherent stochastic sampling process of
diffusion using only minimal additional learning. We demonstrate on three
different medical image modalities- CT, ultrasound, and MRI that our model is
capable of producing several possible variants while capturing the frequencies
of their occurrences. Comprehensive results show that our proposed approach
outperforms existing state-of-the-art ambiguous segmentation networks in terms
of accuracy while preserving naturally occurring variation. We also propose a
new metric to evaluate the diversity as well as the accuracy of segmentation
predictions that aligns with the interest of clinical practice of collective
insights.
Reflected Diffusion Models
April 10, 2023
Aaron Lou, Stefano Ermon
Score-based diffusion models learn to reverse a stochastic differential
equation that maps data to noise. However, for complex tasks, numerical error
can compound and result in highly unnatural samples. Previous work mitigates
this drift with thresholding, which projects to the natural data domain (such
as pixel space for images) after each diffusion step, but this leads to a
mismatch between the training and generative processes. To incorporate data
constraints in a principled manner, we present Reflected Diffusion Models,
which instead reverse a reflected stochastic differential equation evolving on
the support of the data. Our approach learns the perturbed score function
through a generalized score matching loss and extends key components of
standard diffusion models including diffusion guidance, likelihood-based
training, and ODE sampling. We also bridge the theoretical gap with
thresholding: such schemes are just discretizations of reflected SDEs. On
standard image benchmarks, our method is competitive with or surpasses the
state of the art and, for classifier-free guidance, our approach enables fast
exact sampling with ODEs and produces more faithful samples under high guidance
weight.
DDRF: Denoising Diffusion Model for Remote Sensing Image Fusion
April 10, 2023
ZiHan Cao, ShiQi Cao, Xiao Wu, JunMing Hou, Ran Ran, Liang-Jian Deng
Denosing diffusion model, as a generative model, has received a lot of
attention in the field of image generation recently, thanks to its powerful
generation capability. However, diffusion models have not yet received
sufficient research in the field of image fusion. In this article, we introduce
diffusion model to the image fusion field, treating the image fusion task as
image-to-image translation and designing two different conditional injection
modulation modules (i.e., style transfer modulation and wavelet modulation) to
inject coarse-grained style information and fine-grained high-frequency and
low-frequency information into the diffusion UNet, thereby generating fused
images. In addition, we also discussed the residual learning and the selection
of training objectives of the diffusion model in the image fusion task.
Extensive experimental results based on quantitative and qualitative
assessments compared with benchmarks demonstrates state-of-the-art results and
good generalization performance in image fusion tasks. Finally, it is hoped
that our method can inspire other works and gain insight into this field to
better apply the diffusion model to image fusion tasks. Code shall be released
for better reproducibility.
BerDiff: Conditional Bernoulli Diffusion Model for Medical Image Segmentation
April 10, 2023
Tao Chen, Chenhui Wang, Hongming Shan
Medical image segmentation is a challenging task with inherent ambiguity and
high uncertainty, attributed to factors such as unclear tumor boundaries and
multiple plausible annotations. The accuracy and diversity of segmentation
masks are both crucial for providing valuable references to radiologists in
clinical practice. While existing diffusion models have shown strong capacities
in various visual generation tasks, it is still challenging to deal with
discrete masks in segmentation. To achieve accurate and diverse medical image
segmentation masks, we propose a novel conditional Bernoulli Diffusion model
for medical image segmentation (BerDiff). Instead of using the Gaussian noise,
we first propose to use the Bernoulli noise as the diffusion kernel to enhance
the capacity of the diffusion model for binary segmentation tasks, resulting in
more accurate segmentation masks. Second, by leveraging the stochastic nature
of the diffusion model, our BerDiff randomly samples the initial Bernoulli
noise and intermediate latent variables multiple times to produce a range of
diverse segmentation masks, which can highlight salient regions of interest
that can serve as valuable references for radiologists. In addition, our
BerDiff can efficiently sample sub-sequences from the overall trajectory of the
reverse diffusion, thereby speeding up the segmentation process. Extensive
experimental results on two medical image segmentation datasets with different
modalities demonstrate that our BerDiff outperforms other recently published
state-of-the-art methods. Our results suggest diffusion models could serve as a
strong backbone for medical image segmentation.
Zero-shot CT Field-of-view Completion with Unconditional Generative Diffusion Prior
April 07, 2023
Kaiwen Xu, Aravind R. Krishnan, Thomas Z. Li, Yuankai Huo, Kim L. Sandler, Fabien Maldonado, Bennett A. Landman
Anatomically consistent field-of-view (FOV) completion to recover truncated
body sections has important applications in quantitative analyses of computed
tomography (CT) with limited FOV. Existing solution based on conditional
generative models relies on the fidelity of synthetic truncation patterns at
training phase, which poses limitations for the generalizability of the method
to potential unknown types of truncation. In this study, we evaluate a
zero-shot method based on a pretrained unconditional generative diffusion
prior, where truncation pattern with arbitrary forms can be specified at
inference phase. In evaluation on simulated chest CT slices with synthetic FOV
truncation, the method is capable of recovering anatomically consistent body
sections and subcutaneous adipose tissue measurement error caused by FOV
truncation. However, the correction accuracy is inferior to the conditionally
trained counterpart.
ChiroDiff: Modelling chirographic data with Diffusion Models
April 07, 2023
Ayan Das, Yongxin Yang, Timothy Hospedales, Tao Xiang, Yi-Zhe Song
Generative modelling over continuous-time geometric constructs, a.k.a such as
handwriting, sketches, drawings etc., have been accomplished through
autoregressive distributions. Such strictly-ordered discrete factorization
however falls short of capturing key properties of chirographic data – it
fails to build holistic understanding of the temporal concept due to one-way
visibility (causality). Consequently, temporal data has been modelled as
discrete token sequences of fixed sampling rate instead of capturing the true
underlying concept. In this paper, we introduce a powerful model-class namely
“Denoising Diffusion Probabilistic Models” or DDPMs for chirographic data that
specifically addresses these flaws. Our model named “ChiroDiff”, being
non-autoregressive, learns to capture holistic concepts and therefore remains
resilient to higher temporal sampling rate up to a good extent. Moreover, we
show that many important downstream utilities (e.g. conditional sampling,
creative mixing) can be flexibly implemented using ChiroDiff. We further show
some unique use-cases like stochastic vectorization, de-noising/healing,
abstraction are also possible with this model-class. We perform quantitative
and qualitative evaluation of our framework on relevant datasets and found it
to be better or on par with competing approaches.
Towards Coherent Image Inpainting Using Denoising Diffusion Implicit Models
April 06, 2023
Guanhua Zhang, Jiabao Ji, Yang Zhang, Mo Yu, Tommi Jaakkola, Shiyu Chang
Image inpainting refers to the task of generating a complete, natural image
based on a partially revealed reference image. Recently, many research
interests have been focused on addressing this problem using fixed diffusion
models. These approaches typically directly replace the revealed region of the
intermediate or final generated images with that of the reference image or its
variants. However, since the unrevealed regions are not directly modified to
match the context, it results in incoherence between revealed and unrevealed
regions. To address the incoherence problem, a small number of methods
introduce a rigorous Bayesian framework, but they tend to introduce mismatches
between the generated and the reference images due to the approximation errors
in computing the posterior distributions. In this paper, we propose COPAINT,
which can coherently inpaint the whole image without introducing mismatches.
COPAINT also uses the Bayesian framework to jointly modify both revealed and
unrevealed regions, but approximates the posterior distribution in a way that
allows the errors to gradually drop to zero throughout the denoising steps,
thus strongly penalizing any mismatches with the reference image. Our
experiments verify that COPAINT can outperform the existing diffusion-based
methods under both objective and subjective metrics. The codes are available at
https://github.com/UCSB-NLP-Chang/CoPaint/.
Diffusion Models as Masked Autoencoders
April 06, 2023
Chen Wei, Karttikeya Mangalam, Po-Yao Huang, Yanghao Li, Haoqi Fan, Hu Xu, Huiyu Wang, Cihang Xie, Alan Yuille, Christoph Feichtenhofer
There has been a longstanding belief that generation can facilitate a true
understanding of visual data. In line with this, we revisit generatively
pre-training visual representations in light of recent interest in denoising
diffusion models. While directly pre-training with diffusion models does not
produce strong representations, we condition diffusion models on masked input
and formulate diffusion models as masked autoencoders (DiffMAE). Our approach
is capable of (i) serving as a strong initialization for downstream recognition
tasks, (ii) conducting high-quality image inpainting, and (iii) being
effortlessly extended to video where it produces state-of-the-art
classification accuracy. We further perform a comprehensive study on the pros
and cons of design choices and build connections between diffusion models and
masked autoencoders.
Anomaly Detection via Gumbel Noise Score Matching
April 06, 2023
Ahsan Mahmood, Junier Oliva, Martin Styner
We propose Gumbel Noise Score Matching (GNSM), a novel unsupervised method to
detect anomalies in categorical data. GNSM accomplishes this by estimating the
scores, i.e. the gradients of log likelihoods w.r.t.~inputs, of continuously
relaxed categorical distributions. We test our method on a suite of anomaly
detection tabular datasets. GNSM achieves a consistently high performance
across all experiments. We further demonstrate the flexibility of GNSM by
applying it to image data where the model is tasked to detect poor segmentation
predictions. Images ranked anomalous by GNSM show clear segmentation failures,
with the outputs of GNSM strongly correlating with segmentation metrics
computed on ground-truth. We outline the score matching training objective
utilized by GNSM and provide an open-source implementation of our work.
DITTO-NeRF: Diffusion-based Iterative Text To Omni-directional 3D Model
April 06, 2023
Hoigi Seo, Hayeon Kim, Gwanghyun Kim, Se Young Chun
The increasing demand for high-quality 3D content creation has motivated the
development of automated methods for creating 3D object models from a single
image and/or from a text prompt. However, the reconstructed 3D objects using
state-of-the-art image-to-3D methods still exhibit low correspondence to the
given image and low multi-view consistency. Recent state-of-the-art text-to-3D
methods are also limited, yielding 3D samples with low diversity per prompt
with long synthesis time. To address these challenges, we propose DITTO-NeRF, a
novel pipeline to generate a high-quality 3D NeRF model from a text prompt or a
single image. Our DITTO-NeRF consists of constructing high-quality partial 3D
object for limited in-boundary (IB) angles using the given or text-generated 2D
image from the frontal view and then iteratively reconstructing the remaining
3D NeRF using inpainting latent diffusion model. We propose progressive 3D
object reconstruction schemes in terms of scales (low to high resolution),
angles (IB angles initially to outer-boundary (OB) later), and masks (object to
background boundary) in our DITTO-NeRF so that high-quality information on IB
can be propagated into OB. Our DITTO-NeRF outperforms state-of-the-art methods
in terms of fidelity and diversity qualitatively and quantitatively with much
faster training times than prior arts on image/text-to-3D such as DreamFusion,
and NeuralLift-360.
Zero-shot Medical Image Translation via Frequency-Guided Diffusion Models
April 05, 2023
Yunxiang Li, Hua-Chieh Shao, Xiao Liang, Liyuan Chen, Ruiqi Li, Steve Jiang, Jing Wang, You Zhang
Recently, the diffusion model has emerged as a superior generative model that
can produce high-quality images with excellent realism. There is a growing
interest in applying diffusion models to image translation tasks. However, for
medical image translation, the existing diffusion models are deficient in
accurately retaining structural information since the structure details of
source domain images are lost during the forward diffusion process and cannot
be fully recovered through learned reverse diffusion, while the integrity of
anatomical structures is extremely important in medical images. Training and
conditioning diffusion models using paired source and target images with
matching anatomy can help. However, such paired data are very difficult and
costly to obtain, and may also reduce the robustness of the developed model to
out-of-distribution testing data. We propose a frequency-guided diffusion model
(FGDM) that employs frequency-domain filters to guide the diffusion model for
structure-preserving image translation. Based on its design, FGDM allows
zero-shot learning, as it can be trained solely on the data from the target
domain, and used directly for source-to-target domain translation without any
exposure to the source-domain data during training. We trained FGDM solely on
the head-and-neck CT data, and evaluated it on both head-and-neck and lung
cone-beam CT (CBCT)-to-CT translation tasks. FGDM outperformed the
state-of-the-art methods (GAN-based, VAE-based, and diffusion-based) in all
metrics, showing its significant advantages in zero-shot medical image
translation.
GenPhys: From Physical Processes to Generative Models
April 05, 2023
Ziming Liu, Di Luo, Yilun Xu, Tommi Jaakkola, Max Tegmark
cs.LG, cs.AI, physics.comp-ph, physics.data-an, quant-ph
Since diffusion models (DM) and the more recent Poisson flow generative
models (PFGM) are inspired by physical processes, it is reasonable to ask: Can
physical processes offer additional new generative models? We show that the
answer is yes. We introduce a general family, Generative Models from Physical
Processes (GenPhys), where we translate partial differential equations (PDEs)
describing physical processes to generative models. We show that generative
models can be constructed from s-generative PDEs (s for smooth). GenPhys
subsume the two existing generative models (DM and PFGM) and even give rise to
new families of generative models, e.g., “Yukawa Generative Models” inspired
from weak interactions. On the other hand, some physical processes by default
do not belong to the GenPhys family, e.g., the wave equation and the
Schr"{o}dinger equation, but could be made into the GenPhys family with some
modifications. Our goal with GenPhys is to explore and expand the design space
of generative models.
Generative Novel View Synthesis with 3D-Aware Diffusion Models
April 05, 2023
Eric R. Chan, Koki Nagano, Matthew A. Chan, Alexander W. Bergman, Jeong Joon Park, Axel Levy, Miika Aittala, Shalini De Mello, Tero Karras, Gordon Wetzstein
We present a diffusion-based model for 3D-aware generative novel view
synthesis from as few as a single input image. Our model samples from the
distribution of possible renderings consistent with the input and, even in the
presence of ambiguity, is capable of rendering diverse and plausible novel
views. To achieve this, our method makes use of existing 2D diffusion backbones
but, crucially, incorporates geometry priors in the form of a 3D feature
volume. This latent feature field captures the distribution over possible scene
representations and improves our method’s ability to generate view-consistent
novel renderings. In addition to generating novel views, our method has the
ability to autoregressively synthesize 3D-consistent sequences. We demonstrate
state-of-the-art results on synthetic renderings and room-scale scenes; we also
show compelling results for challenging, real-world objects.
EigenFold: Generative Protein Structure Prediction with Diffusion Models
April 05, 2023
Bowen Jing, Ezra Erives, Peter Pao-Huang, Gabriele Corso, Bonnie Berger, Tommi Jaakkola
q-bio.BM, cs.LG, physics.bio-ph
Protein structure prediction has reached revolutionary levels of accuracy on
single structures, yet distributional modeling paradigms are needed to capture
the conformational ensembles and flexibility that underlie biological function.
Towards this goal, we develop EigenFold, a diffusion generative modeling
framework for sampling a distribution of structures from a given protein
sequence. We define a diffusion process that models the structure as a system
of harmonic oscillators and which naturally induces a cascading-resolution
generative process along the eigenmodes of the system. On recent CAMEO targets,
EigenFold achieves a median TMScore of 0.84, while providing a more
comprehensive picture of model uncertainty via the ensemble of sampled
structures relative to existing methods. We then assess EigenFold’s ability to
model and predict conformational heterogeneity for fold-switching proteins and
ligand-induced conformational change. Code is available at
https://github.com/bjing2016/EigenFold.
A Diffusion-based Method for Multi-turn Compositional Image Generation
April 05, 2023
Chao Wang, Xiaoyu Yang, Jinmiao Huang, Kevin Ferreira
Multi-turn compositional image generation (M-CIG) is a challenging task that
aims to iteratively manipulate a reference image given a modification text.
While most of the existing methods for M-CIG are based on generative
adversarial networks (GANs), recent advances in image generation have
demonstrated the superiority of diffusion models over GANs. In this paper, we
propose a diffusion-based method for M-CIG named conditional denoising
diffusion with image compositional matching (CDD-ICM). We leverage CLIP as the
backbone of image and text encoders, and incorporate a gated fusion mechanism,
originally proposed for question answering, to compositionally fuse the
reference image and the modification text at each turn of M-CIG. We introduce a
conditioning scheme to generate the target image based on the fusion results.
To prioritize the semantic quality of the generated target image, we learn an
auxiliary image compositional match (ICM) objective, along with the conditional
denoising diffusion (CDD) objective in a multi-task learning framework.
Additionally, we also perform ICM guidance and classifier-free guidance to
improve performance. Experimental results show that CDD-ICM achieves
state-of-the-art results on two benchmark datasets for M-CIG, i.e., CoDraw and
i-CLEVR.
Trace and Pace: Controllable Pedestrian Animation via Guided Trajectory Diffusion
April 04, 2023
Davis Rempe, Zhengyi Luo, Xue Bin Peng, Ye Yuan, Kris Kitani, Karsten Kreis, Sanja Fidler, Or Litany
We introduce a method for generating realistic pedestrian trajectories and
full-body animations that can be controlled to meet user-defined goals. We draw
on recent advances in guided diffusion modeling to achieve test-time
controllability of trajectories, which is normally only associated with
rule-based systems. Our guided diffusion model allows users to constrain
trajectories through target waypoints, speed, and specified social groups while
accounting for the surrounding environment context. This trajectory diffusion
model is integrated with a novel physics-based humanoid controller to form a
closed-loop, full-body pedestrian animation system capable of placing large
crowds in a simulated environment with varying terrains. We further propose
utilizing the value function learned during RL training of the animation
controller to guide diffusion to produce trajectories better suited for
particular scenarios such as collision avoidance and traversing uneven terrain.
Video results are available on the project page at
https://nv-tlabs.github.io/trace-pace .
Denoising Diffusion Probabilistic Models to Predict the Density of Molecular Clouds
April 04, 2023
Duo Xu, Jonathan C. Tan, Chia-Jung Hsu, Ye Zhu
astro-ph.GA, astro-ph.IM, cs.LG
We introduce the state-of-the-art deep learning Denoising Diffusion
Probabilistic Model (DDPM) as a method to infer the volume or number density of
giant molecular clouds (GMCs) from projected mass surface density maps. We
adopt magnetohydrodynamic simulations with different global magnetic field
strengths and large-scale dynamics, i.e., noncolliding and colliding GMCs. We
train a diffusion model on both mass surface density maps and their
corresponding mass-weighted number density maps from different viewing angles
for all the simulations. We compare the diffusion model performance with a more
traditional empirical two-component and three-component power-law fitting
method and with a more traditional neural network machine learning approach
(CASI-2D). We conclude that the diffusion model achieves an order of magnitude
improvement on the accuracy of predicting number density compared to that by
other methods. We apply the diffusion method to some example astronomical
column density maps of Taurus and the Infrared Dark Clouds (IRDCs) G28.37+0.07
and G35.39-0.33 to produce maps of their mean volume densities.
A Survey on Graph Diffusion Models: Generative AI in Science for Molecule, Protein and Material
April 04, 2023
Mengchun Zhang, Maryam Qamar, Taegoo Kang, Yuna Jung, Chenshuang Zhang, Sung-Ho Bae, Chaoning Zhang
Diffusion models have become a new SOTA generative modeling method in various
fields, for which there are multiple survey works that provide an overall
survey. With the number of articles on diffusion models increasing
exponentially in the past few years, there is an increasing need for surveys of
diffusion models on specific fields. In this work, we are committed to
conducting a survey on the graph diffusion models. Even though our focus is to
cover the progress of diffusion models in graphs, we first briefly summarize
how other generative modeling methods are used for graphs. After that, we
introduce the mechanism of diffusion models in various forms, which facilitates
the discussion on the graph diffusion models. The applications of graph
diffusion models mainly fall into the category of AI-generated content (AIGC)
in science, for which we mainly focus on how graph diffusion models are
utilized for generating molecules and proteins but also cover other cases,
including materials design. Moreover, we discuss the issue of evaluating
diffusion models in the graph domain and the existing challenges.
ReMoDiffuse: Retrieval-Augmented Motion Diffusion Model
April 03, 2023
Mingyuan Zhang, Xinying Guo, Liang Pan, Zhongang Cai, Fangzhou Hong, Huirong Li, Lei Yang, Ziwei Liu
3D human motion generation is crucial for creative industry. Recent advances
rely on generative models with domain knowledge for text-driven motion
generation, leading to substantial progress in capturing common motions.
However, the performance on more diverse motions remains unsatisfactory. In
this work, we propose ReMoDiffuse, a diffusion-model-based motion generation
framework that integrates a retrieval mechanism to refine the denoising
process. ReMoDiffuse enhances the generalizability and diversity of text-driven
motion generation with three key designs: 1) Hybrid Retrieval finds appropriate
references from the database in terms of both semantic and kinematic
similarities. 2) Semantic-Modulated Transformer selectively absorbs retrieval
knowledge, adapting to the difference between retrieved samples and the target
motion sequence. 3) Condition Mixture better utilizes the retrieval database
during inference, overcoming the scale sensitivity in classifier-free guidance.
Extensive experiments demonstrate that ReMoDiffuse outperforms state-of-the-art
methods by balancing both text-motion consistency and motion quality,
especially for more diverse motion generation.
Diffusion Bridge Mixture Transports, Schrödinger Bridge Problems and Generative Modeling
April 03, 2023
Stefano Peluchetti
The dynamic Schr"odinger bridge problem seeks a stochastic process that
defines a transport between two target probability measures, while optimally
satisfying the criteria of being closest, in terms of Kullback-Leibler
divergence, to a reference process.
We propose a novel sampling-based iterative algorithm, the iterated diffusion
bridge mixture transport (IDBM), aimed at solving the dynamic Schr"odinger
bridge problem. The IDBM procedure exhibits the attractive property of
realizing a valid coupling between the target measures at each step. We perform
an initial theoretical investigation of the IDBM procedure, establishing its
convergence properties. The theoretical findings are complemented by numerous
numerical experiments illustrating the competitive performance of the IDBM
procedure across various applications.
Recent advancements in generative modeling employ the time-reversal of a
diffusion process to define a generative process that approximately transports
a simple distribution to the data distribution. As an alternative, we propose
using the first iteration of the IDBM procedure as an approximation-free method
for realizing this transport. This approach offers greater flexibility in
selecting the generative process dynamics and exhibits faster training and
superior sample quality over longer discretization intervals. In terms of
implementation, the necessary modifications are minimally intrusive, being
limited to the training loss computation, with no changes necessary for
generative sampling.
DreamAvatar: Text-and-Shape Guided 3D Human Avatar Generation via Diffusion Models
April 03, 2023
Yukang Cao, Yan-Pei Cao, Kai Han, Ying Shan, Kwan-Yee K. Wong
We present DreamAvatar, a text-and-shape guided framework for generating
high-quality 3D human avatars with controllable poses. While encouraging
results have been produced by recent methods on text-guided 3D common object
generation, generating high-quality human avatars remains an open challenge due
to the complexity of the human body’s shape, pose, and appearance. We propose
DreamAvatar to tackle this challenge, which utilizes a trainable NeRF for
predicting density and color features for 3D points and a pre-trained
text-to-image diffusion model for providing 2D self-supervision. Specifically,
we leverage SMPL models to provide rough pose and shape guidance for the
generation. We introduce a dual space design that comprises a canonical space
and an observation space, which are related by a learnable deformation field
through the NeRF, allowing for the transfer of well-optimized texture and
geometry from the canonical space to the target posed avatar. Additionally, we
exploit a normal-consistency regularization to allow for more vivid generation
with detailed geometry and texture. Through extensive evaluations, we
demonstrate that DreamAvatar significantly outperforms existing methods,
establishing a new state-of-the-art for text-and-shape guided 3D human
generation.
Textile Pattern Generation Using Diffusion Models
April 02, 2023
Halil Faruk Karagoz, Gulcin Baykal, Irem Arikan Eksi, Gozde Unal
The problem of text-guided image generation is a complex task in Computer
Vision, with various applications, including creating visually appealing
artwork and realistic product images. One popular solution widely used for this
task is the diffusion model, a generative model that generates images through
an iterative process. Although diffusion models have demonstrated promising
results for various image generation tasks, they may only sometimes produce
satisfactory results when applied to more specific domains, such as the
generation of textile patterns based on text guidance. This study presents a
fine-tuned diffusion model specifically trained for textile pattern generation
by text guidance to address this issue. The study involves the collection of
various textile pattern images and their captioning with the help of another AI
model. The fine-tuned diffusion model is trained with this newly created
dataset, and its results are compared with the baseline models visually and
numerically. The results demonstrate that the proposed fine-tuned diffusion
model outperforms the baseline models in terms of pattern quality and
efficiency in textile pattern generation by text guidance. This study presents
a promising solution to the problem of text-guided textile pattern generation
and has the potential to simplify the design process within the textile
industry.
Inf-Diff: Infinite Resolution Diffusion with Subsampled Mollified States
March 31, 2023
Sam Bond-Taylor, Chris G. Willcocks
We introduce $\infty$-Diff, a generative diffusion model which directly
operates on infinite resolution data. By randomly sampling subsets of
coordinates during training and learning to denoise the content at those
coordinates, a continuous function is learned that allows sampling at arbitrary
resolutions. In contrast to other recent infinite resolution generative models,
our approach operates directly on the raw data, not requiring latent vector
compression for context, using hypernetworks, nor relying on discrete
components. As such, our approach achieves significantly higher sample quality,
as evidenced by lower FID scores, as well as being able to effectively scale to
higher resolutions than the training data while retaining detail.
A Closer Look at Parameter-Efficient Tuning in Diffusion Models
March 31, 2023
Chendong Xiang, Fan Bao, Chongxuan Li, Hang Su, Jun Zhu
Large-scale diffusion models like Stable Diffusion are powerful and find
various real-world applications while customizing such models by fine-tuning is
both memory and time inefficient. Motivated by the recent progress in natural
language processing, we investigate parameter-efficient tuning in large
diffusion models by inserting small learnable modules (termed adapters). In
particular, we decompose the design space of adapters into orthogonal factors
– the input position, the output position as well as the function form, and
perform Analysis of Variance (ANOVA), a classical statistical approach for
analyzing the correlation between discrete (design options) and continuous
variables (evaluation metrics). Our analysis suggests that the input position
of adapters is the critical factor influencing the performance of downstream
tasks. Then, we carefully study the choice of the input position, and we find
that putting the input position after the cross-attention block can lead to the
best performance, validated by additional visualization analyses. Finally, we
provide a recipe for parameter-efficient tuning in diffusion models, which is
comparable if not superior to the fully fine-tuned baseline (e.g., DreamBooth)
with only 0.75 \% extra parameters, across various customized tasks.
Diffusion Action Segmentation
March 31, 2023
Daochang Liu, Qiyue Li, AnhDung Dinh, Tingting Jiang, Mubarak Shah, Chang Xu
Temporal action segmentation is crucial for understanding long-form videos.
Previous works on this task commonly adopt an iterative refinement paradigm by
using multi-stage models. Our paper proposes an essentially different framework
via denoising diffusion models, which nonetheless shares the same inherent
spirit of such iterative refinement. In this framework, action predictions are
progressively generated from random noise with input video features as
conditions. To enhance the modeling of three striking characteristics of human
actions, including the position prior, the boundary ambiguity, and the
relational dependency, we devise a unified masking strategy for the
conditioning inputs in our framework. Extensive experiments on three benchmark
datasets, i.e., GTEA, 50Salads, and Breakfast, are performed and the proposed
method achieves superior or comparable results to state-of-the-art methods,
showing the effectiveness of a generative approach for action segmentation. Our
codes will be made available.
Reference-based Image Composition with Sketch via Structure-aware Diffusion Model
March 31, 2023
Kangyeol Kim, Sunghyun Park, Junsoo Lee, Jaegul Choo
Recent remarkable improvements in large-scale text-to-image generative models
have shown promising results in generating high-fidelity images. To further
enhance editability and enable fine-grained generation, we introduce a
multi-input-conditioned image composition model that incorporates a sketch as a
novel modal, alongside a reference image. Thanks to the edge-level
controllability using sketches, our method enables a user to edit or complete
an image sub-part with a desired structure (i.e., sketch) and content (i.e.,
reference image). Our framework fine-tunes a pre-trained diffusion model to
complete missing regions using the reference image while maintaining sketch
guidance. Albeit simple, this leads to wide opportunities to fulfill user needs
for obtaining the in-demand images. Through extensive experiments, we
demonstrate that our proposed method offers unique use cases for image
manipulation, enabling user-driven modifications of arbitrary scenes.
Token Merging for Fast Stable Diffusion
March 30, 2023
Daniel Bolya, Judy Hoffman
The landscape of image generation has been forever changed by open vocabulary
diffusion models. However, at their core these models use transformers, which
makes generation slow. Better implementations to increase the throughput of
these transformers have emerged, but they still evaluate the entire model. In
this paper, we instead speed up diffusion models by exploiting natural
redundancy in generated images by merging redundant tokens. After making some
diffusion-specific improvements to Token Merging (ToMe), our ToMe for Stable
Diffusion can reduce the number of tokens in an existing Stable Diffusion model
by up to 60% while still producing high quality images without any extra
training. In the process, we speed up image generation by up to 2x and reduce
memory consumption by up to 5.6x. Furthermore, this speed-up stacks with
efficient implementations such as xFormers, minimally impacting quality while
being up to 5.4x faster for large images. Code is available at
https://github.com/dbolya/tomesd.
Forget-Me-Not: Learning to Forget in Text-to-Image Diffusion Models
March 30, 2023
Eric Zhang, Kai Wang, Xingqian Xu, Zhangyang Wang, Humphrey Shi
The unlearning problem of deep learning models, once primarily an academic
concern, has become a prevalent issue in the industry. The significant advances
in text-to-image generation techniques have prompted global discussions on
privacy, copyright, and safety, as numerous unauthorized personal IDs, content,
artistic creations, and potentially harmful materials have been learned by
these models and later utilized to generate and distribute uncontrolled
content. To address this challenge, we propose \textbf{Forget-Me-Not}, an
efficient and low-cost solution designed to safely remove specified IDs,
objects, or styles from a well-configured text-to-image model in as little as
30 seconds, without impairing its ability to generate other content. Alongside
our method, we introduce the \textbf{Memorization Score (M-Score)} and
\textbf{ConceptBench} to measure the models’ capacity to generate general
concepts, grouped into three primary categories: ID, object, and style. Using
M-Score and ConceptBench, we demonstrate that Forget-Me-Not can effectively
eliminate targeted concepts while maintaining the model’s performance on other
concepts. Furthermore, Forget-Me-Not offers two practical extensions: a)
removal of potentially harmful or NSFW content, and b) enhancement of model
accuracy, inclusion and diversity through \textbf{concept correction and
disentanglement}. It can also be adapted as a lightweight model patch for
Stable Diffusion, allowing for concept manipulation and convenient
distribution. To encourage future research in this critical area and promote
the development of safe and inclusive generative models, we will open-source
our code and ConceptBench at
\href{https://github.com/SHI-Labs/Forget-Me-Not}{https://github.com/SHI-Labs/Forget-Me-Not}.
DDP: Diffusion Model for Dense Visual Prediction
March 30, 2023
Yuanfeng Ji, Zhe Chen, Enze Xie, Lanqing Hong, Xihui Liu, Zhaoqiang Liu, Tong Lu, Zhenguo Li, Ping Luo
We propose a simple, efficient, yet powerful framework for dense visual
predictions based on the conditional diffusion pipeline. Our approach follows a
“noise-to-map” generative paradigm for prediction by progressively removing
noise from a random Gaussian distribution, guided by the image. The method,
called DDP, efficiently extends the denoising diffusion process into the modern
perception pipeline. Without task-specific design and architecture
customization, DDP is easy to generalize to most dense prediction tasks, e.g.,
semantic segmentation and depth estimation. In addition, DDP shows attractive
properties such as dynamic inference and uncertainty awareness, in contrast to
previous single-step discriminative methods. We show top results on three
representative tasks with six diverse benchmarks, without tricks, DDP achieves
state-of-the-art or competitive performance on each task compared to the
specialist counterparts. For example, semantic segmentation (83.9 mIoU on
Cityscapes), BEV map segmentation (70.6 mIoU on nuScenes), and depth estimation
(0.05 REL on KITTI). We hope that our approach will serve as a solid baseline
and facilitate future research
PAIR-Diffusion: Object-Level Image Editing with Structure-and-Appearance Paired Diffusion Models
March 30, 2023
Vidit Goel, Elia Peruzzo, Yifan Jiang, Dejia Xu, Nicu Sebe, Trevor Darrell, Zhangyang Wang, Humphrey Shi
Image editing using diffusion models has witnessed extremely fast-paced
growth recently. There are various ways in which previous works enable
controlling and editing images. Some works use high-level conditioning such as
text, while others use low-level conditioning. Nevertheless, most of them lack
fine-grained control over the properties of the different objects present in
the image, i.e. object-level image editing. In this work, we consider an image
as a composition of multiple objects, each defined by various properties. Out
of these properties, we identify structure and appearance as the most intuitive
to understand and useful for editing purposes. We propose
Structure-and-Appearance Paired Diffusion model (PAIR-Diffusion), which is
trained using structure and appearance information explicitly extracted from
the images. The proposed model enables users to inject a reference image’s
appearance into the input image at both the object and global levels.
Additionally, PAIR-Diffusion allows editing the structure while maintaining the
style of individual components of the image unchanged. We extensively evaluate
our method on LSUN datasets and the CelebA-HQ face dataset, and we demonstrate
fine-grained control over both structure and appearance at the object level. We
also applied the method to Stable Diffusion to edit any real image at the
object level.
LayoutDiffusion: Controllable Diffusion Model for Layout-to-image Generation
March 30, 2023
Guangcong Zheng, Xianpan Zhou, Xuewei Li, Zhongang Qi, Ying Shan, Xi Li
Recently, diffusion models have achieved great success in image synthesis.
However, when it comes to the layout-to-image generation where an image often
has a complex scene of multiple objects, how to make strong control over both
the global layout map and each detailed object remains a challenging task. In
this paper, we propose a diffusion model named LayoutDiffusion that can obtain
higher generation quality and greater controllability than the previous works.
To overcome the difficult multimodal fusion of image and layout, we propose to
construct a structural image patch with region information and transform the
patched image into a special layout to fuse with the normal layout in a unified
form. Moreover, Layout Fusion Module (LFM) and Object-aware Cross Attention
(OaCA) are proposed to model the relationship among multiple objects and
designed to be object-aware and position-sensitive, allowing for precisely
controlling the spatial related information. Extensive experiments show that
our LayoutDiffusion outperforms the previous SOTA methods on FID, CAS by
relatively 46.35%, 26.70% on COCO-stuff and 44.29%, 41.82% on VG. Code is
available at https://github.com/ZGCTroy/LayoutDiffusion.
Discriminative Class Tokens for Text-to-Image Diffusion Models
March 30, 2023
Idan Schwartz, Vésteinn Snæbjarnarson, Sagie Benaim, Hila Chefer, Ryan Cotterell, Lior Wolf, Serge Belongie
Recent advances in text-to-image diffusion models have enabled the generation
of diverse and high-quality images. However, generated images often fall short
of depicting subtle details and are susceptible to errors due to ambiguity in
the input text. One way of alleviating these issues is to train diffusion
models on class-labeled datasets. This comes with a downside, doing so limits
their expressive power: (i) supervised datasets are generally small compared to
large-scale scraped text-image datasets on which text-to-image models are
trained, and so the quality and diversity of generated images are severely
affected, or (ii) the input is a hard-coded label, as opposed to free-form
text, which limits the control over the generated images.
In this work, we propose a non-invasive fine-tuning technique that
capitalizes on the expressive potential of free-form text while achieving high
accuracy through discriminative signals from a pretrained classifier, which
guides the generation. This is done by iteratively modifying the embedding of a
single input token of a text-to-image diffusion model, using the classifier, by
steering generated images toward a given target class. Our method is fast
compared to prior fine-tuning methods and does not require a collection of
in-class images or retraining of a noise-tolerant classifier. We evaluate our
method extensively, showing that the generated images are: (i) more accurate
and of higher quality than standard diffusion models, (ii) can be used to
augment training data in a low-resource setting, and (iii) reveal information
about the data used to train the guiding classifier. The code is available at
\url{https://github.com/idansc/discriminative_class_tokens}
DiffCollage: Parallel Generation of Large Content with Diffusion Models
March 30, 2023
Qinsheng Zhang, Jiaming Song, Xun Huang, Yongxin Chen, Ming-Yu Liu
We present DiffCollage, a compositional diffusion model that can generate
large content by leveraging diffusion models trained on generating pieces of
the large content. Our approach is based on a factor graph representation where
each factor node represents a portion of the content and a variable node
represents their overlap. This representation allows us to aggregate
intermediate outputs from diffusion models defined on individual nodes to
generate content of arbitrary size and shape in parallel without resorting to
an autoregressive generation procedure. We apply DiffCollage to various tasks,
including infinite image generation, panorama image generation, and
long-duration text-guided motion generation. Extensive experimental results
with a comparison to strong autoregressive baselines verify the effectiveness
of our approach.
HyperDiffusion: Generating Implicit Neural Fields with Weight-Space Diffusion
March 29, 2023
Ziya Erkoç, Fangchang Ma, Qi Shan, Matthias Nießner, Angela Dai
Implicit neural fields, typically encoded by a multilayer perceptron (MLP)
that maps from coordinates (e.g., xyz) to signals (e.g., signed distances),
have shown remarkable promise as a high-fidelity and compact representation.
However, the lack of a regular and explicit grid structure also makes it
challenging to apply generative modeling directly on implicit neural fields in
order to synthesize new data. To this end, we propose HyperDiffusion, a novel
approach for unconditional generative modeling of implicit neural fields.
HyperDiffusion operates directly on MLP weights and generates new neural
implicit fields encoded by synthesized MLP parameters. Specifically, a
collection of MLPs is first optimized to faithfully represent individual data
samples. Subsequently, a diffusion process is trained in this MLP weight space
to model the underlying distribution of neural implicit fields. HyperDiffusion
enables diffusion modeling over a implicit, compact, and yet high-fidelity
representation of complex signals across 3D shapes and 4D mesh animations
within one single unified framework.
Physics-Driven Diffusion Models for Impact Sound Synthesis from Videos
March 29, 2023
Kun Su, Kaizhi Qian, Eli Shlizerman, Antonio Torralba, Chuang Gan
cs.CV, cs.LG, cs.SD, eess.AS
Modeling sounds emitted from physical object interactions is critical for
immersive perceptual experiences in real and virtual worlds. Traditional
methods of impact sound synthesis use physics simulation to obtain a set of
physics parameters that could represent and synthesize the sound. However, they
require fine details of both the object geometries and impact locations, which
are rarely available in the real world and can not be applied to synthesize
impact sounds from common videos. On the other hand, existing video-driven deep
learning-based approaches could only capture the weak correspondence between
visual content and impact sounds since they lack of physics knowledge. In this
work, we propose a physics-driven diffusion model that can synthesize
high-fidelity impact sound for a silent video clip. In addition to the video
content, we propose to use additional physics priors to guide the impact sound
synthesis procedure. The physics priors include both physics parameters that
are directly estimated from noisy real-world impact sound examples without
sophisticated setup and learned residual parameters that interpret the sound
environment via neural networks. We further implement a novel diffusion model
with specific training and inference strategies to combine physics priors and
visual information for impact sound synthesis. Experimental results show that
our model outperforms several existing systems in generating realistic impact
sounds. More importantly, the physics-based representations are fully
interpretable and transparent, thus enabling us to perform sound editing
flexibly.
Diffusion Schrödinger Bridge Matching
March 29, 2023
Yuyang Shi, Valentin De Bortoli, Andrew Campbell, Arnaud Doucet
Solving transport problems, i.e. finding a map transporting one given
distribution to another, has numerous applications in machine learning. Novel
mass transport methods motivated by generative modeling have recently been
proposed, e.g. Denoising Diffusion Models (DDMs) and Flow Matching Models
(FMMs) implement such a transport through a Stochastic Differential Equation
(SDE) or an Ordinary Differential Equation (ODE). However, while it is
desirable in many applications to approximate the deterministic dynamic Optimal
Transport (OT) map which admits attractive properties, DDMs and FMMs are not
guaranteed to provide transports close to the OT map. In contrast,
Schr"odinger bridges (SBs) compute stochastic dynamic mappings which recover
entropy-regularized versions of OT. Unfortunately, existing numerical methods
approximating SBs either scale poorly with dimension or accumulate errors
across iterations. In this work, we introduce Iterative Markovian Fitting
(IMF), a new methodology for solving SB problems, and Diffusion Schr"odinger
Bridge Matching (DSBM), a novel numerical algorithm for computing IMF iterates.
DSBM significantly improves over previous SB numerics and recovers as
special/limiting cases various recent transport methods. We demonstrate the
performance of DSBM on a variety of problems.
4D Facial Expression Diffusion Model
March 29, 2023
Kaifeng Zou, Sylvain Faisan, Boyang Yu, Sébastien Valette, Hyewon Seo
Facial expression generation is one of the most challenging and long-sought
aspects of character animation, with many interesting applications. The
challenging task, traditionally having relied heavily on digital craftspersons,
remains yet to be explored. In this paper, we introduce a generative framework
for generating 3D facial expression sequences (i.e. 4D faces) that can be
conditioned on different inputs to animate an arbitrary 3D face mesh. It is
composed of two tasks: (1) Learning the generative model that is trained over a
set of 3D landmark sequences, and (2) Generating 3D mesh sequences of an input
facial mesh driven by the generated landmark sequences. The generative model is
based on a Denoising Diffusion Probabilistic Model (DDPM), which has achieved
remarkable success in generative tasks of other domains. While it can be
trained unconditionally, its reverse process can still be conditioned by
various condition signals. This allows us to efficiently develop several
downstream tasks involving various conditional generation, by using expression
labels, text, partial sequences, or simply a facial geometry. To obtain the
full mesh deformation, we then develop a landmark-guided encoder-decoder to
apply the geometrical deformation embedded in landmarks on a given facial mesh.
Experiments show that our model has learned to generate realistic, quality
expressions solely from the dataset of relatively small size, improving over
the state-of-the-art methods. Videos and qualitative comparisons with other
methods can be found at https://github.com/ZOUKaifeng/4DFM. Code and models
will be made available upon acceptance.
WordStylist: Styled Verbatim Handwritten Text Generation with Latent Diffusion Models
March 29, 2023
Konstantina Nikolaidou, George Retsinas, Vincent Christlein, Mathias Seuret, Giorgos Sfikas, Elisa Barney Smith, Hamam Mokayed, Marcus Liwicki
Text-to-Image synthesis is the task of generating an image according to a
specific text description. Generative Adversarial Networks have been considered
the standard method for image synthesis virtually since their introduction.
Denoising Diffusion Probabilistic Models are recently setting a new baseline,
with remarkable results in Text-to-Image synthesis, among other fields. Aside
its usefulness per se, it can also be particularly relevant as a tool for data
augmentation to aid training models for other document image processing tasks.
In this work, we present a latent diffusion-based method for styled
text-to-text-content-image generation on word-level. Our proposed method is
able to generate realistic word image samples from different writer styles, by
using class index styles and text content prompts without the need of
adversarial training, writer recognition, or text recognition. We gauge system
performance with the Fr'echet Inception Distance, writer recognition accuracy,
and writer retrieval. We show that the proposed model produces samples that are
aesthetically pleasing, help boosting text recognition performance, and get
similar writer retrieval score as real data. Code is available at:
https://github.com/koninik/WordStylist.
Your Diffusion Model is Secretly a Zero-Shot Classifier
March 28, 2023
Alexander C. Li, Mihir Prabhudesai, Shivam Duggal, Ellis Brown, Deepak Pathak
cs.LG, cs.AI, cs.CV, cs.NE, cs.RO
The recent wave of large-scale text-to-image diffusion models has
dramatically increased our text-based image generation abilities. These models
can generate realistic images for a staggering variety of prompts and exhibit
impressive compositional generalization abilities. Almost all use cases thus
far have solely focused on sampling; however, diffusion models can also provide
conditional density estimates, which are useful for tasks beyond image
generation. In this paper, we show that the density estimates from large-scale
text-to-image diffusion models like Stable Diffusion can be leveraged to
perform zero-shot classification without any additional training. Our
generative approach to classification, which we call Diffusion Classifier,
attains strong results on a variety of benchmarks and outperforms alternative
methods of extracting knowledge from diffusion models. Although a gap remains
between generative and discriminative approaches on zero-shot recognition
tasks, we find that our diffusion-based approach has stronger multimodal
relational reasoning abilities than competing discriminative approaches.
Finally, we use Diffusion Classifier to extract standard classifiers from
class-conditional diffusion models trained on ImageNet. Even though these
models are trained with weak augmentations and no regularization, they approach
the performance of SOTA discriminative classifiers. Overall, our results are a
step toward using generative over discriminative models for downstream tasks.
Results and visualizations at https://diffusion-classifier.github.io/
Visual Chain-of-Thought Diffusion Models
March 28, 2023
William Harvey, Frank Wood
Recent progress with conditional image diffusion models has been stunning,
and this holds true whether we are speaking about models conditioned on a text
description, a scene layout, or a sketch. Unconditional image diffusion models
are also improving but lag behind, as do diffusion models which are conditioned
on lower-dimensional features like class labels. We propose to close the gap
between conditional and unconditional models using a two-stage sampling
procedure. In the first stage we sample an embedding describing the semantic
content of the image. In the second stage we sample the image conditioned on
this embedding and then discard the embedding. Doing so lets us leverage the
power of conditional diffusion models on the unconditional generation task,
which we show improves FID by 25-50% compared to standard unconditional
generation.
DDMM-Synth: A Denoising Diffusion Model for Cross-modal Medical Image Synthesis with Sparse-view Measurement Embedding
March 28, 2023
Xiaoyue Li, Kai Shang, Gaoang Wang, Mark D. Butala
eess.IV, cs.CV, physics.med-ph
Reducing the radiation dose in computed tomography (CT) is important to
mitigate radiation-induced risks. One option is to employ a well-trained model
to compensate for incomplete information and map sparse-view measurements to
the CT reconstruction. However, reconstruction from sparsely sampled
measurements is insufficient to uniquely characterize an object in CT, and a
learned prior model may be inadequate for unencountered cases. Medical modal
translation from magnetic resonance imaging (MRI) to CT is an alternative but
may introduce incorrect information into the synthesized CT images in addition
to the fact that there exists no explicit transformation describing their
relationship. To address these issues, we propose a novel framework called the
denoising diffusion model for medical image synthesis (DDMM-Synth) to close the
performance gaps described above. This framework combines an MRI-guided
diffusion model with a new CT measurement embedding reverse sampling scheme.
Specifically, the null-space content of the one-step denoising result is
refined by the MRI-guided data distribution prior, and its range-space
component derived from an explicit operator matrix and the sparse-view CT
measurements is directly integrated into the inference stage. DDMM-Synth can
adjust the projection number of CT a posteriori for a particular clinical
application and its modified version can even improve the results significantly
for noisy cases. Our results show that DDMM-Synth outperforms other
state-of-the-art supervised-learning-based baselines under fair experimental
conditions.
DiffULD: Diffusive Universal Lesion Detection
March 28, 2023
Peiang Zhao, Han Li, Ruiyang Jin, S. Kevin Zhou
Universal Lesion Detection (ULD) in computed tomography (CT) plays an
essential role in computer-aided diagnosis. Promising ULD results have been
reported by anchor-based detection designs, but they have inherent drawbacks
due to the use of anchors: i) Insufficient training targets and ii)
Difficulties in anchor design. Diffusion probability models (DPM) have
demonstrated outstanding capabilities in many vision tasks. Many DPM-based
approaches achieve great success in natural image object detection without
using anchors. But they are still ineffective for ULD due to the insufficient
training targets. In this paper, we propose a novel ULD method, DiffULD, which
utilizes DPM for lesion detection. To tackle the negative effect triggered by
insufficient targets, we introduce a novel center-aligned bounding box padding
strategy that provides additional high-quality training targets yet avoids
significant performance deterioration. DiffULD is inherently advanced in
locating lesions with diverse sizes and shapes since it can predict with
arbitrary boxes. Experiments on the benchmark dataset DeepLesion show the
superiority of DiffULD when compared to state-of-the-art ULD approaches.
Anti-DreamBooth: Protecting users from personalized text-to-image synthesis
March 27, 2023
Thanh Van Le, Hao Phung, Thuan Hoang Nguyen, Quan Dao, Ngoc Tran, Anh Tran
Text-to-image diffusion models are nothing but a revolution, allowing anyone,
even without design skills, to create realistic images from simple text inputs.
With powerful personalization tools like DreamBooth, they can generate images
of a specific person just by learning from his/her few reference images.
However, when misused, such a powerful and convenient tool can produce fake
news or disturbing content targeting any individual victim, posing a severe
negative social impact. In this paper, we explore a defense system called
Anti-DreamBooth against such malicious use of DreamBooth. The system aims to
add subtle noise perturbation to each user’s image before publishing in order
to disrupt the generation quality of any DreamBooth model trained on these
perturbed images. We investigate a wide range of algorithms for perturbation
optimization and extensively evaluate them on two facial datasets over various
text-to-image model versions. Despite the complicated formulation of DreamBooth
and Diffusion-based text-to-image models, our methods effectively defend users
from the malicious use of those models. Their effectiveness withstands even
adverse conditions, such as model or prompt/term mismatching between training
and testing. Our code will be available at
\href{https://github.com/VinAIResearch/Anti-DreamBooth.git}{https://github.com/VinAIResearch/Anti-DreamBooth.git}.
Debiasing Scores and Prompts of 2D Diffusion for Robust Text-to-3D Generation
March 27, 2023
Susung Hong, Donghoon Ahn, Seungryong Kim
cs.CV, cs.CL, cs.GR, cs.LG
The view inconsistency problem in score-distilling text-to-3D generation,
also known as the Janus problem, arises from the intrinsic bias of 2D diffusion
models, which leads to the unrealistic generation of 3D objects. In this work,
we explore score-distilling text-to-3D generation and identify the main causes
of the Janus problem. Based on these findings, we propose two approaches to
debias the score-distillation frameworks for robust text-to-3D generation. Our
first approach, called score debiasing, involves gradually increasing the
truncation value for the score estimated by 2D diffusion models throughout the
optimization process. Our second approach, called prompt debiasing, identifies
conflicting words between user prompts and view prompts utilizing a language
model and adjusts the discrepancy between view prompts and object-space camera
poses. Our experimental results show that our methods improve realism by
significantly reducing artifacts and achieve a good trade-off between
faithfulness to the 2D diffusion models and 3D consistency with little
overhead.
Training-free Style Transfer Emerges from h-space in Diffusion models
March 27, 2023
Jaeseok Jeong, Mingi Kwon, Youngjung Uh
Diffusion models (DMs) synthesize high-quality images in various domains.
However, controlling their generative process is still hazy because the
intermediate variables in the process are not rigorously studied. Recently,
StyleCLIP-like editing of DMs is found in the bottleneck of the U-Net, named
$h$-space. In this paper, we discover that DMs inherently have disentangled
representations for content and style of the resulting images: $h$-space
contains the content and the skip connections convey the style. Furthermore, we
introduce a principled way to inject content of one image to another
considering progressive nature of the generative process. Briefly, given the
original generative process, 1) the feature of the source content should be
gradually blended, 2) the blended feature should be normalized to preserve the
distribution, 3) the change of skip connections due to content injection should
be calibrated. Then, the resulting image has the source content with the style
of the original image just like image-to-image translation. Interestingly,
injecting contents to styles of unseen domains produces harmonization-like
style transfer. To the best of our knowledge, our method introduces the first
training-free feed-forward style transfer only with an unconditional pretrained
frozen generative network. The code is available at
https://curryjung.github.io/DiffStyle/.
Exploring Continual Learning of Diffusion Models
March 27, 2023
Michał Zając, Kamil Deja, Anna Kuzina, Jakub M. Tomczak, Tomasz Trzciński, Florian Shkurti, Piotr Miłoś
cs.LG, cs.AI, cs.CV, stat.ML
Diffusion models have achieved remarkable success in generating high-quality
images thanks to their novel training procedures applied to unprecedented
amounts of data. However, training a diffusion model from scratch is
computationally expensive. This highlights the need to investigate the
possibility of training these models iteratively, reusing computation while the
data distribution changes. In this study, we take the first step in this
direction and evaluate the continual learning (CL) properties of diffusion
models. We begin by benchmarking the most common CL methods applied to
Denoising Diffusion Probabilistic Models (DDPMs), where we note the strong
performance of the experience replay with the reduced rehearsal coefficient.
Furthermore, we provide insights into the dynamics of forgetting, which exhibit
diverse behavior across diffusion timesteps. We also uncover certain pitfalls
of using the bits-per-dimension metric for evaluating CL.
Text-to-Image Diffusion Models are Zero-Shot Classifiers
March 27, 2023
Kevin Clark, Priyank Jaini
The excellent generative capabilities of text-to-image diffusion models
suggest they learn informative representations of image-text data. However,
what knowledge their representations capture is not fully understood, and they
have not been thoroughly explored on downstream tasks. We investigate diffusion
models by proposing a method for evaluating them as zero-shot classifiers. The
key idea is using a diffusion model’s ability to denoise a noised image given a
text description of a label as a proxy for that label’s likelihood. We apply
our method to Imagen, using it to probe fine-grained aspects of Imagen’s
knowledge and comparing it with CLIP’s zero-shot abilities. Imagen performs
competitively with CLIP on a wide range of zero-shot image classification
datasets. Additionally, it achieves state-of-the-art results on shape/texture
bias tests and can successfully perform attribute binding while CLIP cannot.
Although generative pre-training is prevalent in NLP, visual foundation models
often use other methods such as contrastive learning. Based on our findings, we
argue that generative pre-training should be explored as a compelling
alternative for vision and vision-language problems.
Diffusion Denoised Smoothing for Certified and Adversarial Robust Out-Of-Distribution Detection
March 27, 2023
Nicola Franco, Daniel Korth, Jeanette Miriam Lorenz, Karsten Roscher, Stephan Guennemann
As the use of machine learning continues to expand, the importance of
ensuring its safety cannot be overstated. A key concern in this regard is the
ability to identify whether a given sample is from the training distribution,
or is an “Out-Of-Distribution” (OOD) sample. In addition, adversaries can
manipulate OOD samples in ways that lead a classifier to make a confident
prediction. In this study, we present a novel approach for certifying the
robustness of OOD detection within a $\ell_2$-norm around the input, regardless
of network architecture and without the need for specific components or
additional training. Further, we improve current techniques for detecting
adversarial attacks on OOD samples, while providing high levels of certified
and adversarial robustness on in-distribution samples. The average of all OOD
detection metrics on CIFAR10/100 shows an increase of $\sim 13 \% / 5\%$
relative to previous approaches.
Seer: Language Instructed Video Prediction with Latent Diffusion Models
March 27, 2023
Xianfan Gu, Chuan Wen, Jiaming Song, Yang Gao
Imagining the future trajectory is the key for robots to make sound planning
and successfully reach their goals. Therefore, text-conditioned video
prediction (TVP) is an essential task to facilitate general robot policy
learning, i.e., predicting future video frames with a given language
instruction and reference frames. It is a highly challenging task to ground
task-level goals specified by instructions and high-fidelity frames together,
requiring large-scale data and computation. To tackle this task and empower
robots with the ability to foresee the future, we propose a sample and
computation-efficient model, named \textbf{Seer}, by inflating the pretrained
text-to-image (T2I) stable diffusion models along the temporal axis. We inflate
the denoising U-Net and language conditioning model with two novel techniques,
Autoregressive Spatial-Temporal Attention and Frame Sequential Text Decomposer,
to propagate the rich prior knowledge in the pretrained T2I models across the
frames. With the well-designed architecture, Seer makes it possible to generate
high-fidelity, coherent, and instruction-aligned video frames by fine-tuning a
few layers on a small amount of data. The experimental results on Something
Something V2 (SSv2) and Bridgedata datasets demonstrate our superior video
prediction performance with around 210-hour training on 4 RTX 3090 GPUs:
decreasing the FVD of the current SOTA model from 290 to 200 on SSv2 and
achieving at least 70\% preference in the human evaluation.
DiffTAD: Temporal Action Detection with Proposal Denoising Diffusion
March 27, 2023
Sauradip Nag, Xiatian Zhu, Jiankang Deng, Yi-Zhe Song, Tao Xiang
cs.CV, cs.AI, cs.LG, cs.MM
We propose a new formulation of temporal action detection (TAD) with
denoising diffusion, DiffTAD in short. Taking as input random temporal
proposals, it can yield action proposals accurately given an untrimmed long
video. This presents a generative modeling perspective, against previous
discriminative learning manners. This capability is achieved by first diffusing
the ground-truth proposals to random ones (i.e., the forward/noising process)
and then learning to reverse the noising process (i.e., the backward/denoising
process). Concretely, we establish the denoising process in the Transformer
decoder (e.g., DETR) by introducing a temporal location query design with
faster convergence in training. We further propose a cross-step selective
conditioning algorithm for inference acceleration. Extensive evaluations on
ActivityNet and THUMOS show that our DiffTAD achieves top performance compared
to previous art alternatives. The code will be made available at
https://github.com/sauradip/DiffusionTAD.
Conditional Score-Based Reconstructions for Multi-contrast MRI
March 26, 2023
Brett Levac, Ajil Jalal, Kannan Ramchandran, Jonathan I. Tamir
Magnetic resonance imaging (MRI) exam protocols consist of multiple
contrast-weighted images of the same anatomy to emphasize different tissue
properties. Due to the long acquisition times required to collect fully sampled
k-space measurements, it is common to only collect a fraction of k-space for
each scan and subsequently solve independent inverse problems for each image
contrast. Recently, there has been a push to further accelerate MRI exams using
data-driven priors, and generative models in particular, to regularize the
ill-posed inverse problem of image reconstruction. These methods have shown
promising improvements over classical methods. However, many of the approaches
neglect the additional information present in a clinical MRI exam like the
multi-contrast nature of the data and treat each scan as an independent
reconstruction. In this work we show that by learning a joint Bayesian prior
over multi-contrast data with a score-based generative model we are able to
leverage the underlying structure between random variables related to a given
imaging problem. This leads to an improvement in image reconstruction fidelity
over generative models that rely only on a marginal prior over the image
contrast of interest.
GestureDiffuCLIP: Gesture Diffusion Model with CLIP Latents
March 26, 2023
Tenglong Ao, Zeyi Zhang, Libin Liu
The automatic generation of stylized co-speech gestures has recently received
increasing attention. Previous systems typically allow style control via
predefined text labels or example motion clips, which are often not flexible
enough to convey user intent accurately. In this work, we present
GestureDiffuCLIP, a neural network framework for synthesizing realistic,
stylized co-speech gestures with flexible style control. We leverage the power
of the large-scale Contrastive-Language-Image-Pre-training (CLIP) model and
present a novel CLIP-guided mechanism that extracts efficient style
representations from multiple input modalities, such as a piece of text, an
example motion clip, or a video. Our system learns a latent diffusion model to
generate high-quality gestures and infuses the CLIP representations of style
into the generator via an adaptive instance normalization (AdaIN) layer. We
further devise a gesture-transcript alignment mechanism that ensures a
semantically correct gesture generation based on contrastive learning. Our
system can also be extended to allow fine-grained style control of individual
body parts. We demonstrate an extensive set of examples showing the flexibility
and generalizability of our model to a variety of style descriptions. In a user
study, we show that our system outperforms the state-of-the-art approaches
regarding human likeness, appropriateness, and style correctness.
DiracDiffusion: Denoising and Incremental Reconstruction with Assured Data-Consistency
March 25, 2023
Zalan Fabian, Berk Tinaz, Mahdi Soltanolkotabi
eess.IV, cs.CV, cs.LG, I.2.6; I.4.4; I.4.5
Diffusion models have established new state of the art in a multitude of
computer vision tasks, including image restoration. Diffusion-based inverse
problem solvers generate reconstructions of exceptional visual quality from
heavily corrupted measurements. However, in what is widely known as the
perception-distortion trade-off, the price of perceptually appealing
reconstructions is often paid in declined distortion metrics, such as PSNR.
Distortion metrics measure faithfulness to the observation, a crucial
requirement in inverse problems. In this work, we propose a novel framework for
inverse problem solving, namely we assume that the observation comes from a
stochastic degradation process that gradually degrades and noises the original
clean image. We learn to reverse the degradation process in order to recover
the clean image. Our technique maintains consistency with the original
measurement throughout the reverse process, and allows for great flexibility in
trading off perceptual quality for improved distortion metrics and sampling
speedup via early-stopping. We demonstrate the efficiency of our method on
different high-resolution datasets and inverse problems, achieving great
improvements over other state-of-the-art diffusion-based methods with respect
to both perceptual and distortion metrics. Source code and pre-trained models
will be released soon.
DiffuScene: Scene Graph Denoising Diffusion Probabilistic Model for Generative Indoor Scene Synthesis
March 24, 2023
Jiapeng Tang, Yinyu Nie, Lev Markhasin, Angela Dai, Justus Thies, Matthias Nießner
We present DiffuScene for indoor 3D scene synthesis based on a novel scene
graph denoising diffusion probabilistic model, which generates 3D instance
properties stored in a fully-connected scene graph and then retrieves the most
similar object geometry for each graph node i.e. object instance which is
characterized as a concatenation of different attributes, including location,
size, orientation, semantic, and geometry features. Based on this scene graph,
we designed a diffusion model to determine the placements and types of 3D
instances. Our method can facilitate many downstream applications, including
scene completion, scene arrangement, and text-conditioned scene synthesis.
Experiments on the 3D-FRONT dataset show that our method can synthesize more
physically plausible and diverse indoor scenes than state-of-the-art methods.
Extensive ablation studies verify the effectiveness of our design choice in
scene diffusion models.
MindDiffuser: Controlled Image Reconstruction from Human Brain Activity with Semantic and Structural Diffusion
March 24, 2023
Yizhuo Lu, Changde Du, Dianpeng Wang, Huiguang He
Reconstructing visual stimuli from measured functional magnetic resonance
imaging (fMRI) has been a meaningful and challenging task. Previous studies
have successfully achieved reconstructions with structures similar to the
original images, such as the outlines and size of some natural images. However,
these reconstructions lack explicit semantic information and are difficult to
discern. In recent years, many studies have utilized multi-modal pre-trained
models with stronger generative capabilities to reconstruct images that are
semantically similar to the original ones. However, these images have
uncontrollable structural information such as position and orientation. To
address both of the aforementioned issues simultaneously, we propose a
two-stage image reconstruction model called MindDiffuser, utilizing Stable
Diffusion. In Stage 1, the VQ-VAE latent representations and the CLIP text
embeddings decoded from fMRI are put into the image-to-image process of Stable
Diffusion, which yields a preliminary image that contains semantic and
structural information. In Stage 2, we utilize the low-level CLIP visual
features decoded from fMRI as supervisory information, and continually adjust
the two features in Stage 1 through backpropagation to align the structural
information. The results of both qualitative and quantitative analyses
demonstrate that our proposed model has surpassed the current state-of-the-art
models in terms of reconstruction results on Natural Scenes Dataset (NSD).
Furthermore, the results of ablation experiments indicate that each component
of our model is effective for image reconstruction.
CoLa-Diff: Conditional Latent Diffusion Model for Multi-Modal MRI Synthesis
March 24, 2023
Lan Jiang, Ye Mao, Xi Chen, Xiangfeng Wang, Chao Li
eess.IV, cs.CV, I.3.3; I.4.10
MRI synthesis promises to mitigate the challenge of missing MRI modality in
clinical practice. Diffusion model has emerged as an effective technique for
image synthesis by modelling complex and variable data distributions. However,
most diffusion-based MRI synthesis models are using a single modality. As they
operate in the original image domain, they are memory-intensive and less
feasible for multi-modal synthesis. Moreover, they often fail to preserve the
anatomical structure in MRI. Further, balancing the multiple conditions from
multi-modal MRI inputs is crucial for multi-modal synthesis. Here, we propose
the first diffusion-based multi-modality MRI synthesis model, namely
Conditioned Latent Diffusion Model (CoLa-Diff). To reduce memory consumption,
we design CoLa-Diff to operate in the latent space. We propose a novel network
architecture, e.g., similar cooperative filtering, to solve the possible
compression and noise in latent space. To better maintain the anatomical
structure, brain region masks are introduced as the priors of density
distributions to guide diffusion process. We further present auto-weight
adaptation to employ multi-modal information effectively. Our experiments
demonstrate that CoLa-Diff outperforms other state-of-the-art MRI synthesis
methods, promising to serve as an effective tool for multi-modal MRI synthesis.
DisC-Diff: Disentangled Conditional Diffusion Model for Multi-Contrast MRI Super-Resolution
March 24, 2023
Ye Mao, Lan Jiang, Xi Chen, Chao Li
Multi-contrast magnetic resonance imaging (MRI) is the most common management
tool used to characterize neurological disorders based on brain tissue
contrasts. However, acquiring high-resolution MRI scans is time-consuming and
infeasible under specific conditions. Hence, multi-contrast super-resolution
methods have been developed to improve the quality of low-resolution contrasts
by leveraging complementary information from multi-contrast MRI. Current deep
learning-based super-resolution methods have limitations in estimating
restoration uncertainty and avoiding mode collapse. Although the diffusion
model has emerged as a promising approach for image enhancement, capturing
complex interactions between multiple conditions introduced by multi-contrast
MRI super-resolution remains a challenge for clinical applications. In this
paper, we propose a disentangled conditional diffusion model, DisC-Diff, for
multi-contrast brain MRI super-resolution. It utilizes the sampling-based
generation and simple objective function of diffusion models to estimate
uncertainty in restorations effectively and ensure a stable optimization
process. Moreover, DisC-Diff leverages a disentangled multi-stream network to
fully exploit complementary information from multi-contrast MRI, improving
model interpretation under multiple conditions of multi-contrast inputs. We
validated the effectiveness of DisC-Diff on two datasets: the IXI dataset,
which contains 578 normal brains, and a clinical dataset with 316 pathological
brains. Our experimental results demonstrate that DisC-Diff outperforms other
state-of-the-art methods both quantitatively and visually.
Conditional Image-to-Video Generation with Latent Flow Diffusion Models
March 24, 2023
Haomiao Ni, Changhao Shi, Kai Li, Sharon X. Huang, Martin Renqiang Min
Conditional image-to-video (cI2V) generation aims to synthesize a new
plausible video starting from an image (e.g., a person’s face) and a condition
(e.g., an action class label like smile). The key challenge of the cI2V task
lies in the simultaneous generation of realistic spatial appearance and
temporal dynamics corresponding to the given image and condition. In this
paper, we propose an approach for cI2V using novel latent flow diffusion models
(LFDM) that synthesize an optical flow sequence in the latent space based on
the given condition to warp the given image. Compared to previous
direct-synthesis-based works, our proposed LFDM can better synthesize spatial
details and temporal motion by fully utilizing the spatial content of the given
image and warping it in the latent space according to the generated
temporally-coherent flow. The training of LFDM consists of two separate stages:
(1) an unsupervised learning stage to train a latent flow auto-encoder for
spatial content generation, including a flow predictor to estimate latent flow
between pairs of video frames, and (2) a conditional learning stage to train a
3D-UNet-based diffusion model (DM) for temporal latent flow generation. Unlike
previous DMs operating in pixel space or latent feature space that couples
spatial and temporal information, the DM in our LFDM only needs to learn a
low-dimensional latent flow space for motion generation, thus being more
computationally efficient. We conduct comprehensive experiments on multiple
datasets, where LFDM consistently outperforms prior arts. Furthermore, we show
that LFDM can be easily adapted to new domains by simply finetuning the image
decoder. Our code is available at https://github.com/nihaomiao/CVPR23_LFDM.
End-to-End Diffusion Latent Optimization Improves Classifier Guidance
March 23, 2023
Bram Wallace, Akash Gokul, Stefano Ermon, Nikhil Naik
Classifier guidance – using the gradients of an image classifier to steer
the generations of a diffusion model – has the potential to dramatically
expand the creative control over image generation and editing. However,
currently classifier guidance requires either training new noise-aware models
to obtain accurate gradients or using a one-step denoising approximation of the
final generation, which leads to misaligned gradients and sub-optimal control.
We highlight this approximation’s shortcomings and propose a novel guidance
method: Direct Optimization of Diffusion Latents (DOODL), which enables
plug-and-play guidance by optimizing diffusion latents w.r.t. the gradients of
a pre-trained classifier on the true generated pixels, using an invertible
diffusion process to achieve memory-efficient backpropagation. Showcasing the
potential of more precise guidance, DOODL outperforms one-step classifier
guidance on computational and human evaluation metrics across different forms
of guidance: using CLIP guidance to improve generations of complex prompts from
DrawBench, using fine-grained visual classifiers to expand the vocabulary of
Stable Diffusion, enabling image-conditioned generation with a CLIP visual
encoder, and improving image aesthetics using an aesthetic scoring network.
Code at https://github.com/salesforce/DOODL.
Ablating Concepts in Text-to-Image Diffusion Models
March 23, 2023
Nupur Kumari, Bingliang Zhang, Sheng-Yu Wang, Eli Shechtman, Richard Zhang, Jun-Yan Zhu
Large-scale text-to-image diffusion models can generate high-fidelity images
with powerful compositional ability. However, these models are typically
trained on an enormous amount of Internet data, often containing copyrighted
material, licensed images, and personal photos. Furthermore, they have been
found to replicate the style of various living artists or memorize exact
training samples. How can we remove such copyrighted concepts or images without
retraining the model from scratch? To achieve this goal, we propose an
efficient method of ablating concepts in the pretrained model, i.e., preventing
the generation of a target concept. Our algorithm learns to match the image
distribution for a target style, instance, or text prompt we wish to ablate to
the distribution corresponding to an anchor concept. This prevents the model
from generating target concepts given its text condition. Extensive experiments
show that our method can successfully prevent the generation of the ablated
concept while preserving closely related concepts in the model.
Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators
March 23, 2023
Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, Humphrey Shi
Recent text-to-video generation approaches rely on computationally heavy
training and require large-scale video datasets. In this paper, we introduce a
new task of zero-shot text-to-video generation and propose a low-cost approach
(without any training or optimization) by leveraging the power of existing
text-to-image synthesis methods (e.g., Stable Diffusion), making them suitable
for the video domain.
Our key modifications include (i) enriching the latent codes of the generated
frames with motion dynamics to keep the global scene and the background time
consistent; and (ii) reprogramming frame-level self-attention using a new
cross-frame attention of each frame on the first frame, to preserve the
context, appearance, and identity of the foreground object.
Experiments show that this leads to low overhead, yet high-quality and
remarkably consistent video generation. Moreover, our approach is not limited
to text-to-video synthesis but is also applicable to other tasks such as
conditional and content-specialized video generation, and Video
Instruct-Pix2Pix, i.e., instruction-guided video editing.
As experiments show, our method performs comparably or sometimes better than
recent approaches, despite not being trained on additional video data. Our code
will be open sourced at: https://github.com/Picsart-AI-Research/Text2Video-Zero .
Medical diffusion on a budget: textual inversion for medical image generation
March 23, 2023
Bram de Wilde, Anindo Saha, Richard P. G. ten Broek, Henkjan Huisman
Diffusion-based models for text-to-image generation have gained immense
popularity due to recent advancements in efficiency, accessibility, and
quality. Although it is becoming increasingly feasible to perform inference
with these systems using consumer-grade GPUs, training them from scratch still
requires access to large datasets and significant computational resources. In
the case of medical image generation, the availability of large, publicly
accessible datasets that include text reports is limited due to legal and
ethical concerns. While training a diffusion model on a private dataset may
address this issue, it is not always feasible for institutions lacking the
necessary computational resources. This work demonstrates that pre-trained
Stable Diffusion models, originally trained on natural images, can be adapted
to various medical imaging modalities by training text embeddings with textual
inversion. In this study, we conducted experiments using medical datasets
comprising only 100 samples from three medical modalities. Embeddings were
trained in a matter of hours, while still retaining diagnostic relevance in
image generation. Experiments were designed to achieve several objectives.
Firstly, we fine-tuned the training and inference processes of textual
inversion, revealing that larger embeddings and more examples are required.
Secondly, we validated our approach by demonstrating a 2\% increase in the
diagnostic accuracy (AUC) for detecting prostate cancer on MRI, which is a
challenging multi-modal imaging modality, from 0.78 to 0.80. Thirdly, we
performed simulations by interpolating between healthy and diseased states,
combining multiple pathologies, and inpainting to show embedding flexibility
and control of disease appearance. Finally, the embeddings trained in this
study are small (less than 1 MB), which facilitates easy sharing of medical
data with reduced privacy concerns.
March 23, 2023
Ce Zheng, Guo-Jun Qi, Chen Chen
cs.CV, cs.AI, cs.HC, cs.MM
Human mesh recovery (HMR) provides rich human body information for various
real-world applications such as gaming, human-computer interaction, and virtual
reality. Compared to single image-based methods, video-based methods can
utilize temporal information to further improve performance by incorporating
human body motion priors. However, many-to-many approaches such as VIBE suffer
from motion smoothness and temporal inconsistency. While many-to-one approaches
such as TCMR and MPS-Net rely on the future frames, which is non-causal and
time inefficient during inference. To address these challenges, a novel
Diffusion-Driven Transformer-based framework (DDT) for video-based HMR is
presented. DDT is designed to decode specific motion patterns from the input
sequence, enhancing motion smoothness and temporal consistency. As a
many-to-many approach, the decoder of our DDT outputs the human mesh of all the
frames, making DDT more viable for real-world applications where time
efficiency is crucial and a causal model is desired. Extensive experiments are
conducted on the widely used datasets (Human3.6M, MPI-INF-3DHP, and 3DPW),
which demonstrated the effectiveness and efficiency of our DDT.