Adversarial Examples are Misaligned in Diffusion Model Manifolds
January 12, 2024
Peter Lorenz, Ricard Durall, Jansi Keuper
In recent years, diffusion models (DMs) have drawn significant attention for
their success in approximating data distributions, yielding state-of-the-art
generative results. Nevertheless, the versatility of these models extends
beyond their generative capabilities to encompass various vision applications,
such as image inpainting, segmentation, adversarial robustness, among others.
This study is dedicated to the investigation of adversarial attacks through the
lens of diffusion models. However, our objective does not involve enhancing the
adversarial robustness of image classifiers. Instead, our focus lies in
utilizing the diffusion model to detect and analyze the anomalies introduced by
these attacks on images. To that end, we systematically examine the alignment
of the distributions of adversarial examples when subjected to the process of
transformation using diffusion models. The efficacy of this approach is
assessed across CIFAR-10 and ImageNet datasets, including varying image sizes
in the latter. The results demonstrate a notable capacity to discriminate
effectively between benign and attacked images, providing compelling evidence
that adversarial instances do not align with the learned manifold of the DMs.
Faster Sampling without Isoperimetry via Diffusion-based Monte Carlo
January 12, 2024
Xunpeng Huang, Difan Zou, Hanze Dong, Yian Ma, Tong Zhang
stat.ML, cs.LG, math.OC, stat.CO
To sample from a general target distribution $p_\propto e^{-f_}$ beyond the
isoperimetric condition, Huang et al. (2023) proposed to perform sampling
through reverse diffusion, giving rise to Diffusion-based Monte Carlo (DMC).
Specifically, DMC follows the reverse SDE of a diffusion process that
transforms the target distribution to the standard Gaussian, utilizing a
non-parametric score estimation. However, the original DMC algorithm
encountered high gradient complexity, resulting in an exponential dependency on
the error tolerance $\epsilon$ of the obtained samples. In this paper, we
demonstrate that the high complexity of DMC originates from its redundant
design of score estimation, and proposed a more efficient algorithm, called
RS-DMC, based on a novel recursive score estimation method. In particular, we
first divide the entire diffusion process into multiple segments and then
formulate the score estimation step (at any time step) as a series of
interconnected mean estimation and sampling subproblems accordingly, which are
correlated in a recursive manner. Importantly, we show that with a proper
design of the segment decomposition, all sampling subproblems will only need to
tackle a strongly log-concave distribution, which can be very efficient to
solve using the Langevin-based samplers with a provably rapid convergence rate.
As a result, we prove that the gradient complexity of RS-DMC only has a
quasi-polynomial dependency on $\epsilon$, which significantly improves
exponential gradient complexity in Huang et al. (2023). Furthermore, under
commonly used dissipative conditions, our algorithm is provably much faster
than the popular Langevin-based algorithms. Our algorithm design and
theoretical framework illuminate a novel direction for addressing sampling
problems, which could be of broader applicability in the community.
Demystifying Variational Diffusion Models
January 11, 2024
Fabio De Sousa Ribeiro, Ben Glocker
Despite the growing popularity of diffusion models, gaining a deep
understanding of the model class remains somewhat elusive for the uninitiated
in non-equilibrium statistical physics. With that in mind, we present what we
believe is a more straightforward introduction to diffusion models using
directed graphical modelling and variational Bayesian principles, which imposes
relatively fewer prerequisites on the average reader. Our exposition
constitutes a comprehensive technical review spanning from foundational
concepts like deep latent variable models to recent advances in continuous-time
diffusion-based modelling, highlighting theoretical connections between model
classes along the way. We provide additional mathematical insights that were
omitted in the seminal works whenever possible to aid in understanding, while
avoiding the introduction of new notation. We envision this article serving as
a useful educational supplement for both researchers and practitioners in the
area, and we welcome feedback and contributions from the community at
https://github.com/biomedia-mira/demystifying-diffusion.
FedTabDiff: Federated Learning of Diffusion Probabilistic Models for Synthetic Mixed-Type Tabular Data Generation
January 11, 2024
Timur Sattarov, Marco Schreyer, Damian Borth
Realistic synthetic tabular data generation encounters significant challenges
in preserving privacy, especially when dealing with sensitive information in
domains like finance and healthcare. In this paper, we introduce
\textit{Federated Tabular Diffusion} (FedTabDiff) for generating high-fidelity
mixed-type tabular data without centralized access to the original tabular
datasets. Leveraging the strengths of \textit{Denoising Diffusion Probabilistic
Models} (DDPMs), our approach addresses the inherent complexities in tabular
data, such as mixed attribute types and implicit relationships. More
critically, FedTabDiff realizes a decentralized learning scheme that permits
multiple entities to collaboratively train a generative model while respecting
data privacy and locality. We extend DDPMs into the federated setting for
tabular data generation, which includes a synchronous update scheme and
weighted averaging for effective model aggregation. Experimental evaluations on
real-world financial and medical datasets attest to the framework’s capability
to produce synthetic data that maintains high fidelity, utility, privacy, and
coverage.
Morphable Diffusion: 3D-Consistent Diffusion for Single-image Avatar Creation
January 09, 2024
Xiyi Chen, Marko Mihajlovic, Shaofei Wang, Sergey Prokudin, Siyu Tang
Recent advances in generative diffusion models have enabled the previously
unfeasible capability of generating 3D assets from a single input image or a
text prompt. In this work, we aim to enhance the quality and functionality of
these models for the task of creating controllable, photorealistic human
avatars. We achieve this by integrating a 3D morphable model into the
state-of-the-art multiview-consistent diffusion approach. We demonstrate that
accurate conditioning of a generative pipeline on the articulated 3D model
enhances the baseline model performance on the task of novel view synthesis
from a single image. More importantly, this integration facilitates a seamless
and accurate incorporation of facial expression and body pose control into the
generation process. To the best of our knowledge, our proposed framework is the
first diffusion model to enable the creation of fully 3D-consistent,
animatable, and photorealistic human avatars from a single image of an unseen
subject; extensive quantitative and qualitative evaluations demonstrate the
advantages of our approach over existing state-of-the-art avatar creation
models on both novel view and novel expression synthesis tasks.
EmoGen: Emotional Image Content Generation with Text-to-Image Diffusion Models
January 09, 2024
Jingyuan Yang, Jiawei Feng, Hui Huang
Recent years have witnessed remarkable progress in image generation task,
where users can create visually astonishing images with high-quality. However,
existing text-to-image diffusion models are proficient in generating concrete
concepts (dogs) but encounter challenges with more abstract ones (emotions).
Several efforts have been made to modify image emotions with color and style
adjustments, facing limitations in effectively conveying emotions with fixed
image contents. In this work, we introduce Emotional Image Content Generation
(EICG), a new task to generate semantic-clear and emotion-faithful images given
emotion categories. Specifically, we propose an emotion space and construct a
mapping network to align it with the powerful Contrastive Language-Image
Pre-training (CLIP) space, providing a concrete interpretation of abstract
emotions. Attribute loss and emotion confidence are further proposed to ensure
the semantic diversity and emotion fidelity of the generated images. Our method
outperforms the state-of-the-art text-to-image approaches both quantitatively
and qualitatively, where we derive three custom metrics, i.e., emotion
accuracy, semantic clarity and semantic diversity. In addition to generation,
our method can help emotion understanding and inspire emotional art design.
Enhanced Distribution Alignment for Post-Training Quantization of Diffusion Models
January 09, 2024
Xuewen Liu, Zhikai Li, Junrui Xiao, Qingyi Gu
Diffusion models have achieved great success in image generation tasks
through iterative noise estimation. However, the heavy denoising process and
complex neural networks hinder their low-latency applications in real-world
scenarios. Quantization can effectively reduce model complexity, and
post-training quantization (PTQ), which does not require fine-tuning, is highly
promising in accelerating the denoising process. Unfortunately, we find that
due to the highly dynamic distribution of activations in different denoising
steps, existing PTQ methods for diffusion models suffer from distribution
mismatch issues at both calibration sample level and reconstruction output
level, which makes the performance far from satisfactory, especially in low-bit
cases. In this paper, we propose Enhanced Distribution Alignment for
Post-Training Quantization of Diffusion Models (EDA-DM) to address the above
issues. Specifically, at the calibration sample level, we select calibration
samples based on the density and diversity in the latent space, thus
facilitating the alignment of their distribution with the overall samples; and
at the reconstruction output level, we propose Fine-grained Block
Reconstruction, which can align the outputs of the quantized model and the
full-precision model at different network granularity. Extensive experiments
demonstrate that EDA-DM outperforms the existing post-training quantization
frameworks in both unconditional and conditional generation scenarios. At
low-bit precision, the quantized models with our method even outperform the
full-precision models on most datasets.
Stable generative modeling using diffusion maps
January 09, 2024
Georg Gottwald, Fengyi Li, Youssef Marzouk, Sebastian Reich
stat.ML, cs.LG, cs.NA, math.NA, stat.CO
We consider the problem of sampling from an unknown distribution for which
only a sufficiently large number of training samples are available. Such
settings have recently drawn considerable interest in the context of generative
modelling. In this paper, we propose a generative model combining diffusion
maps and Langevin dynamics. Diffusion maps are used to approximate the drift
term from the available training samples, which is then implemented in a
discrete-time Langevin sampler to generate new samples. By setting the kernel
bandwidth to match the time step size used in the unadjusted Langevin
algorithm, our method effectively circumvents any stability issues typically
associated with time-stepping stiff stochastic differential equations. More
precisely, we introduce a novel split-step scheme, ensuring that the generated
samples remain within the convex hull of the training samples. Our framework
can be naturally extended to generate conditional samples. We demonstrate the
performance of our proposed scheme through experiments on synthetic datasets
with increasing dimensions and on a stochastic subgrid-scale parametrization
conditional sampling problem.
D3PRefiner: A Diffusion-based Denoise Method for 3D Human Pose Refinement
January 08, 2024
Danqi Yan, Qing Gao, Yuepeng Qian, Xinxing Chen, Chenglong Fu, Yuquan Leng
Three-dimensional (3D) human pose estimation using a monocular camera has
gained increasing attention due to its ease of implementation and the abundance
of data available from daily life. However, owing to the inherent depth
ambiguity in images, the accuracy of existing monocular camera-based 3D pose
estimation methods remains unsatisfactory, and the estimated 3D poses usually
include much noise. By observing the histogram of this noise, we find each
dimension of the noise follows a certain distribution, which indicates the
possibility for a neural network to learn the mapping between noisy poses and
ground truth poses. In this work, in order to obtain more accurate 3D poses, a
Diffusion-based 3D Pose Refiner (D3PRefiner) is proposed to refine the output
of any existing 3D pose estimator. We first introduce a conditional
multivariate Gaussian distribution to model the distribution of noisy 3D poses,
using paired 2D poses and noisy 3D poses as conditions to achieve greater
accuracy. Additionally, we leverage the architecture of current diffusion
models to convert the distribution of noisy 3D poses into ground truth 3D
poses. To evaluate the effectiveness of the proposed method, two
state-of-the-art sequence-to-sequence 3D pose estimators are used as basic 3D
pose estimation models, and the proposed method is evaluated on different types
of 2D poses and different lengths of the input sequence. Experimental results
demonstrate the proposed architecture can significantly improve the performance
of current sequence-to-sequence 3D pose estimators, with a reduction of at
least 10.3% in the mean per joint position error (MPJPE) and at least 11.0% in
the Procrustes MPJPE (P-MPJPE).
Reflected Schrödinger Bridge for Constrained Generative Modeling
January 06, 2024
Wei Deng, Yu Chen, Nicole Tianjiao Yang, Hengrong Du, Qi Feng, Ricky T. Q. Chen
Diffusion models have become the go-to method for large-scale generative
models in real-world applications. These applications often involve data
distributions confined within bounded domains, typically requiring ad-hoc
thresholding techniques for boundary enforcement. Reflected diffusion models
(Lou23) aim to enhance generalizability by generating the data distribution
through a backward process governed by reflected Brownian motion. However,
reflected diffusion models may not easily adapt to diverse domains without the
derivation of proper diffeomorphic mappings and do not guarantee optimal
transport properties. To overcome these limitations, we introduce the Reflected
Schrodinger Bridge algorithm: an entropy-regularized optimal transport approach
tailored for generating data within diverse bounded domains. We derive elegant
reflected forward-backward stochastic differential equations with Neumann and
Robin boundary conditions, extend divergence-based likelihood training to
bounded domains, and explore natural connections to entropic optimal transport
for the study of approximate linear convergence - a valuable insight for
practical training. Our algorithm yields robust generative modeling in diverse
domains, and its scalability is demonstrated in real-world constrained
generative modeling through standard image benchmarks.
MirrorDiffusion: Stabilizing Diffusion Process in Zero-shot Image Translation by Prompts Redescription and Beyond
January 06, 2024
Yupei Lin, Xiaoyu Xian, Yukai Shi, Liang Lin
Recently, text-to-image diffusion models become a new paradigm in image
processing fields, including content generation, image restoration and
image-to-image translation. Given a target prompt, Denoising Diffusion
Probabilistic Models (DDPM) are able to generate realistic yet eligible images.
With this appealing property, the image translation task has the potential to
be free from target image samples for supervision. By using a target text
prompt for domain adaption, the diffusion model is able to implement zero-shot
image-to-image translation advantageously. However, the sampling and inversion
processes of DDPM are stochastic, and thus the inversion process often fail to
reconstruct the input content. Specifically, the displacement effect will
gradually accumulated during the diffusion and inversion processes, which led
to the reconstructed results deviating from the source domain. To make
reconstruction explicit, we propose a prompt redescription strategy to realize
a mirror effect between the source and reconstructed image in the diffusion
model (MirrorDiffusion). More specifically, a prompt redescription mechanism is
investigated to align the text prompts with latent code at each time step of
the Denoising Diffusion Implicit Models (DDIM) inversion to pursue a
structure-preserving reconstruction. With the revised DDIM inversion,
MirrorDiffusion is able to realize accurate zero-shot image translation by
editing optimized text prompts and latent code. Extensive experiments
demonstrate that MirrorDiffusion achieves superior performance over the
state-of-the-art methods on zero-shot image translation benchmarks by clear
margins and practical model stability.
An Event-Oriented Diffusion-Refinement Method for Sparse Events Completion
January 06, 2024
Bo Zhang, Yuqi Han, Jinli Suo, Qionghai Dai
Event cameras or dynamic vision sensors (DVS) record asynchronous response to
brightness changes instead of conventional intensity frames, and feature
ultra-high sensitivity at low bandwidth. The new mechanism demonstrates great
advantages in challenging scenarios with fast motion and large dynamic range.
However, the recorded events might be highly sparse due to either limited
hardware bandwidth or extreme photon starvation in harsh environments. To
unlock the full potential of event cameras, we propose an inventive event
sequence completion approach conforming to the unique characteristics of event
data in both the processing stage and the output form. Specifically, we treat
event streams as 3D event clouds in the spatiotemporal domain, develop a
diffusion-based generative model to generate dense clouds in a coarse-to-fine
manner, and recover exact timestamps to maintain the temporal resolution of raw
data successfully. To validate the effectiveness of our method comprehensively,
we perform extensive experiments on three widely used public datasets with
different spatial resolutions, and additionally collect a novel event dataset
covering diverse scenarios with highly dynamic motions and under harsh
illumination. Besides generating high-quality dense events, our method can
benefit downstream applications such as object classification and intensity
frame reconstruction.
Fair Sampling in Diffusion Models through Switching Mechanism
January 06, 2024
Yujin Choi, Jinseong Park, Hoki Kim, Jaewook Lee, Saeroom Park
Diffusion models have shown their effectiveness in generation tasks by
well-approximating the underlying probability distribution. However, diffusion
models are known to suffer from an amplified inherent bias from the training
data in terms of fairness. While the sampling process of diffusion models can
be controlled by conditional guidance, previous works have attempted to find
empirical guidance to achieve quantitative fairness. To address this
limitation, we propose a fairness-aware sampling method called
\textit{attribute switching} mechanism for diffusion models. Without additional
training, the proposed sampling can obfuscate sensitive attributes in generated
data without relying on classifiers. We mathematically prove and experimentally
demonstrate the effectiveness of the proposed method on two key aspects: (i)
the generation of fair data and (ii) the preservation of the utility of the
generated data.
SAR Despeckling via Regional Denoising Diffusion Probabilistic Model
January 06, 2024
Xuran Hu, Ziqiang Xu, Zhihan Chen, Zhengpeng Feng, Mingzhe Zhu, LJubisa Stankovic
Speckle noise poses a significant challenge in maintaining the quality of
synthetic aperture radar (SAR) images, so SAR despeckling techniques have drawn
increasing attention. Despite the tremendous advancements of deep learning in
fixed-scale SAR image despeckling, these methods still struggle to deal with
large-scale SAR images. To address this problem, this paper introduces a novel
despeckling approach termed Region Denoising Diffusion Probabilistic Model
(R-DDPM) based on generative models. R-DDPM enables versatile despeckling of
SAR images across various scales, accomplished within a single training
session. Moreover, The artifacts in the fused SAR images can be avoided
effectively with the utilization of region-guided inverse sampling. Experiments
of our proposed R-DDPM on Sentinel-1 data demonstrates superior performance to
existing methods.
January 05, 2024
Xin Ma, Yaohui Wang, Gengyun Jia, Xinyuan Chen, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, Yu Qiao
We propose a novel Latent Diffusion Transformer, namely Latte, for video
generation. Latte first extracts spatio-temporal tokens from input videos and
then adopts a series of Transformer blocks to model video distribution in the
latent space. In order to model a substantial number of tokens extracted from
videos, four efficient variants are introduced from the perspective of
decomposing the spatial and temporal dimensions of input videos. To improve the
quality of generated videos, we determine the best practices of Latte through
rigorous experimental analysis, including video clip patch embedding, model
variants, timestep-class information injection, temporal positional embedding,
and learning strategies. Our comprehensive evaluation demonstrates that Latte
achieves state-of-the-art performance across four standard video generation
datasets, i.e., FaceForensics, SkyTimelapse, UCF101, and Taichi-HD. In
addition, we extend Latte to text-to-video generation (T2V) task, where Latte
achieves comparable results compared to recent T2V models. We strongly believe
that Latte provides valuable insights for future research on incorporating
Transformers into diffusion models for video generation.
The Rise of Diffusion Models in Time-Series Forecasting
January 05, 2024
Caspar Meijer, Lydia Y. Chen
This survey delves into the application of diffusion models in time-series
forecasting. Diffusion models are demonstrating state-of-the-art results in
various fields of generative AI. The paper includes comprehensive background
information on diffusion models, detailing their conditioning methods and
reviewing their use in time-series forecasting. The analysis covers 11 specific
time-series implementations, the intuition and theory behind them, the
effectiveness on different datasets, and a comparison among each other. Key
contributions of this work are the thorough exploration of diffusion models’
applications in time-series forecasting and a chronologically ordered overview
of these models. Additionally, the paper offers an insightful discussion on the
current state-of-the-art in this domain and outlines potential future research
directions. This serves as a valuable resource for researchers in AI and
time-series analysis, offering a clear view of the latest advancements and
future potential of diffusion models.
Diffusion Variational Inference: Diffusion Models as Expressive Variational Posteriors
January 05, 2024
Top Piriyakulkij, Yingheng Wang, Volodymyr Kuleshov
We propose denoising diffusion variational inference (DDVI), an approximate
inference algorithm for latent variable models which relies on diffusion models
as expressive variational posteriors. Our method augments variational
posteriors with auxiliary latents, which yields an expressive class of models
that perform diffusion in latent space by reversing a user-specified noising
process. We fit these models by optimizing a novel lower bound on the marginal
likelihood inspired by the wake-sleep algorithm. Our method is easy to
implement (it fits a regularized extension of the ELBO), is compatible with
black-box variational inference, and outperforms alternative classes of
approximate posteriors based on normalizing flows or adversarial networks. When
applied to deep latent variable models, our method yields the denoising
diffusion VAE (DD-VAE) algorithm. We use this algorithm on a motivating task in
biology – inferring latent ancestry from human genomes – outperforming strong
baselines on the Thousand Genomes dataset.
Progressive Knowledge Distillation Of Stable Diffusion XL Using Layer Level Loss
January 05, 2024
Yatharth Gupta, Vishnu V. Jaddipal, Harish Prabhala, Sayak Paul, Patrick Von Platen
Stable Diffusion XL (SDXL) has become the best open source text-to-image
model (T2I) for its versatility and top-notch image quality. Efficiently
addressing the computational demands of SDXL models is crucial for wider reach
and applicability. In this work, we introduce two scaled-down variants, Segmind
Stable Diffusion (SSD-1B) and Segmind-Vega, with 1.3B and 0.74B parameter
UNets, respectively, achieved through progressive removal using layer-level
losses focusing on reducing the model size while preserving generative quality.
We release these models weights at https://hf.co/Segmind. Our methodology
involves the elimination of residual networks and transformer blocks from the
U-Net structure of SDXL, resulting in significant reductions in parameters, and
latency. Our compact models effectively emulate the original SDXL by
capitalizing on transferred knowledge, achieving competitive results against
larger multi-billion parameter SDXL. Our work underscores the efficacy of
knowledge distillation coupled with layer-level losses in reducing model size
while preserving the high-quality generative capabilities of SDXL, thus
facilitating more accessible deployment in resource-constrained environments.
Bring Metric Functions into Diffusion Models
January 04, 2024
Jie An, Zhengyuan Yang, Jianfeng Wang, Linjie Li, Zicheng Liu, Lijuan Wang, Jiebo Luo
We introduce a Cascaded Diffusion Model (Cas-DM) that improves a Denoising
Diffusion Probabilistic Model (DDPM) by effectively incorporating additional
metric functions in training. Metric functions such as the LPIPS loss have been
proven highly effective in consistency models derived from the score matching.
However, for the diffusion counterparts, the methodology and efficacy of adding
extra metric functions remain unclear. One major challenge is the mismatch
between the noise predicted by a DDPM at each step and the desired clean image
that the metric function works well on. To address this problem, we propose
Cas-DM, a network architecture that cascades two network modules to effectively
apply metric functions to the diffusion model training. The first module,
similar to a standard DDPM, learns to predict the added noise and is unaffected
by the metric function. The second cascaded module learns to predict the clean
image, thereby facilitating the metric function computation. Experiment results
show that the proposed diffusion model backbone enables the effective use of
the LPIPS loss, leading to state-of-the-art image quality (FID, sFID, IS) on
various established benchmarks.
Energy based diffusion generator for efficient sampling of Boltzmann distributions
January 04, 2024
Yan Wang, Ling Guo, Hao Wu, Tao Zhou
We introduce a novel sampler called the energy based diffusion generator for
generating samples from arbitrary target distributions. The sampling model
employs a structure similar to a variational autoencoder, utilizing a decoder
to transform latent variables from a simple distribution into random variables
approximating the target distribution, and we design an encoder based on the
diffusion model. Leveraging the powerful modeling capacity of the diffusion
model for complex distributions, we can obtain an accurate variational estimate
of the Kullback-Leibler divergence between the distributions of the generated
samples and the target. Moreover, we propose a decoder based on generalized
Hamiltonian dynamics to further enhance sampling performance. Through empirical
evaluation, we demonstrate the effectiveness of our method across various
complex distribution functions, showcasing its superiority compared to existing
methods.
Improving Diffusion-Based Image Synthesis with Context Prediction
January 04, 2024
Ling Yang, Jingwei Liu, Shenda Hong, Zhilong Zhang, Zhilin Huang, Zheming Cai, Wentao Zhang, Bin Cui
Diffusion models are a new class of generative models, and have dramatically
promoted image generation with unprecedented quality and diversity. Existing
diffusion models mainly try to reconstruct input image from a corrupted one
with a pixel-wise or feature-wise constraint along spatial axes. However, such
point-based reconstruction may fail to make each predicted pixel/feature fully
preserve its neighborhood context, impairing diffusion-based image synthesis.
As a powerful source of automatic supervisory signal, context has been well
studied for learning representations. Inspired by this, we for the first time
propose ConPreDiff to improve diffusion-based image synthesis with context
prediction. We explicitly reinforce each point to predict its neighborhood
context (i.e., multi-stride features/tokens/pixels) with a context decoder at
the end of diffusion denoising blocks in training stage, and remove the decoder
for inference. In this way, each point can better reconstruct itself by
preserving its semantic connections with neighborhood context. This new
paradigm of ConPreDiff can generalize to arbitrary discrete and continuous
diffusion backbones without introducing extra parameters in sampling procedure.
Extensive experiments are conducted on unconditional image generation,
text-to-image generation and image inpainting tasks. Our ConPreDiff
consistently outperforms previous methods and achieves a new SOTA text-to-image
generation results on MS-COCO, with a zero-shot FID score of 6.21.
CoMoSVC: Consistency Model-based Singing Voice Conversion
January 03, 2024
Yiwen Lu, Zhen Ye, Wei Xue, Xu Tan, Qifeng Liu, Yike Guo
eess.AS, cs.AI, cs.LG, cs.SD
The diffusion-based Singing Voice Conversion (SVC) methods have achieved
remarkable performances, producing natural audios with high similarity to the
target timbre. However, the iterative sampling process results in slow
inference speed, and acceleration thus becomes crucial. In this paper, we
propose CoMoSVC, a consistency model-based SVC method, which aims to achieve
both high-quality generation and high-speed sampling. A diffusion-based teacher
model is first specially designed for SVC, and a student model is further
distilled under self-consistency properties to achieve one-step sampling.
Experiments on a single NVIDIA GTX4090 GPU reveal that although CoMoSVC has a
significantly faster inference speed than the state-of-the-art (SOTA)
diffusion-based SVC system, it still achieves comparable or superior conversion
performance based on both subjective and objective metrics. Audio samples and
codes are available at https://comosvc.github.io/.
DiffYOLO: Object Detection for Anti-Noise via YOLO and Diffusion Models
January 03, 2024
Yichen Liu, Huajian Zhang, Daqing Gao
Object detection models represented by YOLO series have been widely used and
have achieved great results on the high quality datasets, but not all the
working conditions are ideal. To settle down the problem of locating targets on
low quality datasets, the existing methods either train a new object detection
network, or need a large collection of low-quality datasets to train. However,
we propose a framework in this paper and apply it on the YOLO models called
DiffYOLO. Specifically, we extract feature maps from the denoising diffusion
probabilistic models to enhance the well-trained models, which allows us
fine-tune YOLO on high-quality datasets and test on low-quality datasets. The
results proved this framework can not only prove the performance on noisy
datasets, but also prove the detection results on high-quality test datasets.
We will supplement more experiments later (with various datasets and network
architectures).
DDPM based X-ray Image Synthesizer
January 03, 2024
Praveen Mahaulpatha, Thulana Abeywardane, Tomson George
Access to high-quality datasets in the medical industry limits machine
learning model performance. To address this issue, we propose a Denoising
Diffusion Probabilistic Model (DDPM) combined with a UNet architecture for
X-ray image synthesis. Focused on pneumonia medical condition, our methodology
employs over 3000 pneumonia X-ray images obtained from Kaggle for training.
Results demonstrate the effectiveness of our approach, as the model
successfully generated realistic images with low Mean Squared Error (MSE). The
synthesized images showed distinct differences from non-pneumonia images,
highlighting the model’s ability to capture key features of positive cases.
Beyond pneumonia, the applications of this synthesizer extend to various
medical conditions, provided an ample dataset is available. The capability to
produce high-quality images can potentially enhance machine learning models’
performance, aiding in more accurate and efficient medical diagnoses. This
innovative DDPM-based X-ray photo synthesizer presents a promising avenue for
addressing the scarcity of positive medical image datasets, paving the way for
improved medical image analysis and diagnosis in the healthcare industry.
S2-DMs:Skip-Step Diffusion Models
January 03, 2024
Yixuan Wang, Shuangyin Li
Diffusion models have emerged as powerful generative tools, rivaling GANs in
sample quality and mirroring the likelihood scores of autoregressive models. A
subset of these models, exemplified by DDIMs, exhibit an inherent asymmetry:
they are trained over $T$ steps but only sample from a subset of $T$ during
generation. This selective sampling approach, though optimized for speed,
inadvertently misses out on vital information from the unsampled steps, leading
to potential compromises in sample quality. To address this issue, we present
the S$^{2}$-DMs, which is a new training method by using an innovative
$L_{skip}$, meticulously designed to reintegrate the information omitted during
the selective sampling phase. The benefits of this approach are manifold: it
notably enhances sample quality, is exceptionally simple to implement, requires
minimal code modifications, and is flexible enough to be compatible with
various sampling algorithms. On the CIFAR10 dataset, models trained using our
algorithm showed an improvement of 3.27% to 14.06% over models trained with
traditional methods across various sampling algorithms (DDIMs, PNDMs, DEIS) and
different numbers of sampling steps (10, 20, …, 1000). On the CELEBA dataset,
the improvement ranged from 8.97% to 27.08%. Access to the code and additional
resources is provided in the github.
Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio Generation
January 02, 2024
Jinlong Xue, Yayue Deng, Yingming Gao, Ya Li
cs.SD, cs.AI, cs.CL, eess.AS
Recent advancements in diffusion models and large language models (LLMs) have
significantly propelled the field of AIGC. Text-to-Audio (TTA), a burgeoning
AIGC application designed to generate audio from natural language prompts, is
attracting increasing attention. However, existing TTA studies often struggle
with generation quality and text-audio alignment, especially for complex
textual inputs. Drawing inspiration from state-of-the-art Text-to-Image (T2I)
diffusion models, we introduce Auffusion, a TTA system adapting T2I model
frameworks to TTA task, by effectively leveraging their inherent generative
strengths and precise cross-modal alignment. Our objective and subjective
evaluations demonstrate that Auffusion surpasses previous TTA approaches using
limited data and computational resource. Furthermore, previous studies in T2I
recognizes the significant impact of encoder choice on cross-modal alignment,
like fine-grained details and object bindings, while similar evaluation is
lacking in prior TTA works. Through comprehensive ablation studies and
innovative cross-attention map visualizations, we provide insightful
assessments of text-audio alignment in TTA. Our findings reveal Auffusion’s
superior capability in generating audios that accurately match textual
descriptions, which further demonstrated in several related tasks, such as
audio style transfer, inpainting and other manipulations. Our implementation
and demos are available at https://auffusion.github.io.
DiffAugment: Diffusion based Long-Tailed Visual Relationship Recognition
January 01, 2024
Parul Gupta, Tuan Nguyen, Abhinav Dhall, Munawar Hayat, Trung Le, Thanh-Toan Do
The task of Visual Relationship Recognition (VRR) aims to identify
relationships between two interacting objects in an image and is particularly
challenging due to the widely-spread and highly imbalanced distribution of
<subject, relation, object> triplets. To overcome the resultant performance
bias in existing VRR approaches, we introduce DiffAugment – a method which
first augments the tail classes in the linguistic space by making use of
WordNet and then utilizes the generative prowess of Diffusion Models to expand
the visual space for minority classes. We propose a novel hardness-aware
component in diffusion which is based upon the hardness of each <S,R,O> triplet
and demonstrate the effectiveness of hardness-aware diffusion in generating
visual embeddings for the tail classes. We also propose a novel subject and
object based seeding strategy for diffusion sampling which improves the
discriminative capability of the generated visual embeddings. Extensive
experimentation on the GQA-LT dataset shows favorable gains in the
subject/object and relation average per-class accuracy using Diffusion
augmented samples.
DiffMorph: Text-less Image Morphing with Diffusion Models
January 01, 2024
Shounak Chatterjee
Text-conditioned image generation models are a prevalent use of AI image
synthesis, yet intuitively controlling output guided by an artist remains
challenging. Current methods require multiple images and textual prompts for
each object to specify them as concepts to generate a single customized image.
On the other hand, our work, \verb|DiffMorph|, introduces a novel approach
that synthesizes images that mix concepts without the use of textual prompts.
Our work integrates a sketch-to-image module to incorporate user sketches as
input. \verb|DiffMorph| takes an initial image with conditioning artist-drawn
sketches to generate a morphed image.
We employ a pre-trained text-to-image diffusion model and fine-tune it to
reconstruct each image faithfully. We seamlessly merge images and concepts from
sketches into a cohesive composition. The image generation capability of our
work is demonstrated through our results and a comparison of these with
prompt-based image generation.
Diffusion Models, Image Super-Resolution And Everything: A Survey
January 01, 2024
Brian B. Moser, Arundhati S. Shanbhag, Federico Raue, Stanislav Frolov, Sebastian Palacio, Andreas Dengel
cs.CV, cs.AI, cs.LG, cs.MM
Diffusion Models (DMs) represent a significant advancement in image
Super-Resolution (SR), aligning technical image quality more closely with human
preferences and expanding SR applications. DMs address critical limitations of
previous methods, enhancing overall realism and details in SR images. However,
DMs suffer from color-shifting issues, and their high computational costs call
for efficient sampling alternatives, underscoring the challenge of balancing
computational efficiency and image quality. This survey gives an overview of
DMs applied to image SR and offers a detailed analysis that underscores the
unique characteristics and methodologies within this domain, distinct from
broader existing reviews in the field. It presents a unified view of DM
fundamentals and explores research directions, including alternative input
domains, conditioning strategies, guidance, corruption spaces, and zero-shot
methods. This survey provides insights into the evolution of image SR with DMs,
addressing current trends, challenges, and future directions in this rapidly
evolving field.
SteinDreamer: Variance Reduction for Text-to-3D Score Distillation via Stein Identity
December 31, 2023
Peihao Wang, Zhiwen Fan, Dejia Xu, Dilin Wang, Sreyas Mohan, Forrest Iandola, Rakesh Ranjan, Yilei Li, Qiang Liu, Zhangyang Wang, Vikas Chandra
Score distillation has emerged as one of the most prevalent approaches for
text-to-3D asset synthesis. Essentially, score distillation updates 3D
parameters by lifting and back-propagating scores averaged over different
views. In this paper, we reveal that the gradient estimation in score
distillation is inherent to high variance. Through the lens of variance
reduction, the effectiveness of SDS and VSD can be interpreted as applications
of various control variates to the Monte Carlo estimator of the distilled
score. Motivated by this rethinking and based on Stein’s identity, we propose a
more general solution to reduce variance for score distillation, termed Stein
Score Distillation (SSD). SSD incorporates control variates constructed by
Stein identity, allowing for arbitrary baseline functions. This enables us to
include flexible guidance priors and network architectures to explicitly
optimize for variance reduction. In our experiments, the overall pipeline,
dubbed SteinDreamer, is implemented by instantiating the control variate with a
monocular depth estimator. The results suggest that SSD can effectively reduce
the distillation variance and consistently improve visual quality for both
object- and scene-level generation. Moreover, we demonstrate that SteinDreamer
achieves faster convergence than existing methods due to more stable gradient
updates.
Taming Mode Collapse in Score Distillation for Text-to-3D Generation
December 31, 2023
Peihao Wang, Dejia Xu, Zhiwen Fan, Dilin Wang, Sreyas Mohan, Forrest Iandola, Rakesh Ranjan, Yilei Li, Qiang Liu, Zhangyang Wang, Vikas Chandra
Despite the remarkable performance of score distillation in text-to-3D
generation, such techniques notoriously suffer from view inconsistency issues,
also known as “Janus” artifact, where the generated objects fake each view with
multiple front faces. Although empirically effective methods have approached
this problem via score debiasing or prompt engineering, a more rigorous
perspective to explain and tackle this problem remains elusive. In this paper,
we reveal that the existing score distillation-based text-to-3D generation
frameworks degenerate to maximal likelihood seeking on each view independently
and thus suffer from the mode collapse problem, manifesting as the Janus
artifact in practice. To tame mode collapse, we improve score distillation by
re-establishing in entropy term in the corresponding variational objective,
which is applied to the distribution of rendered images. Maximizing the entropy
encourages diversity among different views in generated 3D assets, thereby
mitigating the Janus problem. Based on this new objective, we derive a new
update rule for 3D score distillation, dubbed Entropic Score Distillation
(ESD). We theoretically reveal that ESD can be simplified and implemented by
just adopting the classifier-free guidance trick upon variational score
distillation. Although embarrassingly straightforward, our extensive
experiments successfully demonstrate that ESD can be an effective treatment for
Janus artifacts in score distillation.
Probing the Limits and Capabilities of Diffusion Models for the Anatomic Editing of Digital Twins
December 30, 2023
Karim Kadry, Shreya Gupta, Farhad R. Nezami, Elazer R. Edelman
Numerical simulations can model the physical processes that govern
cardiovascular device deployment. When such simulations incorporate digital
twins; computational models of patient-specific anatomy, they can expedite and
de-risk the device design process. Nonetheless, the exclusive use of
patient-specific data constrains the anatomic variability which can be
precisely or fully explored. In this study, we investigate the capacity of
Latent Diffusion Models (LDMs) to edit digital twins to create anatomic
variants, which we term digital siblings. Digital twins and their corresponding
siblings can serve as the basis for comparative simulations, enabling the study
of how subtle anatomic variations impact the simulated deployment of
cardiovascular devices, as well as the augmentation of virtual cohorts for
device assessment. However, while diffusion models have been characterized in
their ability to edit natural images, their capacity to anatomically edit
digital twins has yet to be studied. Using a case example centered on 3D
digital twins of cardiac anatomy, we implement various methods for generating
digital siblings and characterize them through morphological and topological
analyses. We specifically edit digital twins to introduce anatomic variation at
different spatial scales and within localized regions, demonstrating the
existence of bias towards common anatomic features. We further show that such
anatomic bias can be leveraged for virtual cohort augmentation through
selective editing, partially alleviating issues related to dataset imbalance
and lack of diversity. Our experimental framework thus delineates the limits
and capabilities of using latent diffusion models in synthesizing anatomic
variation for in silico trials.
Diffusion Model with Perceptual Loss
December 30, 2023
Shanchuan Lin, Xiao Yang
Diffusion models trained with mean squared error loss tend to generate
unrealistic samples. Current state-of-the-art models rely on classifier-free
guidance to improve sample quality, yet its surprising effectiveness is not
fully understood. In this paper, We show that the effectiveness of
classifier-free guidance partly originates from it being a form of implicit
perceptual guidance. As a result, we can directly incorporate perceptual loss
in diffusion training to improve sample quality. Since the score matching
objective used in diffusion training strongly resembles the denoising
autoencoder objective used in unsupervised training of perceptual networks, the
diffusion model itself is a perceptual network and can be used to generate
meaningful perceptual loss. We propose a novel self-perceptual objective that
results in diffusion models capable of generating more realistic samples. For
conditional generation, our method only improves sample quality without
entanglement with the conditional input and therefore does not sacrifice sample
diversity. Our method can also improve sample quality for unconditional
generation, which was not possible with classifier-free guidance before.
iFusion: Inverting Diffusion for Pose-Free Reconstruction from Sparse Views
December 28, 2023
Chin-Hsuan Wu, Yen-Chun Chen, Bolivar Solarte, Lu Yuan, Min Sun
We present iFusion, a novel 3D object reconstruction framework that requires
only two views with unknown camera poses. While single-view reconstruction
yields visually appealing results, it can deviate significantly from the actual
object, especially on unseen sides. Additional views improve reconstruction
fidelity but necessitate known camera poses. However, assuming the availability
of pose may be unrealistic, and existing pose estimators fail in sparse view
scenarios. To address this, we harness a pre-trained novel view synthesis
diffusion model, which embeds implicit knowledge about the geometry and
appearance of diverse objects. Our strategy unfolds in three steps: (1) We
invert the diffusion model for camera pose estimation instead of synthesizing
novel views. (2) The diffusion model is fine-tuned using provided views and
estimated poses, turned into a novel view synthesizer tailored for the target
object. (3) Leveraging registered views and the fine-tuned diffusion model, we
reconstruct the 3D object. Experiments demonstrate strong performance in both
pose estimation and novel view synthesis. Moreover, iFusion seamlessly
integrates with various reconstruction methods and enhances them.
DreamGaussian4D: Generative 4D Gaussian Splatting
December 28, 2023
Jiawei Ren, Liang Pan, Jiaxiang Tang, Chi Zhang, Ang Cao, Gang Zeng, Ziwei Liu
Remarkable progress has been made in 4D content generation recently. However,
existing methods suffer from long optimization time, lack of motion
controllability, and a low level of detail. In this paper, we introduce
DreamGaussian4D, an efficient 4D generation framework that builds on 4D
Gaussian Splatting representation. Our key insight is that the explicit
modeling of spatial transformations in Gaussian Splatting makes it more
suitable for the 4D generation setting compared with implicit representations.
DreamGaussian4D reduces the optimization time from several hours to just a few
minutes, allows flexible control of the generated 3D motion, and produces
animated meshes that can be efficiently rendered in 3D engines.
PolyDiff: Generating 3D Polygonal Meshes with Diffusion Models
December 18, 2023
Antonio Alliegro, Yawar Siddiqui, Tatiana Tommasi, Matthias Nießner
We introduce PolyDiff, the first diffusion-based approach capable of directly
generating realistic and diverse 3D polygonal meshes. In contrast to methods
that use alternate 3D shape representations (e.g. implicit representations),
our approach is a discrete denoising diffusion probabilistic model that
operates natively on the polygonal mesh data structure. This enables learning
of both the geometric properties of vertices and the topological
characteristics of faces. Specifically, we treat meshes as quantized triangle
soups, progressively corrupted with categorical noise in the forward diffusion
phase. In the reverse diffusion phase, a transformer-based denoising network is
trained to revert the noising process, restoring the original mesh structure.
At inference, new meshes can be generated by applying this denoising network
iteratively, starting with a completely noisy triangle soup. Consequently, our
model is capable of producing high-quality 3D polygonal meshes, ready for
integration into downstream 3D workflows. Our extensive experimental analysis
shows that PolyDiff achieves a significant advantage (avg. FID and JSD
improvement of 18.2 and 5.8 respectively) over current state-of-the-art
methods.
Adv-Diffusion: Imperceptible Adversarial Face Identity Attack via Latent Diffusion Model
December 18, 2023
Decheng Liu, Xijun Wang, Chunlei Peng, Nannan Wang, Ruiming Hu, Xinbo Gao
Adversarial attacks involve adding perturbations to the source image to cause
misclassification by the target model, which demonstrates the potential of
attacking face recognition models. Existing adversarial face image generation
methods still can’t achieve satisfactory performance because of low
transferability and high detectability. In this paper, we propose a unified
framework Adv-Diffusion that can generate imperceptible adversarial identity
perturbations in the latent space but not the raw pixel space, which utilizes
strong inpainting capabilities of the latent diffusion model to generate
realistic adversarial images. Specifically, we propose the identity-sensitive
conditioned diffusion generative model to generate semantic perturbations in
the surroundings. The designed adaptive strength-based adversarial perturbation
algorithm can ensure both attack transferability and stealthiness. Extensive
qualitative and quantitative experiments on the public FFHQ and CelebA-HQ
datasets prove the proposed method achieves superior performance compared with
the state-of-the-art methods without an extra generative model training
process. The source code is available at
https://github.com/kopper-xdu/Adv-Diffusion.
DataElixir: Purifying Poisoned Dataset to Mitigate Backdoor Attacks via Diffusion Models
December 18, 2023
Jiachen Zhou, Peizhuo Lv, Yibing Lan, Guozhu Meng, Kai Chen, Hualong Ma
Dataset sanitization is a widely adopted proactive defense against
poisoning-based backdoor attacks, aimed at filtering out and removing poisoned
samples from training datasets. However, existing methods have shown limited
efficacy in countering the ever-evolving trigger functions, and often leading
to considerable degradation of benign accuracy. In this paper, we propose
DataElixir, a novel sanitization approach tailored to purify poisoned datasets.
We leverage diffusion models to eliminate trigger features and restore benign
features, thereby turning the poisoned samples into benign ones. Specifically,
with multiple iterations of the forward and reverse process, we extract
intermediary images and their predicted labels for each sample in the original
dataset. Then, we identify anomalous samples in terms of the presence of label
transition of the intermediary images, detect the target label by quantifying
distribution discrepancy, select their purified images considering pixel and
feature distance, and determine their ground-truth labels by training a benign
model. Experiments conducted on 9 popular attacks demonstrates that DataElixir
effectively mitigates various complex attacks while exerting minimal impact on
benign accuracy, surpassing the performance of baseline defense methods.
Realistic Human Motion Generation with Cross-Diffusion Models
December 18, 2023
Zeping Ren, Shaoli Huang, Xiu Li
We introduce the Cross Human Motion Diffusion Model (CrossDiff), a novel
approach for generating high-quality human motion based on textual
descriptions. Our method integrates 3D and 2D information using a shared
transformer network within the training of the diffusion model, unifying motion
noise into a single feature space. This enables cross-decoding of features into
both 3D and 2D motion representations, regardless of their original dimension.
The primary advantage of CrossDiff is its cross-diffusion mechanism, which
allows the model to reverse either 2D or 3D noise into clean motion during
training. This capability leverages the complementary information in both
motion representations, capturing intricate human movement details often missed
by models relying solely on 3D information. Consequently, CrossDiff effectively
combines the strengths of both representations to generate more realistic
motion sequences. In our experiments, our model demonstrates competitive
state-of-the-art performance on text-to-motion benchmarks. Moreover, our method
consistently provides enhanced motion generation quality, capturing complex
full-body movement intricacies. Additionally, with a pretrained model,our
approach accommodates using in the wild 2D motion data without 3D motion ground
truth during training to generate 3D motion, highlighting its potential for
broader applications and efficient use of available data resources. Project
page: https://wonderno.github.io/CrossDiff-webpage/.
Your Student is Better Than Expected: Adaptive Teacher-Student Collaboration for Text-Conditional Diffusion Models
December 17, 2023
Nikita Starodubcev, Artem Fedorov, Artem Babenko, Dmitry Baranchuk
Knowledge distillation methods have recently shown to be a promising
direction to speedup the synthesis of large-scale diffusion models by requiring
only a few inference steps. While several powerful distillation methods were
recently proposed, the overall quality of student samples is typically lower
compared to the teacher ones, which hinders their practical usage. In this
work, we investigate the relative quality of samples produced by the teacher
text-to-image diffusion model and its distilled student version. As our main
empirical finding, we discover that a noticeable portion of student samples
exhibit superior fidelity compared to the teacher ones, despite the
``approximate’’ nature of the student. Based on this finding, we propose an
adaptive collaboration between student and teacher diffusion models for
effective text-to-image synthesis. Specifically, the distilled model produces
the initial sample, and then an oracle decides whether it needs further
improvements with a slow teacher model. Extensive experiments demonstrate that
the designed pipeline surpasses state-of-the-art text-to-image alternatives for
various inference budgets in terms of human preference. Furthermore, the
proposed approach can be naturally used in popular applications such as
text-guided image editing and controllable generation.
VecFusion: Vector Font Generation with Diffusion
December 16, 2023
Vikas Thamizharasan, Difan Liu, Shantanu Agarwal, Matthew Fisher, Michael Gharbi, Oliver Wang, Alec Jacobson, Evangelos Kalogerakis
We present VecFusion, a new neural architecture that can generate vector
fonts with varying topological structures and precise control point positions.
Our approach is a cascaded diffusion model which consists of a raster diffusion
model followed by a vector diffusion model. The raster model generates
low-resolution, rasterized fonts with auxiliary control point information,
capturing the global style and shape of the font, while the vector model
synthesizes vector fonts conditioned on the low-resolution raster fonts from
the first stage. To synthesize long and complex curves, our vector diffusion
model uses a transformer architecture and a novel vector representation that
enables the modeling of diverse vector geometry and the precise prediction of
control points. Our experiments show that, in contrast to previous generative
models for vector graphics, our new cascaded vector diffusion model generates
higher quality vector fonts, with complex structures and diverse styles.
Continuous Diffusion for Mixed-Type Tabular Data
December 16, 2023
Markus Mueller, Kathrin Gruber, Dennis Fok
Score-based generative models (or diffusion models for short) have proven
successful across many domains in generating text and image data. However, the
consideration of mixed-type tabular data with this model family has fallen
short so far. Existing research mainly combines different diffusion processes
without explicitly accounting for the feature heterogeneity inherent to tabular
data. In this paper, we combine score matching and score interpolation to
ensure a common type of continuous noise distribution that affects both
continuous and categorical features alike. Further, we investigate the impact
of distinct noise schedules per feature or per data type. We allow for
adaptive, learnable noise schedules to ensure optimally allocated model
capacity and balanced generative capability. Results show that our model
consistently outperforms state-of-the-art benchmark models and that accounting
for heterogeneity within the noise schedule design boosts the sample quality.
Lecture Notes in Probabilistic Diffusion Models
December 16, 2023
Inga Strümke, Helge Langseth
Diffusion models are loosely modelled based on non-equilibrium
thermodynamics, where \textit{diffusion} refers to particles flowing from
high-concentration regions towards low-concentration regions. In statistics,
the meaning is quite similar, namely the process of transforming a complex
distribution $p_{\text{complex}}$ on $\mathbb{R}^d$ to a simple distribution
$p_{\text{prior}}$ on the same domain. This constitutes a Markov chain of
diffusion steps of slowly adding random noise to data, followed by a reverse
diffusion process in which the data is reconstructed from the noise. The
diffusion model learns the data manifold to which the original and thus the
reconstructed data samples belong, by training on a large number of data
points. While the diffusion process pushes a data sample off the data manifold,
the reverse process finds a trajectory back to the data manifold. Diffusion
models have – unlike variational autoencoder and flow models – latent
variables with the same dimensionality as the original data, and they are
currently\footnote{At the time of writing, 2023.} outperforming other
approaches – including Generative Adversarial Networks (GANs) – to modelling
the distribution of, e.g., natural images.
Image Restoration Through Generalized Ornstein-Uhlenbeck Bridge
December 16, 2023
Conghan Yue, Zhengwei Peng, Junlong Ma, Shiyan Du, Pengxu Wei, Dongyu Zhang
Diffusion models possess powerful generative capabilities enabling the
mapping of noise to data using reverse stochastic differential equations.
However, in image restoration tasks, the focus is on the mapping relationship
from low-quality images to high-quality images. To address this, we introduced
the Generalized Ornstein-Uhlenbeck Bridge (GOUB) model. By leveraging the
natural mean-reverting property of the generalized OU process and further
adjusting the variance of its steady-state distribution through the Doob’s
h-transform, we achieve diffusion mappings from point to point with minimal
cost. This allows for end-to-end training, enabling the recovery of
high-quality images from low-quality ones. Additionally, we uncovered the
mathematical essence of some bridge models, all of which are special cases of
the GOUB and empirically demonstrated the optimality of our proposed models.
Furthermore, benefiting from our distinctive parameterization mechanism, we
proposed the Mean-ODE model that is better at capturing pixel-level information
and structural perceptions. Experimental results show that both models achieved
state-of-the-art results in various tasks, including inpainting, deraining, and
super-resolution. Code is available at https://github.com/Hammour-steak/GOUB.
PhenDiff: Revealing Invisible Phenotypes with Conditional Diffusion Models
December 13, 2023
Anis Bourou, Thomas Boyer, Kévin Daupin, Véronique Dubreuil, Aurélie De Thonel, Valérie Mezger, Auguste Genovesio
Over the last five years, deep generative models have gradually been adopted
for various tasks in biological research. Notably, image-to-image translation
methods showed to be effective in revealing subtle phenotypic cell variations
otherwise invisible to the human eye. Current methods to achieve this goal
mainly rely on Generative Adversarial Networks (GANs). However, these models
are known to suffer from some shortcomings such as training instability and
mode collapse. Furthermore, the lack of robustness to invert a real image into
the latent of a trained GAN prevents flexible editing of real images. In this
work, we propose PhenDiff, an image-to-image translation method based on
conditional diffusion models to identify subtle phenotypes in microscopy
images. We evaluate this approach on biological datasets against previous work
such as CycleGAN. We show that PhenDiff outperforms this baseline in terms of
quality and diversity of the generated images. We then apply this method to
display invisible phenotypic changes triggered by a rare neurodevelopmental
disorder on microscopy images of organoids. Altogether, we demonstrate that
PhenDiff is able to perform high quality biological image-to-image translation
allowing to spot subtle phenotype variations on a real image.
SPD-DDPM: Denoising Diffusion Probabilistic Models in the Symmetric Positive Definite Space
December 13, 2023
Yunchen Li, Zhou Yu, Gaoqi He, Yunhang Shen, Ke Li, Xing Sun, Shaohui Lin
Symmetric positive definite~(SPD) matrices have shown important value and
applications in statistics and machine learning, such as FMRI analysis and
traffic prediction. Previous works on SPD matrices mostly focus on
discriminative models, where predictions are made directly on $E(X|y)$, where
$y$ is a vector and $X$ is an SPD matrix. However, these methods are
challenging to handle for large-scale data, as they need to access and process
the whole data. In this paper, inspired by denoising diffusion probabilistic
model~(DDPM), we propose a novel generative model, termed SPD-DDPM, by
introducing Gaussian distribution in the SPD space to estimate $E(X|y)$.
Moreover, our model is able to estimate $p(X)$ unconditionally and flexibly
without giving $y$. On the one hand, the model conditionally learns $p(X|y)$
and utilizes the mean of samples to obtain $E(X|y)$ as a prediction. On the
other hand, the model unconditionally learns the probability distribution of
the data $p(X)$ and generates samples that conform to this distribution.
Furthermore, we propose a new SPD net which is much deeper than the previous
networks and allows for the inclusion of conditional factors. Experiment
results on toy data and real taxi data demonstrate that our models effectively
fit the data distribution both unconditionally and unconditionally and provide
accurate predictions.
Concept-centric Personalization with Large-scale Diffusion Priors
December 13, 2023
Pu Cao, Lu Yang, Feng Zhou, Tianrui Huang, Qing Song
Despite large-scale diffusion models being highly capable of generating
diverse open-world content, they still struggle to match the photorealism and
fidelity of concept-specific generators. In this work, we present the task of
customizing large-scale diffusion priors for specific concepts as
concept-centric personalization. Our goal is to generate high-quality
concept-centric images while maintaining the versatile controllability inherent
to open-world models, enabling applications in diverse tasks such as
concept-centric stylization and image translation. To tackle these challenges,
we identify catastrophic forgetting of guidance prediction from diffusion
priors as the fundamental issue. Consequently, we develop a guidance-decoupled
personalization framework specifically designed to address this task. We
propose Generalized Classifier-free Guidance (GCFG) as the foundational theory
for our framework. This approach extends Classifier-free Guidance (CFG) to
accommodate an arbitrary number of guidances, sourced from a variety of
conditions and models. Employing GCFG enables us to separate conditional
guidance into two distinct components: concept guidance for fidelity and
control guidance for controllability. This division makes it feasible to train
a specialized model for concept guidance, while ensuring both control and
unconditional guidance remain intact. We then present a null-text
Concept-centric Diffusion Model as a concept-specific generator to learn
concept guidance without the need for text annotations. Code will be available
at https://github.com/PRIV-Creation/Concept-centric-Personalization.
-Diffusion: A diffusion-based density estimation framework for computational physics
December 13, 2023
Maxwell X. Cai, Kin Long Kelvin Lee
In physics, density $\rho(\cdot)$ is a fundamentally important scalar
function to model, since it describes a scalar field or a probability density
function that governs a physical process. Modeling $\rho(\cdot)$ typically
scales poorly with parameter space, however, and quickly becomes prohibitively
difficult and computationally expensive. One promising avenue to bypass this is
to leverage the capabilities of denoising diffusion models often used in
high-fidelity image generation to parameterize $\rho(\cdot)$ from existing
scientific data, from which new samples can be trivially sampled from. In this
paper, we propose $\rho$-Diffusion, an implementation of denoising diffusion
probabilistic models for multidimensional density estimation in physics, which
is currently in active development and, from our results, performs well on
physically motivated 2D and 3D density functions. Moreover, we propose a novel
hashing technique that allows $\rho$-Diffusion to be conditioned by arbitrary
amounts of physical parameters of interest.
Clockwork Diffusion: Efficient Generation With Model-Step Distillation
December 13, 2023
Amirhossein Habibian, Amir Ghodrati, Noor Fathima, Guillaume Sautiere, Risheek Garrepalli, Fatih Porikli, Jens Petersen
This work aims to improve the efficiency of text-to-image diffusion models.
While diffusion models use computationally expensive UNet-based denoising
operations in every generation step, we identify that not all operations are
equally relevant for the final output quality. In particular, we observe that
UNet layers operating on high-res feature maps are relatively sensitive to
small perturbations. In contrast, low-res feature maps influence the semantic
layout of the final image and can often be perturbed with no noticeable change
in the output. Based on this observation, we propose Clockwork Diffusion, a
method that periodically reuses computation from preceding denoising steps to
approximate low-res feature maps at one or more subsequent steps. For multiple
baselines, and for both text-to-image generation and image editing, we
demonstrate that Clockwork leads to comparable or improved perceptual scores
with drastically reduced computational complexity. As an example, for Stable
Diffusion v1.5 with 8 DPM++ steps we save 32% of FLOPs with negligible FID and
CLIP change.
Compositional Inversion for Stable Diffusion Models
December 13, 2023
Xulu Zhang, Xiao-Yong Wei, Jinlin Wu, Tianyi Zhang, Zhaoxiang Zhang, Zhen Lei, Qing Li
Inversion methods, such as Textual Inversion, generate personalized images by
incorporating concepts of interest provided by user images. However, existing
methods often suffer from overfitting issues, where the dominant presence of
inverted concepts leads to the absence of other desired concepts. It stems from
the fact that during inversion, the irrelevant semantics in the user images are
also encoded, forcing the inverted concepts to occupy locations far from the
core distribution in the embedding space. To address this issue, we propose a
method that guides the inversion process towards the core distribution for
compositional embeddings. Additionally, we introduce a spatial regularization
approach to balance the attention on the concepts being composed. Our method is
designed as a post-training approach and can be seamlessly integrated with
other inversion methods. Experimental results demonstrate the effectiveness of
our proposed approach in mitigating the overfitting problem and generating more
diverse and balanced compositions of concepts in the synthesized images. The
source code is available at
https://github.com/zhangxulu1996/Compositional-Inversion.
ClusterDDPM: An EM clustering framework with Denoising Diffusion Probabilistic Models
December 13, 2023
Jie Yan, Jing Liu, Zhong-yuan Zhang
Variational autoencoder (VAE) and generative adversarial networks (GAN) have
found widespread applications in clustering and have achieved significant
success. However, the potential of these approaches may be limited due to VAE’s
mediocre generation capability or GAN’s well-known instability during
adversarial training. In contrast, denoising diffusion probabilistic models
(DDPMs) represent a new and promising class of generative models that may
unlock fresh dimensions in clustering. In this study, we introduce an
innovative expectation-maximization (EM) framework for clustering using DDPMs.
In the E-step, we aim to derive a mixture of Gaussian priors for the subsequent
M-step. In the M-step, our focus lies in learning clustering-friendly latent
representations for the data by employing the conditional DDPM and matching the
distribution of latent representations to the mixture of Gaussian priors. We
present a rigorous theoretical analysis of the optimization process in the
M-step, proving that the optimizations are equivalent to maximizing the lower
bound of the Q function within the vanilla EM framework under certain
constraints. Comprehensive experiments validate the advantages of the proposed
framework, showcasing superior performance in clustering, unsupervised
conditional generation and latent representation learning.
Time Series Diffusion Method: A Denoising Diffusion Probabilistic Model for Vibration Signal Generation
December 13, 2023
Haiming Yi, Lei Hou, Yuhong Jin, Nasser A. Saeed
Diffusion models have demonstrated robust data generation capabilities in
various research fields. In this paper, a Time Series Diffusion Method (TSDM)
is proposed for vibration signal generation, leveraging the foundational
principles of diffusion models. The TSDM uses an improved U-net architecture
with attention block to effectively segment and extract features from
one-dimensional time series data. It operates based on forward diffusion and
reverse denoising processes for time-series generation. Experimental validation
is conducted using single-frequency, multi-frequency datasets, and bearing
fault datasets. The results show that TSDM can accurately generate the
single-frequency and multi-frequency features in the time series and retain the
basic frequency features for the diffusion generation results of the bearing
fault series. Finally, TSDM is applied to the small sample fault diagnosis of
three public bearing fault datasets, and the results show that the accuracy of
small sample fault diagnosis of the three datasets is improved by 32.380%,
18.355% and 9.298% at most, respectively
Diffusion Models Enable Zero-Shot Pose Estimation for Lower-Limb Prosthetic Users
December 13, 2023
Tianxun Zhou, Muhammad Nur Shahril Iskandar, Keng-Hwee Chiam
The application of 2D markerless gait analysis has garnered increasing
interest and application within clinical settings. However, its effectiveness
in the realm of lower-limb amputees has remained less than optimal. In
response, this study introduces an innovative zero-shot method employing image
generation diffusion models to achieve markerless pose estimation for
lower-limb prosthetics, presenting a promising solution to gait analysis for
this specific population. Our approach demonstrates an enhancement in detecting
key points on prosthetic limbs over existing methods, and enables clinicians to
gain invaluable insights into the kinematics of lower-limb amputees across the
gait cycle. The outcomes obtained not only serve as a proof-of-concept for the
feasibility of this zero-shot approach but also underscore its potential in
advancing rehabilitation through gait analysis for this unique population.
December 12, 2023
Shentong Mo, Enze Xie, Yue Wu, Junsong Chen, Matthias Nießner, Zhenguo Li
Diffusion Transformers have recently shown remarkable effectiveness in
generating high-quality 3D point clouds. However, training voxel-based
diffusion models for high-resolution 3D voxels remains prohibitively expensive
due to the cubic complexity of attention operators, which arises from the
additional dimension of voxels. Motivated by the inherent redundancy of 3D
compared to 2D, we propose FastDiT-3D, a novel masked diffusion transformer
tailored for efficient 3D point cloud generation, which greatly reduces
training costs. Specifically, we draw inspiration from masked autoencoders to
dynamically operate the denoising process on masked voxelized point clouds. We
also propose a novel voxel-aware masking strategy to adaptively aggregate
background/foreground information from voxelized point clouds. Our method
achieves state-of-the-art performance with an extreme masking ratio of nearly
99%. Moreover, to improve multi-category 3D generation, we introduce
Mixture-of-Expert (MoE) in 3D diffusion model. Each category can learn a
distinct diffusion path with different experts, relieving gradient conflict.
Experimental results on the ShapeNet dataset demonstrate that our method
achieves state-of-the-art high-fidelity and diverse 3D point cloud generation
performance. Our FastDiT-3D improves 1-Nearest Neighbor Accuracy and Coverage
metrics when generating 128-resolution voxel point clouds, using only 6.5% of
the original training cost.
Equivariant Flow Matching with Hybrid Probability Transport
December 12, 2023
Yuxuan Song, Jingjing Gong, Minkai Xu, Ziyao Cao, Yanyan Lan, Stefano Ermon, Hao Zhou, Wei-Ying Ma
The generation of 3D molecules requires simultaneously deciding the
categorical features~(atom types) and continuous features~(atom coordinates).
Deep generative models, especially Diffusion Models (DMs), have demonstrated
effectiveness in generating feature-rich geometries. However, existing DMs
typically suffer from unstable probability dynamics with inefficient sampling
speed. In this paper, we introduce geometric flow matching, which enjoys the
advantages of both equivariant modeling and stabilized probability dynamics.
More specifically, we propose a hybrid probability path where the coordinates
probability path is regularized by an equivariant optimal transport, and the
information between different modalities is aligned. Experimentally, the
proposed method could consistently achieve better performance on multiple
molecule generation benchmarks with 4.75$\times$ speed up of sampling on
average.
Generating High-Resolution Regional Precipitation Using Conditional Diffusion Model
December 12, 2023
Naufal Shidqi, Chaeyoon Jeong, Sungwon Park, Elke Zeller, Arjun Babu Nellikkattil, Karandeep Singh
cs.LG, cs.AI, physics.ao-ph
Climate downscaling is a crucial technique within climate research, serving
to project low-resolution (LR) climate data to higher resolutions (HR).
Previous research has demonstrated the effectiveness of deep learning for
downscaling tasks. However, most deep learning models for climate downscaling
may not perform optimally for high scaling factors (i.e., 4x, 8x) due to their
limited ability to capture the intricate details required for generating HR
climate data. Furthermore, climate data behaves differently from image data,
necessitating a nuanced approach when employing deep generative models. In
response to these challenges, this paper presents a deep generative model for
downscaling climate data, specifically precipitation on a regional scale. We
employ a denoising diffusion probabilistic model (DDPM) conditioned on multiple
LR climate variables. The proposed model is evaluated using precipitation data
from the Community Earth System Model (CESM) v1.2.2 simulation. Our results
demonstrate significant improvements over existing baselines, underscoring the
effectiveness of the conditional diffusion model in downscaling climate data.
LoRA-Enhanced Distillation on Guided Diffusion Models
December 12, 2023
Pareesa Ameneh Golnari
Diffusion models, such as Stable Diffusion (SD), offer the ability to
generate high-resolution images with diverse features, but they come at a
significant computational and memory cost. In classifier-free guided diffusion
models, prolonged inference times are attributed to the necessity of computing
two separate diffusion models at each denoising step. Recent work has shown
promise in improving inference time through distillation techniques, teaching
the model to perform similar denoising steps with reduced computations.
However, the application of distillation introduces additional memory overhead
to these already resource-intensive diffusion models, making it less practical.
To address these challenges, our research explores a novel approach that
combines Low-Rank Adaptation (LoRA) with model distillation to efficiently
compress diffusion models. This approach not only reduces inference time but
also mitigates memory overhead, and notably decreases memory consumption even
before applying distillation. The results are remarkable, featuring a
significant reduction in inference time due to the distillation process and a
substantial 50% reduction in memory consumption. Our examination of the
generated images underscores that the incorporation of LoRA-enhanced
distillation maintains image quality and alignment with the provided prompts.
In summary, while conventional distillation tends to increase memory
consumption, LoRA-enhanced distillation offers optimization without any
trade-offs or compromises in quality.
Photorealistic Video Generation with Diffusion Models
December 11, 2023
Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Li Fei-Fei, Irfan Essa, Lu Jiang, José Lezama
We present W.A.L.T, a transformer-based approach for photorealistic video
generation via diffusion modeling. Our approach has two key design decisions.
First, we use a causal encoder to jointly compress images and videos within a
unified latent space, enabling training and generation across modalities.
Second, for memory and training efficiency, we use a window attention
architecture tailored for joint spatial and spatiotemporal generative modeling.
Taken together these design decisions enable us to achieve state-of-the-art
performance on established video (UCF-101 and Kinetics-600) and image
(ImageNet) generation benchmarks without using classifier free guidance.
Finally, we also train a cascade of three models for the task of text-to-video
generation consisting of a base latent video diffusion model, and two video
super-resolution diffusion models to generate videos of $512 \times 896$
resolution at $8$ frames per second.
UpFusion: Novel View Diffusion from Unposed Sparse View Observations
December 11, 2023
Bharath Raj Nagoor Kani, Hsin-Ying Lee, Sergey Tulyakov, Shubham Tulsiani
We propose UpFusion, a system that can perform novel view synthesis and infer
3D representations for an object given a sparse set of reference images without
corresponding pose information. Current sparse-view 3D inference methods
typically rely on camera poses to geometrically aggregate information from
input views, but are not robust in-the-wild when such information is
unavailable/inaccurate. In contrast, UpFusion sidesteps this requirement by
learning to implicitly leverage the available images as context in a
conditional generative model for synthesizing novel views. We incorporate two
complementary forms of conditioning into diffusion models for leveraging the
input views: a) via inferring query-view aligned features using a scene-level
transformer, b) via intermediate attentional layers that can directly observe
the input image tokens. We show that this mechanism allows generating
high-fidelity novel views while improving the synthesis quality given
additional (unposed) images. We evaluate our approach on the Co3Dv2 and Google
Scanned Objects datasets and demonstrate the benefits of our method over
pose-reliant sparse-view methods as well as single-view methods that cannot
leverage additional views. Finally, we also show that our learned model can
generalize beyond the training categories and even allow reconstruction from
self-captured images of generic objects in-the-wild.
DiAD: A Diffusion-based Framework for Multi-class Anomaly Detection
December 11, 2023
Haoyang He, Jiangning Zhang, Hongxu Chen, Xuhai Chen, Zhishan Li, Xu Chen, Yabiao Wang, Chengjie Wang, Lei Xie
Reconstruction-based approaches have achieved remarkable outcomes in anomaly
detection. The exceptional image reconstruction capabilities of recently
popular diffusion models have sparked research efforts to utilize them for
enhanced reconstruction of anomalous images. Nonetheless, these methods might
face challenges related to the preservation of image categories and pixel-wise
structural integrity in the more practical multi-class setting. To solve the
above problems, we propose a Difusion-based Anomaly Detection (DiAD) framework
for multi-class anomaly detection, which consists of a pixel-space autoencoder,
a latent-space Semantic-Guided (SG) network with a connection to the stable
diffusion’s denoising network, and a feature-space pre-trained feature
extractor. Firstly, The SG network is proposed for reconstructing anomalous
regions while preserving the original image’s semantic information. Secondly,
we introduce Spatial-aware Feature Fusion (SFF) block to maximize
reconstruction accuracy when dealing with extensively reconstructed areas.
Thirdly, the input and reconstructed images are processed by a pre-trained
feature extractor to generate anomaly maps based on features extracted at
different scales. Experiments on MVTec-AD and VisA datasets demonstrate the
effectiveness of our approach which surpasses the state-of-the-art methods,
e.g., achieving 96.8/52.6 and 97.2/99.0 (AUROC/AP) for localization and
detection respectively on multi-class MVTec-AD dataset. Code will be available
at https://lewandofskee.github.io/projects/diad.
HOI-Diff: Text-Driven Synthesis of 3D Human-Object Interactions using Diffusion Models
December 11, 2023
Xiaogang Peng, Yiming Xie, Zizhao Wu, Varun Jampani, Deqing Sun, Huaizu Jiang
We address the problem of generating realistic 3D human-object interactions
(HOIs) driven by textual prompts. Instead of a single model, our key insight is
to take a modular design and decompose the complex task into simpler sub-tasks.
We first develop a dual-branch diffusion model (HOI-DM) to generate both human
and object motions conditioning on the input text, and encourage coherent
motions by a cross-attention communication module between the human and object
motion generation branches. We also develop an affordance prediction diffusion
model (APDM) to predict the contacting area between the human and object during
the interactions driven by the textual prompt. The APDM is independent of the
results by the HOI-DM and thus can correct potential errors by the latter.
Moreover, it stochastically generates the contacting points to diversify the
generated motions. Finally, we incorporate the estimated contacting points into
the classifier-guidance to achieve accurate and close contact between humans
and objects. To train and evaluate our approach, we annotate BEHAVE dataset
with text descriptions. Experimental results demonstrate that our approach is
able to produce realistic HOIs with various interactions and different types of
objects.
DiffAIL: Diffusion Adversarial Imitation Learning
December 11, 2023
Bingzheng Wang, Guoqiang Wu, Teng Pang, Yan Zhang, Yilong Yin
Imitation learning aims to solve the problem of defining reward functions in
real-world decision-making tasks. The current popular approach is the
Adversarial Imitation Learning (AIL) framework, which matches expert
state-action occupancy measures to obtain a surrogate reward for forward
reinforcement learning. However, the traditional discriminator is a simple
binary classifier and doesn’t learn an accurate distribution, which may result
in failing to identify expert-level state-action pairs induced by the policy
interacting with the environment. To address this issue, we propose a method
named diffusion adversarial imitation learning (DiffAIL), which introduces the
diffusion model into the AIL framework. Specifically, DiffAIL models the
state-action pairs as unconditional diffusion models and uses diffusion loss as
part of the discriminator’s learning objective, which enables the discriminator
to capture better expert demonstrations and improve generalization.
Experimentally, the results show that our method achieves state-of-the-art
performance and significantly surpasses expert demonstration on two benchmark
tasks, including the standard state-action setting and state-only settings. Our
code can be available at the link https://github.com/ML-Group-SDU/DiffAIL.
The Journey, Not the Destination: How Data Guides Diffusion Models
December 11, 2023
Kristian Georgiev, Joshua Vendrow, Hadi Salman, Sung Min Park, Aleksander Madry
Diffusion models trained on large datasets can synthesize photo-realistic
images of remarkable quality and diversity. However, attributing these images
back to the training data-that is, identifying specific training examples which
caused an image to be generated-remains a challenge. In this paper, we propose
a framework that: (i) provides a formal notion of data attribution in the
context of diffusion models, and (ii) allows us to counterfactually validate
such attributions. Then, we provide a method for computing these attributions
efficiently. Finally, we apply our method to find (and evaluate) such
attributions for denoising diffusion probabilistic models trained on CIFAR-10
and latent diffusion models trained on MS COCO. We provide code at
https://github.com/MadryLab/journey-TRAK .
December 11, 2023
Linjie Fu, Xia Li, Xiuding Cai, Yingkai Wang, Xueyao Wang, Yu Yao, Yali Shen
Radiation therapy serves as an effective and standard method for cancer
treatment. Excellent radiation therapy plans always rely on high-quality dose
distribution maps obtained through repeated trial and error by experienced
experts. However, due to individual differences and complex clinical
situations, even seasoned expert teams may need help to achieve the best
treatment plan every time quickly. Many automatic dose distribution prediction
methods have been proposed recently to accelerate the radiation therapy
planning process and have achieved good results. However, these results suffer
from over-smoothing issues, with the obtained dose distribution maps needing
more high-frequency details, limiting their clinical application. To address
these limitations, we propose a dose prediction diffusion model based on
SwinTransformer and a projector, SP-DiffDose. To capture the direct correlation
between anatomical structure and dose distribution maps, SP-DiffDose uses a
structural encoder to extract features from anatomical images, then employs a
conditional diffusion process to blend noise and anatomical images at multiple
scales and gradually map them to dose distribution maps. To enhance the dose
prediction distribution for organs at risk, SP-DiffDose utilizes
SwinTransformer in the deeper layers of the network to capture features at
different scales in the image. To learn good representations from the fused
features, SP-DiffDose passes the fused features through a designed projector,
improving dose prediction accuracy. Finally, we evaluate SP-DiffDose on an
internal dataset. The results show that SP-DiffDose outperforms existing
methods on multiple evaluation metrics, demonstrating the superiority and
generalizability of our method.
PCRDiffusion: Diffusion Probabilistic Models for Point Cloud Registration
December 11, 2023
Yue Wu, Yongzhe Yuan, Xiaolong Fan, Xiaoshui Huang, Maoguo Gong, Qiguang Miao
We propose a new framework that formulates point cloud registration as a
denoising diffusion process from noisy transformation to object transformation.
During training stage, object transformation diffuses from ground-truth
transformation to random distribution, and the model learns to reverse this
noising process. In sampling stage, the model refines randomly generated
transformation to the output result in a progressive way. We derive the
variational bound in closed form for training and provide implementations of
the model. Our work provides the following crucial findings: (i) In contrast to
most existing methods, our framework, Diffusion Probabilistic Models for Point
Cloud Registration (PCRDiffusion) does not require repeatedly update source
point cloud to refine the predicted transformation. (ii) Point cloud
registration, one of the representative discriminative tasks, can be solved by
a generative way and the unified probabilistic formulation. Finally, we discuss
and provide an outlook on the application of diffusion model in different
scenarios for point cloud registration. Experimental results demonstrate that
our model achieves competitive performance in point cloud registration. In
correspondence-free and correspondence-based scenarios, PCRDifussion can both
achieve exceeding 50\% performance improvements.
CONFORM: Contrast is All You Need For High-Fidelity Text-to-Image Diffusion Models
December 11, 2023
Tuna Han Salih Meral, Enis Simsar, Federico Tombari, Pinar Yanardag
Images produced by text-to-image diffusion models might not always faithfully
represent the semantic intent of the provided text prompt, where the model
might overlook or entirely fail to produce certain objects. Existing solutions
often require customly tailored functions for each of these problems, leading
to sub-optimal results, especially for complex prompts. Our work introduces a
novel perspective by tackling this challenge in a contrastive context. Our
approach intuitively promotes the segregation of objects in attention maps
while also maintaining that pairs of related attributes are kept close to each
other. We conduct extensive experiments across a wide variety of scenarios,
each involving unique combinations of objects, attributes, and scenes. These
experiments effectively showcase the versatility, efficiency, and flexibility
of our method in working with both latent and pixel-based diffusion models,
including Stable Diffusion and Imagen. Moreover, we publicly share our source
code to facilitate further research.
A Note on the Convergence of Denoising Diffusion Probabilistic Models
December 10, 2023
Sokhna Diarra Mbacke, Omar Rivasplata
Diffusion models are one of the most important families of deep generative
models. In this note, we derive a quantitative upper bound on the Wasserstein
distance between the data-generating distribution and the distribution learned
by a diffusion model. Unlike previous works in this field, our result does not
make assumptions on the learned score function. Moreover, our bound holds for
arbitrary data-generating distributions on bounded instance spaces, even those
without a density w.r.t. the Lebesgue measure, and the upper bound does not
suffer from exponential dependencies. Our main result builds upon the recent
work of Mbacke et al. (2023) and our proofs are elementary.
Diffusion for Natural Image Matting
December 10, 2023
Yihan Hu, Yiheng Lin, Wei Wang, Yao Zhao, Yunchao Wei, Humphrey Shi
We aim to leverage diffusion to address the challenging image matting task.
However, the presence of high computational overhead and the inconsistency of
noise sampling between the training and inference processes pose significant
obstacles to achieving this goal. In this paper, we present DiffMatte, a
solution designed to effectively overcome these challenges. First, DiffMatte
decouples the decoder from the intricately coupled matting network design,
involving only one lightweight decoder in the iterations of the diffusion
process. With such a strategy, DiffMatte mitigates the growth of computational
overhead as the number of samples increases. Second, we employ a self-aligned
training strategy with uniform time intervals, ensuring a consistent noise
sampling between training and inference across the entire time domain. Our
DiffMatte is designed with flexibility in mind and can seamlessly integrate
into various modern matting architectures. Extensive experimental results
demonstrate that DiffMatte not only reaches the state-of-the-art level on the
Composition-1k test set, surpassing the best methods in the past by 5% and 15%
in the SAD metric and MSE metric respectively, but also show stronger
generalization ability in other benchmarks.
InteractDiffusion: Interaction Control in Text-to-Image Diffusion Models
December 10, 2023
Jiun Tian Hoe, Xudong Jiang, Chee Seng Chan, Yap-Peng Tan, Weipeng Hu
Large-scale text-to-image (T2I) diffusion models have showcased incredible
capabilities in generating coherent images based on textual descriptions,
enabling vast applications in content generation. While recent advancements
have introduced control over factors such as object localization, posture, and
image contours, a crucial gap remains in our ability to control the
interactions between objects in the generated content. Well-controlling
interactions in generated images could yield meaningful applications, such as
creating realistic scenes with interacting characters. In this work, we study
the problems of conditioning T2I diffusion models with Human-Object Interaction
(HOI) information, consisting of a triplet label (person, action, object) and
corresponding bounding boxes. We propose a pluggable interaction control model,
called InteractDiffusion that extends existing pre-trained T2I diffusion models
to enable them being better conditioned on interactions. Specifically, we
tokenize the HOI information and learn their relationships via interaction
embeddings. A conditioning self-attention layer is trained to map HOI tokens to
visual tokens, thereby conditioning the visual tokens better in existing T2I
diffusion models. Our model attains the ability to control the interaction and
location on existing T2I diffusion models, which outperforms existing baselines
by a large margin in HOI detection score, as well as fidelity in FID and KID.
Project page: https://jiuntian.github.io/interactdiffusion.
AnomalyDiffusion: Few-Shot Anomaly Image Generation with Diffusion Model
December 10, 2023
Teng Hu, Jiangning Zhang, Ran Yi, Yuzhen Du, Xu Chen, Liang Liu, Yabiao Wang, Chengjie Wang
Anomaly inspection plays an important role in industrial manufacture.
Existing anomaly inspection methods are limited in their performance due to
insufficient anomaly data. Although anomaly generation methods have been
proposed to augment the anomaly data, they either suffer from poor generation
authenticity or inaccurate alignment between the generated anomalies and masks.
To address the above problems, we propose AnomalyDiffusion, a novel
diffusion-based few-shot anomaly generation model, which utilizes the strong
prior information of latent diffusion model learned from large-scale dataset to
enhance the generation authenticity under few-shot training data. Firstly, we
propose Spatial Anomaly Embedding, which consists of a learnable anomaly
embedding and a spatial embedding encoded from an anomaly mask, disentangling
the anomaly information into anomaly appearance and location information.
Moreover, to improve the alignment between the generated anomalies and the
anomaly masks, we introduce a novel Adaptive Attention Re-weighting Mechanism.
Based on the disparities between the generated anomaly image and normal sample,
it dynamically guides the model to focus more on the areas with less noticeable
generated anomalies, enabling generation of accurately-matched anomalous
image-mask pairs. Extensive experiments demonstrate that our model
significantly outperforms the state-of-the-art methods in generation
authenticity and diversity, and effectively improves the performance of
downstream anomaly inspection tasks. The code and data are available in
https://github.com/sjtuplayer/anomalydiffusion.
Conditional Stochastic Interpolation for Generative Learning
December 09, 2023
Ding Huang, Jian Huang, Ting Li, Guohao Shen
We propose a conditional stochastic interpolation (CSI) approach to learning
conditional distributions. CSI learns probability flow equations or stochastic
differential equations that transport a reference distribution to the target
conditional distribution. This is achieved by first learning the drift function
and the conditional score function based on conditional stochastic
interpolation, which are then used to construct a deterministic process
governed by an ordinary differential equation or a diffusion process for
conditional sampling. In our proposed CSI model, we incorporate an adaptive
diffusion term to address the instability issues arising during the training
process. We provide explicit forms of the conditional score function and the
drift function in terms of conditional expectations under mild conditions,
which naturally lead to an nonparametric regression approach to estimating
these functions. Furthermore, we establish non-asymptotic error bounds for
learning the target conditional distribution via conditional stochastic
interpolation in terms of KL divergence, taking into account the neural network
approximation error. We illustrate the application of CSI on image generation
using a benchmark image dataset.
DPoser: Diffusion Model as Robust 3D Human Pose Prior
December 09, 2023
Junzhe Lu, Jing Lin, Hongkun Dou, Yulun Zhang, Yue Deng, Haoqian Wang
Modeling human pose is a cornerstone in applications from human-robot
interaction to augmented reality, yet crafting a robust human pose prior
remains a challenge due to biomechanical constraints and diverse human
movements. Traditional priors like VAEs and NDFs often fall short in realism
and generalization, especially in extreme conditions such as unseen noisy
poses. To address these issues, we introduce DPoser, a robust and versatile
human pose prior built upon diffusion models. Designed with optimization
frameworks, DPoser seamlessly integrates into various pose-centric
applications, including human mesh recovery, pose completion, and motion
denoising. Specifically, by formulating these tasks as inverse problems, we
employ variational diffusion sampling for efficient solving. Furthermore,
acknowledging the disparity between the articulated poses we focus on and
structured images in previous research, we propose a truncated timestep
scheduling to boost performance on downstream tasks. Our exhaustive experiments
demonstrate DPoser’s superiority over existing state-of-the-art pose priors
across multiple tasks.
Consistency Models for Scalable and Fast Simulation-Based Inference
December 09, 2023
Marvin Schmitt, Valentin Pratz, Ullrich Köthe, Paul-Christian Bürkner, Stefan T Radev
Simulation-based inference (SBI) is constantly in search of more expressive
algorithms for accurately inferring the parameters of complex models from noisy
data. We present consistency models for neural posterior estimation (CMPE), a
new free-form conditional sampler for scalable, fast, and amortized SBI with
generative neural networks. CMPE combines the advantages of normalizing flows
and flow matching methods into a single generative architecture: It essentially
distills a continuous probability flow and enables rapid few-shot inference
with an unconstrained architecture that can be tailored to the structure of the
estimation problem. Our empirical evaluation demonstrates that CMPE not only
outperforms current state-of-the-art algorithms on three hard low-dimensional
problems, but also achieves competitive performance in a high-dimensional
Bayesian denoising experiment and in estimating a computationally demanding
multi-scale model of tumor spheroid growth.
Efficient Quantization Strategies for Latent Diffusion Models
December 09, 2023
Yuewei Yang, Xiaoliang Dai, Jialiang Wang, Peizhao Zhang, Hongbo Zhang
Latent Diffusion Models (LDMs) capture the dynamic evolution of latent
variables over time, blending patterns and multimodality in a generative
system. Despite the proficiency of LDM in various applications, such as
text-to-image generation, facilitated by robust text encoders and a variational
autoencoder, the critical need to deploy large generative models on edge
devices compels a search for more compact yet effective alternatives. Post
Training Quantization (PTQ), a method to compress the operational size of deep
learning models, encounters challenges when applied to LDM due to temporal and
structural complexities. This study proposes a quantization strategy that
efficiently quantize LDMs, leveraging Signal-to-Quantization-Noise Ratio (SQNR)
as a pivotal metric for evaluation. By treating the quantization discrepancy as
relative noise and identifying sensitive part(s) of a model, we propose an
efficient quantization approach encompassing both global and local strategies.
The global quantization process mitigates relative quantization noise by
initiating higher-precision quantization on sensitive blocks, while local
treatments address specific challenges in quantization-sensitive and
time-sensitive modules. The outcomes of our experiments reveal that the
implementation of both global and local treatments yields a highly efficient
and effective Post Training Quantization (PTQ) of LDMs.
Cross Domain Generative Augmentation: Domain Generalization with Latent Diffusion Models
December 08, 2023
Sobhan Hemati, Mahdi Beitollahi, Amir Hossein Estiri, Bassel Al Omari, Xi Chen, Guojun Zhang
Despite the huge effort in developing novel regularizers for Domain
Generalization (DG), adding simple data augmentation to the vanilla ERM which
is a practical implementation of the Vicinal Risk Minimization principle (VRM)
\citep{chapelle2000vicinal} outperforms or stays competitive with many of the
proposed regularizers. The VRM reduces the estimation error in ERM by replacing
the point-wise kernel estimates with a more precise estimation of true data
distribution that reduces the gap between data points \textbf{within each
domain}. However, in the DG setting, the estimation error of true data
distribution by ERM is mainly caused by the distribution shift \textbf{between
domains} which cannot be fully addressed by simple data augmentation techniques
within each domain. Inspired by this limitation of VRM, we propose a novel data
augmentation named Cross Domain Generative Augmentation (CDGA) that replaces
the pointwise kernel estimates in ERM with new density estimates in the
\textbf{vicinity of domain pairs} so that the gap between domains is further
reduced. To this end, CDGA, which is built upon latent diffusion models (LDM),
generates synthetic images to fill the gap between all domains and as a result,
reduces the non-iidness. We show that CDGA outperforms SOTA DG methods under
the Domainbed benchmark. To explain the effectiveness of CDGA, we generate more
than 5 Million synthetic images and perform extensive ablation studies
including data scaling laws, distribution visualization, domain shift
quantification, adversarial robustness, and loss landscape analysis.
Membership Inference Attacks on Diffusion Models via Quantile Regression
December 08, 2023
Shuai Tang, Zhiwei Steven Wu, Sergul Aydore, Michael Kearns, Aaron Roth
Recently, diffusion models have become popular tools for image synthesis
because of their high-quality outputs. However, like other large-scale models,
they may leak private information about their training data. Here, we
demonstrate a privacy vulnerability of diffusion models through a
\emph{membership inference (MI) attack}, which aims to identify whether a
target example belongs to the training set when given the trained diffusion
model. Our proposed MI attack learns quantile regression models that predict (a
quantile of) the distribution of reconstruction loss on examples not used in
training. This allows us to define a granular hypothesis test for determining
the membership of a point in the training set, based on thresholding the
reconstruction loss of that point using a custom threshold tailored to the
example. We also provide a simple bootstrap technique that takes a majority
membership prediction over a bag of weak attackers'' which improves the
accuracy over individual quantile regression models. We show that our attack
outperforms the prior state-of-the-art attack while being substantially less
computationally expensive -- prior attacks required training multiple
shadow
models’’ with the same architecture as the model under attack, whereas our
attack requires training only much smaller models.
UDiffText: A Unified Framework for High-quality Text Synthesis in Arbitrary Images via Character-aware Diffusion Models
December 08, 2023
Yiming Zhao, Zhouhui Lian
Text-to-Image (T2I) generation methods based on diffusion model have garnered
significant attention in the last few years. Although these image synthesis
methods produce visually appealing results, they frequently exhibit spelling
errors when rendering text within the generated images. Such errors manifest as
missing, incorrect or extraneous characters, thereby severely constraining the
performance of text image generation based on diffusion models. To address the
aforementioned issue, this paper proposes a novel approach for text image
generation, utilizing a pre-trained diffusion model (i.e., Stable Diffusion
[27]). Our approach involves the design and training of a light-weight
character-level text encoder, which replaces the original CLIP encoder and
provides more robust text embeddings as conditional guidance. Then, we
fine-tune the diffusion model using a large-scale dataset, incorporating local
attention control under the supervision of character-level segmentation maps.
Finally, by employing an inference stage refinement process, we achieve a
notably high sequence accuracy when synthesizing text in arbitrarily given
images. Both qualitative and quantitative results demonstrate the superiority
of our method to the state of the art. Furthermore, we showcase several
potential applications of the proposed UDiffText, including text-centric image
synthesis, scene text editing, etc. Code and model will be available at
https://github.com/ZYM-PKU/UDiffText .
MVDD: Multi-View Depth Diffusion Models
December 08, 2023
Zhen Wang, Qiangeng Xu, Feitong Tan, Menglei Chai, Shichen Liu, Rohit Pandey, Sean Fanello, Achuta Kadambi, Yinda Zhang
Denoising diffusion models have demonstrated outstanding results in 2D image
generation, yet it remains a challenge to replicate its success in 3D shape
generation. In this paper, we propose leveraging multi-view depth, which
represents complex 3D shapes in a 2D data format that is easy to denoise. We
pair this representation with a diffusion model, MVDD, that is capable of
generating high-quality dense point clouds with 20K+ points with fine-grained
details. To enforce 3D consistency in multi-view depth, we introduce an
epipolar line segment attention that conditions the denoising step for a view
on its neighboring views. Additionally, a depth fusion module is incorporated
into diffusion steps to further ensure the alignment of depth maps. When
augmented with surface reconstruction, MVDD can also produce high-quality 3D
meshes. Furthermore, MVDD stands out in other tasks such as depth completion,
and can serve as a 3D prior, significantly boosting many downstream tasks, such
as GAN inversion. State-of-the-art results from extensive experiments
demonstrate MVDD’s excellent ability in 3D shape generation, depth completion,
and its potential as a 3D prior for downstream tasks.
HandDiffuse: Generative Controllers for Two-Hand Interactions via Diffusion Models
December 08, 2023
Pei Lin, Sihang Xu, Hongdi Yang, Yiran Liu, Xin Chen, Jingya Wang, Jingyi Yu, Lan Xu
Existing hands datasets are largely short-range and the interaction is weak
due to the self-occlusion and self-similarity of hands, which can not yet fit
the need for interacting hands motion generation. To rescue the data scarcity,
we propose HandDiffuse12.5M, a novel dataset that consists of temporal
sequences with strong two-hand interactions. HandDiffuse12.5M has the largest
scale and richest interactions among the existing two-hand datasets. We further
present a strong baseline method HandDiffuse for the controllable motion
generation of interacting hands using various controllers. Specifically, we
apply the diffusion model as the backbone and design two motion representations
for different controllers. To reduce artifacts, we also propose Interaction
Loss which explicitly quantifies the dynamic interaction process. Our
HandDiffuse enables various applications with vivid two-hand interactions,
i.e., motion in-betweening and trajectory control. Experiments show that our
method outperforms the state-of-the-art techniques in motion generation and can
also contribute to data augmentation for other datasets. Our dataset,
corresponding codes, and pre-trained models will be disseminated to the
community for future research towards two-hand interaction modeling.
DiffCMR: Fast Cardiac MRI Reconstruction with Diffusion Probabilistic Models
December 08, 2023
Tianqi Xiang, Wenjun Yue, Yiqun Lin, Jiewen Yang, Zhenkun Wang, Xiaomeng Li
Performing magnetic resonance imaging (MRI) reconstruction from under-sampled
k-space data can accelerate the procedure to acquire MRI scans and reduce
patients’ discomfort. The reconstruction problem is usually formulated as a
denoising task that removes the noise in under-sampled MRI image slices.
Although previous GAN-based methods have achieved good performance in image
denoising, they are difficult to train and require careful tuning of
hyperparameters. In this paper, we propose a novel MRI denoising framework
DiffCMR by leveraging conditional denoising diffusion probabilistic models.
Specifically, DiffCMR perceives conditioning signals from the under-sampled MRI
image slice and generates its corresponding fully-sampled MRI image slice.
During inference, we adopt a multi-round ensembling strategy to stabilize the
performance. We validate DiffCMR with cine reconstruction and T1/T2 mapping
tasks on MICCAI 2023 Cardiac MRI Reconstruction Challenge (CMRxRecon) dataset.
Results show that our method achieves state-of-the-art performance, exceeding
previous methods by a significant margin. Code is available at
https://github.com/xmed-lab/DiffCMR.
MimicDiffusion: Purifying Adversarial Perturbation via Mimicking Clean Diffusion Model
December 08, 2023
Kaiyu Song, Hanjiang Lai
Deep neural networks (DNNs) are vulnerable to adversarial perturbation, where
an imperceptible perturbation is added to the image that can fool the DNNs.
Diffusion-based adversarial purification focuses on using the diffusion model
to generate a clean image against such adversarial attacks. Unfortunately, the
generative process of the diffusion model is also inevitably affected by
adversarial perturbation since the diffusion model is also a deep network where
its input has adversarial perturbation. In this work, we propose
MimicDiffusion, a new diffusion-based adversarial purification technique, that
directly approximates the generative process of the diffusion model with the
clean image as input. Concretely, we analyze the differences between the guided
terms using the clean image and the adversarial sample. After that, we first
implement MimicDiffusion based on Manhattan distance. Then, we propose two
guidance to purify the adversarial perturbation and approximate the clean
diffusion model. Extensive experiments on three image datasets including
CIFAR-10, CIFAR-100, and ImageNet with three classifier backbones including
WideResNet-70-16, WideResNet-28-10, and ResNet50 demonstrate that
MimicDiffusion significantly performs better than the state-of-the-art
baselines. On CIFAR-10, CIFAR-100, and ImageNet, it achieves 92.67\%, 61.35\%,
and 61.53\% average robust accuracy, which are 18.49\%, 13.23\%, and 17.64\%
higher, respectively. The code is available in the supplementary material.
Diffence: Fencing Membership Privacy With Diffusion Models
December 07, 2023
Yuefeng Peng, Ali Naseh, Amir Houmansadr
Deep learning models, while achieving remarkable performance across various
tasks, are vulnerable to member inference attacks, wherein adversaries identify
if a specific data point was part of a model’s training set. This
susceptibility raises substantial privacy concerns, especially when models are
trained on sensitive datasets. Current defense methods often struggle to
provide robust protection without hurting model utility, and they often require
retraining the model or using extra data. In this work, we introduce a novel
defense framework against membership attacks by leveraging generative models.
The key intuition of our defense is to remove the differences between member
and non-member inputs which can be used to perform membership attacks, by
re-generating input samples before feeding them to the target model. Therefore,
our defense works \emph{pre-inference}, which is unlike prior defenses that are
either training-time (modify the model) or post-inference time (modify the
model’s output).
A unique feature of our defense is that it works on input samples only,
without modifying the training or inference phase of the target model.
Therefore, it can be cascaded with other defense mechanisms as we demonstrate
through experiments. Through extensive experimentation, we show that our
approach can serve as a robust plug-n-play defense mechanism, enhancing
membership privacy without compromising model utility in both baseline and
defended settings. For example, our method enhanced the effectiveness of recent
state-of-the-art defenses, reducing attack accuracy by an average of 5.7\% to
12.4\% across three datasets, without any impact on the model’s accuracy. By
integrating our method with prior defenses, we achieve new state-of-the-art
performance in the privacy-utility trade-off.
NeuSD: Surface Completion with Multi-View Text-to-Image Diffusion
December 07, 2023
Savva Ignatyev, Daniil Selikhanovych, Oleg Voynov, Yiqun Wang, Peter Wonka, Stamatios Lefkimmiatis, Evgeny Burnaev
We present a novel method for 3D surface reconstruction from multiple images
where only a part of the object of interest is captured. Our approach builds on
two recent developments: surface reconstruction using neural radiance fields
for the reconstruction of the visible parts of the surface, and guidance of
pre-trained 2D diffusion models in the form of Score Distillation Sampling
(SDS) to complete the shape in unobserved regions in a plausible manner. We
introduce three components. First, we suggest employing normal maps as a pure
geometric representation for SDS instead of color renderings which are
entangled with the appearance information. Second, we introduce the freezing of
the SDS noise during training which results in more coherent gradients and
better convergence. Third, we propose Multi-View SDS as a way to condition the
generation of the non-observable part of the surface without fine-tuning or
making changes to the underlying 2D Stable Diffusion model. We evaluate our
approach on the BlendedMVS dataset demonstrating significant qualitative and
quantitative improvements over competing methods.
Memory Triggers: Unveiling Memorization in Text-To-Image Generative Models through Word-Level Duplication
December 06, 2023
Ali Naseh, Jaechul Roh, Amir Houmansadr
Diffusion-based models, such as the Stable Diffusion model, have
revolutionized text-to-image synthesis with their ability to produce
high-quality, high-resolution images. These advancements have prompted
significant progress in image generation and editing tasks. However, these
models also raise concerns due to their tendency to memorize and potentially
replicate exact training samples, posing privacy risks and enabling adversarial
attacks. Duplication in training datasets is recognized as a major factor
contributing to memorization, and various forms of memorization have been
studied so far. This paper focuses on two distinct and underexplored types of
duplication that lead to replication during inference in diffusion-based
models, particularly in the Stable Diffusion model. We delve into these
lesser-studied duplication phenomena and their implications through two case
studies, aiming to contribute to the safer and more responsible use of
generative models in various applications.
WarpDiffusion: Efficient Diffusion Model for High-Fidelity Virtual Try-on
December 06, 2023
xujie zhang, Xiu Li, Michael Kampffmeyer, Xin Dong, Zhenyu Xie, Feida Zhu, Haoye Dong, Xiaodan Liang
Image-based Virtual Try-On (VITON) aims to transfer an in-shop garment image
onto a target person. While existing methods focus on warping the garment to
fit the body pose, they often overlook the synthesis quality around the
garment-skin boundary and realistic effects like wrinkles and shadows on the
warped garments. These limitations greatly reduce the realism of the generated
results and hinder the practical application of VITON techniques. Leveraging
the notable success of diffusion-based models in cross-modal image synthesis,
some recent diffusion-based methods have ventured to tackle this issue.
However, they tend to either consume a significant amount of training resources
or struggle to achieve realistic try-on effects and retain garment details. For
efficient and high-fidelity VITON, we propose WarpDiffusion, which bridges the
warping-based and diffusion-based paradigms via a novel informative and local
garment feature attention mechanism. Specifically, WarpDiffusion incorporates
local texture attention to reduce resource consumption and uses a novel
auto-mask module that effectively retains only the critical areas of the warped
garment while disregarding unrealistic or erroneous portions. Notably,
WarpDiffusion can be integrated as a plug-and-play component into existing
VITON methodologies, elevating their synthesis quality. Extensive experiments
on high-resolution VITON benchmarks and an in-the-wild test set demonstrate the
superiority of WarpDiffusion, surpassing state-of-the-art methods both
qualitatively and quantitatively.
TokenCompose: Grounding Diffusion with Token-level Supervision
December 06, 2023
Zirui Wang, Zhizhou Sha, Zheng Ding, Yilin Wang, Zhuowen Tu
We present TokenCompose, a Latent Diffusion Model for text-to-image
generation that achieves enhanced consistency between user-specified text
prompts and model-generated images. Despite its tremendous success, the
standard denoising process in the Latent Diffusion Model takes text prompts as
conditions only, absent explicit constraint for the consistency between the
text prompts and the image contents, leading to unsatisfactory results for
composing multiple object categories. TokenCompose aims to improve
multi-category instance composition by introducing the token-wise consistency
terms between the image content and object segmentation maps in the finetuning
stage. TokenCompose can be applied directly to the existing training pipeline
of text-conditioned diffusion models without extra human labeling information.
By finetuning Stable Diffusion, the model exhibits significant improvements in
multi-category instance composition and enhanced photorealism for its generated
images.
DiffusionSat: A Generative Foundation Model for Satellite Imagery
December 06, 2023
Samar Khanna, Patrick Liu, Linqi Zhou, Chenlin Meng, Robin Rombach, Marshall Burke, David Lobell, Stefano Ermon
Diffusion models have achieved state-of-the-art results on many modalities
including images, speech, and video. However, existing models are not tailored
to support remote sensing data, which is widely used in important applications
including environmental monitoring and crop-yield prediction. Satellite images
are significantly different from natural images – they can be multi-spectral,
irregularly sampled across time – and existing diffusion models trained on
images from the Web do not support them. Furthermore, remote sensing data is
inherently spatio-temporal, requiring conditional generation tasks not
supported by traditional methods based on captions or images. In this paper, we
present DiffusionSat, to date the largest generative foundation model trained
on a collection of publicly available large, high-resolution remote sensing
datasets. As text-based captions are sparsely available for satellite images,
we incorporate the associated metadata such as geolocation as conditioning
information. Our method produces realistic samples and can be used to solve
multiple generative tasks including temporal generation, superresolution given
multi-spectral inputs and in-painting. Our method outperforms previous
state-of-the-art methods for satellite image generation and is the first
large-scale $\textit{generative}$ foundation model for satellite imagery.
Context Diffusion: In-Context Aware Image Generation
December 06, 2023
Ivona Najdenkoska, Animesh Sinha, Abhimanyu Dubey, Dhruv Mahajan, Vignesh Ramanathan, Filip Radenovic
We propose Context Diffusion, a diffusion-based framework that enables image
generation models to learn from visual examples presented in context. Recent
work tackles such in-context learning for image generation, where a query image
is provided alongside context examples and text prompts. However, the quality
and fidelity of the generated images deteriorate when the prompt is not
present, demonstrating that these models are unable to truly learn from the
visual context. To address this, we propose a novel framework that separates
the encoding of the visual context and preserving the structure of the query
images. This results in the ability to learn from the visual context and text
prompts, but also from either one of them. Furthermore, we enable our model to
handle few-shot settings, to effectively address diverse in-context learning
scenarios. Our experiments and user study demonstrate that Context Diffusion
excels in both in-domain and out-of-domain tasks, resulting in an overall
enhancement in image quality and fidelity compared to counterpart models.
FoodFusion: A Latent Diffusion Model for Realistic Food Image Generation
December 06, 2023
Olivia Markham, Yuhao Chen, Chi-en Amy Tai, Alexander Wong
Current state-of-the-art image generation models such as Latent Diffusion
Models (LDMs) have demonstrated the capacity to produce visually striking
food-related images. However, these generated images often exhibit an artistic
or surreal quality that diverges from the authenticity of real-world food
representations. This inadequacy renders them impractical for applications
requiring realistic food imagery, such as training models for image-based
dietary assessment. To address these limitations, we introduce FoodFusion, a
Latent Diffusion model engineered specifically for the faithful synthesis of
realistic food images from textual descriptions. The development of the
FoodFusion model involves harnessing an extensive array of open-source food
datasets, resulting in over 300,000 curated image-caption pairs. Additionally,
we propose and employ two distinct data cleaning methodologies to ensure that
the resulting image-text pairs maintain both realism and accuracy. The
FoodFusion model, thus trained, demonstrates a remarkable ability to generate
food images that exhibit a significant improvement in terms of both realism and
diversity over the publicly available image generation models. We openly share
the dataset and fine-tuned models to support advancements in this critical
field of food image synthesis at https://bit.ly/genai4good.
Schrodinger Bridges Beat Diffusion Models on Text-to-Speech Synthesis
December 06, 2023
Zehua Chen, Guande He, Kaiwen Zheng, Xu Tan, Jun Zhu
In text-to-speech (TTS) synthesis, diffusion models have achieved promising
generation quality. However, because of the pre-defined data-to-noise diffusion
process, their prior distribution is restricted to a noisy representation,
which provides little information of the generation target. In this work, we
present a novel TTS system, Bridge-TTS, making the first attempt to substitute
the noisy Gaussian prior in established diffusion-based TTS methods with a
clean and deterministic one, which provides strong structural information of
the target. Specifically, we leverage the latent representation obtained from
text input as our prior, and build a fully tractable Schrodinger bridge between
it and the ground-truth mel-spectrogram, leading to a data-to-data process.
Moreover, the tractability and flexibility of our formulation allow us to
empirically study the design spaces such as noise schedules, as well as to
develop stochastic and deterministic samplers. Experimental results on the
LJ-Speech dataset illustrate the effectiveness of our method in terms of both
synthesis quality and sampling efficiency, significantly outperforming our
diffusion counterpart Grad-TTS in 50-step/1000-step synthesis and strong fast
TTS models in few-step scenarios. Project page: https://bridge-tts.github.io/
Diffused Task-Agnostic Milestone Planner
December 06, 2023
Mineui Hong, Minjae Kang, Songhwai Oh
Addressing decision-making problems using sequence modeling to predict future
trajectories shows promising results in recent years. In this paper, we take a
step further to leverage the sequence predictive method in wider areas such as
long-term planning, vision-based control, and multi-task decision-making. To
this end, we propose a method to utilize a diffusion-based generative sequence
model to plan a series of milestones in a latent space and to have an agent to
follow the milestones to accomplish a given task. The proposed method can learn
control-relevant, low-dimensional latent representations of milestones, which
makes it possible to efficiently perform long-term planning and vision-based
control. Furthermore, our approach exploits generation flexibility of the
diffusion model, which makes it possible to plan diverse trajectories for
multi-task decision-making. We demonstrate the proposed method across offline
reinforcement learning (RL) benchmarks and an visual manipulation environment.
The results show that our approach outperforms offline RL methods in solving
long-horizon, sparse-reward tasks and multi-task problems, while also achieving
the state-of-the-art performance on the most challenging vision-based
manipulation benchmark.
DiffPMAE: Diffusion Masked Autoencoders for Point Cloud Reconstruction
December 06, 2023
Yanlong Li, Chamara Madarasingha, Kanchana Thilakarathna
Point cloud streaming is increasingly getting popular, evolving into the norm
for interactive service delivery and the future Metaverse. However, the
substantial volume of data associated with point clouds presents numerous
challenges, particularly in terms of high bandwidth consumption and large
storage capacity. Despite various solutions proposed thus far, with a focus on
point cloud compression, upsampling, and completion, these
reconstruction-related methods continue to fall short in delivering high
fidelity point cloud output. As a solution, in DiffPMAE, we propose an
effective point cloud reconstruction architecture. Inspired by self-supervised
learning concepts, we combine Masked Auto-Encoding and Diffusion Model
mechanism to remotely reconstruct point cloud data. By the nature of this
reconstruction process, DiffPMAE can be extended to many related downstream
tasks including point cloud compression, upsampling and completion. Leveraging
ShapeNet-55 and ModelNet datasets with over 60000 objects, we validate the
performance of DiffPMAE exceeding many state-of-the-art methods in-terms of
auto-encoding and downstream tasks considered.
ReconFusion: 3D Reconstruction with Diffusion Priors
December 05, 2023
Rundi Wu, Ben Mildenhall, Philipp Henzler, Keunhong Park, Ruiqi Gao, Daniel Watson, Pratul P. Srinivasan, Dor Verbin, Jonathan T. Barron, Ben Poole, Aleksander Holynski
3D reconstruction methods such as Neural Radiance Fields (NeRFs) excel at
rendering photorealistic novel views of complex scenes. However, recovering a
high-quality NeRF typically requires tens to hundreds of input images,
resulting in a time-consuming capture process. We present ReconFusion to
reconstruct real-world scenes using only a few photos. Our approach leverages a
diffusion prior for novel view synthesis, trained on synthetic and multiview
datasets, which regularizes a NeRF-based 3D reconstruction pipeline at novel
camera poses beyond those captured by the set of input images. Our method
synthesizes realistic geometry and texture in underconstrained regions while
preserving the appearance of observed regions. We perform an extensive
evaluation across various real-world datasets, including forward-facing and
360-degree scenes, demonstrating significant performance improvements over
previous few-view NeRF reconstruction approaches.
DiffusionPCR: Diffusion Models for Robust Multi-Step Point Cloud Registration
December 05, 2023
Zhi Chen, Yufan Ren, Tong Zhang, Zheng Dang, Wenbing Tao, Sabine Süsstrunk, Mathieu Salzmann
Point Cloud Registration (PCR) estimates the relative rigid transformation
between two point clouds. We propose formulating PCR as a denoising diffusion
probabilistic process, mapping noisy transformations to the ground truth.
However, using diffusion models for PCR has nontrivial challenges, such as
adapting a generative model to a discriminative task and leveraging the
estimated nonlinear transformation from the previous step. Instead of training
a diffusion model to directly map pure noise to ground truth, we map the
predictions of an off-the-shelf PCR model to ground truth. The predictions of
off-the-shelf models are often imperfect, especially in challenging cases where
the two points clouds have low overlap, and thus could be seen as noisy
versions of the real rigid transformation. In addition, we transform the
rotation matrix into a spherical linear space for interpolation between samples
in the forward process, and convert rigid transformations into auxiliary
information to implicitly exploit last-step estimations in the reverse process.
As a result, conditioned on time step, the denoising model adapts to the
increasing accuracy across steps and refines registrations. Our extensive
experiments showcase the effectiveness of our DiffusionPCR, yielding
state-of-the-art registration recall rates (95.3%/81.6%) on 3DMatch and
3DLoMatch. The code will be made public upon publication.
Diffusion-SS3D: Diffusion Model for Semi-supervised 3D Object Detection
December 05, 2023
Cheng-Ju Ho, Chen-Hsuan Tai, Yen-Yu Lin, Ming-Hsuan Yang, Yi-Hsuan Tsai
Semi-supervised object detection is crucial for 3D scene understanding,
efficiently addressing the limitation of acquiring large-scale 3D bounding box
annotations. Existing methods typically employ a teacher-student framework with
pseudo-labeling to leverage unlabeled point clouds. However, producing reliable
pseudo-labels in a diverse 3D space still remains challenging. In this work, we
propose Diffusion-SS3D, a new perspective of enhancing the quality of
pseudo-labels via the diffusion model for semi-supervised 3D object detection.
Specifically, we include noises to produce corrupted 3D object size and class
label distributions, and then utilize the diffusion model as a denoising
process to obtain bounding box outputs. Moreover, we integrate the diffusion
model into the teacher-student framework, so that the denoised bounding boxes
can be used to improve pseudo-label generation, as well as the entire
semi-supervised learning process. We conduct experiments on the ScanNet and SUN
RGB-D benchmark datasets to demonstrate that our approach achieves
state-of-the-art performance against existing methods. We also present
extensive analysis to understand how our diffusion model design affects
performance in semi-supervised learning.
Deterministic Guidance Diffusion Model for Probabilistic Weather Forecasting
December 05, 2023
Donggeun Yoon, Minseok Seo, Doyi Kim, Yeji Choi, Donghyeon Cho
Weather forecasting requires not only accuracy but also the ability to
perform probabilistic prediction. However, deterministic weather forecasting
methods do not support probabilistic predictions, and conversely, probabilistic
models tend to be less accurate. To address these challenges, in this paper, we
introduce the \textbf{\textit{D}}eterministic \textbf{\textit{G}}uidance
\textbf{\textit{D}}iffusion \textbf{\textit{M}}odel (DGDM) for probabilistic
weather forecasting, integrating benefits of both deterministic and
probabilistic approaches. During the forward process, both the deterministic
and probabilistic models are trained end-to-end. In the reverse process,
weather forecasting leverages the predicted result from the deterministic
model, using as an intermediate starting point for the probabilistic model. By
fusing deterministic models with probabilistic models in this manner, DGDM is
capable of providing accurate forecasts while also offering probabilistic
predictions. To evaluate DGDM, we assess it on the global weather forecasting
dataset (WeatherBench) and the common video frame prediction benchmark (Moving
MNIST). We also introduce and evaluate the Pacific Northwest Windstorm
(PNW)-Typhoon weather satellite dataset to verify the effectiveness of DGDM in
high-resolution regional forecasting. As a result of our experiments, DGDM
achieves state-of-the-art results not only in global forecasting but also in
regional forecasting. The code is available at:
\url{https://github.com/DongGeun-Yoon/DGDM}.
BIVDiff: A Training-Free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion Models
December 05, 2023
Fengyuan Shi, Jiaxi Gu, Hang Xu, Songcen Xu, Wei Zhang, Limin Wang
Diffusion models have made tremendous progress in text-driven image and video
generation. Now text-to-image foundation models are widely applied to various
downstream image synthesis tasks, such as controllable image generation and
image editing, while downstream video synthesis tasks are less explored for
several reasons. First, it requires huge memory and compute overhead to train a
video generation foundation model. Even with video foundation models,
additional costly training is still required for downstream video synthesis
tasks. Second, although some works extend image diffusion models into videos in
a training-free manner, temporal consistency cannot be well kept. Finally,
these adaption methods are specifically designed for one task and fail to
generalize to different downstream video synthesis tasks. To mitigate these
issues, we propose a training-free general-purpose video synthesis framework,
coined as BIVDiff, via bridging specific image diffusion models and general
text-to-video foundation diffusion models. Specifically, we first use an image
diffusion model (like ControlNet, Instruct Pix2Pix) for frame-wise video
generation, then perform Mixed Inversion on the generated video, and finally
input the inverted latents into the video diffusion model for temporal
smoothing. Decoupling image and video models enables flexible image model
selection for different purposes, which endows the framework with strong task
generalization and high efficiency. To validate the effectiveness and general
use of BIVDiff, we perform a wide range of video generation tasks, including
controllable video generation video editing, video inpainting and outpainting.
Our project page is available at https://bivdiff.github.io.
Analyzing and Improving the Training Dynamics of Diffusion Models
December 05, 2023
Tero Karras, Miika Aittala, Jaakko Lehtinen, Janne Hellsten, Timo Aila, Samuli Laine
cs.CV, cs.AI, cs.LG, cs.NE, stat.ML
Diffusion models currently dominate the field of data-driven image synthesis
with their unparalleled scaling to large datasets. In this paper, we identify
and rectify several causes for uneven and ineffective training in the popular
ADM diffusion model architecture, without altering its high-level structure.
Observing uncontrolled magnitude changes and imbalances in both the network
activations and weights over the course of training, we redesign the network
layers to preserve activation, weight, and update magnitudes on expectation. We
find that systematic application of this philosophy eliminates the observed
drifts and imbalances, resulting in considerably better networks at equal
computational complexity. Our modifications improve the previous record FID of
2.41 in ImageNet-512 synthesis to 1.81, achieved using fast deterministic
sampling.
As an independent contribution, we present a method for setting the
exponential moving average (EMA) parameters post-hoc, i.e., after completing
the training run. This allows precise tuning of EMA length without the cost of
performing several training runs, and reveals its surprising interactions with
network architecture, training time, and guidance.
Diffusion-Based Speech Enhancement in Matched and Mismatched Conditions Using a Heun-Based Sampler
December 05, 2023
Philippe Gonzalez, Zheng-Hua Tan, Jan Østergaard, Jesper Jensen, Tommy Sonne Alstrøm, Tobias May
Diffusion models are a new class of generative models that have recently been
applied to speech enhancement successfully. Previous works have demonstrated
their superior performance in mismatched conditions compared to state-of-the
art discriminative models. However, this was investigated with a single
database for training and another one for testing, which makes the results
highly dependent on the particular databases. Moreover, recent developments
from the image generation literature remain largely unexplored for speech
enhancement. These include several design aspects of diffusion models, such as
the noise schedule or the reverse sampler. In this work, we systematically
assess the generalization performance of a diffusion-based speech enhancement
model by using multiple speech, noise and binaural room impulse response (BRIR)
databases to simulate mismatched acoustic conditions. We also experiment with a
noise schedule and a sampler that have not been applied to speech enhancement
before. We show that the proposed system substantially benefits from using
multiple databases for training, and achieves superior performance compared to
state-of-the-art discriminative models in both matched and mismatched
conditions. We also show that a Heun-based sampler achieves superior
performance at a smaller computational cost compared to a sampler commonly used
for speech enhancement.
Stable Diffusion Exposed: Gender Bias from Prompt to Image
December 05, 2023
Yankun Wu, Yuta Nakashima, Noa Garcia
Recent studies have highlighted biases in generative models, shedding light
on their predisposition towards gender-based stereotypes and imbalances. This
paper contributes to this growing body of research by introducing an evaluation
protocol designed to automatically analyze the impact of gender indicators on
Stable Diffusion images. Leveraging insights from prior work, we explore how
gender indicators not only affect gender presentation but also the
representation of objects and layouts within the generated images. Our findings
include the existence of differences in the depiction of objects, such as
instruments tailored for specific genders, and shifts in overall layouts. We
also reveal that neutral prompts tend to produce images more aligned with
masculine prompts than their feminine counterparts, providing valuable insights
into the nuanced gender biases inherent in Stable Diffusion.
Diffusion Noise Feature: Accurate and Fast Generated Image Detection
December 05, 2023
Yichi Zhang, Xiaogang Xu
Generative models have reached an advanced stage where they can produce
remarkably realistic images. However, this remarkable generative capability
also introduces the risk of disseminating false or misleading information.
Notably, existing image detectors for generated images encounter challenges
such as low accuracy and limited generalization. This paper seeks to address
this issue by seeking a representation with strong generalization capabilities
to enhance the detection of generated images. Our investigation has revealed
that real and generated images display distinct latent Gaussian representations
when subjected to an inverse diffusion process within a pre-trained diffusion
model. Exploiting this disparity, we can amplify subtle artifacts in generated
images. Building upon this insight, we introduce a novel image representation
known as Diffusion Noise Feature (DNF). DNF is an ensemble representation that
estimates the noise generated during the inverse diffusion process. A simple
classifier, e.g., ResNet, trained on DNF achieves high accuracy, robustness,
and generalization capabilities for detecting generated images, even from
previously unseen classes or models. We conducted experiments using a widely
recognized and standard dataset, achieving state-of-the-art effects of
Detection.
GeNIe: Generative Hard Negative Images Through Diffusion
December 05, 2023
Soroush Abbasi Koohpayegani, Anuj Singh, K L Navaneet, Hadi Jamali-Rad, Hamed Pirsiavash
Data augmentation is crucial in training deep models, preventing them from
overfitting to limited data. Common data augmentation methods are effective,
but recent advancements in generative AI, such as diffusion models for image
generation, enable more sophisticated augmentation techniques that produce data
resembling natural images. We recognize that augmented samples closer to the
ideal decision boundary of a classifier are particularly effective and
efficient in guiding the learning process. We introduce GeNIe which leverages a
diffusion model conditioned on a text prompt to merge contrasting data points
(an image from the source category and a text prompt from the target category)
to generate challenging samples for the target category. Inspired by recent
image editing methods, we limit the number of diffusion iterations and the
amount of noise. This ensures that the generated image retains low-level and
contextual features from the source image, potentially conflicting with the
target category. Our extensive experiments, in few-shot and also long-tail
distribution settings, demonstrate the effectiveness of our novel augmentation
method, especially benefiting categories with a limited number of examples.
Kernel Diffusion: An Alternate Approach to Blind Deconvolution
December 04, 2023
Yash Sanghvi, Yiheng Chi, Stanley H. Chan
Blind deconvolution problems are severely ill-posed because neither the
underlying signal nor the forward operator are not known exactly.
Conventionally, these problems are solved by alternating between estimation of
the image and kernel while keeping the other fixed. In this paper, we show that
this framework is flawed because of its tendency to get trapped in local minima
and, instead, suggest the use of a kernel estimation strategy with a non-blind
solver. This framework is employed by a diffusion method which is trained to
sample the blur kernel from the conditional distribution with guidance from a
pre-trained non-blind solver. The proposed diffusion method leads to
state-of-the-art results on both synthetic and real blur datasets.
Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation
December 04, 2023
Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Metzger, Rodrigo Caye Daudt, Konrad Schindler
Monocular depth estimation is a fundamental computer vision task. Recovering
3D depth from a single image is geometrically ill-posed and requires scene
understanding, so it is not surprising that the rise of deep learning has led
to a breakthrough. The impressive progress of monocular depth estimators has
mirrored the growth in model capacity, from relatively modest CNNs to large
Transformer architectures. Still, monocular depth estimators tend to struggle
when presented with images with unfamiliar content and layout, since their
knowledge of the visual world is restricted by the data seen during training,
and challenged by zero-shot generalization to new domains. This motivates us to
explore whether the extensive priors captured in recent generative diffusion
models can enable better, more generalizable depth estimation. We introduce
Marigold, a method for affine-invariant monocular depth estimation that is
derived from Stable Diffusion and retains its rich prior knowledge. The
estimator can be fine-tuned in a couple of days on a single GPU using only
synthetic training data. It delivers state-of-the-art performance across a wide
range of datasets, including over 20% performance gains in specific cases.
Project page: https://marigoldmonodepth.github.io.
December 04, 2023
Ali Hatamizadeh, Jiaming Song, Guilin Liu, Jan Kautz, Arash Vahdat
Diffusion models with their powerful expressivity and high sample quality
have enabled many new applications and use-cases in various domains. For sample
generation, these models rely on a denoising neural network that generates
images by iterative denoising. Yet, the role of denoising network architecture
is not well-studied with most efforts relying on convolutional residual U-Nets.
In this paper, we study the effectiveness of vision transformers in
diffusion-based generative learning. Specifically, we propose a new model,
denoted as Diffusion Vision Transformers (DiffiT), which consists of a hybrid
hierarchical architecture with a U-shaped encoder and decoder. We introduce a
novel time-dependent self-attention module that allows attention layers to
adapt their behavior at different stages of the denoising process in an
efficient manner. We also introduce latent DiffiT which consists of transformer
model with the proposed self-attention layers, for high-resolution image
generation. Our results show that DiffiT is surprisingly effective in
generating high-fidelity images, and it achieves state-of-the-art (SOTA)
benchmarks on a variety of class-conditional and unconditional synthesis tasks.
In the latent space, DiffiT achieves a new SOTA FID score of 1.73 on
ImageNet-256 dataset. Repository: https://github.com/NVlabs/DiffiT
Stochastic Optimal Control Matching
December 04, 2023
Carles Domingo-Enrich, Jiequn Han, Brandon Amos, Joan Bruna, Ricky T. Q. Chen
math.OC, cs.LG, cs.NA, math.NA, math.PR, stat.ML
Stochastic optimal control, which has the goal of driving the behavior of
noisy systems, is broadly applicable in science, engineering and artificial
intelligence. Our work introduces Stochastic Optimal Control Matching (SOCM), a
novel Iterative Diffusion Optimization (IDO) technique for stochastic optimal
control that stems from the same philosophy as the conditional score matching
loss for diffusion models. That is, the control is learned via a least squares
problem by trying to fit a matching vector field. The training loss, which is
closely connected to the cross-entropy loss, is optimized with respect to both
the control function and a family of reparameterization matrices which appear
in the matching vector field. The optimization with respect to the
reparameterization matrices aims at minimizing the variance of the matching
vector field. Experimentally, our algorithm achieves lower error than all the
existing IDO techniques for stochastic optimal control for three out of four
control problems, in some cases by an order of magnitude. The key idea
underlying SOCM is the path-wise reparameterization trick, a novel technique
that is of independent interest, e.g., for generative modeling. Code at
https://github.com/facebookresearch/SOC-matching
Conditional Variational Diffusion Models
December 04, 2023
Gabriel della Maggiora, Luis Alberto Croquevielle, Nikita Desphande, Harry Horsley, Thomas Heinis, Artur Yakimovich
cs.CV, cs.AI, cs.LG, stat.ML, I.2.6
Inverse problems aim to determine parameters from observations, a crucial
task in engineering and science. Lately, generative models, especially
diffusion models, have gained popularity in this area for their ability to
produce realistic solutions and their good mathematical properties. Despite
their success, an important drawback of diffusion models is their sensitivity
to the choice of variance schedule, which controls the dynamics of the
diffusion process. Fine-tuning this schedule for specific applications is
crucial but time-costly and does not guarantee an optimal result. We propose a
novel approach for learning the schedule as part of the training process. Our
method supports probabilistic conditioning on data, provides high-quality
solutions, and is flexible, proving able to adapt to different applications
with minimum overhead. This approach is tested in two unrelated inverse
problems: super-resolution microscopy and quantitative phase imaging, yielding
comparable or superior results to previous methods and fine-tuned diffusion
models. We conclude that fine-tuning the schedule by experimentation should be
avoided because it can be learned during training in a stable way that yields
better results.
Generalization by Adaptation: Diffusion-Based Domain Extension for Domain-Generalized Semantic Segmentation
December 04, 2023
Joshua Niemeijer, Manuel Schwonberg, Jan-Aike Termöhlen, Nico M. Schmidt, Tim Fingscheidt
When models, e.g., for semantic segmentation, are applied to images that are
vastly different from training data, the performance will drop significantly.
Domain adaptation methods try to overcome this issue, but need samples from the
target domain. However, this might not always be feasible for various reasons
and therefore domain generalization methods are useful as they do not require
any target data. We present a new diffusion-based domain extension (DIDEX)
method and employ a diffusion model to generate a pseudo-target domain with
diverse text prompts. In contrast to existing methods, this allows to control
the style and content of the generated images and to introduce a high
diversity. In a second step, we train a generalizing model by adapting towards
this pseudo-target domain. We outperform previous approaches by a large margin
across various datasets and architectures without using any real data. For the
generalization from GTA5, we improve state-of-the-art mIoU performance by 3.8%
absolute on average and for SYNTHIA by 11.8% absolute, marking a big step for
the generalization performance on these benchmarks. Code is available at
https://github.com/JNiemeijer/DIDEX
Fully Spiking Denoising Diffusion Implicit Models
December 04, 2023
Ryo Watanabe, Yusuke Mukuta, Tatsuya Harada
Spiking neural networks (SNNs) have garnered considerable attention owing to
their ability to run on neuromorphic devices with super-high speeds and
remarkable energy efficiencies. SNNs can be used in conventional neural
network-based time- and energy-consuming applications. However, research on
generative models within SNNs remains limited, despite their advantages. In
particular, diffusion models are a powerful class of generative models, whose
image generation quality surpass that of the other generative models, such as
GANs. However, diffusion models are characterized by high computational costs
and long inference times owing to their iterative denoising feature. Therefore,
we propose a novel approach fully spiking denoising diffusion implicit model
(FSDDIM) to construct a diffusion model within SNNs and leverage the high speed
and low energy consumption features of SNNs via synaptic current learning
(SCL). SCL fills the gap in that diffusion models use a neural network to
estimate real-valued parameters of a predefined probabilistic distribution,
whereas SNNs output binary spike trains. The SCL enables us to complete the
entire generative process of diffusion models exclusively using SNNs. We
demonstrate that the proposed method outperforms the state-of-the-art fully
spiking generative model.
ResEnsemble-DDPM: Residual Denoising Diffusion Probabilistic Models for Ensemble Learning
December 04, 2023
Shi Zhenning, Dong Changsheng, Xie Xueshuo, Pan Bin, He Along, Li Tao
Nowadays, denoising diffusion probabilistic models have been adapted for many
image segmentation tasks. However, existing end-to-end models have already
demonstrated remarkable capabilities. Rather than using denoising diffusion
probabilistic models alone, integrating the abilities of both denoising
diffusion probabilistic models and existing end-to-end models can better
improve the performance of image segmentation. Based on this, we implicitly
introduce residual term into the diffusion process and propose
ResEnsemble-DDPM, which seamlessly integrates the diffusion model and the
end-to-end model through ensemble learning. The output distributions of these
two models are strictly symmetric with respect to the ground truth
distribution, allowing us to integrate the two models by reducing the residual
term. Experimental results demonstrate that our ResEnsemble-DDPM can further
improve the capabilities of existing models. Furthermore, its ensemble learning
strategy can be generalized to other downstream tasks in image generation and
get strong competitiveness.
Diffusion Posterior Sampling for Nonlinear CT Reconstruction
December 03, 2023
Shudong Li, Matthew Tivnan, Yuan Shen, J. Webster Stayman
physics.med-ph, cs.CV, eess.IV, physics.comp-ph, J.3; I.4.4; I.4.5
Diffusion models have been demonstrated as powerful deep learning tools for
image generation in CT reconstruction and restoration. Recently, diffusion
posterior sampling, where a score-based diffusion prior is combined with a
likelihood model, has been used to produce high quality CT images given
low-quality measurements. This technique is attractive since it permits a
one-time, unsupervised training of a CT prior; which can then be incorporated
with an arbitrary data model. However, current methods only rely on a linear
model of x-ray CT physics to reconstruct or restore images. While it is common
to linearize the transmission tomography reconstruction problem, this is an
approximation to the true and inherently nonlinear forward model. We propose a
new method that solves the inverse problem of nonlinear CT image reconstruction
via diffusion posterior sampling. We implement a traditional unconditional
diffusion model by training a prior score function estimator, and apply Bayes
rule to combine this prior with a measurement likelihood score function derived
from the nonlinear physical model to arrive at a posterior score function that
can be used to sample the reverse-time diffusion process. This plug-and-play
method allows incorporation of a diffusion-based prior with generalized
nonlinear CT image reconstruction into multiple CT system designs with
different forward models, without the need for any additional training. We
develop the algorithm that performs this reconstruction, including an
ordered-subsets variant for accelerated processing and demonstrate the
technique in both fully sampled low dose data and sparse-view geometries using
a single unsupervised training of the prior.
Generative Rendering: Controllable 4D-Guided Video Generation with 2D Diffusion Models
December 03, 2023
Shengqu Cai, Duygu Ceylan, Matheus Gadelha, Chun-Hao Paul Huang, Tuanfeng Yang Wang, Gordon Wetzstein
Traditional 3D content creation tools empower users to bring their
imagination to life by giving them direct control over a scene’s geometry,
appearance, motion, and camera path. Creating computer-generated videos,
however, is a tedious manual process, which can be automated by emerging
text-to-video diffusion models. Despite great promise, video diffusion models
are difficult to control, hindering a user to apply their own creativity rather
than amplifying it. To address this challenge, we present a novel approach that
combines the controllability of dynamic 3D meshes with the expressivity and
editability of emerging diffusion models. For this purpose, our approach takes
an animated, low-fidelity rendered mesh as input and injects the ground truth
correspondence information obtained from the dynamic mesh into various stages
of a pre-trained text-to-image generation model to output high-quality and
temporally consistent frames. We demonstrate our approach on various examples
where motion can be obtained by animating rigged assets or changing the camera
path.
A Conditional Denoising Diffusion Probabilistic Model for Point Cloud Upsampling
December 03, 2023
Wentao Qu, Yuantian Shao, Lingwu Meng, Xiaoshui Huang, Liang Xiao
Point cloud upsampling (PCU) enriches the representation of raw point clouds,
significantly improving the performance in downstream tasks such as
classification and reconstruction. Most of the existing point cloud upsampling
methods focus on sparse point cloud feature extraction and upsampling module
design. In a different way, we dive deeper into directly modelling the gradient
of data distribution from dense point clouds. In this paper, we proposed a
conditional denoising diffusion probability model (DDPM) for point cloud
upsampling, called PUDM. Specifically, PUDM treats the sparse point cloud as a
condition, and iteratively learns the transformation relationship between the
dense point cloud and the noise. Simultaneously, PUDM aligns with a dual
mapping paradigm to further improve the discernment of point features. In this
context, PUDM enables learning complex geometry details in the ground truth
through the dominant features, while avoiding an additional upsampling module
design. Furthermore, to generate high-quality arbitrary-scale point clouds
during inference, PUDM exploits the prior knowledge of the scale between sparse
point clouds and dense point clouds during training by parameterizing a rate
factor. Moreover, PUDM exhibits strong noise robustness in experimental
results. In the quantitative and qualitative evaluations on PU1K and PUGAN,
PUDM significantly outperformed existing methods in terms of Chamfer Distance
(CD) and Hausdorff Distance (HD), achieving state of the art (SOTA)
performance.
ViVid-1-to-3: Novel View Synthesis with Video Diffusion Models
December 03, 2023
Jeong-gi Kwak, Erqun Dong, Yuhe Jin, Hanseok Ko, Shweta Mahajan, Kwang Moo Yi
Generating novel views of an object from a single image is a challenging
task. It requires an understanding of the underlying 3D structure of the object
from an image and rendering high-quality, spatially consistent new views. While
recent methods for view synthesis based on diffusion have shown great progress,
achieving consistency among various view estimates and at the same time abiding
by the desired camera pose remains a critical problem yet to be solved. In this
work, we demonstrate a strikingly simple method, where we utilize a pre-trained
video diffusion model to solve this problem. Our key idea is that synthesizing
a novel view could be reformulated as synthesizing a video of a camera going
around the object of interest – a scanning video – which then allows us to
leverage the powerful priors that a video diffusion model would have learned.
Thus, to perform novel-view synthesis, we create a smooth camera trajectory to
the target view that we wish to render, and denoise using both a
view-conditioned diffusion model and a video diffusion model. By doing so, we
obtain a highly consistent novel view synthesis, outperforming the state of the
art.
Portrait Diffusion: Training-free Face Stylization with Chain-of-Painting
December 03, 2023
Jin Liu, Huaibo Huang, Chao Jin, Ran He
Face stylization refers to the transformation of a face into a specific
portrait style. However, current methods require the use of example-based
adaptation approaches to fine-tune pre-trained generative models so that they
demand lots of time and storage space and fail to achieve detailed style
transformation. This paper proposes a training-free face stylization framework,
named Portrait Diffusion. This framework leverages off-the-shelf text-to-image
diffusion models, eliminating the need for fine-tuning specific examples.
Specifically, the content and style images are first inverted into latent
codes. Then, during image reconstruction using the corresponding latent code,
the content and style features in the attention space are delicately blended
through a modified self-attention operation called Style Attention Control.
Additionally, a Chain-of-Painting method is proposed for the gradual redrawing
of unsatisfactory areas from rough adjustments to fine-tuning. Extensive
experiments validate the effectiveness of our Portrait Diffusion method and
demonstrate the superiority of Chain-of-Painting in achieving precise face
stylization. Code will be released at
\url{https://github.com/liujin112/PortraitDiffusion}.
AAMDM: Accelerated Auto-regressive Motion Diffusion Model
December 02, 2023
Tianyu Li, Calvin Qiao, Guanqiao Ren, KangKang Yin, Sehoon Ha
Interactive motion synthesis is essential in creating immersive experiences
in entertainment applications, such as video games and virtual reality.
However, generating animations that are both high-quality and contextually
responsive remains a challenge. Traditional techniques in the game industry can
produce high-fidelity animations but suffer from high computational costs and
poor scalability. Trained neural network models alleviate the memory and speed
issues, yet fall short on generating diverse motions. Diffusion models offer
diverse motion synthesis with low memory usage, but require expensive reverse
diffusion processes. This paper introduces the Accelerated Auto-regressive
Motion Diffusion Model (AAMDM), a novel motion synthesis framework designed to
achieve quality, diversity, and efficiency all together. AAMDM integrates
Denoising Diffusion GANs as a fast Generation Module, and an Auto-regressive
Diffusion Model as a Polishing Module. Furthermore, AAMDM operates in a
lower-dimensional embedded space rather than the full-dimensional pose space,
which reduces the training complexity as well as further improves the
performance. We show that AAMDM outperforms existing methods in motion quality,
diversity, and runtime efficiency, through comprehensive quantitative analyses
and visual comparisons. We also demonstrate the effectiveness of each
algorithmic component through ablation studies.
PAC Privacy Preserving Diffusion Models
December 02, 2023
Qipan Xu, Youlong Ding, Jie Gao, Hao Wang
Data privacy protection is garnering increased attention among researchers.
Diffusion models (DMs), particularly with strict differential privacy, can
potentially produce images with both high privacy and visual quality. However,
challenges arise in ensuring robust protection in privatizing specific data
attributes, areas where current models often fall short. To address these
challenges, we introduce the PAC Privacy Preserving Diffusion Model, a model
leverages diffusion principles and ensure Probably Approximately Correct (PAC)
privacy. We enhance privacy protection by integrating a private classifier
guidance into the Langevin Sampling Process. Additionally, recognizing the gap
in measuring the privacy of models, we have developed a novel metric to gauge
privacy levels. Our model, assessed with this new metric and supported by
Gaussian matrix computations for the PAC bound, has shown superior performance
in privacy protection over existing leading private generative models according
to benchmark tests.
Exploiting Diffusion Priors for All-in-One Image Restoration
December 02, 2023
Yuanbiao Gou, Haiyu Zhao, Boyun Li, Xinyan Xiao, Xi Peng
All-in-one aims to solve various tasks of image restoration in a single
model. To this end, we present a feasible way of exploiting the image priors
captured by the pretrained diffusion model, through addressing the two
challenges, i.e., degradation modeling and diffusion guidance. The former aims
to simulate the process of the clean image degenerated by the unknown
degradations, and the latter aims at guiding the diffusion model to generate
the desired clean image. With the motivations, we propose a zero-shot framework
for all-in-one image restoration, termed ZeroAIR, which alternatively performs
the test-time degradation modeling (TDM) and the three-stage diffusion guidance
(TDG) at each timestep of the reverse sampling. To be specific, TDM exploits
the diffusion priors to learn a degradation model from a given degraded image,
and TDG divides the timesteps into three stages for taking full advantages of
the varying diffusion priors. Thanks to their degradation-agnostic property,
all-in-one restoration could be achieved in a zero-shot way. Through extensive
experiments, we show that our ZeroAIR achieves comparable even better
performance than those task-specific methods. The code will be available on
Github.
Planning as In-Painting: A Diffusion-Based Embodied Task Planning Framework for Environments under Uncertainty
December 02, 2023
Cheng-Fu Yang, Haoyang Xu, Te-Lin Wu, Xiaofeng Gao, Kai-Wei Chang, Feng Gao
Task planning for embodied AI has been one of the most challenging problems
where the community does not meet a consensus in terms of formulation. In this
paper, we aim to tackle this problem with a unified framework consisting of an
end-to-end trainable method and a planning algorithm. Particularly, we propose
a task-agnostic method named ‘planning as in-painting’. In this method, we use
a Denoising Diffusion Model (DDM) for plan generation, conditioned on both
language instructions and perceptual inputs under partially observable
environments. Partial observation often leads to the model hallucinating the
planning. Therefore, our diffusion-based method jointly models both state
trajectory and goal estimation to improve the reliability of the generated
plan, given the limited available information at each step. To better leverage
newly discovered information along the plan execution for a higher success
rate, we propose an on-the-fly planning algorithm to collaborate with the
diffusion-based planner. The proposed framework achieves promising performances
in various embodied AI tasks, including vision-language navigation, object
manipulation, and task planning in a photorealistic virtual environment. The
code is available at: https://github.com/joeyy5588/planning-as-inpainting.
DPHMs: Diffusion Parametric Head Models for Depth-based Tracking
December 02, 2023
Jiapeng Tang, Angela Dai, Yinyu Nie, Lev Markhasin, Justus Thies, Matthias Niessner
We introduce Diffusion Parametric Head Models (DPHMs), a generative model
that enables robust volumetric head reconstruction and tracking from monocular
depth sequences. While recent volumetric head models, such as NPHMs, can now
excel in representing high-fidelity head geometries, tracking and
reconstruction heads from real-world single-view depth sequences remains very
challenging, as the fitting to partial and noisy observations is
underconstrained. To tackle these challenges, we propose a latent
diffusion-based prior to regularize volumetric head reconstruction and
tracking. This prior-based regularizer effectively constrains the identity and
expression codes to lie on the underlying latent manifold which represents
plausible head shapes. To evaluate the effectiveness of the diffusion-based
prior, we collect a dataset of monocular Kinect sequences consisting of various
complex facial expression motions and rapid transitions. We compare our method
to state-of-the-art tracking methods, and demonstrate improved head identity
reconstruction as well as robust expression tracking.
Taming Latent Diffusion Models to See in the Dark
December 02, 2023
Qiang Wen, Yazhou Xing, Qifeng Chen
Enhancing a low-light noisy RAW image into a well-exposed and clean sRGB
image is a significant challenge in computational photography. Due to the
limitation of large-scale paired data, prior approaches have difficulty in
recovering fine details and true colors in extremely low-light regions.
Meanwhile, recent advancements in generative diffusion models have shown
promising generating capabilities, which inspires this work to explore
generative priors from a diffusion model trained on a large-scale open-domain
dataset to benefit the low-light image enhancement (LLIE) task. Based on this
intention, we propose a novel diffusion-model-based LLIE method, dubbed
LDM-SID. LDM-SID aims at inserting a set of proposed taming modules into a
frozen pre-trained diffusion model to steer its generating process.
Specifically, the taming module fed with low-light information serves to output
a pair of affine transformation parameters to modulate the intermediate feature
in the diffusion model. Additionally, based on the observation of dedicated
generative priors across different portions of the diffusion model, we propose
to apply 2D discrete wavelet transforms on the input RAW image, resulting in
dividing the LLIE task into two essential parts: low-frequency content
generation and high-frequency detail maintenance. This enables us to skillfully
tame the diffusion model for optimized structural generation and detail
enhancement. Extensive experiments demonstrate the proposed method not only
achieves state-of-the-art performance in quantitative evaluations but also
shows significant superiority in visual comparisons. These findings highlight
the effectiveness of leveraging a pre-trained diffusion model as a generative
prior to the LLIE task. The project page is available at
https://csqiangwen.github.io/projects/ldm-sid/
Consistent Mesh Diffusion
December 01, 2023
Julian Knodt, Xifeng Gao
Given a 3D mesh with a UV parameterization, we introduce a novel approach to
generating textures from text prompts. While prior work uses optimization from
Text-to-Image Diffusion models to generate textures and geometry, this is slow
and requires significant compute resources. Alternatively, there are projection
based approaches that use the same Text-to-Image models that paint images onto
a mesh, but lack consistency at different viewing angles, we propose a method
that uses a single Depth-to-Image diffusion network, and generates a single
consistent texture when rendered on the 3D surface by first unifying multiple
2D image’s diffusion paths, and hoisting that to 3D with
MultiDiffusion~\cite{multidiffusion}. We demonstrate our approach on a dataset
containing 30 meshes, taking approximately 5 minutes per mesh. To evaluate the
quality of our approach, we use CLIP-score~\cite{clipscore} and Frechet
Inception Distance (FID)~\cite{frechet} to evaluate the quality of the
rendering, and show our improvement over prior work.
3DiFACE: Diffusion-based Speech-driven 3D Facial Animation and Editing
December 01, 2023
Balamurugan Thambiraja, Sadegh Aliakbarian, Darren Cosker, Justus Thies
cs.CV, cs.AI, cs.GR, cs.LG
We present 3DiFACE, a novel method for personalized speech-driven 3D facial
animation and editing. While existing methods deterministically predict facial
animations from speech, they overlook the inherent one-to-many relationship
between speech and facial expressions, i.e., there are multiple reasonable
facial expression animations matching an audio input. It is especially
important in content creation to be able to modify generated motion or to
specify keyframes. To enable stochasticity as well as motion editing, we
propose a lightweight audio-conditioned diffusion model for 3D facial motion.
This diffusion model can be trained on a small 3D motion dataset, maintaining
expressive lip motion output. In addition, it can be finetuned for specific
subjects, requiring only a short video of the person. Through quantitative and
qualitative evaluations, we show that our method outperforms existing
state-of-the-art techniques and yields speech-driven animations with greater
fidelity and diversity.
Adversarial Score Distillation: When score distillation meets GAN
December 01, 2023
Min Wei, Jingkai Zhou, Junyao Sun, Xuesong Zhang
Existing score distillation methods are sensitive to classifier-free guidance
(CFG) scale: manifested as over-smoothness or instability at small CFG scales,
while over-saturation at large ones. To explain and analyze these issues, we
revisit the derivation of Score Distillation Sampling (SDS) and decipher
existing score distillation with the Wasserstein Generative Adversarial Network
(WGAN) paradigm. With the WGAN paradigm, we find that existing score
distillation either employs a fixed sub-optimal discriminator or conducts
incomplete discriminator optimization, resulting in the scale-sensitive issue.
We propose the Adversarial Score Distillation (ASD), which maintains an
optimizable discriminator and updates it using the complete optimization
objective. Experiments show that the proposed ASD performs favorably in 2D
distillation and text-to-3D tasks against existing methods. Furthermore, to
explore the generalization ability of our WGAN paradigm, we extend ASD to the
image editing task, which achieves competitive results. The project page and
code are at https://github.com/2y7c3/ASD.
DeepCache: Accelerating Diffusion Models for Free
December 01, 2023
Xinyin Ma, Gongfan Fang, Xinchao Wang
Diffusion models have recently gained unprecedented attention in the field of
image synthesis due to their remarkable generative capabilities.
Notwithstanding their prowess, these models often incur substantial
computational costs, primarily attributed to the sequential denoising process
and cumbersome model size. Traditional methods for compressing diffusion models
typically involve extensive retraining, presenting cost and feasibility
challenges. In this paper, we introduce DeepCache, a novel training-free
paradigm that accelerates diffusion models from the perspective of model
architecture. DeepCache capitalizes on the inherent temporal redundancy
observed in the sequential denoising steps of diffusion models, which caches
and retrieves features across adjacent denoising stages, thereby curtailing
redundant computations. Utilizing the property of the U-Net, we reuse the
high-level features while updating the low-level features in a very cheap way.
This innovative strategy, in turn, enables a speedup factor of 2.3$\times$ for
Stable Diffusion v1.5 with only a 0.05 decline in CLIP Score, and 4.1$\times$
for LDM-4-G with a slight decrease of 0.22 in FID on ImageNet. Our experiments
also demonstrate DeepCache’s superiority over existing pruning and distillation
methods that necessitate retraining and its compatibility with current sampling
techniques. Furthermore, we find that under the same throughput, DeepCache
effectively achieves comparable or even marginally improved results with DDIM
or PLMS. The code is available at https://github.com/horseee/DeepCache
TrackDiffusion: Multi-object Tracking Data Generation via Diffusion Models
December 01, 2023
Pengxiang Li, Zhili Liu, Kai Chen, Lanqing Hong, Yunzhi Zhuge, Dit-Yan Yeung, Huchuan Lu, Xu Jia
Diffusion models have gained prominence in generating data for perception
tasks such as image classification and object detection. However, the potential
in generating high-quality tracking sequences, a crucial aspect in the field of
video perception, has not been fully investigated. To address this gap, we
propose TrackDiffusion, a novel architecture designed to generate continuous
video sequences from the tracklets. TrackDiffusion represents a significant
departure from the traditional layout-to-image (L2I) generation and copy-paste
synthesis focusing on static image elements like bounding boxes by empowering
image diffusion models to encompass dynamic and continuous tracking
trajectories, thereby capturing complex motion nuances and ensuring instance
consistency among video frames. For the first time, we demonstrate that the
generated video sequences can be utilized for training multi-object tracking
(MOT) systems, leading to significant improvement in tracker performance.
Experimental results show that our model significantly enhances instance
consistency in generated video sequences, leading to improved perceptual
metrics. Our approach achieves an improvement of 8.7 in TrackAP and 11.8 in
TrackAP$_{50}$ on the YTVIS dataset, underscoring its potential to redefine the
standards of video data generation for MOT tasks and beyond.
Beyond First-Order Tweedie: Solving Inverse Problems using Latent Diffusion
December 01, 2023
Litu Rout, Yujia Chen, Abhishek Kumar, Constantine Caramanis, Sanjay Shakkottai, Wen-Sheng Chu
Sampling from the posterior distribution poses a major computational
challenge in solving inverse problems using latent diffusion models. Common
methods rely on Tweedie’s first-order moments, which are known to induce a
quality-limiting bias. Existing second-order approximations are impractical due
to prohibitive computational costs, making standard reverse diffusion processes
intractable for posterior sampling. This paper introduces Second-order Tweedie
sampler from Surrogate Loss (STSL), a novel sampler that offers efficiency
comparable to first-order Tweedie with a tractable reverse process using
second-order approximation. Our theoretical results reveal that the
second-order approximation is lower bounded by our surrogate loss that only
requires $O(1)$ compute using the trace of the Hessian, and by the lower bound
we derive a new drift term to make the reverse process tractable. Our method
surpasses SoTA solvers PSLD and P2L, achieving 4X and 8X reduction in neural
function evaluations, respectively, while notably enhancing sampling quality on
FFHQ, ImageNet, and COCO benchmarks. In addition, we show STSL extends to
text-guided image editing and addresses residual distortions present from
corrupted images in leading text-guided image editing methods. To our best
knowledge, this is the first work to offer an efficient second-order
approximation in solving inverse problems using latent diffusion and editing
real-world images with corruptions.
DFU: scale-robust diffusion model for zero-shot super-resolution image generation
November 30, 2023
Alex Havrilla, Kevin Rojas, Wenjing Liao, Molei Tao
Diffusion generative models have achieved remarkable success in generating
images with a fixed resolution. However, existing models have limited ability
to generalize to different resolutions when training data at those resolutions
are not available. Leveraging techniques from operator learning, we present a
novel deep-learning architecture, Dual-FNO UNet (DFU), which approximates the
score operator by combining both spatial and spectral information at multiple
resolutions. Comparisons of DFU to baselines demonstrate its scalability: 1)
simultaneously training on multiple resolutions improves FID over training at
any single fixed resolution; 2) DFU generalizes beyond its training
resolutions, allowing for coherent, high-fidelity generation at
higher-resolutions with the same model, i.e. zero-shot super-resolution
image-generation; 3) we propose a fine-tuning strategy to further enhance the
zero-shot super-resolution image-generation capability of our model, leading to
a FID of 11.3 at 1.66 times the maximum training resolution on FFHQ, which no
other method can come close to achieving.
DREAM: Diffusion Rectification and Estimation-Adaptive Models
November 30, 2023
Jinxin Zhou, Tianyu Ding, Tianyi Chen, Jiachen Jiang, Ilya Zharkov, Zhihui Zhu, Luming Liang
We present DREAM, a novel training framework representing Diffusion
Rectification and Estimation-Adaptive Models, requiring minimal code changes
(just three lines) yet significantly enhancing the alignment of training with
sampling in diffusion models. DREAM features two components: diffusion
rectification, which adjusts training to reflect the sampling process, and
estimation adaptation, which balances perception against distortion. When
applied to image super-resolution (SR), DREAM adeptly navigates the tradeoff
between minimizing distortion and preserving high image quality. Experiments
demonstrate DREAM’s superiority over standard diffusion-based SR methods,
showing a $2$ to $3\times $ faster training convergence and a $10$ to
$20\times$ reduction in necessary sampling steps to achieve comparable or
superior results. We hope DREAM will inspire a rethinking of diffusion model
training paradigms.
S2ST: Image-to-Image Translation in the Seed Space of Latent Diffusion
November 30, 2023
Or Greenberg, Eran Kishon, Dani Lischinski
Image-to-image translation (I2IT) refers to the process of transforming
images from a source domain to a target domain while maintaining a fundamental
connection in terms of image content. In the past few years, remarkable
advancements in I2IT were achieved by Generative Adversarial Networks (GANs),
which nevertheless struggle with translations requiring high precision.
Recently, Diffusion Models have established themselves as the engine of choice
for image generation. In this paper we introduce S2ST, a novel framework
designed to accomplish global I2IT in complex photorealistic images, such as
day-to-night or clear-to-rain translations of automotive scenes. S2ST operates
within the seed space of a Latent Diffusion Model, thereby leveraging the
powerful image priors learned by the latter. We show that S2ST surpasses
state-of-the-art GAN-based I2IT methods, as well as diffusion-based approaches,
for complex automotive scenes, improving fidelity while respecting the target
domain’s appearance across a variety of domains. Notably, S2ST obviates the
necessity for training domain-specific translation networks.
Exploiting Diffusion Prior for Generalizable Pixel-Level Semantic Prediction
November 30, 2023
Hsin-Ying Lee, Hung-Yu Tseng, Hsin-Ying Lee, Ming-Hsuan Yang
Contents generated by recent advanced Text-to-Image (T2I) diffusion models
are sometimes too imaginative for existing off-the-shelf property semantic
predictors to estimate due to the immitigable domain gap. We introduce DMP, a
pipeline utilizing pre-trained T2I models as a prior for pixel-level semantic
prediction tasks. To address the misalignment between deterministic prediction
tasks and stochastic T2I models, we reformulate the diffusion process through a
sequence of interpolations, establishing a deterministic mapping between input
RGB images and output prediction distributions. To preserve generalizability,
we use low-rank adaptation to fine-tune pre-trained models. Extensive
experiments across five tasks, including 3D property estimation, semantic
segmentation, and intrinsic image decomposition, showcase the efficacy of the
proposed method. Despite limited-domain training data, the approach yields
faithful estimations for arbitrary images, surpassing existing state-of-the-art
algorithms.
One-step Diffusion with Distribution Matching Distillation
November 30, 2023
Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T. Freeman, Taesung Park
Diffusion models generate high-quality images but require dozens of forward
passes. We introduce Distribution Matching Distillation (DMD), a procedure to
transform a diffusion model into a one-step image generator with minimal impact
on image quality. We enforce the one-step image generator match the diffusion
model at distribution level, by minimizing an approximate KL divergence whose
gradient can be expressed as the difference between 2 score functions, one of
the target distribution and the other of the synthetic distribution being
produced by our one-step generator. The score functions are parameterized as
two diffusion models trained separately on each distribution. Combined with a
simple regression loss matching the large-scale structure of the multi-step
diffusion outputs, our method outperforms all published few-step diffusion
approaches, reaching 2.62 FID on ImageNet 64x64 and 11.49 FID on zero-shot
COCO-30k, comparable to Stable Diffusion but orders of magnitude faster.
Utilizing FP16 inference, our model generates images at 20 FPS on modern
hardware.
Continual Diffusion with STAMINA: STack-And-Mask INcremental Adapters
November 30, 2023
James Seale Smith, Yen-Chang Hsu, Zsolt Kira, Yilin Shen, Hongxia Jin
Recent work has demonstrated a remarkable ability to customize text-to-image
diffusion models to multiple, fine-grained concepts in a sequential (i.e.,
continual) manner while only providing a few example images for each concept.
This setting is known as continual diffusion. Here, we ask the question: Can we
scale these methods to longer concept sequences without forgetting? Although
prior work mitigates the forgetting of previously learned concepts, we show
that its capacity to learn new tasks reaches saturation over longer sequences.
We address this challenge by introducing a novel method, STack-And-Mask
INcremental Adapters (STAMINA), which is composed of low-ranked
attention-masked adapters and customized MLP tokens. STAMINA is designed to
enhance the robust fine-tuning properties of LoRA for sequential concept
learning via learnable hard-attention masks parameterized with low rank MLPs,
enabling precise, scalable learning via sparse adaptation. Notably, all
introduced trainable parameters can be folded back into the model after
training, inducing no additional inference parameter costs. We show that
STAMINA outperforms the prior SOTA for the setting of text-to-image continual
customization on a 50-concept benchmark composed of landmarks and human faces,
with no stored replay data. Additionally, we extended our method to the setting
of continual learning for image classification, demonstrating that our gains
also translate to state-of-the-art performance in this standard benchmark.
DiffusionAvatars: Deferred Diffusion for High-fidelity 3D Head Avatars
November 30, 2023
Tobias Kirschstein, Simon Giebenhain, Matthias Nießner
DiffusionAvatars synthesizes a high-fidelity 3D head avatar of a person,
offering intuitive control over both pose and expression. We propose a
diffusion-based neural renderer that leverages generic 2D priors to produce
compelling images of faces. For coarse guidance of the expression and head
pose, we render a neural parametric head model (NPHM) from the target
viewpoint, which acts as a proxy geometry of the person. Additionally, to
enhance the modeling of intricate facial expressions, we condition
DiffusionAvatars directly on the expression codes obtained from NPHM via
cross-attention. Finally, to synthesize consistent surface details across
different viewpoints and expressions, we rig learnable spatial features to the
head’s surface via TriPlane lookup in NPHM’s canonical space. We train
DiffusionAvatars on RGB videos and corresponding tracked NPHM meshes of a
person and test the obtained avatars in both self-reenactment and animation
scenarios. Our experiments demonstrate that DiffusionAvatars generates
temporally consistent and visually appealing videos for novel poses and
expressions of a person, outperforming existing approaches.
Fast ODE-based Sampling for Diffusion Models in Around 5 Steps
November 30, 2023
Zhenyu Zhou, Defang Chen, Can Wang, Chun Chen
Sampling from diffusion models can be treated as solving the corresponding
ordinary differential equations (ODEs), with the aim of obtaining an accurate
solution with as few number of function evaluations (NFE) as possible.
Recently, various fast samplers utilizing higher-order ODE solvers have emerged
and achieved better performance than the initial first-order one. However,
these numerical methods inherently result in certain approximation errors,
which significantly degrades sample quality with extremely small NFE (e.g.,
around 5). In contrast, based on the geometric observation that each sampling
trajectory almost lies in a two-dimensional subspace embedded in the ambient
space, we propose Approximate MEan-Direction Solver (AMED-Solver) that
eliminates truncation errors by directly learning the mean direction for fast
diffusion sampling. Besides, our method can be easily used as a plugin to
further improve existing ODE-based samplers. Extensive experiments on image
synthesis with the resolution ranging from 32 to 256 demonstrate the
effectiveness of our method. With only 5 NFE, we achieve 7.14 FID on CIFAR-10,
13.75 FID on ImageNet 64$\times$64, and 12.79 FID on LSUN Bedroom. Our code is
available at https://github.com/zhyzhouu/amed-solver.
On Exact Inversion of DPM-Solvers
November 30, 2023
Seongmin Hong, Kyeonghyun Lee, Suh Yoon Jeon, Hyewon Bae, Se Young Chun
Diffusion probabilistic models (DPMs) are a key component in modern
generative models. DPM-solvers have achieved reduced latency and enhanced
quality significantly, but have posed challenges to find the exact inverse
(i.e., finding the initial noise from the given image). Here we investigate the
exact inversions for DPM-solvers and propose algorithms to perform them when
samples are generated by the first-order as well as higher-order DPM-solvers.
For each explicit denoising step in DPM-solvers, we formulated the inversions
using implicit methods such as gradient descent or forward step method to
ensure the robustness to large classifier-free guidance unlike the prior
approach using fixed-point iteration. Experimental results demonstrated that
our proposed exact inversion methods significantly reduced the error of both
image and noise reconstructions, greatly enhanced the ability to distinguish
invisible watermarks and well prevented unintended background changes
consistently during image editing. Project page:
\url{https://smhongok.github.io/inv-dpm.html}.
Non-Cross Diffusion for Semantic Consistency
November 30, 2023
Ziyang Zheng, Ruiyuan Gao, Qiang Xu
In diffusion models, deviations from a straight generative flow are a common
issue, resulting in semantic inconsistencies and suboptimal generations. To
address this challenge, we introduce `Non-Cross Diffusion’, an innovative
approach in generative modeling for learning ordinary differential equation
(ODE) models. Our methodology strategically incorporates an ascending dimension
of input to effectively connect points sampled from two distributions with
uncrossed paths. This design is pivotal in ensuring enhanced semantic
consistency throughout the inference process, which is especially critical for
applications reliant on consistent generative flows, including various
distillation methods and deterministic sampling, which are fundamental in image
editing and interpolation tasks. Our empirical results demonstrate the
effectiveness of Non-Cross Diffusion, showing a substantial reduction in
semantic inconsistencies at different inference steps and a notable enhancement
in the overall performance of diffusion models.
Diffusion Models Without Attention
November 30, 2023
Jing Nathan Yan, Jiatao Gu, Alexander M. Rush
In recent advancements in high-fidelity image generation, Denoising Diffusion
Probabilistic Models (DDPMs) have emerged as a key player. However, their
application at high resolutions presents significant computational challenges.
Current methods, such as patchifying, expedite processes in UNet and
Transformer architectures but at the expense of representational capacity.
Addressing this, we introduce the Diffusion State Space Model (DiffuSSM), an
architecture that supplants attention mechanisms with a more scalable state
space model backbone. This approach effectively handles higher resolutions
without resorting to global compression, thus preserving detailed image
representation throughout the diffusion process. Our focus on FLOP-efficient
architectures in diffusion training marks a significant step forward.
Comprehensive evaluations on both ImageNet and LSUN datasets at two resolutions
demonstrate that DiffuSSMs are on par or even outperform existing diffusion
models with attention modules in FID and Inception Score metrics while
significantly reducing total FLOP usage.
SMaRt: Improving GANs with Score Matching Regularity
November 30, 2023
Mengfei Xia, Yujun Shen, Ceyuan Yang, Ran Yi, Wenping Wang, Yong-jin Liu
Generative adversarial networks (GANs) usually struggle in learning from
highly diverse data, whose underlying manifold is complex. In this work, we
revisit the mathematical foundations of GANs, and theoretically reveal that the
native adversarial loss for GAN training is insufficient to fix the problem of
subsets with positive Lebesgue measure of the generated data manifold lying out
of the real data manifold. Instead, we find that score matching serves as a
valid solution to this issue thanks to its capability of persistently pushing
the generated data points towards the real data manifold. We thereby propose to
improve the optimization of GANs with score matching regularity (SMaRt).
Regarding the empirical evidences, we first design a toy example to show that
training GANs by the aid of a ground-truth score function can help reproduce
the real data distribution more accurately, and then confirm that our approach
can consistently boost the synthesis performance of various state-of-the-art
GANs on real-world datasets with pre-trained diffusion models acting as the
approximate score function. For instance, when training Aurora on the ImageNet
64x64 dataset, we manage to improve FID from 8.87 to 7.11, on par with the
performance of one-step consistency model. The source code will be made public.
HiFi Tuner: High-Fidelity Subject-Driven Fine-Tuning for Diffusion Models
November 30, 2023
Zhonghao Wang, Wei Wei, Yang Zhao, Zhisheng Xiao, Mark Hasegawa-Johnson, Humphrey Shi, Tingbo Hou
cs.CV, cs.AI, cs.CL, cs.LG
This paper explores advancements in high-fidelity personalized image
generation through the utilization of pre-trained text-to-image diffusion
models. While previous approaches have made significant strides in generating
versatile scenes based on text descriptions and a few input images, challenges
persist in maintaining the subject fidelity within the generated images. In
this work, we introduce an innovative algorithm named HiFi Tuner to enhance the
appearance preservation of objects during personalized image generation. Our
proposed method employs a parameter-efficient fine-tuning framework, comprising
a denoising process and a pivotal inversion process. Key enhancements include
the utilization of mask guidance, a novel parameter regularization technique,
and the incorporation of step-wise subject representations to elevate the
sample fidelity. Additionally, we propose a reference-guided generation
approach that leverages the pivotal inversion of a reference image to mitigate
unwanted subject variations and artifacts. We further extend our method to a
novel image editing task: substituting the subject in an image through textual
manipulations. Experimental evaluations conducted on the DreamBooth dataset
using the Stable Diffusion model showcase promising results. Fine-tuning solely
on textual embeddings improves CLIP-T score by 3.6 points and improves DINO
score by 9.6 points over Textual Inversion. When fine-tuning all parameters,
HiFi Tuner improves CLIP-T score by 1.2 points and improves DINO score by 1.2
points over DreamBooth, establishing a new state of the art.
DiffGEPCI: 3D MRI Synthesis from mGRE Signals using 2.5D Diffusion Model
November 29, 2023
Yuyang Hu, Satya V. V. N. Kothapalli, Weijie Gan, Alexander L. Sukstanskii, Gregory F. Wu, Manu Goyal, Dmitriy A. Yablonskiy, Ulugbek S. Kamilov
We introduce a new framework called DiffGEPCI for cross-modality generation
in magnetic resonance imaging (MRI) using a 2.5D conditional diffusion model.
DiffGEPCI can synthesize high-quality Fluid Attenuated Inversion Recovery
(FLAIR) and Magnetization Prepared-Rapid Gradient Echo (MPRAGE) images, without
acquiring corresponding measurements, by leveraging multi-Gradient-Recalled
Echo (mGRE) MRI signals as conditional inputs. DiffGEPCI operates in a two-step
fashion: it initially estimates a 3D volume slice-by-slice using the axial
plane and subsequently applies a refinement algorithm (referred to as 2.5D) to
enhance the quality of the coronal and sagittal planes. Experimental validation
on real mGRE data shows that DiffGEPCI achieves excellent performance,
surpassing generative adversarial networks (GANs) and traditional diffusion
models.
Do text-free diffusion models learn discriminative visual representations?
November 29, 2023
Soumik Mukhopadhyay, Matthew Gwilliam, Yosuke Yamaguchi, Vatsal Agarwal, Namitha Padmanabhan, Archana Swaminathan, Tianyi Zhou, Abhinav Shrivastava
While many unsupervised learning models focus on one family of tasks, either
generative or discriminative, we explore the possibility of a unified
representation learner: a model which addresses both families of tasks
simultaneously. We identify diffusion models, a state-of-the-art method for
generative tasks, as a prime candidate. Such models involve training a U-Net to
iteratively predict and remove noise, and the resulting model can synthesize
high-fidelity, diverse, novel images. We find that the intermediate feature
maps of the U-Net are diverse, discriminative feature representations. We
propose a novel attention mechanism for pooling feature maps and further
leverage this mechanism as DifFormer, a transformer feature fusion of features
from different diffusion U-Net blocks and noise steps. We also develop DifFeed,
a novel feedback mechanism tailored to diffusion. We find that diffusion models
are better than GANs, and, with our fusion and feedback mechanisms, can compete
with state-of-the-art unsupervised image representation learning methods for
discriminative tasks - image classification with full and semi-supervision,
transfer for fine-grained classification, object detection and segmentation,
and semantic segmentation. Our project website
(https://mgwillia.github.io/diffssl/) and code
(https://github.com/soumik-kanad/diffssl) are available publicly.
SODA: Bottleneck Diffusion Models for Representation Learning
November 29, 2023
Drew A. Hudson, Daniel Zoran, Mateusz Malinowski, Andrew K. Lampinen, Andrew Jaegle, James L. McClelland, Loic Matthey, Felix Hill, Alexander Lerchner
We introduce SODA, a self-supervised diffusion model, designed for
representation learning. The model incorporates an image encoder, which
distills a source view into a compact representation, that, in turn, guides the
generation of related novel views. We show that by imposing a tight bottleneck
between the encoder and a denoising decoder, and leveraging novel view
synthesis as a self-supervised objective, we can turn diffusion models into
strong representation learners, capable of capturing visual semantics in an
unsupervised manner. To the best of our knowledge, SODA is the first diffusion
model to succeed at ImageNet linear-probe classification, and, at the same
time, it accomplishes reconstruction, editing and synthesis tasks across a wide
range of datasets. Further investigation reveals the disentangled nature of its
emergent latent space, that serves as an effective interface to control and
manipulate the model’s produced images. All in all, we aim to shed light on the
exciting and promising potential of diffusion models, not only for image
generation, but also for learning rich and robust representations.
Leveraging Graph Diffusion Models for Network Refinement Tasks
November 29, 2023
Puja Trivedi, Ryan Rossi, David Arbour, Tong Yu, Franck Dernoncourt, Sungchul Kim, Nedim Lipka, Namyong Park, Nesreen K. Ahmed, Danai Koutra
Most real-world networks are noisy and incomplete samples from an unknown
target distribution. Refining them by correcting corruptions or inferring
unobserved regions typically improves downstream performance. Inspired by the
impressive generative capabilities that have been used to correct corruptions
in images, and the similarities between “in-painting” and filling in missing
nodes and edges conditioned on the observed graph, we propose a novel graph
generative framework, SGDM, which is based on subgraph diffusion. Our framework
not only improves the scalability and fidelity of graph diffusion models, but
also leverages the reverse process to perform novel, conditional generation
tasks. In particular, through extensive empirical analysis and a set of novel
metrics, we demonstrate that our proposed model effectively supports the
following refinement tasks for partially observable networks: T1: denoising
extraneous subgraphs, T2: expanding existing subgraphs and T3: performing
“style” transfer by regenerating a particular subgraph to match the
characteristics of a different node or subgraph.
Fair Text-to-Image Diffusion via Fair Mapping
November 29, 2023
Jia Li, Lijie Hu, Jingfeng Zhang, Tianhang Zheng, Hua Zhang, Di Wang
cs.CV, cs.AI, cs.CY, cs.LG
In this paper, we address the limitations of existing text-to-image diffusion
models in generating demographically fair results when given human-related
descriptions. These models often struggle to disentangle the target language
context from sociocultural biases, resulting in biased image generation. To
overcome this challenge, we propose Fair Mapping, a general, model-agnostic,
and lightweight approach that modifies a pre-trained text-to-image model by
controlling the prompt to achieve fair image generation. One key advantage of
our approach is its high efficiency. The training process only requires
updating a small number of parameters in an additional linear mapping network.
This not only reduces the computational cost but also accelerates the
optimization process. We first demonstrate the issue of bias in generated
results caused by language biases in text-guided diffusion models. By
developing a mapping network that projects language embeddings into an unbiased
space, we enable the generation of relatively balanced demographic results
based on a keyword specified in the prompt. With comprehensive experiments on
face image generation, we show that our method significantly improves image
generation performance when prompted with descriptions related to human faces.
By effectively addressing the issue of bias, we produce more fair and diverse
image outputs. This work contributes to the field of text-to-image generation
by enhancing the ability to generate images that accurately reflect the
intended demographic characteristics specified in the text.
Using Ornstein-Uhlenbeck Process to understand Denoising Diffusion Probabilistic Model and its Noise Schedules
November 29, 2023
Javier E. Santos, Yen Ting Lin
stat.ML, cond-mat.stat-mech, cs.AI, cs.LG, math-ph, math.MP
The aim of this short note is to show that Denoising Diffusion Probabilistic
Model DDPM, a non-homogeneous discrete-time Markov process, can be represented
by a time-homogeneous continuous-time Markov process observed at non-uniformly
sampled discrete times. Surprisingly, this continuous-time Markov process is
the well-known and well-studied Ornstein-Ohlenbeck (OU) process, which was
developed in 1930’s for studying Brownian particles in Harmonic potentials. We
establish the formal equivalence between DDPM and the OU process using its
analytical solution. We further demonstrate that the design problem of the
noise scheduler for non-homogeneous DDPM is equivalent to designing observation
times for the OU process. We present several heuristic designs for observation
times based on principled quantities such as auto-variance and Fisher
Information and connect them to ad hoc noise schedules for DDPM. Interestingly,
we show that the Fisher-Information-motivated schedule corresponds exactly the
cosine schedule, which was developed without any theoretical foundation but is
the current state-of-the-art noise schedule.
Probabilistic Copyright Protection Can Fail for Text-to-Image Generative Models
November 29, 2023
Xiang Li, Qianli Shen, Kenji Kawaguchi
cs.CR, cs.AI, cs.CV, cs.MM
The booming use of text-to-image generative models has raised concerns about
their high risk of producing copyright-infringing content. While probabilistic
copyright protection methods provide a probabilistic guarantee against such
infringement, in this paper, we introduce Virtually Assured Amplification
Attack (VA3), a novel online attack framework that exposes the vulnerabilities
of these protection mechanisms. The proposed framework significantly amplifies
the probability of generating infringing content on the sustained interactions
with generative models and a lower-bounded success probability of each
engagement. Our theoretical and experimental results demonstrate the
effectiveness of our approach and highlight the potential risk of implementing
probabilistic copyright protection in practical applications of text-to-image
generative models. Code is available at https://github.com/South7X/VA3.
HiDiffusion: Unlocking High-Resolution Creativity and Efficiency in Low-Resolution Trained Diffusion Models
November 29, 2023
Shen Zhang, Zhaowei Chen, Zhenyu Zhao, Zhenyuan Chen, Yao Tang, Yuhao Chen, Wengang Cao, Jiajun Liang
We introduce HiDiffusion, a tuning-free framework comprised of
Resolution-Aware U-Net (RAU-Net) and Modified Shifted Window Multi-head
Self-Attention (MSW-MSA) to enable pretrained large text-to-image diffusion
models to efficiently generate high-resolution images (e.g. 1024$\times$1024)
that surpass the training image resolution. Pretrained diffusion models
encounter unreasonable object duplication in generating images beyond the
training image resolution. We attribute it to the mismatch between the feature
map size of high-resolution images and the receptive field of U-Net’s
convolution. To address this issue, we propose a simple yet scalable method
named RAU-Net. RAU-Net dynamically adjusts the feature map size to match the
convolution’s receptive field in the deep block of U-Net. Another obstacle in
high-resolution synthesis is the slow inference speed of U-Net. Our
observations reveal that the global self-attention in the top block, which
exhibits locality, however, consumes the majority of computational resources.
To tackle this issue, we propose MSW-MSA. Unlike previous window attention
mechanisms, our method uses a much larger window size and dynamically shifts
windows to better accommodate diffusion models. Extensive experiments
demonstrate that our HiDiffusion can scale diffusion models to generate
1024$\times$1024, 2048$\times$2048, or even 4096$\times$4096 resolution images,
while simultaneously reducing inference time by 40\%-60\%, achieving
state-of-the-art performance on high-resolution image synthesis. The most
significant revelation of our work is that a pretrained diffusion model on
low-resolution images is scalable for high-resolution generation without
further tuning. We hope this revelation can provide insights for future
research on the scalability of diffusion models.
MMA-Diffusion: MultiModal Attack on Diffusion Models
November 29, 2023
Yijun Yang, Ruiyuan Gao, Xiaosen Wang, Tsung-Yi Ho, Nan Xu, Qiang Xu
In recent years, Text-to-Image (T2I) models have seen remarkable
advancements, gaining widespread adoption. However, this progress has
inadvertently opened avenues for potential misuse, particularly in generating
inappropriate or Not-Safe-For-Work (NSFW) content. Our work introduces
MMA-Diffusion, a framework that presents a significant and realistic threat to
the security of T2I models by effectively circumventing current defensive
measures in both open-source models and commercial online services. Unlike
previous approaches, MMA-Diffusion leverages both textual and visual modalities
to bypass safeguards like prompt filters and post-hoc safety checkers, thus
exposing and highlighting the vulnerabilities in existing defense mechanisms.
When StyleGAN Meets Stable Diffusion: a W+ Adapter for Personalized Image Generation
November 29, 2023
Xiaoming Li, Xinyu Hou, Chen Change Loy
Text-to-image diffusion models have remarkably excelled in producing diverse,
high-quality, and photo-realistic images. This advancement has spurred a
growing interest in incorporating specific identities into generated content.
Most current methods employ an inversion approach to embed a target visual
concept into the text embedding space using a single reference image. However,
the newly synthesized faces either closely resemble the reference image in
terms of facial attributes, such as expression, or exhibit a reduced capacity
for identity preservation. Text descriptions intended to guide the facial
attributes of the synthesized face may fall short, owing to the intricate
entanglement of identity information with identity-irrelevant facial attributes
derived from the reference image. To address these issues, we present the novel
use of the extended StyleGAN embedding space $\mathcal{W}_+$, to achieve
enhanced identity preservation and disentanglement for diffusion models. By
aligning this semantically meaningful human face latent space with
text-to-image diffusion models, we succeed in maintaining high fidelity in
identity preservation, coupled with the capacity for semantic editing.
Additionally, we propose new training objectives to balance the influences of
both prompt and identity conditions, ensuring that the identity-irrelevant
background remains unaffected during facial attribute modifications. Extensive
experiments reveal that our method adeptly generates personalized text-to-image
outputs that are not only compatible with prompt descriptions but also amenable
to common StyleGAN editing directions in diverse settings. Our source code will
be available at \url{https://github.com/csxmli2016/w-plus-adapter}.
DifFlow3D: Toward Robust Uncertainty-Aware Scene Flow Estimation with Diffusion Model
November 29, 2023
Jiuming Liu, Guangming Wang, Weicai Ye, Chaokang Jiang, Jinru Han, Zhe Liu, Guofeng Zhang, Dalong Du, Hesheng Wang
Scene flow estimation, which aims to predict per-point 3D displacements of
dynamic scenes, is a fundamental task in the computer vision field. However,
previous works commonly suffer from unreliable correlation caused by locally
constrained searching ranges, and struggle with accumulated inaccuracy arising
from the coarse-to-fine structure. To alleviate these problems, we propose a
novel uncertainty-aware scene flow estimation network (DifFlow3D) with the
diffusion probabilistic model. Iterative diffusion-based refinement is designed
to enhance the correlation robustness and resilience to challenging cases,
e.g., dynamics, noisy inputs, repetitive patterns, etc. To restrain the
generation diversity, three key flow-related features are leveraged as
conditions in our diffusion model. Furthermore, we also develop an uncertainty
estimation module within diffusion to evaluate the reliability of estimated
scene flow. Our DifFlow3D achieves state-of-the-art performance, with 6.7\% and
19.1\% EPE3D reduction respectively on FlyingThings3D and KITTI 2015 datasets.
Notably, our method achieves an unprecedented millimeter-level accuracy
(0.0089m in EPE3D) on the KITTI dataset. Additionally, our diffusion-based
refinement paradigm can be readily integrated as a plug-and-play module into
existing scene flow networks, significantly increasing their estimation
accuracy. Codes will be released later.
Image Inpainting via Tractable Steering of Diffusion Models
November 28, 2023
Anji Liu, Mathias Niepert, Guy Van den Broeck
Diffusion models are the current state of the art for generating
photorealistic images. Controlling the sampling process for constrained image
generation tasks such as inpainting, however, remains challenging since exact
conditioning on such constraints is intractable. While existing methods use
various techniques to approximate the constrained posterior, this paper
proposes to exploit the ability of Tractable Probabilistic Models (TPMs) to
exactly and efficiently compute the constrained posterior, and to leverage this
signal to steer the denoising process of diffusion models. Specifically, this
paper adopts a class of expressive TPMs termed Probabilistic Circuits (PCs).
Building upon prior advances, we further scale up PCs and make them capable of
guiding the image generation process of diffusion models. Empirical results
suggest that our approach can consistently improve the overall quality and
semantic coherence of inpainted images across three natural image datasets
(i.e., CelebA-HQ, ImageNet, and LSUN) with only ~10% additional computational
overhead brought by the TPM. Further, with the help of an image encoder and
decoder, our method can readily accept semantic constraints on specific regions
of the image, which opens up the potential for more controlled image generation
tasks. In addition to proposing a new framework for constrained image
generation, this paper highlights the benefit of more tractable models and
motivates the development of expressive TPMs.
Unlocking Spatial Comprehension in Text-to-Image Diffusion Models
November 28, 2023
Mohammad Mahdi Derakhshani, Menglin Xia, Harkirat Behl, Cees G. M. Snoek, Victor Rühle
We propose CompFuser, an image generation pipeline that enhances spatial
comprehension and attribute assignment in text-to-image generative models. Our
pipeline enables the interpretation of instructions defining spatial
relationships between objects in a scene, such as `An image of a gray cat on
the left of an orange dog’, and generate corresponding images. This is
especially important in order to provide more control to the user. CompFuser
overcomes the limitation of existing text-to-image diffusion models by decoding
the generation of multiple objects into iterative steps: first generating a
single object and then editing the image by placing additional objects in their
designated positions. To create training data for spatial comprehension and
attribute assignment we introduce a synthetic data generation process, that
leverages a frozen large language model and a frozen layout-based diffusion
model for object placement. We compare our approach to strong baselines and
show that our model outperforms state-of-the-art image generation models in
spatial comprehension and attribute assignment, despite being 3x to 5x smaller
in parameters.
Shadows Don’t Lie and Lines Can’t Bend! Generative Models don’t know Projective Geometry…for now
November 28, 2023
Ayush Sarkar, Hanlin Mai, Amitabh Mahapatra, Svetlana Lazebnik, D. A. Forsyth, Anand Bhattad
cs.CV, cs.AI, cs.GR, cs.LG
Generative models can produce impressively realistic images. This paper
demonstrates that generated images have geometric features different from those
of real images. We build a set of collections of generated images, prequalified
to fool simple, signal-based classifiers into believing they are real. We then
show that prequalified generated images can be identified reliably by
classifiers that only look at geometric properties. We use three such
classifiers. All three classifiers are denied access to image pixels, and look
only at derived geometric features. The first classifier looks at the
perspective field of the image, the second looks at lines detected in the
image, and the third looks at relations between detected objects and shadows.
Our procedure detects generated images more reliably than SOTA local signal
based detectors, for images from a number of distinct generators. Saliency maps
suggest that the classifiers can identify geometric problems reliably. We
conclude that current generators cannot reliably reproduce geometric properties
of real images.
DiffuseBot: Breeding Soft Robots With Physics-Augmented Generative Diffusion Models
November 28, 2023
Tsun-Hsuan Wang, Juntian Zheng, Pingchuan Ma, Yilun Du, Byungchul Kim, Andrew Spielberg, Joshua Tenenbaum, Chuang Gan, Daniela Rus
cs.RO, cs.AI, cs.CV, cs.LG
Nature evolves creatures with a high complexity of morphological and
behavioral intelligence, meanwhile computational methods lag in approaching
that diversity and efficacy. Co-optimization of artificial creatures’
morphology and control in silico shows promise for applications in physical
soft robotics and virtual character creation; such approaches, however, require
developing new learning algorithms that can reason about function atop pure
structure. In this paper, we present DiffuseBot, a physics-augmented diffusion
model that generates soft robot morphologies capable of excelling in a wide
spectrum of tasks. DiffuseBot bridges the gap between virtually generated
content and physical utility by (i) augmenting the diffusion process with a
physical dynamical simulation which provides a certificate of performance, and
(ii) introducing a co-design procedure that jointly optimizes physical design
and control by leveraging information about physical sensitivities from
differentiable simulation. We showcase a range of simulated and fabricated
robots along with their capabilities. Check our website at
https://diffusebot.github.io/
Adversarial Diffusion Distillation
November 28, 2023
Axel Sauer, Dominik Lorenz, Andreas Blattmann, Robin Rombach
We introduce Adversarial Diffusion Distillation (ADD), a novel training
approach that efficiently samples large-scale foundational image diffusion
models in just 1-4 steps while maintaining high image quality. We use score
distillation to leverage large-scale off-the-shelf image diffusion models as a
teacher signal in combination with an adversarial loss to ensure high image
fidelity even in the low-step regime of one or two sampling steps. Our analyses
show that our model clearly outperforms existing few-step methods (GANs, Latent
Consistency Models) in a single step and reaches the performance of
state-of-the-art diffusion models (SDXL) in only four steps. ADD is the first
method to unlock single-step, real-time image synthesis with foundation models.
Code and weights available under
https://github.com/Stability-AI/generative-models and
https://huggingface.co/stabilityai/ .
Space-Time Diffusion Features for Zero-Shot Text-Driven Motion Transfer
November 28, 2023
Danah Yatim, Rafail Fridman, Omer Bar-Tal, Yoni Kasten, Tali Dekel
We present a new method for text-driven motion transfer - synthesizing a
video that complies with an input text prompt describing the target objects and
scene while maintaining an input video’s motion and scene layout. Prior methods
are confined to transferring motion across two subjects within the same or
closely related object categories and are applicable for limited domains (e.g.,
humans). In this work, we consider a significantly more challenging setting in
which the target and source objects differ drastically in shape and
fine-grained motion characteristics (e.g., translating a jumping dog into a
dolphin). To this end, we leverage a pre-trained and fixed text-to-video
diffusion model, which provides us with generative and motion priors. The
pillar of our method is a new space-time feature loss derived directly from the
model. This loss guides the generation process to preserve the overall motion
of the input video while complying with the target object in terms of shape and
fine-grained motion traits.
Ranni: Taming Text-to-Image Diffusion for Accurate Instruction Following
November 28, 2023
Yutong Feng, Biao Gong, Di Chen, Yujun Shen, Yu Liu, Jingren Zhou
Existing text-to-image (T2I) diffusion models usually struggle in
interpreting complex prompts, especially those with quantity, object-attribute
binding, and multi-subject descriptions. In this work, we introduce a semantic
panel as the middleware in decoding texts to images, supporting the generator
to better follow instructions. The panel is obtained through arranging the
visual concepts parsed from the input text by the aid of large language models,
and then injected into the denoising network as a detailed control signal to
complement the text condition. To facilitate text-to-panel learning, we come up
with a carefully designed semantic formatting protocol, accompanied by a
fully-automatic data preparation pipeline. Thanks to such a design, our
approach, which we call Ranni, manages to enhance a pre-trained T2I generator
regarding its textual controllability. More importantly, the introduction of
the generative middleware brings a more convenient form of interaction (i.e.,
directly adjusting the elements in the panel or using language instructions)
and further allows users to finely customize their generation, based on which
we develop a practical system and showcase its potential in continuous
generation and chatting-based editing. Our project page is at
https://ranni-t2i.github.io/Ranni.
November 28, 2023
Chen Zhao, Weiling Cai, Chenyu Dong, Chengwei Hu
Underwater images are subject to intricate and diverse degradation,
inevitably affecting the effectiveness of underwater visual tasks. However,
most approaches primarily operate in the raw pixel space of images, which
limits the exploration of the frequency characteristics of underwater images,
leading to an inadequate utilization of deep models’ representational
capabilities in producing high-quality images. In this paper, we introduce a
novel Underwater Image Enhancement (UIE) framework, named WF-Diff, designed to
fully leverage the characteristics of frequency domain information and
diffusion models. WF-Diff consists of two detachable networks: Wavelet-based
Fourier information interaction network (WFI2-net) and Frequency Residual
Diffusion Adjustment Module (FRDAM). With our full exploration of the frequency
domain information, WFI2-net aims to achieve preliminary enhancement of
frequency information in the wavelet space. Our proposed FRDAM can further
refine the high- and low-frequency information of the initial enhanced images,
which can be viewed as a plug-and-play universal module to adjust the detail of
the underwater images. With the above techniques, our algorithm can show SOTA
performance on real-world underwater image datasets, and achieves competitive
performance in visual quality.
Denoising Diffusion Probabilistic Models for Image Inpainting of Cell Distributions in the Human Brain
November 28, 2023
Jan-Oliver Kropp, Christian Schiffer, Katrin Amunts, Timo Dickscheid
Recent advances in imaging and high-performance computing have made it
possible to image the entire human brain at the cellular level. This is the
basis to study the multi-scale architecture of the brain regarding its
subdivision into brain areas and nuclei, cortical layers, columns, and cell
clusters down to single cell morphology Methods for brain mapping and cell
segmentation exploit such images to enable rapid and automated analysis of
cytoarchitecture and cell distribution in complete series of histological
sections. However, the presence of inevitable processing artifacts in the image
data caused by missing sections, tears in the tissue, or staining variations
remains the primary reason for gaps in the resulting image data. To this end we
aim to provide a model that can fill in missing information in a reliable way,
following the true cell distribution at different scales. Inspired by the
recent success in image generation, we propose a denoising diffusion
probabilistic model (DDPM), trained on light-microscopic scans of cell-body
stained sections. We extend this model with the RePaint method to impute
missing or replace corrupted image data. We show that our trained DDPM is able
to generate highly realistic image information for this purpose, generating
plausible cell statistics and cytoarchitectonic patterns. We validate its
outputs using two established downstream task models trained on the same data.
Robust Diffusion GAN using Semi-Unbalanced Optimal Transport
November 28, 2023
Quan Dao, Binh Ta, Tung Pham, Anh Tran
Diffusion models, a type of generative model, have demonstrated great
potential for synthesizing highly detailed images. By integrating with GAN,
advanced diffusion models like DDGAN \citep{xiao2022DDGAN} could approach
real-time performance for expansive practical applications. While DDGAN has
effectively addressed the challenges of generative modeling, namely producing
high-quality samples, covering different data modes, and achieving faster
sampling, it remains susceptible to performance drops caused by datasets that
are corrupted with outlier samples. This work introduces a robust training
technique based on semi-unbalanced optimal transport to mitigate the impact of
outliers effectively. Through comprehensive evaluations, we demonstrate that
our robust diffusion GAN (RDGAN) outperforms vanilla DDGAN in terms of the
aforementioned generative modeling criteria, i.e., image quality, mode coverage
of distribution, and inference speed, and exhibits improved robustness when
dealing with both clean and corrupted datasets.
MobileDiffusion: Subsecond Text-to-Image Generation on Mobile Devices
November 28, 2023
Yang Zhao, Yanwu Xu, Zhisheng Xiao, Tingbo Hou
The deployment of large-scale text-to-image diffusion models on mobile
devices is impeded by their substantial model size and slow inference speed. In
this paper, we propose \textbf{MobileDiffusion}, a highly efficient
text-to-image diffusion model obtained through extensive optimizations in both
architecture and sampling techniques. We conduct a comprehensive examination of
model architecture design to reduce redundancy, enhance computational
efficiency, and minimize model’s parameter count, while preserving image
generation quality. Additionally, we employ distillation and diffusion-GAN
finetuning techniques on MobileDiffusion to achieve 8-step and 1-step inference
respectively. Empirical studies, conducted both quantitatively and
qualitatively, demonstrate the effectiveness of our proposed techniques.
MobileDiffusion achieves a remarkable \textbf{sub-second} inference speed for
generating a $512\times512$ image on mobile devices, establishing a new state
of the art.
DiffusionTalker: Personalization and Acceleration for Speech-Driven 3D Face Diffuser
November 28, 2023
Peng Chen, Xiaobao Wei, Ming Lu, Yitong Zhu, Naiming Yao, Xingyu Xiao, Hui Chen
Speech-driven 3D facial animation has been an attractive task in both
academia and industry. Traditional methods mostly focus on learning a
deterministic mapping from speech to animation. Recent approaches start to
consider the non-deterministic fact of speech-driven 3D face animation and
employ the diffusion model for the task. However, personalizing facial
animation and accelerating animation generation are still two major limitations
of existing diffusion-based methods. To address the above limitations, we
propose DiffusionTalker, a diffusion-based method that utilizes contrastive
learning to personalize 3D facial animation and knowledge distillation to
accelerate 3D animation generation. Specifically, to enable personalization, we
introduce a learnable talking identity to aggregate knowledge in audio
sequences. The proposed identity embeddings extract customized facial cues
across different people in a contrastive learning manner. During inference,
users can obtain personalized facial animation based on input audio, reflecting
a specific talking style. With a trained diffusion model with hundreds of
steps, we distill it into a lightweight model with 8 steps for acceleration.
Extensive experiments are conducted to demonstrate that our method outperforms
state-of-the-art methods. The code will be released.
Exploring Straighter Trajectories of Flow Matching with Diffusion Guidance
November 28, 2023
Siyu Xing, Jie Cao, Huaibo Huang, Xiao-Yu Zhang, Ran He
Flow matching as a paradigm of generative model achieves notable success
across various domains. However, existing methods use either multi-round
training or knowledge within minibatches, posing challenges in finding a
favorable coupling strategy for straight trajectories. To address this issue,
we propose a novel approach, Straighter trajectories of Flow Matching
(StraightFM). It straightens trajectories with the coupling strategy guided by
diffusion model from entire distribution level. First, we propose a coupling
strategy to straighten trajectories, creating couplings between image and noise
samples under diffusion model guidance. Second, StraightFM also integrates real
data to enhance training, employing a neural network to parameterize another
coupling process from images to noise samples. StraightFM is jointly optimized
with couplings from above two mutually complementary directions, resulting in
straighter trajectories and enabling both one-step and few-step generation.
Extensive experiments demonstrate that StraightFM yields high quality samples
with fewer step. StraightFM generates visually appealing images with a lower
FID among diffusion and traditional flow matching methods within 5 sampling
steps when trained on pixel space. In the latent space (i.e., Latent
Diffusion), StraightFM achieves a lower KID value compared to existing methods
on the CelebA-HQ 256 dataset in fewer than 10 sampling steps.
Federated Learning with Diffusion Models for Privacy-Sensitive Vision Tasks
November 28, 2023
Ye Lin Tun, Chu Myaet Thwal, Ji Su Yoon, Sun Moo Kang, Chaoning Zhang, Choong Seon Hong
Diffusion models have shown great potential for vision-related tasks,
particularly for image generation. However, their training is typically
conducted in a centralized manner, relying on data collected from publicly
available sources. This approach may not be feasible or practical in many
domains, such as the medical field, which involves privacy concerns over data
collection. Despite the challenges associated with privacy-sensitive data, such
domains could still benefit from valuable vision services provided by diffusion
models. Federated learning (FL) plays a crucial role in enabling decentralized
model training without compromising data privacy. Instead of collecting data,
an FL system gathers model parameters, effectively safeguarding the private
data of different parties involved. This makes FL systems vital for managing
decentralized learning tasks, especially in scenarios where privacy-sensitive
data is distributed across a network of clients. Nonetheless, FL presents its
own set of challenges due to its distributed nature and privacy-preserving
properties. Therefore, in this study, we explore the FL strategy to train
diffusion models, paving the way for the development of federated diffusion
models. We conduct experiments on various FL scenarios, and our findings
demonstrate that federated diffusion models have great potential to deliver
vision services to privacy-sensitive domains.
TextDiffuser-2: Unleashing the Power of Language Models for Text Rendering
November 28, 2023
Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, Furu Wei
The diffusion model has been proven a powerful generative model in recent
years, yet remains a challenge in generating visual text. Several methods
alleviated this issue by incorporating explicit text position and content as
guidance on where and what text to render. However, these methods still suffer
from several drawbacks, such as limited flexibility and automation, constrained
capability of layout prediction, and restricted style diversity. In this paper,
we present TextDiffuser-2, aiming to unleash the power of language models for
text rendering. Firstly, we fine-tune a large language model for layout
planning. The large language model is capable of automatically generating
keywords for text rendering and also supports layout modification through
chatting. Secondly, we utilize the language model within the diffusion model to
encode the position and texts at the line level. Unlike previous methods that
employed tight character-level guidance, this approach generates more diverse
text images. We conduct extensive experiments and incorporate user studies
involving human participants as well as GPT-4V, validating TextDiffuser-2’s
capacity to achieve a more rational text layout and generation with enhanced
diversity. The code and model will be available at
\url{https://aka.ms/textdiffuser-2}.
PEA-Diffusion: Parameter-Efficient Adapter with Knowledge Distillation in non-English Text-to-Image Generation
November 28, 2023
Jian Ma, Chen Chen, Qingsong Xie, Haonan Lu
Text-to-image diffusion models are well-known for their ability to generate
realistic images based on textual prompts. However, the existing works have
predominantly focused on English, lacking support for non-English text-to-image
models. The most commonly used translation methods cannot solve the generation
problem related to language culture, while training from scratch on a specific
language dataset is prohibitively expensive. In this paper, we are inspired to
propose a simple plug-and-play language transfer method based on knowledge
distillation. All we need to do is train a lightweight MLP-like
parameter-efficient adapter (PEA) with only 6M parameters under teacher
knowledge distillation along with a small parallel data corpus. We are
surprised to find that freezing the parameters of UNet can still achieve
remarkable performance on the language-specific prompt evaluation set,
demonstrating that PEA can stimulate the potential generation ability of the
original UNet. Additionally, it closely approaches the performance of the
English text-to-image model on a general prompt evaluation set. Furthermore,
our adapter can be used as a plugin to achieve significant results in
downstream tasks in cross-lingual text-to-image generation. Code will be
available at: https://github.com/OPPO-Mente-Lab/PEA-Diffusion
Manifold Preserving Guided Diffusion
November 28, 2023
Yutong He, Naoki Murata, Chieh-Hsin Lai, Yuhta Takida, Toshimitsu Uesaka, Dongjun Kim, Wei-Hsiang Liao, Yuki Mitsufuji, J. Zico Kolter, Ruslan Salakhutdinov, Stefano Ermon
Despite the recent advancements, conditional image generation still faces
challenges of cost, generalizability, and the need for task-specific training.
In this paper, we propose Manifold Preserving Guided Diffusion (MPGD), a
training-free conditional generation framework that leverages pretrained
diffusion models and off-the-shelf neural networks with minimal additional
inference cost for a broad range of tasks. Specifically, we leverage the
manifold hypothesis to refine the guided diffusion steps and introduce a
shortcut algorithm in the process. We then propose two methods for on-manifold
training-free guidance using pre-trained autoencoders and demonstrate that our
shortcut inherently preserves the manifolds when applied to latent diffusion
models. Our experiments show that MPGD is efficient and effective for solving a
variety of conditional generation applications in low-compute settings, and can
consistently offer up to 3.8x speed-ups with the same number of diffusion steps
while maintaining high sample quality compared to the baselines.
Test-time Adaptation of Discriminative Models via Diffusion Generative Feedback
November 27, 2023
Mihir Prabhudesai, Tsung-Wei Ke, Alexander C. Li, Deepak Pathak, Katerina Fragkiadaki
cs.CV, cs.AI, cs.LG, cs.RO
The advancements in generative modeling, particularly the advent of diffusion
models, have sparked a fundamental question: how can these models be
effectively used for discriminative tasks? In this work, we find that
generative models can be great test-time adapters for discriminative models.
Our method, Diffusion-TTA, adapts pre-trained discriminative models such as
image classifiers, segmenters and depth predictors, to each unlabelled example
in the test set using generative feedback from a diffusion model. We achieve
this by modulating the conditioning of the diffusion model using the output of
the discriminative model. We then maximize the image likelihood objective by
backpropagating the gradients to discriminative model’s parameters. We show
Diffusion-TTA significantly enhances the accuracy of various large-scale
pre-trained discriminative models, such as, ImageNet classifiers, CLIP models,
image pixel labellers and image depth predictors. Diffusion-TTA outperforms
existing test-time adaptation methods, including TTT-MAE and TENT, and
particularly shines in online adaptation setups, where the discriminative model
is continually adapted to each example in the test set. We provide access to
code, results, and visualizations on our website:
https://diffusion-tta.github.io/.
Self-correcting LLM-controlled Diffusion Models
November 27, 2023
Tsung-Han Wu, Long Lian, Joseph E. Gonzalez, Boyi Li, Trevor Darrell
Text-to-image generation has witnessed significant progress with the advent
of diffusion models. Despite the ability to generate photorealistic images,
current text-to-image diffusion models still often struggle to accurately
interpret and follow complex input text prompts. In contrast to existing models
that aim to generate images only with their best effort, we introduce
Self-correcting LLM-controlled Diffusion (SLD). SLD is a framework that
generates an image from the input prompt, assesses its alignment with the
prompt, and performs self-corrections on the inaccuracies in the generated
image. Steered by an LLM controller, SLD turns text-to-image generation into an
iterative closed-loop process, ensuring correctness in the resulting image. SLD
is not only training-free but can also be seamlessly integrated with diffusion
models behind API access, such as DALL-E 3, to further boost the performance of
state-of-the-art diffusion models. Experimental results show that our approach
can rectify a majority of incorrect generations, particularly in generative
numeracy, attribute binding, and spatial relationships. Furthermore, by simply
adjusting the instructions to the LLM, SLD can perform image editing tasks,
bridging the gap between text-to-image generation and image editing pipelines.
We will make our code available for future research and applications.
DiffSLVA: Harnessing Diffusion Models for Sign Language Video Anonymization
November 27, 2023
Zhaoyang Xia, Carol Neidle, Dimitris N. Metaxas
Since American Sign Language (ASL) has no standard written form, Deaf signers
frequently share videos in order to communicate in their native language.
However, since both hands and face convey critical linguistic information in
signed languages, sign language videos cannot preserve signer privacy. While
signers have expressed interest, for a variety of applications, in sign
language video anonymization that would effectively preserve linguistic
content, attempts to develop such technology have had limited success, given
the complexity of hand movements and facial expressions. Existing approaches
rely predominantly on precise pose estimations of the signer in video footage
and often require sign language video datasets for training. These requirements
prevent them from processing videos ‘in the wild,’ in part because of the
limited diversity present in current sign language video datasets. To address
these limitations, our research introduces DiffSLVA, a novel methodology that
utilizes pre-trained large-scale diffusion models for zero-shot text-guided
sign language video anonymization. We incorporate ControlNet, which leverages
low-level image features such as HED (Holistically-Nested Edge Detection)
edges, to circumvent the need for pose estimation. Additionally, we develop a
specialized module dedicated to capturing facial expressions, which are
critical for conveying essential linguistic information in signed languages. We
then combine the above methods to achieve anonymization that better preserves
the essential linguistic content of the original signer. This innovative
methodology makes possible, for the first time, sign language video
anonymization that could be used for real-world applications, which would offer
significant benefits to the Deaf and Hard-of-Hearing communities. We
demonstrate the effectiveness of our approach with a series of signer
anonymization experiments.
Closing the ODE-SDE gap in score-based diffusion models through the Fokker-Planck equation
November 27, 2023
Teo Deveney, Jan Stanczuk, Lisa Maria Kreusser, Chris Budd, Carola-Bibiane Schönlieb
cs.LG, cs.NA, math.NA, stat.ML
Score-based diffusion models have emerged as one of the most promising
frameworks for deep generative modelling, due to their state-of-the art
performance in many generation tasks while relying on mathematical foundations
such as stochastic differential equations (SDEs) and ordinary differential
equations (ODEs). Empirically, it has been reported that ODE based samples are
inferior to SDE based samples. In this paper we rigorously describe the range
of dynamics and approximations that arise when training score-based diffusion
models, including the true SDE dynamics, the neural approximations, the various
approximate particle dynamics that result, as well as their associated
Fokker–Planck equations and the neural network approximations of these
Fokker–Planck equations. We systematically analyse the difference between the
ODE and SDE dynamics of score-based diffusion models, and link it to an
associated Fokker–Planck equation. We derive a theoretical upper bound on the
Wasserstein 2-distance between the ODE- and SDE-induced distributions in terms
of a Fokker–Planck residual. We also show numerically that conventional
score-based diffusion models can exhibit significant differences between ODE-
and SDE-induced distributions which we demonstrate using explicit comparisons.
Moreover, we show numerically that reducing the Fokker–Planck residual by
adding it as an additional regularisation term leads to closing the gap between
ODE- and SDE-induced distributions. Our experiments suggest that this
regularisation can improve the distribution generated by the ODE, however that
this can come at the cost of degraded SDE sample quality.
DiffAnt: Diffusion Models for Action Anticipation
November 27, 2023
Zeyun Zhong, Chengzhi Wu, Manuel Martin, Michael Voit, Juergen Gall, Jürgen Beyerer
Anticipating future actions is inherently uncertain. Given an observed video
segment containing ongoing actions, multiple subsequent actions can plausibly
follow. This uncertainty becomes even larger when predicting far into the
future. However, the majority of existing action anticipation models adhere to
a deterministic approach, neglecting to account for future uncertainties. In
this work, we rethink action anticipation from a generative view, employing
diffusion models to capture different possible future actions. In this
framework, future actions are iteratively generated from standard Gaussian
noise in the latent space, conditioned on the observed video, and subsequently
transitioned into the action space. Extensive experiments on four benchmark
datasets, i.e., Breakfast, 50Salads, EpicKitchens, and EGTEA Gaze+, are
performed and the proposed method achieves superior or comparable results to
state-of-the-art methods, showing the effectiveness of a generative approach
for action anticipation. Our code and trained models will be published on
GitHub.
TFMQ-DM: Temporal Feature Maintenance Quantization for Diffusion Models
November 27, 2023
Yushi Huang, Ruihao Gong, Jing Liu, Tianlong Chen, Xianglong Liu
The Diffusion model, a prevalent framework for image generation, encounters
significant challenges in terms of broad applicability due to its extended
inference times and substantial memory requirements. Efficient Post-training
Quantization (PTQ) is pivotal for addressing these issues in traditional
models. Different from traditional models, diffusion models heavily depend on
the time-step $t$ to achieve satisfactory multi-round denoising. Usually, $t$
from the finite set ${1, \ldots, T}$ is encoded to a temporal feature by a
few modules totally irrespective of the sampling data. However, existing PTQ
methods do not optimize these modules separately. They adopt inappropriate
reconstruction targets and complex calibration methods, resulting in a severe
disturbance of the temporal feature and denoising trajectory, as well as a low
compression efficiency. To solve these, we propose a Temporal Feature
Maintenance Quantization (TFMQ) framework building upon a Temporal Information
Block which is just related to the time-step $t$ and unrelated to the sampling
data. Powered by the pioneering block design, we devise temporal information
aware reconstruction (TIAR) and finite set calibration (FSC) to align the
full-precision temporal features in a limited time. Equipped with the
framework, we can maintain the most temporal information and ensure the
end-to-end generation quality. Extensive experiments on various datasets and
diffusion models prove our state-of-the-art results. Remarkably, our
quantization approach, for the first time, achieves model performance nearly on
par with the full-precision model under 4-bit weight quantization.
Additionally, our method incurs almost no extra computational cost and
accelerates quantization time by $2.0 \times$ on LSUN-Bedrooms $256 \times 256$
compared to previous works.
Regularization by Texts for Latent Diffusion Inverse Solvers
November 27, 2023
Jeongsol Kim, Geon Yeong Park, Hyungjin Chung, Jong Chul Ye
The recent advent of diffusion models has led to significant progress in
solving inverse problems, leveraging these models as effective generative
priors. Nonetheless, challenges related to the ill-posed nature of such
problems remain, often due to inherent ambiguities in measurements. Drawing
inspiration from the human ability to resolve visual ambiguities through
perceptual biases, here we introduce a novel latent diffusion inverse solver by
incorporating regularization by texts (TReg). Specifically, TReg applies the
textual description of the preconception of the solution during the reverse
sampling phase, of which description isndynamically reinforced through
null-text optimization for adaptive negation. Our comprehensive experimental
results demonstrate that TReg successfully mitigates ambiguity in latent
diffusion inverse solvers, enhancing their effectiveness and accuracy.
LFSRDiff: Light Field Image Super-Resolution via Diffusion Models
November 27, 2023
Wentao Chao, Fuqing Duan, Xuechun Wang, Yingqian Wang, Guanghui Wang
Light field (LF) image super-resolution (SR) is a challenging problem due to
its inherent ill-posed nature, where a single low-resolution (LR) input LF
image can correspond to multiple potential super-resolved outcomes. Despite
this complexity, mainstream LF image SR methods typically adopt a deterministic
approach, generating only a single output supervised by pixel-wise loss
functions. This tendency often results in blurry and unrealistic results.
Although diffusion models can capture the distribution of potential SR results
by iteratively predicting Gaussian noise during the denoising process, they are
primarily designed for general images and struggle to effectively handle the
unique characteristics and information present in LF images. To address these
limitations, we introduce LFSRDiff, the first diffusion-based LF image SR
model, by incorporating the LF disentanglement mechanism. Our novel
contribution includes the introduction of a disentangled U-Net for diffusion
models, enabling more effective extraction and fusion of both spatial and
angular information within LF images. Through comprehensive experimental
evaluations and comparisons with the state-of-the-art LF image SR methods, the
proposed approach consistently produces diverse and realistic SR results. It
achieves the highest perceptual metric in terms of LPIPS. It also demonstrates
the ability to effectively control the trade-off between perception and
distortion. The code is available at
\url{https://github.com/chaowentao/LFSRDiff}.
Functional Diffusion
November 26, 2023
Biao Zhang, Peter Wonka
We propose a new class of generative diffusion models, called functional
diffusion. In contrast to previous work, functional diffusion works on samples
that are represented by functions with a continuous domain. Functional
diffusion can be seen as an extension of classical diffusion models to an
infinite-dimensional domain. Functional diffusion is very versatile as images,
videos, audio, 3D shapes, deformations, \etc, can be handled by the same
framework with minimal changes. In addition, functional diffusion is especially
suited for irregular data or data defined in non-standard domains. In our work,
we derive the necessary foundations for functional diffusion and propose a
first implementation based on the transformer architecture. We show generative
results on complicated signed distance functions and deformation functions
defined on 3D surfaces.
ToddlerDiffusion: Flash Interpretable Controllable Diffusion Model
November 24, 2023
Eslam Mohamed Bakr, Liangbing Zhao, Vincent Tao Hu, Matthieu Cord, Patrick Perez, Mohamed Elhoseiny
Diffusion-based generative models excel in perceptually impressive synthesis
but face challenges in interpretability. This paper introduces
ToddlerDiffusion, an interpretable 2D diffusion image-synthesis framework
inspired by the human generation system. Unlike traditional diffusion models
with opaque denoising steps, our approach decomposes the generation process
into simpler, interpretable stages; generating contours, a palette, and a
detailed colored image. This not only enhances overall performance but also
enables robust editing and interaction capabilities. Each stage is meticulously
formulated for efficiency and accuracy, surpassing Stable-Diffusion (LDM).
Extensive experiments on datasets like LSUN-Churches and COCO validate our
approach, consistently outperforming existing methods. ToddlerDiffusion
achieves notable efficiency, matching LDM performance on LSUN-Churches while
operating three times faster with a 3.76 times smaller architecture. Our source
code is provided in the supplementary material and will be publicly accessible.
DemoFusion: Democratising High-Resolution Image Generation With No $$$
November 24, 2023
Ruoyi Du, Dongliang Chang, Timothy Hospedales, Yi-Zhe Song, Zhanyu Ma
High-resolution image generation with Generative Artificial Intelligence
(GenAI) has immense potential but, due to the enormous capital investment
required for training, it is increasingly centralised to a few large
corporations, and hidden behind paywalls. This paper aims to democratise
high-resolution GenAI by advancing the frontier of high-resolution generation
while remaining accessible to a broad audience. We demonstrate that existing
Latent Diffusion Models (LDMs) possess untapped potential for higher-resolution
image generation. Our novel DemoFusion framework seamlessly extends open-source
GenAI models, employing Progressive Upscaling, Skip Residual, and Dilated
Sampling mechanisms to achieve higher-resolution image generation. The
progressive nature of DemoFusion requires more passes, but the intermediate
results can serve as “previews”, facilitating rapid prompt iteration.
On diffusion-based generative models and their error bounds: The log-concave case with full convergence estimates
November 22, 2023
Stefano Bruno, Ying Zhang, Dong-Young Lim, Ömer Deniz Akyildiz, Sotirios Sabanis
cs.LG, math.OC, math.PR, stat.ML
We provide full theoretical guarantees for the convergence behaviour of
diffusion-based generative models under the assumption of strongly logconcave
data distributions while our approximating class of functions used for score
estimation is made of Lipschitz continuous functions. We demonstrate via a
motivating example, sampling from a Gaussian distribution with unknown mean,
the powerfulness of our approach. In this case, explicit estimates are provided
for the associated optimization problem, i.e. score approximation, while these
are combined with the corresponding sampling estimates. As a result, we obtain
the best known upper bound estimates in terms of key quantities of interest,
such as the dimension and rates of convergence, for the Wasserstein-2 distance
between the data distribution (Gaussian with unknown mean) and our sampling
algorithm.
Beyond the motivating example and in order to allow for the use of a diverse
range of stochastic optimizers, we present our results using an $L^2$-accurate
score estimation assumption, which crucially is formed under an expectation
with respect to the stochastic optimizer and our novel auxiliary process that
uses only known information. This approach yields the best known convergence
rate for our sampling algorithm.
DiffusionMat: Alpha Matting as Sequential Refinement Learning
November 22, 2023
Yangyang Xu, Shengfeng He, Wenqi Shao, Kwan-Yee K. Wong, Yu Qiao, Ping Luo
In this paper, we introduce DiffusionMat, a novel image matting framework
that employs a diffusion model for the transition from coarse to refined alpha
mattes. Diverging from conventional methods that utilize trimaps merely as
loose guidance for alpha matte prediction, our approach treats image matting as
a sequential refinement learning process. This process begins with the addition
of noise to trimaps and iteratively denoises them using a pre-trained diffusion
model, which incrementally guides the prediction towards a clean alpha matte.
The key innovation of our framework is a correction module that adjusts the
output at each denoising step, ensuring that the final result is consistent
with the input image’s structures. We also introduce the Alpha Reliability
Propagation, a novel technique designed to maximize the utility of available
guidance by selectively enhancing the trimap regions with confident alpha
information, thus simplifying the correction task. To train the correction
module, we devise specialized loss functions that target the accuracy of the
alpha matte’s edges and the consistency of its opaque and transparent regions.
We evaluate our model across several image matting benchmarks, and the results
indicate that DiffusionMat consistently outperforms existing methods. Project
page at~\url{https://cnnlstm.github.io/DiffusionMat
Accelerating Inference in Molecular Diffusion Models with Latent Representations of Protein Structure
November 22, 2023
Ian Dunn, David Ryan Koes
Diffusion generative models have emerged as a powerful framework for
addressing problems in structural biology and structure-based drug design.
These models operate directly on 3D molecular structures. Due to the
unfavorable scaling of graph neural networks (GNNs) with graph size as well as
the relatively slow inference speeds inherent to diffusion models, many
existing molecular diffusion models rely on coarse-grained representations of
protein structure to make training and inference feasible. However, such
coarse-grained representations discard essential information for modeling
molecular interactions and impair the quality of generated structures. In this
work, we present a novel GNN-based architecture for learning latent
representations of molecular structure. When trained end-to-end with a
diffusion model for de novo ligand design, our model achieves comparable
performance to one with an all-atom protein representation while exhibiting a
3-fold reduction in inference time.
Guided Flows for Generative Modeling and Decision Making
November 22, 2023
Qinqing Zheng, Matt Le, Neta Shaul, Yaron Lipman, Aditya Grover, Ricky T. Q. Chen
cs.LG, cs.AI, cs.CV, cs.RO, stat.ML
Classifier-free guidance is a key component for enhancing the performance of
conditional generative models across diverse tasks. While it has previously
demonstrated remarkable improvements for the sample quality, it has only been
exclusively employed for diffusion models. In this paper, we integrate
classifier-free guidance into Flow Matching (FM) models, an alternative
simulation-free approach that trains Continuous Normalizing Flows (CNFs) based
on regressing vector fields. We explore the usage of \emph{Guided Flows} for a
variety of downstream applications. We show that Guided Flows significantly
improves the sample quality in conditional image generation and zero-shot
text-to-speech synthesis, boasting state-of-the-art performance. Notably, we
are the first to apply flow models for plan generation in the offline
reinforcement learning setting, showcasing a 10x speedup in computation
compared to diffusion models while maintaining comparable performance.
Recognition-Guided Diffusion Model for Scene Text Image Super-Resolution
November 22, 2023
Yuxuan Zhou, Liangcai Gao, Zhi Tang, Baole Wei
Scene Text Image Super-Resolution (STISR) aims to enhance the resolution and
legibility of text within low-resolution (LR) images, consequently elevating
recognition accuracy in Scene Text Recognition (STR). Previous methods
predominantly employ discriminative Convolutional Neural Networks (CNNs)
augmented with diverse forms of text guidance to address this issue.
Nevertheless, they remain deficient when confronted with severely blurred
images, due to their insufficient generation capability when little structural
or semantic information can be extracted from original images. Therefore, we
introduce RGDiffSR, a Recognition-Guided Diffusion model for scene text image
Super-Resolution, which exhibits great generative diversity and fidelity even
in challenging scenarios. Moreover, we propose a Recognition-Guided Denoising
Network, to guide the diffusion model generating LR-consistent results through
succinct semantic guidance. Experiments on the TextZoom dataset demonstrate the
superiority of RGDiffSR over prior state-of-the-art methods in both text
recognition accuracy and image fidelity.
Diffusion360: Seamless 360 Degree Panoramic Image Generation based on Diffusion Models
November 22, 2023
Mengyang Feng, Jinlin Liu, Miaomiao Cui, Xuansong Xie
This is a technical report on the 360-degree panoramic image generation task
based on diffusion models. Unlike ordinary 2D images, 360-degree panoramic
images capture the entire $360^\circ\times 180^\circ$ field of view. So the
rightmost and the leftmost sides of the 360 panoramic image should be
continued, which is the main challenge in this field. However, the current
diffusion pipeline is not appropriate for generating such a seamless 360-degree
panoramic image. To this end, we propose a circular blending strategy on both
the denoising and VAE decoding stages to maintain the geometry continuity.
Based on this, we present two models for \textbf{Text-to-360-panoramas} and
\textbf{Single-Image-to-360-panoramas} tasks. The code has been released as an
open-source project at
\href{https://github.com/ArcherFMY/SD-T2I-360PanoImage}{https://github.com/ArcherFMY/SD-T2I-360PanoImage}
and
\href{https://www.modelscope.cn/models/damo/cv_diffusion_text-to-360panorama-image_generation/summary}{ModelScope}
On the Limitation of Diffusion Models for Synthesizing Training Datasets
November 22, 2023
Shin'ya Yamaguchi, Takuma Fukuda
Synthetic samples from diffusion models are promising for leveraging in
training discriminative models as replications of real training datasets.
However, we found that the synthetic datasets degrade classification
performance over real datasets even when using state-of-the-art diffusion
models. This means that modern diffusion models do not perfectly represent the
data distribution for the purpose of replicating datasets for training
discriminative tasks. This paper investigates the gap between synthetic and
real samples by analyzing the synthetic samples reconstructed from real samples
through the diffusion and reverse process. By varying the time steps starting
the reverse process in the reconstruction, we can control the trade-off between
the information in the original real data and the information added by
diffusion models. Through assessing the reconstructed samples and trained
models, we found that the synthetic data are concentrated in modes of the
training data distribution as the reverse step increases, and thus, they are
difficult to cover the outer edges of the distribution. Our findings imply that
modern diffusion models are insufficient to replicate training data
distribution perfectly, and there is room for the improvement of generative
modeling in the replication of training datasets.
Diffusion Model Alignment Using Direct Preference Optimization
November 21, 2023
Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, Nikhil Naik
cs.CV, cs.AI, cs.GR, cs.LG
Large language models (LLMs) are fine-tuned using human comparison data with
Reinforcement Learning from Human Feedback (RLHF) methods to make them better
aligned with users’ preferences. In contrast to LLMs, human preference learning
has not been widely explored in text-to-image diffusion models; the best
existing approach is to fine-tune a pretrained model using carefully curated
high quality images and captions to improve visual appeal and text alignment.
We propose Diffusion-DPO, a method to align diffusion models to human
preferences by directly optimizing on human comparison data. Diffusion-DPO is
adapted from the recently developed Direct Preference Optimization (DPO), a
simpler alternative to RLHF which directly optimizes a policy that best
satisfies human preferences under a classification objective. We re-formulate
DPO to account for a diffusion model notion of likelihood, utilizing the
evidence lower bound to derive a differentiable objective. Using the Pick-a-Pic
dataset of 851K crowdsourced pairwise preferences, we fine-tune the base model
of the state-of-the-art Stable Diffusion XL (SDXL)-1.0 model with
Diffusion-DPO. Our fine-tuned base model significantly outperforms both base
SDXL-1.0 and the larger SDXL-1.0 model consisting of an additional refinement
model in human evaluation, improving visual appeal and prompt alignment. We
also develop a variant that uses AI feedback and has comparable performance to
training on human preferences, opening the door for scaling of diffusion model
alignment methods.
Stable Diffusion For Aerial Object Detection
November 21, 2023
Yanan Jian, Fuxun Yu, Simranjit Singh, Dimitrios Stamoulis
Aerial object detection is a challenging task, in which one major obstacle
lies in the limitations of large-scale data collection and the long-tail
distribution of certain classes. Synthetic data offers a promising solution,
especially with recent advances in diffusion-based methods like stable
diffusion (SD). However, the direct application of diffusion methods to aerial
domains poses unique challenges: stable diffusion’s optimization for rich
ground-level semantics doesn’t align with the sparse nature of aerial objects,
and the extraction of post-synthesis object coordinates remains problematic. To
address these challenges, we introduce a synthetic data augmentation framework
tailored for aerial images. It encompasses sparse-to-dense region of interest
(ROI) extraction to bridge the semantic gap, fine-tuning the diffusion model
with low-rank adaptation (LORA) to circumvent exhaustive retraining, and
finally, a Copy-Paste method to compose synthesized objects with backgrounds,
providing a nuanced approach to aerial object detection through synthetic data.
Concept Sliders: LoRA Adaptors for Precise Control in Diffusion Models
November 20, 2023
Rohit Gandikota, Joanna Materzynska, Tingrui Zhou, Antonio Torralba, David Bau
We present a method to create interpretable concept sliders that enable
precise control over attributes in image generations from diffusion models. Our
approach identifies a low-rank parameter direction corresponding to one concept
while minimizing interference with other attributes. A slider is created using
a small set of prompts or sample images; thus slider directions can be created
for either textual or visual concepts. Concept Sliders are plug-and-play: they
can be composed efficiently and continuously modulated, enabling precise
control over image generation. In quantitative experiments comparing to
previous editing techniques, our sliders exhibit stronger targeted edits with
lower interference. We showcase sliders for weather, age, styles, and
expressions, as well as slider compositions. We show how sliders can transfer
latents from StyleGAN for intuitive editing of visual concepts for which
textual description is difficult. We also find that our method can help address
persistent quality issues in Stable Diffusion XL including repair of object
deformations and fixing distorted hands. Our code, data, and trained sliders
are available at https://sliders.baulab.info/
FrePolad: Frequency-Rectified Point Latent Diffusion for Point Cloud Generation
November 20, 2023
Chenliang Zhou, Fangcheng Zhong, Param Hanji, Zhilin Guo, Kyle Fogarty, Alejandro Sztrajman, Hongyun Gao, Cengiz Oztireli
We propose FrePolad: frequency-rectified point latent diffusion, a point
cloud generation pipeline integrating a variational autoencoder (VAE) with a
denoising diffusion probabilistic model (DDPM) for the latent distribution.
FrePolad simultaneously achieves high quality, diversity, and flexibility in
point cloud cardinality for generation tasks while maintaining high
computational efficiency. The improvement in generation quality and diversity
is achieved through (1) a novel frequency rectification module via spherical
harmonics designed to retain high-frequency content while learning the point
cloud distribution; and (2) a latent DDPM to learn the regularized yet complex
latent distribution. In addition, FrePolad supports variable point cloud
cardinality by formulating the sampling of points as conditional distributions
over a latent shape distribution. Finally, the low-dimensional latent space
encoded by the VAE contributes to FrePolad’s fast and scalable sampling. Our
quantitative and qualitative results demonstrate the state-of-the-art
performance of FrePolad in terms of quality, diversity, and computational
efficiency.
Pyramid Diffusion for Fine 3D Large Scene Generation
November 20, 2023
Yuheng Liu, Xinke Li, Xueting Li, Lu Qi, Chongshou Li, Ming-Hsuan Yang
Directly transferring the 2D techniques to 3D scene generation is challenging
due to significant resolution reduction and the scarcity of comprehensive
real-world 3D scene datasets. To address these issues, our work introduces the
Pyramid Discrete Diffusion model (PDD) for 3D scene generation. This novel
approach employs a multi-scale model capable of progressively generating
high-quality 3D scenes from coarse to fine. In this way, the PDD can generate
high-quality scenes within limited resource constraints and does not require
additional data sources. To the best of our knowledge, we are the first to
adopt the simple but effective coarse-to-fine strategy for 3D large scene
generation. Our experiments, covering both unconditional and conditional
generation, have yielded impressive results, showcasing the model’s
effectiveness and robustness in generating realistic and detailed 3D scenes.
Our code will be available to the public.
Reti-Diff: Illumination Degradation Image Restoration with Retinex-based Latent Diffusion Model
November 20, 2023
Chunming He, Chengyu Fang, Yulun Zhang, Kai Li, Longxiang Tang, Chenyu You, Fengyang Xiao, Zhenhua Guo, Xiu Li
Illumination degradation image restoration (IDIR) techniques aim to improve
the visibility of degraded images and mitigate the adverse effects of
deteriorated illumination. Among these algorithms, diffusion model (DM)-based
methods have shown promising performance but are often burdened by heavy
computational demands and pixel misalignment issues when predicting the
image-level distribution. To tackle these problems, we propose to leverage DM
within a compact latent space to generate concise guidance priors and introduce
a novel solution called Reti-Diff for the IDIR task. Reti-Diff comprises two
key components: the Retinex-based latent DM (RLDM) and the Retinex-guided
transformer (RGformer). To ensure detailed reconstruction and illumination
correction, RLDM is empowered to acquire Retinex knowledge and extract
reflectance and illumination priors. These priors are subsequently utilized by
RGformer to guide the decomposition of image features into their respective
reflectance and illumination components. Following this, RGformer further
enhances and consolidates the decomposed features, resulting in the production
of refined images with consistent content and robustness to handle complex
degradation scenarios. Extensive experiments show that Reti-Diff outperforms
existing methods on three IDIR tasks, as well as downstream applications. Code
will be available at \url{https://github.com/ChunmingHe/Reti-Diff}.
Deep Equilibrium Diffusion Restoration with Parallel Sampling
November 20, 2023
Jiezhang Cao, Yue Shi, Kai Zhang, Yulun Zhang, Radu Timofte, Luc Van Gool
Diffusion-based image restoration (IR) methods aim to use diffusion models to
recover high-quality (HQ) images from degraded images and achieve promising
performance. Due to the inherent property of diffusion models, most of these
methods need long serial sampling chains to restore HQ images step-by-step. As
a result, it leads to expensive sampling time and high computation costs.
Moreover, such long sampling chains hinder understanding the relationship
between the restoration results and the inputs since it is hard to compute the
gradients in the whole chains. In this work, we aim to rethink the
diffusion-based IR models through a different perspective, i.e., a deep
equilibrium (DEQ) fixed point system. Specifically, we derive an analytical
solution by modeling the entire sampling chain in diffusion-based IR models as
a joint multivariate fixed point system. With the help of the analytical
solution, we are able to conduct single-image sampling in a parallel way and
restore HQ images without training. Furthermore, we compute fast gradients in
DEQ and found that initialization optimization can boost performance and
control the generation direction. Extensive experiments on benchmarks
demonstrate the effectiveness of our proposed method on typical IR tasks and
real-world settings. The code and models will be made publicly available.
Advancing Urban Renewal: An Automated Approach to Generating Historical Arcade Facades with Stable Diffusion Models
November 20, 2023
Zheyuan Kuang, Jiaxin Zhang, Yiying Huang, Yunqin Li
Urban renewal and transformation processes necessitate the preservation of
the historical urban fabric, particularly in districts known for their
architectural and historical significance. These regions, with their diverse
architectural styles, have traditionally required extensive preliminary
research, often leading to subjective results. However, the advent of machine
learning models has opened up new avenues for generating building facade
images. Despite this, creating high-quality images for historical district
renovations remains challenging, due to the complexity and diversity inherent
in such districts. In response to these challenges, our study introduces a new
methodology for automatically generating images of historical arcade facades,
utilizing Stable Diffusion models conditioned on textual descriptions. By
classifying and tagging a variety of arcade styles, we have constructed several
realistic arcade facade image datasets. We trained multiple low-rank adaptation
(LoRA) models to control the stylistic aspects of the generated images,
supplemented by ControlNet models for improved precision and authenticity. Our
approach has demonstrated high levels of precision, authenticity, and diversity
in the generated images, showing promising potential for real-world urban
renewal projects. This new methodology offers a more efficient and accurate
alternative to conventional design processes in urban renewal, bypassing issues
of unconvincing image details, lack of precision, and limited stylistic
variety. Future research could focus on integrating this two-dimensional image
generation with three-dimensional modeling techniques, providing a more
comprehensive solution for renovating architectural facades in historical
districts.
Fast Controllable Diffusion Models for Undersampled MRI Reconstruction
November 20, 2023
Wei Jiang, Zhuang Xiong, Feng Liu, Nan Ye, Hongfu Sun
Supervised deep learning methods have shown promise in undersampled Magnetic
Resonance Imaging (MRI) reconstruction, but their requirement for paired data
limits their generalizability to the diverse MRI acquisition parameters.
Recently, unsupervised controllable generative diffusion models have been
applied to undersampled MRI reconstruction, without paired data or model
retraining for different MRI acquisitions. However, diffusion models are
generally slow in sampling and state-of-the-art acceleration techniques can
lead to sub-optimal results when directly applied to the controllable
generation process. This study introduces a new algorithm called
Predictor-Projector-Noisor (PPN), which enhances and accelerates controllable
generation of diffusion models for undersampled MRI reconstruction. Our results
demonstrate that PPN produces high-fidelity MR images that conform to
undersampled k-space measurements with significantly shorter reconstruction
time than other controllable sampling methods. In addition, the unsupervised
PPN accelerated diffusion models are adaptable to different MRI acquisition
parameters, making them more practical for clinical use than supervised
learning techniques.
Gaussian Interpolation Flows
November 20, 2023
Yuan Gao, Jian Huang, Yuling Jiao
Gaussian denoising has emerged as a powerful principle for constructing
simulation-free continuous normalizing flows for generative modeling. Despite
their empirical successes, theoretical properties of these flows and the
regularizing effect of Gaussian denoising have remained largely unexplored. In
this work, we aim to address this gap by investigating the well-posedness of
simulation-free continuous normalizing flows built on Gaussian denoising.
Through a unified framework termed Gaussian interpolation flow, we establish
the Lipschitz regularity of the flow velocity field, the existence and
uniqueness of the flow, and the Lipschitz continuity of the flow map and the
time-reversed flow map for several rich classes of target distributions. This
analysis also sheds light on the auto-encoding and cycle-consistency properties
of Gaussian interpolation flows. Additionally, we delve into the stability of
these flows in source distributions and perturbations of the velocity field,
using the quadratic Wasserstein distance as a metric. Our findings offer
valuable insights into the learning techniques employed in Gaussian
interpolation flows for generative modeling, providing a solid theoretical
foundation for end-to-end error analyses of learning GIFs with empirical
observations.
FDDM: Unsupervised Medical Image Translation with a Frequency-Decoupled Diffusion Model
November 19, 2023
Yunxiang Li, Hua-Chieh Shao, Xiaoxue Qian, You Zhang
Diffusion models have demonstrated significant potential in producing
high-quality images for medical image translation to aid disease diagnosis,
localization, and treatment. Nevertheless, current diffusion models have
limited success in achieving faithful image translations that can accurately
preserve the anatomical structures of medical images, especially for unpaired
datasets. The preservation of structural and anatomical details is essential to
reliable medical diagnosis and treatment planning, as structural mismatches can
lead to disease misidentification and treatment errors. In this study, we
introduced a frequency-decoupled diffusion model (FDDM), a novel framework that
decouples the frequency components of medical images in the Fourier domain
during the translation process, to allow structure-preserved high-quality image
conversion. FDDM applies an unsupervised frequency conversion module to
translate the source medical images into frequency-specific outputs and then
uses the frequency-specific information to guide a following diffusion model
for final source-to-target image translation. We conducted extensive
evaluations of FDDM using a public brain MR-to-CT translation dataset, showing
its superior performance against other GAN-, VAE-, and diffusion-based models.
Metrics including the Frechet inception distance (FID), the peak
signal-to-noise ratio (PSNR), and the structural similarity index measure
(SSIM) were assessed. FDDM achieves an FID of 29.88, less than half of the
second best. These results demonstrated FDDM’s prowess in generating
highly-realistic target-domain images while maintaining the faithfulness of
translated anatomical structures.
A Survey of Emerging Applications of Diffusion Probabilistic Models in MRI
November 19, 2023
Yuheng Fan, Hanxi Liao, Shiqi Huang, Yimin Luo, Huazhu Fu, Haikun Qi
Diffusion probabilistic models (DPMs) which employ explicit likelihood
characterization and a gradual sampling process to synthesize data, have gained
increasing research interest. Despite their huge computational burdens due to
the large number of steps involved during sampling, DPMs are widely appreciated
in various medical imaging tasks for their high-quality and diversity of
generation. Magnetic resonance imaging (MRI) is an important medical imaging
modality with excellent soft tissue contrast and superb spatial resolution,
which possesses unique opportunities for DPMs. Although there is a recent surge
of studies exploring DPMs in MRI, a survey paper of DPMs specifically designed
for MRI applications is still lacking. This review article aims to help
researchers in the MRI community to grasp the advances of DPMs in different
applications. We first introduce the theory of two dominant kinds of DPMs,
categorized according to whether the diffusion time step is discrete or
continuous, and then provide a comprehensive review of emerging DPMs in MRI,
including reconstruction, image generation, image translation, segmentation,
anomaly detection, and further research topics. Finally, we discuss the general
limitations as well as limitations specific to the MRI tasks of DPMs and point
out potential areas that are worth further exploration.
GaussianDiffusion: 3D Gaussian Splatting for Denoising Diffusion Probabilistic Models with Structured Noise
November 19, 2023
Xinhai Li, Huaibin Wang, Kuo-Kun Tseng
Text-to-3D, known for its efficient generation methods and expansive creative
potential, has garnered significant attention in the AIGC domain. However, the
amalgamation of Nerf and 2D diffusion models frequently yields oversaturated
images, posing severe limitations on downstream industrial applications due to
the constraints of pixelwise rendering method. Gaussian splatting has recently
superseded the traditional pointwise sampling technique prevalent in NeRF-based
methodologies, revolutionizing various aspects of 3D reconstruction. This paper
introduces a novel text to 3D content generation framework based on Gaussian
splatting, enabling fine control over image saturation through individual
Gaussian sphere transparencies, thereby producing more realistic images. The
challenge of achieving multi-view consistency in 3D generation significantly
impedes modeling complexity and accuracy. Taking inspiration from SJC, we
explore employing multi-view noise distributions to perturb images generated by
3D Gaussian splatting, aiming to rectify inconsistencies in multi-view
geometry. We ingeniously devise an efficient method to generate noise that
produces Gaussian noise from diverse viewpoints, all originating from a shared
noise source. Furthermore, vanilla 3D Gaussian-based generation tends to trap
models in local minima, causing artifacts like floaters, burrs, or
proliferative elements. To mitigate these issues, we propose the variational
Gaussian splatting technique to enhance the quality and stability of 3D
appearance. To our knowledge, our approach represents the first comprehensive
utilization of Gaussian splatting across the entire spectrum of 3D content
generation processes.
Wasserstein Convergence Guarantees for a General Class of Score-Based Generative Models
November 18, 2023
Xuefeng Gao, Hoang M. Nguyen, Lingjiong Zhu
Score-based generative models (SGMs) is a recent class of deep generative
models with state-of-the-art performance in many applications. In this paper,
we establish convergence guarantees for a general class of SGMs in
2-Wasserstein distance, assuming accurate score estimates and smooth
log-concave data distribution. We specialize our result to several concrete
SGMs with specific choices of forward processes modelled by stochastic
differential equations, and obtain an upper bound on the iteration complexity
for each model, which demonstrates the impacts of different choices of the
forward processes. We also provide a lower bound when the data distribution is
Gaussian. Numerically, we experiment SGMs with different forward processes,
some of which are newly proposed in this paper, for unconditional image
generation on CIFAR-10. We find that the experimental results are in good
agreement with our theoretical predictions on the iteration complexity, and the
models with our newly proposed forward processes can outperform existing
models.
SDDPM: Speckle Denoising Diffusion Probabilistic Models
November 17, 2023
Soumee Guha, Scott T. Acton
Coherent imaging systems, such as medical ultrasound and synthetic aperture
radar (SAR), are subject to corruption from speckle due to sub-resolution
scatterers. Since speckle is multiplicative in nature, the constituent image
regions become corrupted to different extents. The task of denoising such
images requires algorithms specifically designed for removing signal-dependent
noise. This paper proposes a novel image denoising algorithm for removing
signal-dependent multiplicative noise with diffusion models, called Speckle
Denoising Diffusion Probabilistic Models (SDDPM). We derive the mathematical
formulations for the forward process, the reverse process, and the training
objective. In the forward process, we apply multiplicative noise to a given
image and prove that the forward process is Gaussian. We show that the reverse
process is also Gaussian and the final training objective can be expressed as
the Kullback Leibler (KL) divergence between the forward and reverse processes.
As derived in the paper, the final denoising task is a single step process,
thereby reducing the denoising time significantly. We have trained our model
with natural land-use images and ultrasound images for different noise levels.
Extensive experiments centered around two different applications show that
SDDPM is robust and performs significantly better than the comparative models
even when the images are severely corrupted.
K-space Cold Diffusion: Learning to Reconstruct Accelerated MRI without Noise
November 16, 2023
Guoyao Shen, Mengyu Li, Chad W. Farris, Stephan Anderson, Xin Zhang
eess.IV, cs.CV, cs.LG, physics.med-ph
Deep learning-based MRI reconstruction models have achieved superior
performance these days. Most recently, diffusion models have shown remarkable
performance in image generation, in-painting, super-resolution, image editing
and more. As a generalized diffusion model, cold diffusion further broadens the
scope and considers models built around arbitrary image transformations such as
blurring, down-sampling, etc. In this paper, we propose a k-space cold
diffusion model that performs image degradation and restoration in k-space
without the need for Gaussian noise. We provide comparisons with multiple deep
learning-based MRI reconstruction models and perform tests on a well-known
large open-source MRI dataset. Our results show that this novel way of
performing degradation can generate high-quality reconstruction images for
accelerated MRI.
The Chosen One: Consistent Characters in Text-to-Image Diffusion Models
November 16, 2023
Omri Avrahami, Amir Hertz, Yael Vinker, Moab Arar, Shlomi Fruchter, Ohad Fried, Daniel Cohen-Or, Dani Lischinski
Recent advances in text-to-image generation models have unlocked vast
potential for visual creativity. However, these models struggle with generation
of consistent characters, a crucial aspect for numerous real-world applications
such as story visualization, game development asset design, advertising, and
more. Current methods typically rely on multiple pre-existing images of the
target character or involve labor-intensive manual processes. In this work, we
propose a fully automated solution for consistent character generation, with
the sole input being a text prompt. We introduce an iterative procedure that,
at each stage, identifies a coherent set of images sharing a similar identity
and extracts a more consistent identity from this set. Our quantitative
analysis demonstrates that our method strikes a better balance between prompt
alignment and identity consistency compared to the baseline methods, and these
findings are reinforced by a user study. To conclude, we showcase several
practical applications of our approach. Project page is available at
https://omriavrahami.com/the-chosen-one
Score-based generative models learn manifold-like structures with constrained mixing
November 16, 2023
Li Kevin Wenliang, Ben Moran
How do score-based generative models (SBMs) learn the data distribution
supported on a low-dimensional manifold? We investigate the score model of a
trained SBM through its linear approximations and subspaces spanned by local
feature vectors. During diffusion as the noise decreases, the local
dimensionality increases and becomes more varied between different sample
sequences. Importantly, we find that the learned vector field mixes samples by
a non-conservative field within the manifold, although it denoises with normal
projections as if there is an energy function in off-manifold directions. At
each noise level, the subspace spanned by the local features overlap with an
effective density function. These observations suggest that SBMs can flexibly
mix samples with the learned score field while carefully maintaining a
manifold-like structure of the data distribution.
DSR-Diff: Depth Map Super-Resolution with Diffusion Model
November 16, 2023
Yuan Shi, Bin Xia, Rui Zhu, Qingmin Liao, Wenming Yang
Color-guided depth map super-resolution (CDSR) improve the spatial resolution
of a low-quality depth map with the corresponding high-quality color map,
benefiting various applications such as 3D reconstruction, virtual reality, and
augmented reality. While conventional CDSR methods typically rely on
convolutional neural networks or transformers, diffusion models (DMs) have
demonstrated notable effectiveness in high-level vision tasks. In this work, we
present a novel CDSR paradigm that utilizes a diffusion model within the latent
space to generate guidance for depth map super-resolution. The proposed method
comprises a guidance generation network (GGN), a depth map super-resolution
network (DSRN), and a guidance recovery network (GRN). The GGN is specifically
designed to generate the guidance while managing its compactness. Additionally,
we integrate a simple but effective feature fusion module and a
transformer-style feature extraction module into the DSRN, enabling it to
leverage guided priors in the extraction, fusion, and reconstruction of
multi-model images. Taking into account both accuracy and efficiency, our
proposed method has shown superior performance in extensive experiments when
compared to state-of-the-art methods. Our codes will be made available at
https://github.com/shiyuan7/DSR-Diff.
Diffusion-Augmented Neural Processes
November 16, 2023
Lorenzo Bonito, James Requeima, Aliaksandra Shysheya, Richard E. Turner
Over the last few years, Neural Processes have become a useful modelling tool
in many application areas, such as healthcare and climate sciences, in which
data are scarce and prediction uncertainty estimates are indispensable.
However, the current state of the art in the field (AR CNPs; Bruinsma et al.,
2023) presents a few issues that prevent its widespread deployment. This work
proposes an alternative, diffusion-based approach to NPs which, through
conditioning on noised datasets, addresses many of these limitations, whilst
also exceeding SOTA performance.
DIFFNAT: Improving Diffusion Image Quality Using Natural Image Statistics
November 16, 2023
Aniket Roy, Maiterya Suin, Anshul Shah, Ketul Shah, Jiang Liu, Rama Chellappa
Diffusion models have advanced generative AI significantly in terms of
editing and creating naturalistic images. However, efficiently improving
generated image quality is still of paramount interest. In this context, we
propose a generic “naturalness” preserving loss function, viz., kurtosis
concentration (KC) loss, which can be readily applied to any standard diffusion
model pipeline to elevate the image quality. Our motivation stems from the
projected kurtosis concentration property of natural images, which states that
natural images have nearly constant kurtosis values across different band-pass
versions of the image. To retain the “naturalness” of the generated images, we
enforce reducing the gap between the highest and lowest kurtosis values across
the band-pass versions (e.g., Discrete Wavelet Transform (DWT)) of images. Note
that our approach does not require any additional guidance like classifier or
classifier-free guidance to improve the image quality. We validate the proposed
approach for three diverse tasks, viz., (1) personalized few-shot finetuning
using text guidance, (2) unconditional image generation, and (3) image
super-resolution. Integrating the proposed KC loss has improved the perceptual
quality across all these tasks in terms of both FID, MUSIQ score, and user
evaluation.
Privacy Threats in Stable Diffusion Models
November 15, 2023
Thomas Cilloni, Charles Fleming, Charles Walter
This paper introduces a novel approach to membership inference attacks (MIA)
targeting stable diffusion computer vision models, specifically focusing on the
highly sophisticated Stable Diffusion V2 by StabilityAI. MIAs aim to extract
sensitive information about a model’s training data, posing significant privacy
concerns. Despite its advancements in image synthesis, our research reveals
privacy vulnerabilities in the stable diffusion models’ outputs. Exploiting
this information, we devise a black-box MIA that only needs to query the victim
model repeatedly. Our methodology involves observing the output of a stable
diffusion model at different generative epochs and training a classification
model to distinguish when a series of intermediates originated from a training
sample or not. We propose numerous ways to measure the membership features and
discuss what works best. The attack’s efficacy is assessed using the ROC AUC
method, demonstrating a 60\% success rate in inferring membership information.
This paper contributes to the growing body of research on privacy and security
in machine learning, highlighting the need for robust defenses against MIAs.
Our findings prompt a reevaluation of the privacy implications of stable
diffusion models, urging practitioners and developers to implement enhanced
security measures to safeguard against such attacks.
DMV3D: Denoising Multi-View Diffusion using 3D Large Reconstruction Model
November 15, 2023
Yinghao Xu, Hao Tan, Fujun Luan, Sai Bi, Peng Wang, Jiahao Li, Zifan Shi, Kalyan Sunkavalli, Gordon Wetzstein, Zexiang Xu, Kai Zhang
We propose \textbf{DMV3D}, a novel 3D generation approach that uses a
transformer-based 3D large reconstruction model to denoise multi-view
diffusion. Our reconstruction model incorporates a triplane NeRF representation
and can denoise noisy multi-view images via NeRF reconstruction and rendering,
achieving single-stage 3D generation in $\sim$30s on single A100 GPU. We train
\textbf{DMV3D} on large-scale multi-view image datasets of highly diverse
objects using only image reconstruction losses, without accessing 3D assets. We
demonstrate state-of-the-art results for the single-image reconstruction
problem where probabilistic modeling of unseen object parts is required for
generating diverse reconstructions with sharp textures. We also show
high-quality text-to-3D generation results outperforming previous 3D diffusion
models. Our project website is at: https://justimyhxu.github.io/projects/dmv3d/ .
A Spectral Diffusion Prior for Hyperspectral Image Super-Resolution
November 15, 2023
Jianjun Liu, Zebin Wu, Liang Xiao
Fusion-based hyperspectral image (HSI) super-resolution aims to produce a
high-spatial-resolution HSI by fusing a low-spatial-resolution HSI and a
high-spatial-resolution multispectral image. Such a HSI super-resolution
process can be modeled as an inverse problem, where the prior knowledge is
essential for obtaining the desired solution. Motivated by the success of
diffusion models, we propose a novel spectral diffusion prior for fusion-based
HSI super-resolution. Specifically, we first investigate the spectrum
generation problem and design a spectral diffusion model to model the spectral
data distribution. Then, in the framework of maximum a posteriori, we keep the
transition information between every two neighboring states during the reverse
generative process, and thereby embed the knowledge of trained spectral
diffusion model into the fusion problem in the form of a regularization term.
At last, we treat each generation step of the final optimization problem as its
subproblem, and employ the Adam to solve these subproblems in a reverse
sequence. Experimental results conducted on both synthetic and real datasets
demonstrate the effectiveness of the proposed approach. The code of the
proposed approach will be available on https://github.com/liuofficial/SDP.
A Diffusion Model Based Quality Enhancement Method for HEVC Compressed Video
November 15, 2023
Zheng Liu, Honggang Qi
Video post-processing methods can improve the quality of compressed videos at
the decoder side. Most of the existing methods need to train corresponding
models for compressed videos with different quantization parameters to improve
the quality of compressed videos. However, in most cases, the quantization
parameters of the decoded video are unknown. This makes existing methods have
their limitations in improving video quality. To tackle this problem, this work
proposes a diffusion model based post-processing method for compressed videos.
The proposed method first estimates the feature vectors of the compressed video
and then uses the estimated feature vectors as the prior information for the
quality enhancement model to adaptively enhance the quality of compressed video
with different quantization parameters. Experimental results show that the
quality enhancement results of our proposed method on mixed datasets are
superior to existing methods.
Towards Graph-Aware Diffusion Modeling for Collaborative Filtering
November 15, 2023
Yunqin Zhu, Chao Wang, Hui Xiong
Recovering masked feedback with neural models is a popular paradigm in
recommender systems. Seeing the success of diffusion models in solving
ill-posed inverse problems, we introduce a conditional diffusion framework for
collaborative filtering that iteratively reconstructs a user’s hidden
preferences guided by its historical interactions. To better align with the
intrinsic characteristics of implicit feedback data, we implement forward
diffusion by applying synthetic smoothing filters to interaction signals on an
item-item graph. The resulting reverse diffusion can be interpreted as a
personalized process that gradually refines preference scores. Through graph
Fourier transform, we equivalently characterize this model as an anisotropic
Gaussian diffusion in the graph spectral domain, establishing both forward and
reverse formulations. Our model outperforms state-of-the-art methods by a large
margin on one dataset and yields competitive results on the others.
EDMSound: Spectrogram Based Diffusion Models for Efficient and High-Quality Audio Synthesis
November 15, 2023
Ge Zhu, Yutong Wen, Marc-André Carbonneau, Zhiyao Duan
Audio diffusion models can synthesize a wide variety of sounds. Existing
models often operate on the latent domain with cascaded phase recovery modules
to reconstruct waveform. This poses challenges when generating high-fidelity
audio. In this paper, we propose EDMSound, a diffusion-based generative model
in spectrogram domain under the framework of elucidated diffusion models (EDM).
Combining with efficient deterministic sampler, we achieved similar Fr'echet
audio distance (FAD) score as top-ranked baseline with only 10 steps and
reached state-of-the-art performance with 50 steps on the DCASE2023 foley sound
generation benchmark. We also revealed a potential concern regarding diffusion
based audio generation models that they tend to generate samples with high
perceptual similarity to the data from training data. Project page:
https://agentcooper2002.github.io/EDMSound/
Diff-GO: Diffusion Goal-Oriented Communications to Achieve Ultra-High Spectrum Efficiency
November 13, 2023
Achintha Wijesinghe, Songyang Zhang, Suchinthaka Wanninayaka, Weiwei Wang, Zhi Ding
cs.LG, cs.AI, cs.CV, cs.MM, eess.SP
The latest advances in artificial intelligence (AI) present many
unprecedented opportunities to achieve much improved bandwidth saving in
communications. Unlike conventional communication systems focusing on packet
transport, rich datasets and AI makes it possible to efficiently transfer only
the information most critical to the goals of message recipients. One of the
most exciting advances in generative AI known as diffusion model presents a
unique opportunity for designing ultra-fast communication systems well beyond
language-based messages. This work presents an ultra-efficient communication
design by utilizing generative AI-based on diffusion models as a specific
example of the general goal-oriented communication framework. To better control
the regenerated message at the receiver output, our diffusion system design
includes a local regeneration module with finite dimensional noise latent. The
critical significance of noise latent control and sharing residing on our
Diff-GO is the ability to introduce the concept of “local generative feedback”
(Local-GF), which enables the transmitter to monitor the quality and gauge the
quality or accuracy of the message recovery at the semantic system receiver. To
this end, we propose a new low-dimensional noise space for the training of
diffusion models, which significantly reduces the communication overhead and
achieves satisfactory message recovery performance. Our experimental results
demonstrate that the proposed noise space and the diffusion-based generative
model achieve ultra-high spectrum efficiency and accurate recovery of
transmitted image signals. By trading off computation for bandwidth efficiency
(C4BE), this new framework provides an important avenue to achieve exceptional
computation-bandwidth tradeoff.
I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models
November 07, 2023
Shiwei Zhang, Jiayu Wang, Yingya Zhang, Kang Zhao, Hangjie Yuan, Zhiwu Qin, Xiang Wang, Deli Zhao, Jingren Zhou
Video synthesis has recently made remarkable strides benefiting from the
rapid development of diffusion models. However, it still encounters challenges
in terms of semantic accuracy, clarity and spatio-temporal continuity. They
primarily arise from the scarcity of well-aligned text-video data and the
complex inherent structure of videos, making it difficult for the model to
simultaneously ensure semantic and qualitative excellence. In this report, we
propose a cascaded I2VGen-XL approach that enhances model performance by
decoupling these two factors and ensures the alignment of the input data by
utilizing static images as a form of crucial guidance. I2VGen-XL consists of
two stages: i) the base stage guarantees coherent semantics and preserves
content from input images by using two hierarchical encoders, and ii) the
refinement stage enhances the video’s details by incorporating an additional
brief text and improves the resolution to 1280$\times$720. To improve the
diversity, we collect around 35 million single-shot text-video pairs and 6
billion text-image pairs to optimize the model. By this means, I2VGen-XL can
simultaneously enhance the semantic accuracy, continuity of details and clarity
of generated videos. Through extensive experiments, we have investigated the
underlying principles of I2VGen-XL and compared it with current top methods,
which can demonstrate its effectiveness on diverse data. The source code and
models will be publicly available at \url{https://i2vgen-xl.github.io}.
Generative learning for nonlinear dynamics
November 07, 2023
William Gilpin
cs.LG, nlin.CD, physics.comp-ph
Modern generative machine learning models demonstrate surprising ability to
create realistic outputs far beyond their training data, such as photorealistic
artwork, accurate protein structures, or conversational text. These successes
suggest that generative models learn to effectively parametrize and sample
arbitrarily complex distributions. Beginning half a century ago, foundational
works in nonlinear dynamics used tools from information theory to infer
properties of chaotic attractors from time series, motivating the development
of algorithms for parametrizing chaos in real datasets. In this perspective, we
aim to connect these classical works to emerging themes in large-scale
generative statistical learning. We first consider classical attractor
reconstruction, which mirrors constraints on latent representations learned by
state space models of time series. We next revisit early efforts to use
symbolic approximations to compare minimal discrete generators underlying
complex processes, a problem relevant to modern efforts to distill and
interpret black-box statistical models. Emerging interdisciplinary works bridge
nonlinear dynamics and learning theory, such as operator-theoretic methods for
complex fluid flows, or detection of broken detailed balance in biological
datasets. We anticipate that future machine learning techniques may revisit
other classical concepts from nonlinear dynamics, such as transinformation
decay and complexity-entropy tradeoffs.
Generative Structural Design Integrating BIM and Diffusion Model
November 07, 2023
Zhili He, Yu-Hsing Wang, Jian Zhang
Intelligent structural design using AI can effectively reduce time overhead
and increase efficiency. It has potential to become the new design paradigm in
the future to assist and even replace engineers, and so it has become a
research hotspot in the academic community. However, current methods have some
limitations to be addressed, whether in terms of application scope, visual
quality of generated results, or evaluation metrics of results. This study
proposes a comprehensive solution. Firstly, we introduce building information
modeling (BIM) into intelligent structural design and establishes a structural
design pipeline integrating BIM and generative AI, which is a powerful
supplement to the previous frameworks that only considered CAD drawings. In
order to improve the perceptual quality and details of generations, this study
makes 3 contributions. Firstly, in terms of generation framework, inspired by
the process of human drawing, a novel 2-stage generation framework is proposed
to replace the traditional end-to-end framework to reduce the generation
difficulty for AI models. Secondly, in terms of generative AI tools adopted,
diffusion models (DMs) are introduced to replace widely used generative
adversarial network (GAN)-based models, and a novel physics-based conditional
diffusion model (PCDM) is proposed to consider different design prerequisites.
Thirdly, in terms of neural networks, an attention block (AB) consisting of a
self-attention block (SAB) and a parallel cross-attention block (PCAB) is
designed to facilitate cross-domain data fusion. The quantitative and
qualitative results demonstrate the powerful generation and representation
capabilities of PCDM. Necessary ablation studies are conducted to examine the
validity of the methods. This study also shows that DMs have the potential to
replace GANs and become the new benchmark for generative problems in civil
engineering.
November 07, 2023
Pengze Zhang, Hubery Yin, Chen Li, Xiaohua Xie
Continuous diffusion models are commonly acknowledged to display a
deterministic probability flow, whereas discrete diffusion models do not. In
this paper, we aim to establish the fundamental theory for the probability flow
of discrete diffusion models. Specifically, we first prove that the continuous
probability flow is the Monge optimal transport map under certain conditions,
and also present an equivalent evidence for discrete cases. In view of these
findings, we are then able to define the discrete probability flow in line with
the principles of optimal transport. Finally, drawing upon our newly
established definitions, we propose a novel sampling method that surpasses
previous discrete diffusion models in its ability to generate more certain
outcomes. Extensive experiments on the synthetic toy dataset and the CIFAR-10
dataset have validated the effectiveness of our proposed discrete probability
flow. Code is released at:
https://github.com/PangzeCheung/Discrete-Probability-Flow.
Reducing Spatial Fitting Error in Distillation of Denoising Diffusion Models
November 07, 2023
Shengzhe Zhou, Zejian Lee, Shengyuan Zhang, Lefan Hou, Changyuan Yang, Guang Yang, Zhiyuan Yang, Lingyun Sun
Denoising Diffusion models have exhibited remarkable capabilities in image
generation. However, generating high-quality samples requires a large number of
iterations. Knowledge distillation for diffusion models is an effective method
to address this limitation with a shortened sampling process but causes
degraded generative quality. Based on our analysis with bias-variance
decomposition and experimental observations, we attribute the degradation to
the spatial fitting error occurring in the training of both the teacher and
student model. Accordingly, we propose $\textbf{S}$patial
$\textbf{F}$itting-$\textbf{E}$rror $\textbf{R}$eduction
$\textbf{D}$istillation model ($\textbf{SFERD}$). SFERD utilizes attention
guidance from the teacher model and a designed semantic gradient predictor to
reduce the student’s fitting error. Empirically, our proposed model facilitates
high-quality sample generation in a few function evaluations. We achieve an FID
of 5.31 on CIFAR-10 and 9.39 on ImageNet 64$\times$64 with only one step,
outperforming existing diffusion methods. Our study provides a new perspective
on diffusion distillation by highlighting the intrinsic denoising ability of
models. Project link: \url{https://github.com/Sainzerjj/SFERD}.
Multi-Resolution Diffusion for Privacy-Sensitive Recommender Systems
November 06, 2023
Derek Lilienthal, Paul Mello, Magdalini Eirinaki, Stas Tiomkin
cs.IR, cs.AI, cs.CR, cs.LG
While recommender systems have become an integral component of the Web
experience, their heavy reliance on user data raises privacy and security
concerns. Substituting user data with synthetic data can address these
concerns, but accurately replicating these real-world datasets has been a
notoriously challenging problem. Recent advancements in generative AI have
demonstrated the impressive capabilities of diffusion models in generating
realistic data across various domains. In this work we introduce a Score-based
Diffusion Recommendation Module (SDRM), which captures the intricate patterns
of real-world datasets required for training highly accurate recommender
systems. SDRM allows for the generation of synthetic data that can replace
existing datasets to preserve user privacy, or augment existing datasets to
address excessive data sparsity. Our method outperforms competing baselines
such as generative adversarial networks, variational autoencoders, and recently
proposed diffusion models in synthesizing various datasets to replace or
augment the original data by an average improvement of 4.30% in Recall@$k$ and
4.65% in NDCG@$k$.
TS-Diffusion: Generating Highly Complex Time Series with Diffusion Models
November 06, 2023
Yangming Li
While current generative models have achieved promising performances in
time-series synthesis, they either make strong assumptions on the data format
(e.g., regularities) or rely on pre-processing approaches (e.g.,
interpolations) to simplify the raw data. In this work, we consider a class of
time series with three common bad properties, including sampling
irregularities, missingness, and large feature-temporal dimensions, and
introduce a general model, TS-Diffusion, to process such complex time series.
Our model consists of three parts under the framework of point process. The
first part is an encoder of the neural ordinary differential equation (ODE)
that converts time series into dense representations, with the jump technique
to capture sampling irregularities and self-attention mechanism to handle
missing values; The second component of TS-Diffusion is a diffusion model that
learns from the representation of time series. These time-series
representations can have a complex distribution because of their high
dimensions; The third part is a decoder of another ODE that generates time
series with irregularities and missing values given their representations. We
have conducted extensive experiments on multiple time-series datasets,
demonstrating that TS-Diffusion achieves excellent results on both conventional
and complex time series and significantly outperforms previous baselines.
LDM3D-VR: Latent Diffusion Model for 3D VR
November 06, 2023
Gabriela Ben Melech Stan, Diana Wofk, Estelle Aflalo, Shao-Yen Tseng, Zhipeng Cai, Michael Paulitsch, Vasudev Lal
Latent diffusion models have proven to be state-of-the-art in the creation
and manipulation of visual outputs. However, as far as we know, the generation
of depth maps jointly with RGB is still limited. We introduce LDM3D-VR, a suite
of diffusion models targeting virtual reality development that includes
LDM3D-pano and LDM3D-SR. These models enable the generation of panoramic RGBD
based on textual prompts and the upscaling of low-resolution inputs to
high-resolution RGBD, respectively. Our models are fine-tuned from existing
pretrained models on datasets containing panoramic/high-resolution RGB images,
depth maps and captions. Both models are evaluated in comparison to existing
related methods.
A Two-Stage Generative Model with CycleGAN and Joint Diffusion for MRI-based Brain Tumor Detection
November 06, 2023
Wenxin Wang, Zhuo-Xu Cui, Guanxun Cheng, Chentao Cao, Xi Xu, Ziwei Liu, Haifeng Wang, Yulong Qi, Dong Liang, Yanjie Zhu
Accurate detection and segmentation of brain tumors is critical for medical
diagnosis. However, current supervised learning methods require extensively
annotated images and the state-of-the-art generative models used in
unsupervised methods often have limitations in covering the whole data
distribution. In this paper, we propose a novel framework Two-Stage Generative
Model (TSGM) that combines Cycle Generative Adversarial Network (CycleGAN) and
Variance Exploding stochastic differential equation using joint probability
(VE-JP) to improve brain tumor detection and segmentation. The CycleGAN is
trained on unpaired data to generate abnormal images from healthy images as
data prior. Then VE-JP is implemented to reconstruct healthy images using
synthetic paired abnormal images as a guide, which alters only pathological
regions but not regions of healthy. Notably, our method directly learned the
joint probability distribution for conditional generation. The residual between
input and reconstructed images suggests the abnormalities and a thresholding
method is subsequently applied to obtain segmentation results. Furthermore, the
multimodal results are weighted with different weights to improve the
segmentation accuracy further. We validated our method on three datasets, and
compared with other unsupervised methods for anomaly detection and
segmentation. The DSC score of 0.8590 in BraTs2020 dataset, 0.6226 in ITCS
dataset and 0.7403 in In-house dataset show that our method achieves better
segmentation performance and has better generalization.
Diffusion-based Radiotherapy Dose Prediction Guided by Inter-slice Aware Structure Encoding
November 06, 2023
Zhenghao Feng, Lu Wen, Jianghong Xiao, Yuanyuan Xu, Xi Wu, Jiliu Zhou, Xingchen Peng, Yan Wang
Deep learning (DL) has successfully automated dose distribution prediction in
radiotherapy planning, enhancing both efficiency and quality. However, existing
methods suffer from the over-smoothing problem for their commonly used L1 or L2
loss with posterior average calculations. To alleviate this limitation, we
propose a diffusion model-based method (DiffDose) for predicting the
radiotherapy dose distribution of cancer patients. Specifically, the DiffDose
model contains a forward process and a reverse process. In the forward process,
DiffDose transforms dose distribution maps into pure Gaussian noise by
gradually adding small noise and a noise predictor is simultaneously trained to
estimate the noise added at each timestep. In the reverse process, it removes
the noise from the pure Gaussian noise in multiple steps with the well-trained
noise predictor and finally outputs the predicted dose distribution maps…
Scenario Diffusion: Controllable Driving Scenario Generation With Diffusion
November 05, 2023
Ethan Pronovost, Meghana Reddy Ganesina, Noureldin Hendy, Zeyu Wang, Andres Morales, Kai Wang, Nicholas Roy
Automated creation of synthetic traffic scenarios is a key part of validating
the safety of autonomous vehicles (AVs). In this paper, we propose Scenario
Diffusion, a novel diffusion-based architecture for generating traffic
scenarios that enables controllable scenario generation. We combine latent
diffusion, object detection and trajectory regression to generate distributions
of synthetic agent poses, orientations and trajectories simultaneously. To
provide additional control over the generated scenario, this distribution is
conditioned on a map and sets of tokens describing the desired scenario. We
show that our approach has sufficient expressive capacity to model diverse
traffic patterns and generalizes to different geographical regions.
Domain Transfer in Latent Space (DTLS) Wins on Image Super-Resolution – a Non-Denoising Model
November 04, 2023
Chun-Chuen Hui, Wan-Chi Siu, Ngai-Fong Law
Large scale image super-resolution is a challenging computer vision task,
since vast information is missing in a highly degraded image, say for example
forscale x16 super-resolution. Diffusion models are used successfully in recent
years in extreme super-resolution applications, in which Gaussian noise is used
as a means to form a latent photo-realistic space, and acts as a link between
the space of latent vectors and the latent photo-realistic space. There are
quite a few sophisticated mathematical derivations on mapping the statistics of
Gaussian noises making Diffusion Models successful. In this paper we propose a
simple approach which gets away from using Gaussian noise but adopts some basic
structures of diffusion models for efficient image super-resolution.
Essentially, we propose a DNN to perform domain transfer between neighbor
domains, which can learn the differences in statistical properties to
facilitate gradual interpolation with results of reasonable quality. Further
quality improvement is achieved by conditioning the domain transfer with
reference to the input LR image. Experimental results show that our method
outperforms not only state-of-the-art large scale super resolution models, but
also the current diffusion models for image super-resolution. The approach can
readily be extended to other image-to-image tasks, such as image enlightening,
inpainting, denoising, etc.
Stable Diffusion Reference Only: Image Prompt and Blueprint Jointly Guided Multi-Condition Diffusion Model for Secondary Painting
November 04, 2023
Hao Ai, Lu Sheng
Stable Diffusion and ControlNet have achieved excellent results in the field
of image generation and synthesis. However, due to the granularity and method
of its control, the efficiency improvement is limited for professional artistic
creations such as comics and animation production whose main work is secondary
painting. In the current workflow, fixing characters and image styles often
need lengthy text prompts, and even requires further training through
TextualInversion, DreamBooth or other methods, which is very complicated and
expensive for painters. Therefore, we present a new method in this paper,
Stable Diffusion Reference Only, a images-to-image self-supervised model that
uses only two types of conditional images for precise control generation to
accelerate secondary painting. The first type of conditional image serves as an
image prompt, supplying the necessary conceptual and color information for
generation. The second type is blueprint image, which controls the visual
structure of the generated image. It is natively embedded into the original
UNet, eliminating the need for ControlNet. We released all the code for the
module and pipeline, and trained a controllable character line art coloring
model at https://github.com/aihao2000/stable-diffusion-reference-only, that
achieved state-of-the-art results in this field. This verifies the
effectiveness of the structure and greatly improves the production efficiency
of animations, comics, and fanworks.
Sparse Training of Discrete Diffusion Models for Graph Generation
November 03, 2023
Yiming Qin, Clement Vignac, Pascal Frossard
Generative models for graphs often encounter scalability challenges due to
the inherent need to predict interactions for every node pair. Despite the
sparsity often exhibited by real-world graphs, the unpredictable sparsity
patterns of their adjacency matrices, stemming from their unordered nature,
leads to quadratic computational complexity. In this work, we introduce
SparseDiff, a denoising diffusion model for graph generation that is able to
exploit sparsity during its training phase. At the core of SparseDiff is a
message-passing neural network tailored to predict only a subset of edges
during each forward pass. When combined with a sparsity-preserving noise model,
this model can efficiently work with edge lists representations of graphs,
paving the way for scalability to much larger structures. During the sampling
phase, SparseDiff iteratively populates the adjacency matrix from its prior
state, ensuring prediction of the full graph while controlling memory
utilization. Experimental results show that SparseDiff simultaneously matches
state-of-the-art in generation performance on both small and large graphs,
highlighting the versatility of our method.
Score Models for Offline Goal-Conditioned Reinforcement Learning
November 03, 2023
Harshit Sikchi, Rohan Chitnis, Ahmed Touati, Alborz Geramifard, Amy Zhang, Scott Niekum
Offline Goal-Conditioned Reinforcement Learning (GCRL) is tasked with
learning to achieve multiple goals in an environment purely from offline
datasets using sparse reward functions. Offline GCRL is pivotal for developing
generalist agents capable of leveraging pre-existing datasets to learn diverse
and reusable skills without hand-engineering reward functions. However,
contemporary approaches to GCRL based on supervised learning and contrastive
learning are often suboptimal in the offline setting. An alternative
perspective on GCRL optimizes for occupancy matching, but necessitates learning
a discriminator, which subsequently serves as a pseudo-reward for downstream
RL. Inaccuracies in the learned discriminator can cascade, negatively
influencing the resulting policy. We present a novel approach to GCRL under a
new lens of mixture-distribution matching, leading to our discriminator-free
method: SMORe. The key insight is combining the occupancy matching perspective
of GCRL with a convex dual formulation to derive a learning objective that can
better leverage suboptimal offline data. SMORe learns scores or unnormalized
densities representing the importance of taking an action at a state for
reaching a particular goal. SMORe is principled and our extensive experiments
on the fully offline GCRL benchmark composed of robot manipulation and
locomotion tasks, including high-dimensional observations, show that SMORe can
outperform state-of-the-art baselines by a significant margin.
Latent Diffusion Model for Conditional Reservoir Facies Generation
November 03, 2023
Daesoo Lee, Oscar Ovanger, Jo Eidsvik, Erlend Aune, Jacob Skauvold, Ragnar Hauge
physics.geo-ph, cs.LG, stat.ML
Creating accurate and geologically realistic reservoir facies based on
limited measurements is crucial for field development and reservoir management,
especially in the oil and gas sector. Traditional two-point geostatistics,
while foundational, often struggle to capture complex geological patterns.
Multi-point statistics offers more flexibility, but comes with its own
challenges. With the rise of Generative Adversarial Networks (GANs) and their
success in various fields, there has been a shift towards using them for facies
generation. However, recent advances in the computer vision domain have shown
the superiority of diffusion models over GANs. Motivated by this, a novel
Latent Diffusion Model is proposed, which is specifically designed for
conditional generation of reservoir facies. The proposed model produces
high-fidelity facies realizations that rigorously preserve conditioning data.
It significantly outperforms a GAN-based alternative.
On the Generalization Properties of Diffusion Models
November 03, 2023
Puheng Li, Zhong Li, Huishuai Zhang, Jiang Bian
Diffusion models are a class of generative models that serve to establish a
stochastic transport map between an empirically observed, yet unknown, target
distribution and a known prior. Despite their remarkable success in real-world
applications, a theoretical understanding of their generalization capabilities
remains underdeveloped. This work embarks on a comprehensive theoretical
exploration of the generalization attributes of diffusion models. We establish
theoretical estimates of the generalization gap that evolves in tandem with the
training dynamics of score-based diffusion models, suggesting a polynomially
small generalization error ($O(n^{-2/5}+m^{-4/5})$) on both the sample size $n$
and the model capacity $m$, evading the curse of dimensionality (i.e., not
exponentially large in the data dimension) when early-stopped. Furthermore, we
extend our quantitative analysis to a data-dependent scenario, wherein target
distributions are portrayed as a succession of densities with progressively
increasing distances between modes. This precisely elucidates the adverse
effect of “modes shift” in ground truths on the model generalization. Moreover,
these estimates are not solely theoretical constructs but have also been
confirmed through numerical simulations. Our findings contribute to the
rigorous understanding of diffusion models’ generalization properties and
provide insights that may guide practical applications.
PDF: Point Diffusion Implicit Function for Large-scale Scene Neural Representation
November 03, 2023
Yuhan Ding, Fukun Yin, Jiayuan Fan, Hui Li, Xin Chen, Wen Liu, Chongshan Lu, Gang YU, Tao Chen
Recent advances in implicit neural representations have achieved impressive
results by sampling and fusing individual points along sampling rays in the
sampling space. However, due to the explosively growing sampling space, finely
representing and synthesizing detailed textures remains a challenge for
unbounded large-scale outdoor scenes. To alleviate the dilemma of using
individual points to perceive the entire colossal space, we explore learning
the surface distribution of the scene to provide structural priors and reduce
the samplable space and propose a Point Diffusion implicit Function, PDF, for
large-scale scene neural representation. The core of our method is a
large-scale point cloud super-resolution diffusion module that enhances the
sparse point cloud reconstructed from several training images into a dense
point cloud as an explicit prior. Then in the rendering stage, only sampling
points with prior points within the sampling radius are retained. That is, the
sampling space is reduced from the unbounded space to the scene surface.
Meanwhile, to fill in the background of the scene that cannot be provided by
point clouds, the region sampling based on Mip-NeRF 360 is employed to model
the background representation. Expensive experiments have demonstrated the
effectiveness of our method for large-scale scene novel view synthesis, which
outperforms relevant state-of-the-art baselines.
Investigating the Behavior of Diffusion Models for Accelerating Electronic Structure Calculations
November 02, 2023
Daniel Rothchild, Andrew S. Rosen, Eric Taw, Connie Robinson, Joseph E. Gonzalez, Aditi S. Krishnapriyan
physics.chem-ph, cond-mat.mtrl-sci, cs.LG, physics.comp-ph
We present an investigation into diffusion models for molecular generation,
with the aim of better understanding how their predictions compare to the
results of physics-based calculations. The investigation into these models is
driven by their potential to significantly accelerate electronic structure
calculations using machine learning, without requiring expensive
first-principles datasets for training interatomic potentials. We find that the
inference process of a popular diffusion model for de novo molecular generation
is divided into an exploration phase, where the model chooses the atomic
species, and a relaxation phase, where it adjusts the atomic coordinates to
find a low-energy geometry. As training proceeds, we show that the model
initially learns about the first-order structure of the potential energy
surface, and then later learns about higher-order structure. We also find that
the relaxation phase of the diffusion model can be re-purposed to sample the
Boltzmann distribution over conformations and to carry out structure
relaxations. For structure relaxations, the model finds geometries with ~10x
lower energy than those produced by a classical force field for small organic
molecules. Initializing a density functional theory (DFT) relaxation at the
diffusion-produced structures yields a >2x speedup to the DFT relaxation when
compared to initializing at structures relaxed with a classical force field.
De-Diffusion Makes Text a Strong Cross-Modal Interface
November 01, 2023
Chen Wei, Chenxi Liu, Siyuan Qiao, Zhishuai Zhang, Alan Yuille, Jiahui Yu
We demonstrate text as a strong cross-modal interface. Rather than relying on
deep embeddings to connect image and language as the interface representation,
our approach represents an image as text, from which we enjoy the
interpretability and flexibility inherent to natural language. We employ an
autoencoder that uses a pre-trained text-to-image diffusion model for decoding.
The encoder is trained to transform an input image into text, which is then fed
into the fixed text-to-image diffusion decoder to reconstruct the original
input – a process we term De-Diffusion. Experiments validate both the
precision and comprehensiveness of De-Diffusion text representing images, such
that it can be readily ingested by off-the-shelf text-to-image tools and LLMs
for diverse multi-modal tasks. For example, a single De-Diffusion model can
generalize to provide transferable prompts for different text-to-image tools,
and also achieves a new state of the art on open-ended vision-language tasks by
simply prompting large language models with few-shot examples.
Intriguing Properties of Data Attribution on Diffusion Models
November 01, 2023
Xiaosen Zheng, Tianyu Pang, Chao Du, Jing Jiang, Min Lin
Data attribution seeks to trace model outputs back to training data. With the
recent development of diffusion models, data attribution has become a desired
module to properly assign valuations for high-quality or copyrighted training
samples, ensuring that data contributors are fairly compensated or credited.
Several theoretically motivated methods have been proposed to implement data
attribution, in an effort to improve the trade-off between computational
scalability and effectiveness. In this work, we conduct extensive experiments
and ablation studies on attributing diffusion models, specifically focusing on
DDPMs trained on CIFAR-10 and CelebA, as well as a Stable Diffusion model
LoRA-finetuned on ArtBench. Intriguingly, we report counter-intuitive
observations that theoretically unjustified design choices for attribution
empirically outperform previous baselines by a large margin, in terms of both
linear datamodeling score and counterfactual evaluation. Our work presents a
significantly more efficient approach for attributing diffusion models, while
the unexpected findings suggest that at least in non-convex settings,
constructions guided by theoretical assumptions may lead to inferior
attribution performance. The code is available at
https://github.com/sail-sg/D-TRAK.
Diffusion models for probabilistic programming
November 01, 2023
Simon Dirmeier, Fernando Perez-Cruz
We propose Diffusion Model Variational Inference (DMVI), a novel method for
automated approximate inference in probabilistic programming languages (PPLs).
DMVI utilizes diffusion models as variational approximations to the true
posterior distribution by deriving a novel bound to the marginal likelihood
objective used in Bayesian modelling. DMVI is easy to implement, allows
hassle-free inference in PPLs without the drawbacks of, e.g., variational
inference using normalizing flows, and does not make any constraints on the
underlying neural network model. We evaluate DMVI on a set of common Bayesian
models and show that its posterior inferences are in general more accurate than
those of contemporary methods used in PPLs while having a similar computational
cost and requiring less manual tuning.
Adaptive Latent Diffusion Model for 3D Medical Image to Image Translation: Multi-modal Magnetic Resonance Imaging Study
November 01, 2023
Jonghun Kim, Hyunjin Park
Multi-modal images play a crucial role in comprehensive evaluations in
medical image analysis providing complementary information for identifying
clinically important biomarkers. However, in clinical practice, acquiring
multiple modalities can be challenging due to reasons such as scan cost,
limited scan time, and safety considerations. In this paper, we propose a model
based on the latent diffusion model (LDM) that leverages switchable blocks for
image-to-image translation in 3D medical images without patch cropping. The 3D
LDM combined with conditioning using the target modality allows generating
high-quality target modality in 3D overcoming the shortcoming of the missing
out-of-slice information in 2D generation methods. The switchable block, noted
as multiple switchable spatially adaptive normalization (MS-SPADE), dynamically
transforms source latents to the desired style of the target latents to help
with the diffusion process. The MS-SPADE block allows us to have one single
model to tackle many translation tasks of one source modality to various
targets removing the need for many translation models for different scenarios.
Our model exhibited successful image synthesis across different source-target
modality scenarios and surpassed other models in quantitative evaluations
tested on multi-modal brain magnetic resonance imaging datasets of four
different modalities and an independent IXI dataset. Our model demonstrated
successful image synthesis across various modalities even allowing for
one-to-many modality translations. Furthermore, it outperformed other
one-to-one translation models in quantitative evaluations.
Score Normalization for a Faster Diffusion Exponential Integrator Sampler
October 31, 2023
Guoxuan Xia, Duolikun Danier, Ayan Das, Stathi Fotiadis, Farhang Nabiei, Ushnish Sengupta, Alberto Bernacchia
Recently, Zhang et al. have proposed the Diffusion Exponential Integrator
Sampler (DEIS) for fast generation of samples from Diffusion Models. It
leverages the semi-linear nature of the probability flow ordinary differential
equation (ODE) in order to greatly reduce integration error and improve
generation quality at low numbers of function evaluations (NFEs). Key to this
approach is the score function reparameterisation, which reduces the
integration error incurred from using a fixed score function estimate over each
integration step. The original authors use the default parameterisation used by
models trained for noise prediction – multiply the score by the standard
deviation of the conditional forward noising distribution. We find that
although the mean absolute value of this score parameterisation is close to
constant for a large portion of the reverse sampling process, it changes
rapidly at the end of sampling. As a simple fix, we propose to instead
reparameterise the score (at inference) by dividing it by the average absolute
value of previous score estimates at that time step collected from offline high
NFE generations. We find that our score normalisation (DEIS-SN) consistently
improves FID compared to vanilla DEIS, showing an improvement at 10 NFEs from
6.44 to 5.57 on CIFAR-10 and from 5.9 to 4.95 on LSUN-Church 64x64. Our code is
available at https://github.com/mtkresearch/Diffusion-DEIS-SN
Diversity and Diffusion: Observations on Synthetic Image Distributions with Stable Diffusion
October 31, 2023
David Marwood, Shumeet Baluja, Yair Alon
Recent progress in text-to-image (TTI) systems, such as StableDiffusion,
Imagen, and DALL-E 2, have made it possible to create realistic images with
simple text prompts. It is tempting to use these systems to eliminate the
manual task of obtaining natural images for training a new machine learning
classifier. However, in all of the experiments performed to date, classifiers
trained solely with synthetic images perform poorly at inference, despite the
images used for training appearing realistic. Examining this apparent
incongruity in detail gives insight into the limitations of the underlying
image generation processes. Through the lens of diversity in image creation
vs.accuracy of what is created, we dissect the differences in semantic
mismatches in what is modeled in synthetic vs. natural images. This will
elucidate the roles of the image-languag emodel, CLIP, and the image generation
model, diffusion. We find four issues that limit the usefulness of TTI systems
for this task: ambiguity, adherence to prompt, lack of diversity, and inability
to represent the underlying concept. We further present surprising insights
into the geometry of CLIP embeddings.
October 31, 2023
Yuxin Zhang, Clément Huneau, Jérôme Idier, Diana Mateus
Despite its wide use in medicine, ultrasound imaging faces several challenges
related to its poor signal-to-noise ratio and several sources of noise and
artefacts. Enhancing ultrasound image quality involves balancing concurrent
factors like contrast, resolution, and speckle preservation. In recent years,
there has been progress both in model-based and learning-based approaches to
improve ultrasound image reconstruction. Bringing the best from both worlds, we
propose a hybrid approach leveraging advances in diffusion models. To this end,
we adapt Denoising Diffusion Restoration Models (DDRM) to incorporate
ultrasound physics through a linear direct model and an unsupervised
fine-tuning of the prior diffusion model. We conduct comprehensive experiments
on simulated, in-vitro, and in-vivo data, demonstrating the efficacy of our
approach in achieving high-quality image reconstructions from a single plane
wave input and in comparison to state-of-the-art methods. Finally, given the
stochastic nature of the method, we analyse in depth the statistical properties
of single and multiple-sample reconstructions, experimentally show the
informativeness of their variance, and provide an empirical model relating this
behaviour to speckle noise. The code and data are available at: (upon
acceptance).
October 31, 2023
Reza Basiri, Karim Manji, Francois Harton, Alisha Poonja, Milos R. Popovic, Shehroz S. Khan
Diabetic Foot Ulcer (DFU) is a serious skin wound requiring specialized care.
However, real DFU datasets are limited, hindering clinical training and
research activities. In recent years, generative adversarial networks and
diffusion models have emerged as powerful tools for generating synthetic images
with remarkable realism and diversity in many applications. This paper explores
the potential of diffusion models for synthesizing DFU images and evaluates
their authenticity through expert clinician assessments. Additionally,
evaluation metrics such as Frechet Inception Distance (FID) and Kernel
Inception Distance (KID) are examined to assess the quality of the synthetic
DFU images. A dataset of 2,000 DFU images is used for training the diffusion
model, and the synthetic images are generated by applying diffusion processes.
The results indicate that the diffusion model successfully synthesizes visually
indistinguishable DFU images. 70% of the time, clinicians marked synthetic DFU
images as real DFUs. However, clinicians demonstrate higher unanimous
confidence in rating real images than synthetic ones. The study also reveals
that FID and KID metrics do not significantly align with clinicians’
assessments, suggesting alternative evaluation approaches are needed. The
findings highlight the potential of diffusion models for generating synthetic
DFU images and their impact on medical training programs and research in wound
detection and classification.
Beyond U: Making Diffusion Models Faster & Lighter
October 31, 2023
Sergio Calvo-Ordonez, Chun-Wun Cheng, Jiahao Huang, Lipei Zhang, Guang Yang, Carola-Bibiane Schonlieb, Angelica I Aviles-Rivero
Diffusion Probabilistic Models stand as a critical tool in generative
modelling, enabling the generation of complex data distributions. This family
of generative models yields record-breaking performance in tasks such as image
synthesis, video generation, and molecule design. Despite their capabilities,
their efficiency, especially in the reverse process, remains a challenge due to
slow convergence rates and high computational costs. In this paper, we
introduce an approach that leverages continuous dynamical systems to design a
novel denoising network for diffusion models that is more parameter-efficient,
exhibits faster convergence, and demonstrates increased noise robustness.
Experimenting with Denoising Diffusion Probabilistic Models (DDPMs), our
framework operates with approximately a quarter of the parameters, and $\sim$
30\% of the Floating Point Operations (FLOPs) compared to standard U-Nets in
DDPMs. Furthermore, our model is notably faster in inference than the baseline
when measured in fair and equal conditions. We also provide a mathematical
intuition as to why our proposed reverse process is faster as well as a
mathematical discussion of the empirical tradeoffs in the denoising downstream
task. Finally, we argue that our method is compatible with existing performance
enhancement techniques, enabling further improvements in efficiency, quality,
and speed.
Scaling Riemannian Diffusion Models
October 30, 2023
Aaron Lou, Minkai Xu, Stefano Ermon
Riemannian diffusion models draw inspiration from standard Euclidean space
diffusion models to learn distributions on general manifolds. Unfortunately,
the additional geometric complexity renders the diffusion transition term
inexpressible in closed form, so prior methods resort to imprecise
approximations of the score matching training objective that degrade
performance and preclude applications in high dimensions. In this work, we
reexamine these approximations and propose several practical improvements. Our
key observation is that most relevant manifolds are symmetric spaces, which are
much more amenable to computation. By leveraging and combining various
ans"{a}tze, we can quickly compute relevant quantities to high precision. On
low dimensional datasets, our correction produces a noticeable improvement,
allowing diffusion to compete with other methods. Additionally, we show that
our method enables us to scale to high dimensional tasks on nontrivial
manifolds. In particular, we model QCD densities on $SU(n)$ lattices and
contrastively learned embeddings on high dimensional hyperspheres.
VideoCrafter1: Open Diffusion Models for High-Quality Video Generation
October 30, 2023
Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Jinbo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, Chao Weng, Ying Shan
Video generation has increasingly gained interest in both academia and
industry. Although commercial tools can generate plausible videos, there is a
limited number of open-source models available for researchers and engineers.
In this work, we introduce two diffusion models for high-quality video
generation, namely text-to-video (T2V) and image-to-video (I2V) models. T2V
models synthesize a video based on a given text input, while I2V models
incorporate an additional image input. Our proposed T2V model can generate
realistic and cinematic-quality videos with a resolution of $1024 \times 576$,
outperforming other open-source T2V models in terms of quality. The I2V model
is designed to produce videos that strictly adhere to the content of the
provided reference image, preserving its content, structure, and style. This
model is the first open-source I2V foundation model capable of transforming a
given image into a video clip while maintaining content preservation
constraints. We believe that these open-source video generation models will
contribute significantly to the technological advancements within the
community.
Text-to-3D with Classifier Score Distillation
October 30, 2023
Xin Yu, Yuan-Chen Guo, Yangguang Li, Ding Liang, Song-Hai Zhang, Xiaojuan Qi
Text-to-3D generation has made remarkable progress recently, particularly
with methods based on Score Distillation Sampling (SDS) that leverages
pre-trained 2D diffusion models. While the usage of classifier-free guidance is
well acknowledged to be crucial for successful optimization, it is considered
an auxiliary trick rather than the most essential component. In this paper, we
re-evaluate the role of classifier-free guidance in score distillation and
discover a surprising finding: the guidance alone is enough for effective
text-to-3D generation tasks. We name this method Classifier Score Distillation
(CSD), which can be interpreted as using an implicit classification model for
generation. This new perspective reveals new insights for understanding
existing techniques. We validate the effectiveness of CSD across a variety of
text-to-3D tasks including shape generation, texture synthesis, and shape
editing, achieving results superior to those of state-of-the-art methods. Our
project page is https://xinyu-andy.github.io/Classifier-Score-Distillation
Noise-Free Score Distillation
October 26, 2023
Oren Katzir, Or Patashnik, Daniel Cohen-Or, Dani Lischinski
Score Distillation Sampling (SDS) has emerged as the de facto approach for
text-to-content generation in non-image domains. In this paper, we reexamine
the SDS process and introduce a straightforward interpretation that demystifies
the necessity for large Classifier-Free Guidance (CFG) scales, rooted in the
distillation of an undesired noise term. Building upon our interpretation, we
propose a novel Noise-Free Score Distillation (NFSD) process, which requires
minimal modifications to the original SDS framework. Through this streamlined
design, we achieve more effective distillation of pre-trained text-to-image
diffusion models while using a nominal CFG scale. This strategic choice allows
us to prevent the over-smoothing of results, ensuring that the generated data
is both realistic and complies with the desired prompt. To demonstrate the
efficacy of NFSD, we provide qualitative examples that compare NFSD and SDS, as
well as several other methods.
SD4Match: Learning to Prompt Stable Diffusion Model for Semantic Matching
October 26, 2023
Xinghui Li, Jingyi Lu, Kai Han, Victor Prisacariu
In this paper, we address the challenge of matching semantically similar
keypoints across image pairs. Existing research indicates that the intermediate
output of the UNet within the Stable Diffusion (SD) can serve as robust image
feature maps for such a matching task. We demonstrate that by employing a basic
prompt tuning technique, the inherent potential of Stable Diffusion can be
harnessed, resulting in a significant enhancement in accuracy over previous
approaches. We further introduce a novel conditional prompting module that
conditions the prompt on the local details of the input image pairs, leading to
a further improvement in performance. We designate our approach as SD4Match,
short for Stable Diffusion for Semantic Matching. Comprehensive evaluations of
SD4Match on the PF-Pascal, PF-Willow, and SPair-71k datasets show that it sets
new benchmarks in accuracy across all these datasets. Particularly, SD4Match
outperforms the previous state-of-the-art by a margin of 12 percentage points
on the challenging SPair-71k dataset.
Discrete Diffusion Language Modeling by Estimating the Ratios of the Data Distribution
October 25, 2023
Aaron Lou, Chenlin Meng, Stefano Ermon
Despite their groundbreaking performance for many generative modeling tasks,
diffusion models have fallen short on discrete data domains such as natural
language. Crucially, standard diffusion models rely on the well-established
theory of score matching, but efforts to generalize this to discrete structures
have not yielded the same empirical gains. In this work, we bridge this gap by
proposing score entropy, a novel discrete score matching loss that is more
stable than existing methods, forms an ELBO for maximum likelihood training,
and can be efficiently optimized with a denoising variant. We scale our Score
Entropy Discrete Diffusion models (SEDD) to the experimental setting of GPT-2,
achieving highly competitive likelihoods while also introducing distinct
algorithmic advantages. In particular, when comparing similarly sized SEDD and
GPT-2 models, SEDD attains comparable perplexities (normally within $+10\%$ of
and sometimes outperforming the baseline). Furthermore, SEDD models learn a
more faithful sequence distribution (around $4\times$ better compared to GPT-2
models with ancestral sampling as measured by large models), can trade off
compute for generation quality (needing only $16\times$ fewer network
evaluations to match GPT-2), and enables arbitrary infilling beyond the
standard left to right prompting.
Using Diffusion Models to Generate Synthetic Labelled Data for Medical Image Segmentation
October 25, 2023
Daniel Saragih, Pascal Tyrrell
In this paper, we proposed and evaluated a pipeline for generating synthetic
labeled polyp images with the aim of augmenting automatic medical image
segmentation models. In doing so, we explored the use of diffusion models to
generate and style synthetic labeled data. The HyperKvasir dataset consisting
of 1000 images of polyps in the human GI tract obtained from 2008 to 2016
during clinical endoscopies was used for training and testing. Furthermore, we
did a qualitative expert review, and computed the Fr'echet Inception Distance
(FID) and Multi-Scale Structural Similarity (MS-SSIM) between the output images
and the source images to evaluate our samples. To evaluate its augmentation
potential, a segmentation model was trained with the synthetic data to compare
their performance with the real data and previous Generative Adversarial
Networks (GAN) methods. These models were evaluated using the Dice loss (DL)
and Intersection over Union (IoU) score. Our pipeline generated images that
more closely resembled real images according to the FID scores (GAN: $118.37
\pm 1.06 \text{ vs SD: } 65.99 \pm 0.37$). Improvements over GAN methods were
seen on average when the segmenter was entirely trained (DL difference:
$-0.0880 \pm 0.0170$, IoU difference: $0.0993 \pm 0.01493$) or augmented (DL
difference: GAN $-0.1140 \pm 0.0900 \text{ vs SD }-0.1053 \pm 0.0981$, IoU
difference: GAN $0.01533 \pm 0.03831 \text{ vs SD }0.0255 \pm 0.0454$) with
synthetic data. Overall, we obtained more realistic synthetic images and
improved segmentation model performance when fully or partially trained on
synthetic data.
Multi-scale Diffusion Denoised Smoothing
October 25, 2023
Jongheon Jeong, Jinwoo Shin
Along with recent diffusion models, randomized smoothing has become one of a
few tangible approaches that offers adversarial robustness to models at scale,
e.g., those of large pre-trained models. Specifically, one can perform
randomized smoothing on any classifier via a simple “denoise-and-classify”
pipeline, so-called denoised smoothing, given that an accurate denoiser is
available - such as diffusion model. In this paper, we present scalable methods
to address the current trade-off between certified robustness and accuracy in
denoised smoothing. Our key idea is to “selectively” apply smoothing among
multiple noise scales, coined multi-scale smoothing, which can be efficiently
implemented with a single diffusion model. This approach also suggests a new
objective to compare the collective robustness of multi-scale smoothed
classifiers, and questions which representation of diffusion model would
maximize the objective. To address this, we propose to further fine-tune
diffusion model (a) to perform consistent denoising whenever the original image
is recoverable, but (b) to generate rather diverse outputs otherwise. Our
experiments show that the proposed multi-scale smoothing scheme combined with
diffusion fine-tuning enables strong certified robustness available with high
noise level while maintaining its accuracy close to non-smoothed classifiers.
Fuse Your Latents: Video Editing with Multi-source Latent Diffusion Models
October 25, 2023
Tianyi Lu, Xing Zhang, Jiaxi Gu, Hang Xu, Renjing Pei, Songcen Xu, Zuxuan Wu
Latent Diffusion Models (LDMs) are renowned for their powerful capabilities
in image and video synthesis. Yet, video editing methods suffer from
insufficient pre-training data or video-by-video re-training cost. In
addressing this gap, we propose FLDM (Fused Latent Diffusion Model), a
training-free framework to achieve text-guided video editing by applying
off-the-shelf image editing methods in video LDMs. Specifically, FLDM fuses
latents from an image LDM and an video LDM during the denoising process. In
this way, temporal consistency can be kept with video LDM while high-fidelity
from the image LDM can also be exploited. Meanwhile, FLDM possesses high
flexibility since both image LDM and video LDM can be replaced so advanced
image editing methods such as InstructPix2Pix and ControlNet can be exploited.
To the best of our knowledge, FLDM is the first method to adapt off-the-shelf
image editing methods into video LDMs for video editing. Extensive quantitative
and qualitative experiments demonstrate that FLDM can improve the textual
alignment and temporal consistency of edited videos.
DiffRef3D: A Diffusion-based Proposal Refinement Framework for 3D Object Detection
October 25, 2023
Se-Ho Kim, Inyong Koo, Inyoung Lee, Byeongjun Park, Changick Kim
Denoising diffusion models show remarkable performances in generative tasks,
and their potential applications in perception tasks are gaining interest. In
this paper, we introduce a novel framework named DiffRef3D which adopts the
diffusion process on 3D object detection with point clouds for the first time.
Specifically, we formulate the proposal refinement stage of two-stage 3D object
detectors as a conditional diffusion process. During training, DiffRef3D
gradually adds noise to the residuals between proposals and target objects,
then applies the noisy residuals to proposals to generate hypotheses. The
refinement module utilizes these hypotheses to denoise the noisy residuals and
generate accurate box predictions. In the inference phase, DiffRef3D generates
initial hypotheses by sampling noise from a Gaussian distribution as residuals
and refines the hypotheses through iterative steps. DiffRef3D is a versatile
proposal refinement framework that consistently improves the performance of
existing 3D object detection models. We demonstrate the significance of
DiffRef3D through extensive experiments on the KITTI benchmark. Code will be
available.
Generative Pre-training for Speech with Flow Matching
October 25, 2023
Alexander H. Liu, Matt Le, Apoorv Vyas, Bowen Shi, Andros Tjandra, Wei-Ning Hsu
eess.AS, cs.CL, cs.LG, cs.SD
Generative models have gained more and more attention in recent years for
their remarkable success in tasks that required estimating and sampling data
distribution to generate high-fidelity synthetic data. In speech,
text-to-speech synthesis and neural vocoder are good examples where generative
models have shined. While generative models have been applied to different
applications in speech, there exists no general-purpose generative model that
models speech directly. In this work, we take a step toward this direction by
showing a single pre-trained generative model can be adapted to different
downstream tasks with strong performance. Specifically, we pre-trained a
generative model, named SpeechFlow, on 60k hours of untranscribed speech with
Flow Matching and masked conditions. Experiment results show the pre-trained
generative model can be fine-tuned with task-specific data to match or surpass
existing expert models on speech enhancement, separation, and synthesis. Our
work suggested a foundational model for generation tasks in speech can be built
with generative pre-training.
Score Matching-based Pseudolikelihood Estimation of Neural Marked Spatio-Temporal Point Process with Uncertainty Quantification
October 25, 2023
Zichong Li, Qunzhi Xu, Zhenghao Xu, Yajun Mei, Tuo Zhao, Hongyuan Zha
Spatio-temporal point processes (STPPs) are potent mathematical tools for
modeling and predicting events with both temporal and spatial features. Despite
their versatility, most existing methods for learning STPPs either assume a
restricted form of the spatio-temporal distribution, or suffer from inaccurate
approximations of the intractable integral in the likelihood training
objective. These issues typically arise from the normalization term of the
probability density function. Moreover, current techniques fail to provide
uncertainty quantification for model predictions, such as confidence intervals
for the predicted event’s arrival time and confidence regions for the event’s
location, which is crucial given the considerable randomness of the data. To
tackle these challenges, we introduce SMASH: a Score MAtching-based
pSeudolikeliHood estimator for learning marked STPPs with uncertainty
quantification. Specifically, our framework adopts a normalization-free
objective by estimating the pseudolikelihood of marked STPPs through
score-matching and offers uncertainty quantification for the predicted event
time, location and mark by computing confidence regions over the generated
samples. The superior performance of our proposed framework is demonstrated
through extensive experiments in both event prediction and uncertainty
quantification.
RAEDiff: Denoising Diffusion Probabilistic Models Based Reversible Adversarial Examples Self-Generation and Self-Recovery
October 25, 2023
Fan Xing, Xiaoyi Zhou, Xuefeng Fan, Zhuo Tian, Yan Zhao
cs.CR, cs.AI, cs.GR, cs.LG
Collected and annotated datasets, which are obtained through extensive
efforts, are effective for training Deep Neural Network (DNN) models. However,
these datasets are susceptible to be misused by unauthorized users, resulting
in infringement of Intellectual Property (IP) rights owned by the dataset
creators. Reversible Adversarial Exsamples (RAE) can help to solve the issues
of IP protection for datasets. RAEs are adversarial perturbed images that can
be restored to the original. As a cutting-edge approach, RAE scheme can serve
the purposes of preventing unauthorized users from engaging in malicious model
training, as well as ensuring the legitimate usage of authorized users.
Nevertheless, in the existing work, RAEs still rely on the embedded auxiliary
information for restoration, which may compromise their adversarial abilities.
In this paper, a novel self-generation and self-recovery method, named as
RAEDiff, is introduced for generating RAEs based on a Denoising Diffusion
Probabilistic Models (DDPM). It diffuses datasets into a Biased Gaussian
Distribution (BGD) and utilizes the prior knowledge of the DDPM for generating
and recovering RAEs. The experimental results demonstrate that RAEDiff
effectively self-generates adversarial perturbations for DNN models, including
Artificial Intelligence Generated Content (AIGC) models, while also exhibiting
significant self-recovery capabilities.
A Diffusion Weighted Graph Framework for New Intent Discovery
October 24, 2023
Wenkai Shi, Wenbin An, Feng Tian, Qinghua Zheng, QianYing Wang, Ping Chen
New Intent Discovery (NID) aims to recognize both new and known intents from
unlabeled data with the aid of limited labeled data containing only known
intents. Without considering structure relationships between samples, previous
methods generate noisy supervisory signals which cannot strike a balance
between quantity and quality, hindering the formation of new intent clusters
and effective transfer of the pre-training knowledge. To mitigate this
limitation, we propose a novel Diffusion Weighted Graph Framework (DWGF) to
capture both semantic similarities and structure relationships inherent in
data, enabling more sufficient and reliable supervisory signals. Specifically,
for each sample, we diffuse neighborhood relationships along semantic paths
guided by the nearest neighbors for multiple hops to characterize its local
structure discriminately. Then, we sample its positive keys and weigh them
based on semantic similarities and local structures for contrastive learning.
During inference, we further propose Graph Smoothing Filter (GSF) to explicitly
utilize the structure relationships to filter high-frequency noise embodied in
semantically ambiguous samples on the cluster boundary. Extensive experiments
show that our method outperforms state-of-the-art models on all evaluation
metrics across multiple benchmark datasets. Code and data are available at
https://github.com/yibai-shi/DWGF.
A Comparative Study of Variational Autoencoders, Normalizing Flows, and Score-based Diffusion Models for Electrical Impedance Tomography
October 24, 2023
Huihui Wang, Guixian Xu, Qingping Zhou
Electrical Impedance Tomography (EIT) is a widely employed imaging technique
in industrial inspection, geophysical prospecting, and medical imaging.
However, the inherent nonlinearity and ill-posedness of EIT image
reconstruction present challenges for classical regularization techniques, such
as the critical selection of regularization terms and the lack of prior
knowledge. Deep generative models (DGMs) have been shown to play a crucial role
in learning implicit regularizers and prior knowledge. This study aims to
investigate the potential of three DGMs-variational autoencoder networks,
normalizing flow, and score-based diffusion model-to learn implicit
regularizers in learning-based EIT imaging. We first introduce background
information on EIT imaging and its inverse problem formulation. Next, we
propose three algorithms for performing EIT inverse problems based on
corresponding DGMs. Finally, we present numerical and visual experiments, which
reveal that (1) no single method consistently outperforms the others across all
settings, and (2) when reconstructing an object with 2 anomalies using a
well-trained model based on a training dataset containing 4 anomalies, the
conditional normalizing flow model (CNF) exhibits the best generalization in
low-level noise, while the conditional score-based diffusion model (CSD*)
demonstrates the best generalization in high-level noise settings. We hope our
preliminary efforts will encourage other researchers to assess their DGMs in
EIT and other nonlinear inverse problems.
Discriminator Guidance for Autoregressive Diffusion Models
October 24, 2023
Filip Ekström Kelvinius, Fredrik Lindsten
We introduce discriminator guidance in the setting of Autoregressive
Diffusion Models. The use of a discriminator to guide a diffusion process has
previously been used for continuous diffusion models, and in this work we
derive ways of using a discriminator together with a pretrained generative
model in the discrete case. First, we show that using an optimal discriminator
will correct the pretrained model and enable exact sampling from the underlying
data distribution. Second, to account for the realistic scenario of using a
sub-optimal discriminator, we derive a sequential Monte Carlo algorithm which
iteratively takes the predictions from the discrimiator into account during the
generation process. We test these approaches on the task of generating
molecular graphs and show how the discriminator improves the generative
performance over using only the pretrained model.
Improving Diffusion Models for ECG Imputation with an Augmented Template Prior
October 24, 2023
Alexander Jenkins, Zehua Chen, Fu Siong Ng, Danilo Mandic
Pulsative signals such as the electrocardiogram (ECG) are extensively
collected as part of routine clinical care. However, noisy and poor-quality
recordings are a major issue for signals collected using mobile health systems,
decreasing the signal quality, leading to missing values, and affecting
automated downstream tasks. Recent studies have explored the imputation of
missing values in ECG with probabilistic time-series models. Nevertheless, in
comparison with the deterministic models, their performance is still limited,
as the variations across subjects and heart-beat relationships are not
explicitly considered in the training objective. In this work, to improve the
imputation and forecasting accuracy for ECG with probabilistic models, we
present a template-guided denoising diffusion probabilistic model (DDPM),
PulseDiff, which is conditioned on an informative prior for a range of health
conditions. Specifically, 1) we first extract a subject-level pulsative
template from the observed values to use as an informative prior of the missing
values, which personalises the prior; 2) we then add beat-level stochastic
shift terms to augment the prior, which considers variations in the position
and amplitude of the prior at each beat; 3) we finally design a confidence
score to consider the health condition of the subject, which ensures our prior
is provided safely. Experiments with the PTBXL dataset reveal that PulseDiff
improves the performance of two strong DDPM baseline models, CSDI and
SSSD$^{S4}$, verifying that our method guides the generation of DDPMs while
managing the uncertainty. When combined with SSSD$^{S4}$, PulseDiff outperforms
the leading deterministic model for short-interval missing data and is
comparable for long-interval data loss.
AutoDiff: combining Auto-encoder and Diffusion model for tabular data synthesizing
October 24, 2023
Namjoon Suh, Xiaofeng Lin, Din-Yin Hsieh, Merhdad Honarkhah, Guang Cheng
Diffusion model has become a main paradigm for synthetic data generation in
many subfields of modern machine learning, including computer vision, language
model, or speech synthesis. In this paper, we leverage the power of diffusion
model for generating synthetic tabular data. The heterogeneous features in
tabular data have been main obstacles in tabular data synthesis, and we tackle
this problem by employing the auto-encoder architecture. When compared with the
state-of-the-art tabular synthesizers, the resulting synthetic tables from our
model show nice statistical fidelities to the real data, and perform well in
downstream tasks for machine learning utilities. We conducted the experiments
over $15$ publicly available datasets. Notably, our model adeptly captures the
correlations among features, which has been a long-standing challenge in
tabular data synthesis. Our code is available at
https://github.com/UCLA-Trustworthy-AI-Lab/AutoDiffusion.
Matryoshka Diffusion Models
October 23, 2023
Jiatao Gu, Shuangfei Zhai, Yizhe Zhang, Josh Susskind, Navdeep Jaitly
Diffusion models are the de facto approach for generating high-quality images
and videos, but learning high-dimensional models remains a formidable task due
to computational and optimization challenges. Existing methods often resort to
training cascaded models in pixel space or using a downsampled latent space of
a separately trained auto-encoder. In this paper, we introduce Matryoshka
Diffusion Models(MDM), an end-to-end framework for high-resolution image and
video synthesis. We propose a diffusion process that denoises inputs at
multiple resolutions jointly and uses a NestedUNet architecture where features
and parameters for small-scale inputs are nested within those of large scales.
In addition, MDM enables a progressive training schedule from lower to higher
resolutions, which leads to significant improvements in optimization for
high-resolution generation. We demonstrate the effectiveness of our approach on
various benchmarks, including class-conditioned image generation,
high-resolution text-to-image, and text-to-video applications. Remarkably, we
can train a single pixel-space model at resolutions of up to 1024x1024 pixels,
demonstrating strong zero-shot generalization using the CC12M dataset, which
contains only 12 million images.
Wonder3D: Single Image to 3D using Cross-Domain Diffusion
October 23, 2023
Xiaoxiao Long, Yuan-Chen Guo, Cheng Lin, Yuan Liu, Zhiyang Dou, Lingjie Liu, Yuexin Ma, Song-Hai Zhang, Marc Habermann, Christian Theobalt, Wenping Wang
In this work, we introduce Wonder3D, a novel method for efficiently
generating high-fidelity textured meshes from single-view images.Recent methods
based on Score Distillation Sampling (SDS) have shown the potential to recover
3D geometry from 2D diffusion priors, but they typically suffer from
time-consuming per-shape optimization and inconsistent geometry. In contrast,
certain works directly produce 3D information via fast network inferences, but
their results are often of low quality and lack geometric details. To
holistically improve the quality, consistency, and efficiency of image-to-3D
tasks, we propose a cross-domain diffusion model that generates multi-view
normal maps and the corresponding color images. To ensure consistency, we
employ a multi-view cross-domain attention mechanism that facilitates
information exchange across views and modalities. Lastly, we introduce a
geometry-aware normal fusion algorithm that extracts high-quality surfaces from
the multi-view 2D representations. Our extensive evaluations demonstrate that
our method achieves high-quality reconstruction results, robust generalization,
and reasonably good efficiency compared to prior works.
DICE: Diverse Diffusion Model with Scoring for Trajectory Prediction
October 23, 2023
Younwoo Choi, Ray Coden Mercurius, Soheil Mohamad Alizadeh Shabestary, Amir Rasouli
Road user trajectory prediction in dynamic environments is a challenging but
crucial task for various applications, such as autonomous driving. One of the
main challenges in this domain is the multimodal nature of future trajectories
stemming from the unknown yet diverse intentions of the agents. Diffusion
models have shown to be very effective in capturing such stochasticity in
prediction tasks. However, these models involve many computationally expensive
denoising steps and sampling operations that make them a less desirable option
for real-time safety-critical applications. To this end, we present a novel
framework that leverages diffusion models for predicting future trajectories in
a computationally efficient manner. To minimize the computational bottlenecks
in iterative sampling, we employ an efficient sampling mechanism that allows us
to maximize the number of sampled trajectories for improved accuracy while
maintaining inference time in real time. Moreover, we propose a scoring
mechanism to select the most plausible trajectories by assigning relative
ranks. We show the effectiveness of our approach by conducting empirical
evaluations on common pedestrian (UCY/ETH) and autonomous driving (nuScenes)
benchmark datasets on which our model achieves state-of-the-art performance on
several subsets and metrics.
Diffusion-Based Adversarial Purification for Speaker Verification
October 22, 2023
Yibo Bai, Xiao-Lei Zhang
Recently, automatic speaker verification (ASV) based on deep learning is
easily contaminated by adversarial attacks, which is a new type of attack that
injects imperceptible perturbations to audio signals so as to make ASV produce
wrong decisions. This poses a significant threat to the security and
reliability of ASV systems. To address this issue, we propose a Diffusion-Based
Adversarial Purification (DAP) method that enhances the robustness of ASV
systems against such adversarial attacks. Our method leverages a conditional
denoising diffusion probabilistic model to effectively purify the adversarial
examples and mitigate the impact of perturbations. DAP first introduces
controlled noise into adversarial examples, and then performs a reverse
denoising process to reconstruct clean audio. Experimental results demonstrate
the efficacy of the proposed DAP in enhancing the security of ASV and meanwhile
minimizing the distortion of the purified audio signals.
Diffusion-based Data Augmentation for Nuclei Image Segmentation
October 22, 2023
Xinyi Yu, Guanbin Li, Wei Lou, Siqi Liu, Xiang Wan, Yan Chen, Haofeng Li
Nuclei segmentation is a fundamental but challenging task in the quantitative
analysis of histopathology images. Although fully-supervised deep
learning-based methods have made significant progress, a large number of
labeled images are required to achieve great segmentation performance.
Considering that manually labeling all nuclei instances for a dataset is
inefficient, obtaining a large-scale human-annotated dataset is time-consuming
and labor-intensive. Therefore, augmenting a dataset with only a few labeled
images to improve the segmentation performance is of significant research and
application value. In this paper, we introduce the first diffusion-based
augmentation method for nuclei segmentation. The idea is to synthesize a large
number of labeled images to facilitate training the segmentation model. To
achieve this, we propose a two-step strategy. In the first step, we train an
unconditional diffusion model to synthesize the Nuclei Structure that is
defined as the representation of pixel-level semantic and distance transform.
Each synthetic nuclei structure will serve as a constraint on histopathology
image synthesis and is further post-processed to be an instance map. In the
second step, we train a conditioned diffusion model to synthesize
histopathology images based on nuclei structures. The synthetic histopathology
images paired with synthetic instance maps will be added to the real dataset
for training the segmentation model. The experimental results show that by
augmenting 10% labeled real dataset with synthetic samples, one can achieve
comparable segmentation results with the fully-supervised baseline.
Fast Diffusion GAN Model for Symbolic Music Generation Controlled by Emotions
October 21, 2023
Jincheng Zhang, György Fazekas, Charalampos Saitis
Diffusion models have shown promising results for a wide range of generative
tasks with continuous data, such as image and audio synthesis. However, little
progress has been made on using diffusion models to generate discrete symbolic
music because this new class of generative models are not well suited for
discrete data while its iterative sampling process is computationally
expensive. In this work, we propose a diffusion model combined with a
Generative Adversarial Network, aiming to (i) alleviate one of the remaining
challenges in algorithmic music generation which is the control of generation
towards a target emotion, and (ii) mitigate the slow sampling drawback of
diffusion models applied to symbolic music generation. We first used a trained
Variational Autoencoder to obtain embeddings of a symbolic music dataset with
emotion labels and then used those to train a diffusion model. Our results
demonstrate the successful control of our diffusion model to generate symbolic
music with a desired emotion. Our model achieves several orders of magnitude
improvement in computational cost, requiring merely four time steps to denoise
while the steps required by current state-of-the-art diffusion models for
symbolic music generation is in the order of thousands.
GraphMaker: Can Diffusion Models Generate Large Attributed Graphs?
October 20, 2023
Mufei Li, Eleonora Kreačić, Vamsi K. Potluru, Pan Li
Large-scale graphs with node attributes are fundamental in real-world
scenarios, such as social and financial networks. The generation of synthetic
graphs that emulate real-world ones is pivotal in graph machine learning,
aiding network evolution understanding and data utility preservation when
original data cannot be shared. Traditional models for graph generation suffer
from limited model capacity. Recent developments in diffusion models have shown
promise in merely graph structure generation or the generation of small
molecular graphs with attributes. However, their applicability to large
attributed graphs remains unaddressed due to challenges in capturing intricate
patterns and scalability. This paper introduces GraphMaker, a novel diffusion
model tailored for generating large attributed graphs. We study the diffusion
models that either couple or decouple graph structure and node attribute
generation to address their complex correlation. We also employ node-level
conditioning and adopt a minibatch strategy for scalability. We further propose
a new evaluation pipeline using models trained on generated synthetic graphs
and tested on original graphs to evaluate the quality of synthetic data.
Empirical evaluations on real-world datasets showcase GraphMaker’s superiority
in generating realistic and diverse large-attributed graphs beneficial for
downstream tasks.
DPM-Solver-v3: Improved Diffusion ODE Solver with Empirical Model Statistics
October 20, 2023
Kaiwen Zheng, Cheng Lu, Jianfei Chen, Jun Zhu
Diffusion probabilistic models (DPMs) have exhibited excellent performance
for high-fidelity image generation while suffering from inefficient sampling.
Recent works accelerate the sampling procedure by proposing fast ODE solvers
that leverage the specific ODE form of DPMs. However, they highly rely on
specific parameterization during inference (such as noise/data prediction),
which might not be the optimal choice. In this work, we propose a novel
formulation towards the optimal parameterization during sampling that minimizes
the first-order discretization error of the ODE solution. Based on such
formulation, we propose DPM-Solver-v3, a new fast ODE solver for DPMs by
introducing several coefficients efficiently computed on the pretrained model,
which we call empirical model statistics. We further incorporate multistep
methods and a predictor-corrector framework, and propose some techniques for
improving sample quality at small numbers of function evaluations (NFE) or
large guidance scales. Experiments show that DPM-Solver-v3 achieves
consistently better or comparable performance in both unconditional and
conditional sampling with both pixel-space and latent-space DPMs, especially in
5$\sim$10 NFEs. We achieve FIDs of 12.21 (5 NFE), 2.51 (10 NFE) on
unconditional CIFAR10, and MSE of 0.55 (5 NFE, 7.5 guidance scale) on Stable
Diffusion, bringing a speed-up of 15%$\sim$30% compared to previous
state-of-the-art training-free methods. Code is available at
https://github.com/thu-ml/DPM-Solver-v3.
CycleNet: Rethinking Cycle Consistency in Text-Guided Diffusion for Image Manipulation
October 19, 2023
Sihan Xu, Ziqiao Ma, Yidong Huang, Honglak Lee, Joyce Chai
Diffusion models (DMs) have enabled breakthroughs in image synthesis tasks
but lack an intuitive interface for consistent image-to-image (I2I)
translation. Various methods have been explored to address this issue,
including mask-based methods, attention-based methods, and image-conditioning.
However, it remains a critical challenge to enable unpaired I2I translation
with pre-trained DMs while maintaining satisfying consistency. This paper
introduces Cyclenet, a novel but simple method that incorporates cycle
consistency into DMs to regularize image manipulation. We validate Cyclenet on
unpaired I2I tasks of different granularities. Besides the scene and object
level translation, we additionally contribute a multi-domain I2I translation
dataset to study the physical state changes of objects. Our empirical studies
show that Cyclenet is superior in translation consistency and quality, and can
generate high-quality images for out-of-domain distributions with a simple
change of the textual prompt. Cyclenet is a practical framework, which is
robust even with very limited training data (around 2k) and requires minimal
computational resources (1 GPU) to train. Project homepage:
https://cyclenetweb.github.io/
EMIT-Diff: Enhancing Medical Image Segmentation via Text-Guided Diffusion Model
October 19, 2023
Zheyuan Zhang, Lanhong Yao, Bin Wang, Debesh Jha, Elif Keles, Alpay Medetalibeyoglu, Ulas Bagci
Large-scale, big-variant, and high-quality data are crucial for developing
robust and successful deep-learning models for medical applications since they
potentially enable better generalization performance and avoid overfitting.
However, the scarcity of high-quality labeled data always presents significant
challenges. This paper proposes a novel approach to address this challenge by
developing controllable diffusion models for medical image synthesis, called
EMIT-Diff. We leverage recent diffusion probabilistic models to generate
realistic and diverse synthetic medical image data that preserve the essential
characteristics of the original medical images by incorporating edge
information of objects to guide the synthesis process. In our approach, we
ensure that the synthesized samples adhere to medically relevant constraints
and preserve the underlying structure of imaging data. Due to the random
sampling process by the diffusion model, we can generate an arbitrary number of
synthetic images with diverse appearances. To validate the effectiveness of our
proposed method, we conduct an extensive set of medical image segmentation
experiments on multiple datasets, including Ultrasound breast (+13.87%), CT
spleen (+0.38%), and MRI prostate (+7.78%), achieving significant improvements
over the baseline segmentation methods. For the first time, to our best
knowledge, the promising results demonstrate the effectiveness of our EMIT-Diff
for medical image segmentation tasks and show the feasibility of introducing a
first-ever text-guided diffusion model for general medical image segmentation
tasks. With carefully designed ablation experiments, we investigate the
influence of various data augmentation ratios, hyper-parameter settings, patch
size for generating random merging mask settings, and combined influence with
different network architectures.
Product of Gaussian Mixture Diffusion Models
October 19, 2023
Martin Zach, Erich Kobler, Antonin Chambolle, Thomas Pock
In this work we tackle the problem of estimating the density $ f_X $ of a
random variable $ X $ by successive smoothing, such that the smoothed random
variable $ Y $ fulfills the diffusion partial differential equation $
(\partial_t - \Delta_1)f_Y(\,\cdot\,, t) = 0 $ with initial condition $
f_Y(\,\cdot\,, 0) = f_X $. We propose a product-of-experts-type model utilizing
Gaussian mixture experts and study configurations that admit an analytic
expression for $ f_Y (\,\cdot\,, t) $. In particular, with a focus on image
processing, we derive conditions for models acting on filter-, wavelet-, and
shearlet responses. Our construction naturally allows the model to be trained
simultaneously over the entire diffusion horizon using empirical Bayes. We show
numerical results for image denoising where our models are competitive while
being tractable, interpretable, and having only a small number of learnable
parameters. As a byproduct, our models can be used for reliable noise level
estimation, allowing blind denoising of images corrupted by heteroscedastic
noise.
Diverse Diffusion: Enhancing Image Diversity in Text-to-Image Generation
October 19, 2023
Mariia Zameshina, Olivier Teytaud, Laurent Najman
Latent diffusion models excel at producing high-quality images from text.
Yet, concerns appear about the lack of diversity in the generated imagery. To
tackle this, we introduce Diverse Diffusion, a method for boosting image
diversity beyond gender and ethnicity, spanning into richer realms, including
color diversity.Diverse Diffusion is a general unsupervised technique that can
be applied to existing text-to-image models. Our approach focuses on finding
vectors in the Stable Diffusion latent space that are distant from each other.
We generate multiple vectors in the latent space until we find a set of vectors
that meets the desired distance requirements and the required batch size.To
evaluate the effectiveness of our diversity methods, we conduct experiments
examining various characteristics, including color diversity, LPIPS metric, and
ethnicity/gender representation in images featuring humans.The results of our
experiments emphasize the significance of diversity in generating realistic and
varied images, offering valuable insights for improving text-to-image models.
Through the enhancement of image diversity, our approach contributes to the
creation of more inclusive and representative AI-generated art.
October 19, 2023
Christopher Scarvelis, Haitz Sáez de Ocáriz Borde, Justin Solomon
Score-based generative models (SGMs) sample from a target distribution by
iteratively transforming noise using the score function of the perturbed
target. For any finite training set, this score function can be evaluated in
closed form, but the resulting SGM memorizes its training data and does not
generate novel samples. In practice, one approximates the score by training a
neural network via score-matching. The error in this approximation promotes
generalization, but neural SGMs are costly to train and sample, and the
effective regularization this error provides is not well-understood
theoretically. In this work, we instead explicitly smooth the closed-form score
to obtain an SGM that generates novel samples without training. We analyze our
model and propose an efficient nearest-neighbor-based estimator of its score
function. Using this estimator, our method achieves sampling times competitive
with neural SGMs while running on consumer-grade CPUs.
Bayesian Flow Networks in Continual Learning
October 18, 2023
Mateusz Pyla, Kamil Deja, Bartłomiej Twardowski, Tomasz Trzciński
Bayesian Flow Networks (BFNs) has been recently proposed as one of the most
promising direction to universal generative modelling, having ability to learn
any of the data type. Their power comes from the expressiveness of neural
networks and Bayesian inference which make them suitable in the context of
continual learning. We delve into the mechanics behind BFNs and conduct the
experiments to empirically verify the generative capabilities on non-stationary
data.
To Generate or Not? Safety-Driven Unlearned Diffusion Models Are Still Easy To Generate Unsafe Images … For Now
October 18, 2023
Yimeng Zhang, Jinghan Jia, Xin Chen, Aochuan Chen, Yihua Zhang, Jiancheng Liu, Ke Ding, Sijia Liu
The recent advances in diffusion models (DMs) have revolutionized the
generation of complex and diverse images. However, these models also introduce
potential safety hazards, such as the production of harmful content and
infringement of data copyrights. Although there have been efforts to create
safety-driven unlearning methods to counteract these challenges, doubts remain
about their capabilities. To bridge this uncertainty, we propose an evaluation
framework built upon adversarial attacks (also referred to as adversarial
prompts), in order to discern the trustworthiness of these safety-driven
unlearned DMs. Specifically, our research explores the (worst-case) robustness
of unlearned DMs in eradicating unwanted concepts, styles, and objects,
assessed by the generation of adversarial prompts. We develop a novel
adversarial learning approach called UnlearnDiff that leverages the inherent
classification capabilities of DMs to streamline the generation of adversarial
prompts, making it as simple for DMs as it is for image classification attacks.
This technique streamlines the creation of adversarial prompts, making the
process as intuitive for generative modeling as it is for image classification
assaults. Through comprehensive benchmarking, we assess the unlearning
robustness of five prevalent unlearned DMs across multiple tasks. Our results
underscore the effectiveness and efficiency of UnlearnDiff when compared to
state-of-the-art adversarial prompting methods. Codes are available at
https://github.com/OPTML-Group/Diffusion-MU-Attack. WARNING: This paper
contains model outputs that may be offensive in nature.
A Survey on Video Diffusion Models
October 16, 2023
Zhen Xing, Qijun Feng, Haoran Chen, Qi Dai, Han Hu, Hang Xu, Zuxuan Wu, Yu-Gang Jiang
The recent wave of AI-generated content (AIGC) has witnessed substantial
success in computer vision, with the diffusion model playing a crucial role in
this achievement. Due to their impressive generative capabilities, diffusion
models are gradually superseding methods based on GANs and auto-regressive
Transformers, demonstrating exceptional performance not only in image
generation and editing, but also in the realm of video-related research.
However, existing surveys mainly focus on diffusion models in the context of
image generation, with few up-to-date reviews on their application in the video
domain. To address this gap, this paper presents a comprehensive review of
video diffusion models in the AIGC era. Specifically, we begin with a concise
introduction to the fundamentals and evolution of diffusion models.
Subsequently, we present an overview of research on diffusion models in the
video domain, categorizing the work into three key areas: video generation,
video editing, and other video understanding tasks. We conduct a thorough
review of the literature in these three key areas, including further
categorization and practical contributions in the field. Finally, we discuss
the challenges faced by research in this domain and outline potential future
developmental trends. A comprehensive list of video diffusion models studied in
this survey is available at
https://github.com/ChenHsing/Awesome-Video-Diffusion-Models.
Self-supervised Fetal MRI 3D Reconstruction Based on Radiation Diffusion Generation Model
October 16, 2023
Junpeng Tan, Xin Zhang, Yao Lv, Xiangmin Xu, Gang Li
Although the use of multiple stacks can handle slice-to-volume motion
correction and artifact removal problems, there are still several problems: 1)
The slice-to-volume method usually uses slices as input, which cannot solve the
problem of uniform intensity distribution and complementarity in regions of
different fetal MRI stacks; 2) The integrity of 3D space is not considered,
which adversely affects the discrimination and generation of globally
consistent information in fetal MRI; 3) Fetal MRI with severe motion artifacts
in the real-world cannot achieve high-quality super-resolution reconstruction.
To address these issues, we propose a novel fetal brain MRI high-quality volume
reconstruction method, called the Radiation Diffusion Generation Model (RDGM).
It is a self-supervised generation method, which incorporates the idea of
Neural Radiation Field (NeRF) based on the coordinate generation and diffusion
model based on super-resolution generation. To solve regional intensity
heterogeneity in different directions, we use a pre-trained transformer model
for slice registration, and then, a new regionally Consistent Implicit Neural
Representation (CINR) network sub-module is proposed. CINR can generate the
initial volume by combining a coordinate association map of two different
coordinate mapping spaces. To enhance volume global consistency and
discrimination, we introduce the Volume Diffusion Super-resolution Generation
(VDSG) mechanism. The global intensity discriminant generation from
volume-to-volume is carried out using the idea of diffusion generation, and
CINR becomes the deviation intensity generation network of the volume-to-volume
diffusion model. Finally, the experimental results on real-world fetal brain
MRI stacks demonstrate the state-of-the-art performance of our method.
Generative Entropic Neural Optimal Transport To Map Within and Across Spaces
October 13, 2023
Dominik Klein, Théo Uscidda, Fabian Theis, Marco Cuturi
Learning measure-to-measure mappings is a crucial task in machine learning,
featured prominently in generative modeling. Recent years have witnessed a
surge of techniques that draw inspiration from optimal transport (OT) theory.
Combined with neural network models, these methods collectively known as
\textit{Neural OT} use optimal transport as an inductive bias: such mappings
should be optimal w.r.t. a given cost function, in the sense that they are able
to move points in a thrifty way, within (by minimizing displacements) or across
spaces (by being isometric). This principle, while intuitive, is often
confronted with several practical challenges that require adapting the OT
toolbox: cost functions other than the squared-Euclidean cost can be
challenging to handle, the deterministic formulation of Monge maps leaves
little flexibility, mapping across incomparable spaces raises multiple
challenges, while the mass conservation constraint inherent to OT can provide
too much credit to outliers. While each of these mismatches between practice
and theory has been addressed independently in various works, we propose in
this work an elegant framework to unify them, called \textit{generative
entropic neural optimal transport} (GENOT). GENOT can accommodate any cost
function; handles randomness using conditional generative models; can map
points across incomparable spaces, and can be used as an \textit{unbalanced}
solver. We evaluate our approach through experiments conducted on various
synthetic datasets and demonstrate its practicality in single-cell biology. In
this domain, GENOT proves to be valuable for tasks such as modeling cell
development, predicting cellular responses to drugs, and translating between
different data modalities of cells.
Unseen Image Synthesis with Diffusion Models
October 13, 2023
Ye Zhu, Yu Wu, Zhiwei Deng, Olga Russakovsky, Yan Yan
While the current trend in the generative field is scaling up towards larger
models and more training data for generalized domain representations, we go the
opposite direction in this work by synthesizing unseen domain images without
additional training. We do so via latent sampling and geometric optimization
using pre-trained and frozen Denoising Diffusion Probabilistic Models (DDPMs)
on single-domain datasets. Our key observation is that DDPMs pre-trained even
just on single-domain images are already equipped with sufficient
representation abilities to reconstruct arbitrary images from the inverted
latent encoding following bi-directional deterministic diffusion and denoising
trajectories. This motivates us to investigate the statistical and geometric
behaviors of the Out-Of-Distribution (OOD) samples from unseen image domains in
the latent spaces along the denoising chain. Notably, we theoretically and
empirically show that the inverted OOD samples also establish Gaussians that
are distinguishable from the original In-Domain (ID) samples in the
intermediate latent spaces, which allows us to sample from them directly.
Geometrical domain-specific and model-dependent information of the unseen
subspace (e.g., sample-wise distance and angles) is used to further optimize
the sampled OOD latent encodings from the estimated Gaussian prior. We conduct
extensive analysis and experiments using pre-trained diffusion models (DDPM,
iDDPM) on different datasets (AFHQ, CelebA-HQ, LSUN-Church, and LSUN-Bedroom),
proving the effectiveness of this novel perspective to explore and re-think the
diffusion models’ data synthesis generalization ability.
Sampling from Mean-Field Gibbs Measures via Diffusion Processes
October 13, 2023
Ahmed El Alaoui, Andrea Montanari, Mark Sellke
We consider Ising mixed $p$-spin glasses at high-temperature and without
external field, and study the problem of sampling from the Gibbs distribution
$\mu$ in polynomial time. We develop a new sampling algorithm with complexity
of the same order as evaluating the gradient of the Hamiltonian and, in
particular, at most linear in the input size. We prove that, at sufficiently
high-temperature, it produces samples from a distribution $\mu^{alg}$ which is
close in normalized Wasserstein distance to $\mu$. Namely, there exists a
coupling of $\mu$ and $\mu^{alg}$ such that if $({\boldsymbol x},{\boldsymbol
x}^{alg})\in{-1,+1}^n\times {-1,+1}^n$ is a pair drawn from this coupling,
then $n^{-1}{\mathbb E}{|{\boldsymbol x}-{\boldsymbol
x}^{alg}|_2^2}=o_n(1)$. For the case of the Sherrington-Kirkpatrick model,
our algorithm succeeds in the full replica-symmetric phase.
We complement this result with a negative one for sampling algorithms
satisfying a certain stability' property, which is verified by many standard
techniques.
No stable algorithm can approximately sample at temperatures below the onset
of shattering, even under the normalized Wasserstein metric. Further, no
algorithm can sample at temperatures below the onset of replica symmetry
breaking.
Our sampling method implements a discretized version of a diffusion process
that has become recently popular in machine learning under the name of
denoising diffusion.’ We derive the same process from the general construction
of stochastic localization. Implementing the diffusion process requires to
efficiently approximate the mean of the tilted measure. To this end, we use an
approximate message passing algorithm that, as we prove, achieves sufficiently
accurate mean estimation.
October 13, 2023
Chaocheng Yang, Tingyin Wang, Xuanhui Yan
Anomaly detection in multivariate time series has emerged as a crucial
challenge in time series research, with significant research implications in
various fields such as fraud detection, fault diagnosis, and system state
estimation. Reconstruction-based models have shown promising potential in
recent years for detecting anomalies in time series data. However, due to the
rapid increase in data scale and dimensionality, the issues of noise and Weak
Identity Mapping (WIM) during time series reconstruction have become
increasingly pronounced. To address this, we introduce a novel Adaptive Dynamic
Neighbor Mask (ADNM) mechanism and integrate it with the Transformer and
Denoising Diffusion Model, creating a new framework for multivariate time
series anomaly detection, named Denoising Diffusion Mask Transformer (DDMT).
The ADNM module is introduced to mitigate information leakage between input and
output features during data reconstruction, thereby alleviating the problem of
WIM during reconstruction. The Denoising Diffusion Transformer (DDT) employs
the Transformer as an internal neural network structure for Denoising Diffusion
Model. It learns the stepwise generation process of time series data to model
the probability distribution of the data, capturing normal data patterns and
progressively restoring time series data by removing noise, resulting in a
clear recovery of anomalies. To the best of our knowledge, this is the first
model that combines Denoising Diffusion Model and the Transformer for
multivariate time series anomaly detection. Experimental evaluations were
conducted on five publicly available multivariate time series anomaly detection
datasets. The results demonstrate that the model effectively identifies
anomalies in time series data, achieving state-of-the-art performance in
anomaly detection.
Debias the Training of Diffusion Models
October 12, 2023
Hu Yu, Li Shen, Jie Huang, Man Zhou, Hongsheng Li, Feng Zhao
Diffusion models have demonstrated compelling generation quality by
optimizing the variational lower bound through a simple denoising score
matching loss. In this paper, we provide theoretical evidence that the
prevailing practice of using a constant loss weight strategy in diffusion
models leads to biased estimation during the training phase. Simply optimizing
the denoising network to predict Gaussian noise with constant weighting may
hinder precise estimations of original images. To address the issue, we propose
an elegant and effective weighting strategy grounded in the theoretically
unbiased principle. Moreover, we conduct a comprehensive and systematic
exploration to dissect the inherent bias problem deriving from constant
weighting loss from the perspectives of its existence, impact and reasons.
These analyses are expected to advance our understanding and demystify the
inner workings of diffusion models. Through empirical evaluation, we
demonstrate that our proposed debiased estimation method significantly enhances
sample quality without the reliance on complex techniques, and exhibits
improved efficiency compared to the baseline method both in training and
sampling processes.
Neural Diffusion Models
October 12, 2023
Grigory Bartosh, Dmitry Vetrov, Christian A. Naesseth
Diffusion models have shown remarkable performance on many generative tasks.
Despite recent success, most diffusion models are restricted in that they only
allow linear transformation of the data distribution. In contrast, broader
family of transformations can potentially help train generative distributions
more efficiently, simplifying the reverse process and closing the gap between
the true negative log-likelihood and the variational approximation. In this
paper, we present Neural Diffusion Models (NDMs), a generalization of
conventional diffusion models that enables defining and learning time-dependent
non-linear transformations of data. We show how to optimise NDMs using a
variational bound in a simulation-free setting. Moreover, we derive a
time-continuous formulation of NDMs, which allows fast and reliable inference
using off-the-shelf numerical ODE and SDE solvers. Finally, we demonstrate the
utility of NDMs with learnable transformations through experiments on standard
image generation benchmarks, including CIFAR-10, downsampled versions of
ImageNet and CelebA-HQ. NDMs outperform conventional diffusion models in terms
of likelihood and produce high-quality samples.
October 12, 2023
Xianghao Kong, Ollie Liu, Han Li, Dani Yogatama, Greg Ver Steeg
cs.LG, cs.AI, cs.IT, math.IT
Denoising diffusion models enable conditional generation and density modeling
of complex relationships like images and text. However, the nature of the
learned relationships is opaque making it difficult to understand precisely
what relationships between words and parts of an image are captured, or to
predict the effect of an intervention. We illuminate the fine-grained
relationships learned by diffusion models by noticing a precise relationship
between diffusion and information decomposition. Exact expressions for mutual
information and conditional mutual information can be written in terms of the
denoising model. Furthermore, pointwise estimates can be easily estimated as
well, allowing us to ask questions about the relationships between specific
images and captions. Decomposing information even further to understand which
variables in a high-dimensional space carry information is a long-standing
problem. For diffusion models, we show that a natural non-negative
decomposition of mutual information emerges, allowing us to quantify
informative relationships between words and pixels in an image. We exploit
these new relations to measure the compositional understanding of diffusion
models, to do unsupervised localization of objects in images, and to measure
effects when selectively editing images through prompt interventions.
Efficient Integrators for Diffusion Generative Models
October 11, 2023
Kushagra Pandey, Maja Rudolph, Stephan Mandt
cs.LG, cs.AI, cs.CV, stat.ML
Diffusion models suffer from slow sample generation at inference time.
Therefore, developing a principled framework for fast deterministic/stochastic
sampling for a broader class of diffusion models is a promising direction. We
propose two complementary frameworks for accelerating sample generation in
pre-trained models: Conjugate Integrators and Splitting Integrators. Conjugate
integrators generalize DDIM, mapping the reverse diffusion dynamics to a more
amenable space for sampling. In contrast, splitting-based integrators, commonly
used in molecular dynamics, reduce the numerical simulation error by cleverly
alternating between numerical updates involving the data and auxiliary
variables. After extensively studying these methods empirically and
theoretically, we present a hybrid method that leads to the best-reported
performance for diffusion models in augmented spaces. Applied to Phase Space
Langevin Diffusion [Pandey & Mandt, 2023] on CIFAR-10, our deterministic and
stochastic samplers achieve FID scores of 2.11 and 2.36 in only 100 network
function evaluations (NFE) as compared to 2.57 and 2.63 for the best-performing
baselines, respectively. Our code and model checkpoints will be made publicly
available at \url{https://github.com/mandt-lab/PSLD}.
Generative Modeling with Phase Stochastic Bridges
October 11, 2023
Tianrong Chen, Jiatao Gu, Laurent Dinh, Evangelos A. Theodorou, Josh Susskind, Shuangfei Zhai
Diffusion models (DMs) represent state-of-the-art generative models for
continuous inputs. DMs work by constructing a Stochastic Differential Equation
(SDE) in the input space (ie, position space), and using a neural network to
reverse it. In this work, we introduce a novel generative modeling framework
grounded in \textbf{phase space dynamics}, where a phase space is defined as
{an augmented space encompassing both position and velocity.} Leveraging
insights from Stochastic Optimal Control, we construct a path measure in the
phase space that enables efficient sampling. {In contrast to DMs, our framework
demonstrates the capability to generate realistic data points at an early stage
of dynamics propagation.} This early prediction sets the stage for efficient
data generation by leveraging additional velocity information along the
trajectory. On standard image generation benchmarks, our model yields favorable
performance over baselines in the regime of small Number of Function
Evaluations (NFEs). Furthermore, our approach rivals the performance of
diffusion models equipped with efficient sampling techniques, underscoring its
potential as a new tool generative modeling.
Score Regularized Policy Optimization through Diffusion Behavior
October 11, 2023
Huayu Chen, Cheng Lu, Zhengyi Wang, Hang Su, Jun Zhu
Recent developments in offline reinforcement learning have uncovered the
immense potential of diffusion modeling, which excels at representing
heterogeneous behavior policies. However, sampling from diffusion policies is
considerably slow because it necessitates tens to hundreds of iterative
inference steps for one action. To address this issue, we propose to extract an
efficient deterministic inference policy from critic models and pretrained
diffusion behavior models, leveraging the latter to directly regularize the
policy gradient with the behavior distribution’s score function during
optimization. Our method enjoys powerful generative capabilities of diffusion
modeling while completely circumventing the computationally intensive and
time-consuming diffusion sampling scheme, both during training and evaluation.
Extensive results on D4RL tasks show that our method boosts action sampling
speed by more than 25 times compared with various leading diffusion-based
methods in locomotion tasks, while still maintaining state-of-the-art
performance.
Generative Modeling on Manifolds Through Mixture of Riemannian Diffusion Processes
October 11, 2023
Jaehyeong Jo, Sung Ju Hwang
Learning the distribution of data on Riemannian manifolds is crucial for
modeling data from non-Euclidean space, which is required by many applications
from diverse scientific fields. Yet, existing generative models on manifolds
suffer from expensive divergence computation or rely on approximations of heat
kernel. These limitations restrict their applicability to simple geometries and
hinder scalability to high dimensions. In this work, we introduce the
Riemannian Diffusion Mixture, a principled framework for building a generative
process on manifolds as a mixture of endpoint-conditioned diffusion processes
instead of relying on the denoising approach of previous diffusion models, for
which the generative process is characterized by its drift guiding toward the
most probable endpoint with respect to the geometry of the manifold. We further
propose a simple yet efficient training objective for learning the mixture
process, that is readily applicable to general manifolds. Our method
outperforms previous generative models on various manifolds while scaling to
high dimensions and requires a dramatically reduced number of in-training
simulation steps for general manifolds.
State of the Art on Diffusion Models for Visual Computing
October 11, 2023
Ryan Po, Wang Yifan, Vladislav Golyanik, Kfir Aberman, Jonathan T. Barron, Amit H. Bermano, Eric Ryan Chan, Tali Dekel, Aleksander Holynski, Angjoo Kanazawa, C. Karen Liu, Lingjie Liu, Ben Mildenhall, Matthias Nießner, Björn Ommer, Christian Theobalt, Peter Wonka, Gordon Wetzstein
cs.AI, cs.CV, cs.GR, cs.LG
The field of visual computing is rapidly advancing due to the emergence of
generative artificial intelligence (AI), which unlocks unprecedented
capabilities for the generation, editing, and reconstruction of images, videos,
and 3D scenes. In these domains, diffusion models are the generative AI
architecture of choice. Within the last year alone, the literature on
diffusion-based tools and applications has seen exponential growth and relevant
papers are published across the computer graphics, computer vision, and AI
communities with new works appearing daily on arXiv. This rapid growth of the
field makes it difficult to keep up with all recent developments. The goal of
this state-of-the-art report (STAR) is to introduce the basic mathematical
concepts of diffusion models, implementation details and design choices of the
popular Stable Diffusion model, as well as overview important aspects of these
generative AI tools, including personalization, conditioning, inversion, among
others. Moreover, we give a comprehensive overview of the rapidly growing
literature on diffusion-based generation and editing, categorized by the type
of generated medium, including 2D images, videos, 3D objects, locomotion, and
4D scenes. Finally, we discuss available datasets, metrics, open challenges,
and social implications. This STAR provides an intuitive starting point to
explore this exciting topic for researchers, artists, and practitioners alike.
Diffusion Prior Regularized Iterative Reconstruction for Low-dose CT
October 10, 2023
Wenjun Xia, Yongyi Shi, Chuang Niu, Wenxiang Cong, Ge Wang
eess.IV, cs.LG, physics.med-ph
Computed tomography (CT) involves a patient’s exposure to ionizing radiation.
To reduce the radiation dose, we can either lower the X-ray photon count or
down-sample projection views. However, either of the ways often compromises
image quality. To address this challenge, here we introduce an iterative
reconstruction algorithm regularized by a diffusion prior. Drawing on the
exceptional imaging prowess of the denoising diffusion probabilistic model
(DDPM), we merge it with a reconstruction procedure that prioritizes data
fidelity. This fusion capitalizes on the merits of both techniques, delivering
exceptional reconstruction results in an unsupervised framework. To further
enhance the efficiency of the reconstruction process, we incorporate the
Nesterov momentum acceleration technique. This enhancement facilitates superior
diffusion sampling in fewer steps. As demonstrated in our experiments, our
method offers a potential pathway to high-definition CT image reconstruction
with minimized radiation.
Stochastic Super-resolution of Cosmological Simulations with Denoising Diffusion Models
October 10, 2023
Andreas Schanz, Florian List, Oliver Hahn
astro-ph.CO, astro-ph.IM, cs.LG
In recent years, deep learning models have been successfully employed for
augmenting low-resolution cosmological simulations with small-scale
information, a task known as “super-resolution”. So far, these cosmological
super-resolution models have relied on generative adversarial networks (GANs),
which can achieve highly realistic results, but suffer from various
shortcomings (e.g. low sample diversity). We introduce denoising diffusion
models as a powerful generative model for super-resolving cosmic large-scale
structure predictions (as a first proof-of-concept in two dimensions). To
obtain accurate results down to small scales, we develop a new “filter-boosted”
training approach that redistributes the importance of different scales in the
pixel-wise training objective. We demonstrate that our model not only produces
convincing super-resolution images and power spectra consistent at the percent
level, but is also able to reproduce the diversity of small-scale features
consistent with a given low-resolution simulation. This enables uncertainty
quantification for the generated small-scale features, which is critical for
the usefulness of such super-resolution models as a viable surrogate model for
cosmic structure formation.
What Does Stable Diffusion Know about the 3D Scene?
October 10, 2023
Guanqi Zhan, Chuanxia Zheng, Weidi Xie, Andrew Zisserman
Recent advances in generative models like Stable Diffusion enable the
generation of highly photo-realistic images. Our objective in this paper is to
probe the diffusion network to determine to what extent it ‘understands’
different properties of the 3D scene depicted in an image. To this end, we make
the following contributions: (i) We introduce a protocol to evaluate whether a
network models a number of physical ‘properties’ of the 3D scene by probing for
explicit features that represent these properties. The probes are applied on
datasets of real images with annotations for the property. (ii) We apply this
protocol to properties covering scene geometry, scene material, support
relations, lighting, and view dependent measures. (iii) We find that Stable
Diffusion is good at a number of properties including scene geometry, support
relations, shadows and depth, but less performant for occlusion. (iv) We also
apply the probes to other models trained at large-scale, including DINO and
CLIP, and find their performance inferior to that of Stable Diffusion.
Latent Diffusion Counterfactual Explanations
October 10, 2023
Karim Farid, Simon Schrodi, Max Argus, Thomas Brox
Counterfactual explanations have emerged as a promising method for
elucidating the behavior of opaque black-box models. Recently, several works
leveraged pixel-space diffusion models for counterfactual generation. To handle
noisy, adversarial gradients during counterfactual generation – causing
unrealistic artifacts or mere adversarial perturbations – they required either
auxiliary adversarially robust models or computationally intensive guidance
schemes. However, such requirements limit their applicability, e.g., in
scenarios with restricted access to the model’s training data. To address these
limitations, we introduce Latent Diffusion Counterfactual Explanations (LDCE).
LDCE harnesses the capabilities of recent class- or text-conditional foundation
latent diffusion models to expedite counterfactual generation and focus on the
important, semantic parts of the data. Furthermore, we propose a novel
consensus guidance mechanism to filter out noisy, adversarial gradients that
are misaligned with the diffusion model’s implicit classifier. We demonstrate
the versatility of LDCE across a wide spectrum of models trained on diverse
datasets with different learning paradigms. Finally, we showcase how LDCE can
provide insights into model errors, enhancing our understanding of black-box
model behavior.
Leveraging Diffusion-Based Image Variations for Robust Training on Poisoned Data
October 10, 2023
Lukas Struppek, Martin B. Hentschel, Clifton Poth, Dominik Hintersdorf, Kristian Kersting
Backdoor attacks pose a serious security threat for training neural networks
as they surreptitiously introduce hidden functionalities into a model. Such
backdoors remain silent during inference on clean inputs, evading detection due
to inconspicuous behavior. However, once a specific trigger pattern appears in
the input data, the backdoor activates, causing the model to execute its
concealed function. Detecting such poisoned samples within vast datasets is
virtually impossible through manual inspection. To address this challenge, we
propose a novel approach that enables model training on potentially poisoned
datasets by utilizing the power of recent diffusion models. Specifically, we
create synthetic variations of all training samples, leveraging the inherent
resilience of diffusion models to potential trigger patterns in the data. By
combining this generative approach with knowledge distillation, we produce
student models that maintain their general performance on the task while
exhibiting robust resistance to backdoor triggers.
JointNet: Extending Text-to-Image Diffusion for Dense Distribution Modeling
October 10, 2023
Jingyang Zhang, Shiwei Li, Yuanxun Lu, Tian Fang, David McKinnon, Yanghai Tsin, Long Quan, Yao Yao
We introduce JointNet, a novel neural network architecture for modeling the
joint distribution of images and an additional dense modality (e.g., depth
maps). JointNet is extended from a pre-trained text-to-image diffusion model,
where a copy of the original network is created for the new dense modality
branch and is densely connected with the RGB branch. The RGB branch is locked
during network fine-tuning, which enables efficient learning of the new
modality distribution while maintaining the strong generalization ability of
the large-scale pre-trained diffusion model. We demonstrate the effectiveness
of JointNet by using RGBD diffusion as an example and through extensive
experiments, showcasing its applicability in a variety of applications,
including joint RGBD generation, dense depth prediction, depth-conditioned
image generation, and coherent tile-based 3D panorama generation.
Latent Diffusion Model for DNA Sequence Generation
October 09, 2023
Zehui Li, Yuhao Ni, Tim August B. Huygelen, Akashaditya Das, Guoxuan Xia, Guy-Bart Stan, Yiren Zhao
The harnessing of machine learning, especially deep generative models, has
opened up promising avenues in the field of synthetic DNA sequence generation.
Whilst Generative Adversarial Networks (GANs) have gained traction for this
application, they often face issues such as limited sample diversity and mode
collapse. On the other hand, Diffusion Models are a promising new class of
generative models that are not burdened with these problems, enabling them to
reach the state-of-the-art in domains such as image generation. In light of
this, we propose a novel latent diffusion model, DiscDiff, tailored for
discrete DNA sequence generation. By simply embedding discrete DNA sequences
into a continuous latent space using an autoencoder, we are able to leverage
the powerful generative abilities of continuous diffusion models for the
generation of discrete data. Additionally, we introduce Fr'echet
Reconstruction Distance (FReD) as a new metric to measure the sample quality of
DNA sequence generations. Our DiscDiff model demonstrates an ability to
generate synthetic DNA sequences that align closely with real DNA in terms of
Motif Distribution, Latent Embedding Distribution (FReD), and Chromatin
Profiles. Additionally, we contribute a comprehensive cross-species dataset of
150K unique promoter-gene sequences from 15 species, enriching resources for
future generative modelling in genomics. We will make our code public upon
publication.
DiffuSeq-v2: Bridging Discrete and Continuous Text Spaces for Accelerated Seq2Seq Diffusion Models
October 09, 2023
Shansan Gong, Mukai Li, Jiangtao Feng, Zhiyong Wu, Lingpeng Kong
Diffusion models have gained prominence in generating high-quality sequences
of text. Nevertheless, current approaches predominantly represent discrete text
within a continuous diffusion space, which incurs substantial computational
overhead during training and results in slower sampling speeds. In this paper,
we introduce a soft absorbing state that facilitates the diffusion model in
learning to reconstruct discrete mutations based on the underlying Gaussian
space, thereby enhancing its capacity to recover conditional signals. During
the sampling phase, we employ state-of-the-art ODE solvers within the
continuous space to expedite the sampling process. Comprehensive experimental
evaluations reveal that our proposed method effectively accelerates the
training convergence by 4x and generates samples of similar quality 800x
faster, rendering it significantly closer to practical application.
\footnote{The code is released at \url{https://github.com/Shark-NLP/DiffuSeq}
DiffCPS: Diffusion Model based Constrained Policy Search for Offline Reinforcement Learning
October 09, 2023
Longxiang He, Linrui Zhang, Junbo Tan, Xueqian Wang
Constrained policy search (CPS) is a fundamental problem in offline
reinforcement learning, which is generally solved by advantage weighted
regression (AWR). However, previous methods may still encounter
out-of-distribution actions due to the limited expressivity of Gaussian-based
policies. On the other hand, directly applying the state-of-the-art models with
distribution expression capabilities (i.e., diffusion models) in the AWR
framework is insufficient since AWR requires exact policy probability
densities, which is intractable in diffusion models. In this paper, we propose
a novel approach called $\textbf{Diffusion Model based Constrained Policy
Search (DiffCPS)}$, which tackles the diffusion-based constrained policy search
without resorting to AWR. The theoretical analysis reveals our key insights by
leveraging the action distribution of the diffusion model to eliminate the
policy distribution constraint in the CPS and then utilizing the Evidence Lower
Bound (ELBO) of diffusion-based policy to approximate the KL constraint.
Consequently, DiffCPS admits the high expressivity of diffusion models while
circumventing the cumbersome density calculation brought by AWR. Extensive
experimental results based on the D4RL benchmark demonstrate the efficacy of
our approach. We empirically show that DiffCPS achieves better or at least
competitive performance compared to traditional AWR-based baselines as well as
recent diffusion-based offline RL methods. The code is now available at
$\href{https://github.com/felix-thu/DiffCPS}{https://github.com/felix-thu/DiffCPS}$.
Latent Diffusion Model for Medical Image Standardization and Enhancement
October 08, 2023
Md Selim, Jie Zhang, Faraneh Fathi, Michael A. Brooks, Ge Wang, Guoqiang Yu, Jin Chen
Computed tomography (CT) serves as an effective tool for lung cancer
screening, diagnosis, treatment, and prognosis, providing a rich source of
features to quantify temporal and spatial tumor changes. Nonetheless, the
diversity of CT scanners and customized acquisition protocols can introduce
significant inconsistencies in texture features, even when assessing the same
patient. This variability poses a fundamental challenge for subsequent research
that relies on consistent image features. Existing CT image standardization
models predominantly utilize GAN-based supervised or semi-supervised learning,
but their performance remains limited. We present DiffusionCT, an innovative
score-based DDPM model that operates in the latent space to transform disparate
non-standard distributions into a standardized form. The architecture comprises
a U-Net-based encoder-decoder, augmented by a DDPM model integrated at the
bottleneck position. First, the encoder-decoder is trained independently,
without embedding DDPM, to capture the latent representation of the input data.
Second, the latent DDPM model is trained while keeping the encoder-decoder
parameters fixed. Finally, the decoder uses the transformed latent
representation to generate a standardized CT image, providing a more consistent
basis for downstream analysis. Empirical tests on patient CT images indicate
notable improvements in image standardization using DiffusionCT. Additionally,
the model significantly reduces image noise in SPAD images, further validating
the effectiveness of DiffusionCT for advanced imaging tasks.
October 07, 2023
Zihan Zhou, Ruiying Liu, Tianshu Yu
physics.comp-ph, cs.AI, cs.LG
Diffusion-based generative models in SE(3)-invariant space have demonstrated
promising performance in molecular conformation generation, but typically
require solving stochastic differential equations (SDEs) with thousands of
update steps. Till now, it remains unclear how to effectively accelerate this
procedure explicitly in SE(3)-invariant space, which greatly hinders its wide
application in the real world. In this paper, we systematically study the
diffusion mechanism in SE(3)-invariant space via the lens of approximate errors
induced by existing methods. Thereby, we develop more precise approximate in
SE(3) in the context of projected differential equations. Theoretical analysis
is further provided as well as empirical proof relating hyper-parameters with
such errors. Altogether, we propose a novel acceleration scheme for generating
molecular conformations in SE(3)-invariant space. Experimentally, our scheme
can generate high-quality conformations with 50x–100x speedup compared to
existing methods.
October 07, 2023
Theodor Nguyen, Guangzhi Sun, Xianrui Zheng, Chao Zhang, Philip C Woodland
We propose DiffSpEx, a generative target speaker extraction method based on
score-based generative modelling through stochastic differential equations.
DiffSpEx deploys a continuous-time stochastic diffusion process in the complex
short-time Fourier transform domain, starting from the target speaker source
and converging to a Gaussian distribution centred on the mixture of sources.
For the reverse-time process, a parametrised score function is conditioned on a
target speaker embedding to extract the target speaker from the mixture of
sources. We utilise ECAPA-TDNN target speaker embeddings and condition the
score function alternately on the SDE time embedding and the target speaker
embedding. The potential of DiffSpEx is demonstrated with the WSJ0-2mix
dataset, achieving an SI-SDR of 12.9 dB and a NISQA score of 3.56. Moreover, we
show that fine-tuning a pre-trained DiffSpEx model to a specific speaker
further improves performance, enabling personalisation in target speaker
extraction.
DiffNAS: Bootstrapping Diffusion Models by Prompting for Better Architectures
October 07, 2023
Wenhao Li, Xiu Su, Shan You, Fei Wang, Chen Qian, Chang Xu
Diffusion models have recently exhibited remarkable performance on synthetic
data. After a diffusion path is selected, a base model, such as UNet, operates
as a denoising autoencoder, primarily predicting noises that need to be
eliminated step by step. Consequently, it is crucial to employ a model that
aligns with the expected budgets to facilitate superior synthetic performance.
In this paper, we meticulously analyze the diffusion model and engineer a base
model search approach, denoted “DiffNAS”. Specifically, we leverage GPT-4 as a
supernet to expedite the search, supplemented with a search memory to enhance
the results. Moreover, we employ RFID as a proxy to promptly rank the
experimental outcomes produced by GPT-4. We also adopt a rapid-convergence
training strategy to boost search efficiency. Rigorous experimentation
corroborates that our algorithm can augment the search efficiency by 2 times
under GPT-based scenarios, while also attaining a performance of 2.82 with 0.37
improvement in FID on CIFAR10 relative to the benchmark IDDPM algorithm.
October 06, 2023
Jiarui Hai, Helin Wang, Dongchao Yang, Karan Thakkar, Najim Dehak, Mounya Elhilali
Common target sound extraction (TSE) approaches primarily relied on
discriminative approaches in order to separate the target sound while
minimizing interference from the unwanted sources, with varying success in
separating the target from the background. This study introduces DPM-TSE, a
first generative method based on diffusion probabilistic modeling (DPM) for
target sound extraction, to achieve both cleaner target renderings as well as
improved separability from unwanted sounds. The technique also tackles common
background noise issues with DPM by introducing a correction method for noise
schedules and sample steps. This approach is evaluated using both objective and
subjective quality metrics on the FSD Kaggle 2018 dataset. The results show
that DPM-TSE has a significant improvement in perceived quality in terms of
target extraction and purity.
Generative Diffusion From An Action Principle
October 06, 2023
Akhil Premkumar
Generative diffusion models synthesize new samples by reversing a diffusive
process that converts a given data set to generic noise. This is accomplished
by training a neural network to match the gradient of the log of the
probability distribution of a given data set, also called the score. By casting
reverse diffusion as an optimal control problem, we show that score matching
can be derived from an action principle, like the ones commonly used in
physics. We use this insight to demonstrate the connection between different
classes of diffusion models.
Diffusion Random Feature Model
October 06, 2023
Esha Saha, Giang Tran
Diffusion probabilistic models have been successfully used to generate data
from noise. However, most diffusion models are computationally expensive and
difficult to interpret with a lack of theoretical justification. Random feature
models on the other hand have gained popularity due to their interpretability
but their application to complex machine learning tasks remains limited. In
this work, we present a diffusion model-inspired deep random feature model that
is interpretable and gives comparable numerical results to a fully connected
neural network having the same number of trainable parameters. Specifically, we
extend existing results for random features and derive generalization bounds
between the distribution of sampled data and the true distribution using
properties of score matching. We validate our findings by generating samples on
the fashion MNIST dataset and instrumental audio data.
Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference
October 06, 2023
Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, Hang Zhao
Latent Diffusion models (LDMs) have achieved remarkable results in
synthesizing high-resolution images. However, the iterative sampling process is
computationally intensive and leads to slow generation. Inspired by Consistency
Models (song et al.), we propose Latent Consistency Models (LCMs), enabling
swift inference with minimal steps on any pre-trained LDMs, including Stable
Diffusion (rombach et al). Viewing the guided reverse diffusion process as
solving an augmented probability flow ODE (PF-ODE), LCMs are designed to
directly predict the solution of such ODE in latent space, mitigating the need
for numerous iterations and allowing rapid, high-fidelity sampling. Efficiently
distilled from pre-trained classifier-free guided diffusion models, a
high-quality 768 x 768 2~4-step LCM takes only 32 A100 GPU hours for training.
Furthermore, we introduce Latent Consistency Fine-tuning (LCF), a novel method
that is tailored for fine-tuning LCMs on customized image datasets. Evaluation
on the LAION-5B-Aesthetics dataset demonstrates that LCMs achieve
state-of-the-art text-to-image generation performance with few-step inference.
Project Page: https://latent-consistency-models.github.io/
VI-Diff: Unpaired Visible-Infrared Translation Diffusion Model for Single Modality Labeled Visible-Infrared Person Re-identification
October 06, 2023
Han Huang, Yan Huang, Liang Wang
Visible-Infrared person re-identification (VI-ReID) in real-world scenarios
poses a significant challenge due to the high cost of cross-modality data
annotation. Different sensing cameras, such as RGB/IR cameras for good/poor
lighting conditions, make it costly and error-prone to identify the same person
across modalities. To overcome this, we explore the use of single-modality
labeled data for the VI-ReID task, which is more cost-effective and practical.
By labeling pedestrians in only one modality (e.g., visible images) and
retrieving in another modality (e.g., infrared images), we aim to create a
training set containing both originally labeled and modality-translated data
using unpaired image-to-image translation techniques. In this paper, we propose
VI-Diff, a diffusion model that effectively addresses the task of
Visible-Infrared person image translation. Through comprehensive experiments,
we demonstrate that VI-Diff outperforms existing diffusion and GAN models,
making it a promising solution for VI-ReID with single-modality labeled data.
Our approach can be a promising solution to the VI-ReID task with
single-modality labeled data and serves as a good starting point for future
study. Code will be available.
Observation-Guided Diffusion Probabilistic Models
October 06, 2023
Junoh Kang, Jinyoung Choi, Sungik Choi, Bohyung Han
We propose a novel diffusion model called observation-guided diffusion
probabilistic model (OGDM), which effectively addresses the trade-off between
quality control and fast sampling. Our approach reestablishes the training
objective by integrating the guidance of the observation process with the
Markov chain in a principled way. This is achieved by introducing an additional
loss term derived from the observation based on the conditional discriminator
on noise level, which employs Bernoulli distribution indicating whether its
input lies on the (noisy) real manifold or not. This strategy allows us to
optimize the more accurate negative log-likelihood induced in the inference
stage especially when the number of function evaluations is limited. The
proposed training method is also advantageous even when incorporated only into
the fine-tuning process, and it is compatible with various fast inference
strategies since our method yields better denoising networks using the exactly
same inference procedure without incurring extra computational cost. We
demonstrate the effectiveness of the proposed training algorithm using diverse
inference methods on strong diffusion model baselines.
Diffusion Models as Masked Audio-Video Learners
October 05, 2023
Elvis Nunez, Yanzi Jin, Mohammad Rastegari, Sachin Mehta, Maxwell Horton
cs.SD, cs.CV, cs.MM, eess.AS
Over the past several years, the synchronization between audio and visual
signals has been leveraged to learn richer audio-visual representations. Aided
by the large availability of unlabeled videos, many unsupervised training
frameworks have demonstrated impressive results in various downstream audio and
video tasks. Recently, Masked Audio-Video Learners (MAViL) has emerged as a
state-of-the-art audio-video pre-training framework. MAViL couples contrastive
learning with masked autoencoding to jointly reconstruct audio spectrograms and
video frames by fusing information from both modalities. In this paper, we
study the potential synergy between diffusion models and MAViL, seeking to
derive mutual benefits from these two frameworks. The incorporation of
diffusion into MAViL, combined with various training efficiency methodologies
that include the utilization of a masking ratio curriculum and adaptive batch
sizing, results in a notable 32% reduction in pre-training Floating-Point
Operations (FLOPS) and an 18% decrease in pre-training wall clock time.
Crucially, this enhanced efficiency does not compromise the model’s performance
in downstream audio-classification tasks when compared to MAViL’s performance.
Aligning Text-to-Image Diffusion Models with Reward Backpropagation
October 05, 2023
Mihir Prabhudesai, Anirudh Goyal, Deepak Pathak, Katerina Fragkiadaki
cs.CV, cs.AI, cs.LG, cs.RO
Text-to-image diffusion models have recently emerged at the forefront of
image generation, powered by very large-scale unsupervised or weakly supervised
text-to-image training datasets. Due to their unsupervised training,
controlling their behavior in downstream tasks, such as maximizing
human-perceived image quality, image-text alignment, or ethical image
generation, is difficult. Recent works finetune diffusion models to downstream
reward functions using vanilla reinforcement learning, notorious for the high
variance of the gradient estimators. In this paper, we propose AlignProp, a
method that aligns diffusion models to downstream reward functions using
end-to-end backpropagation of the reward gradient through the denoising
process. While naive implementation of such backpropagation would require
prohibitive memory resources for storing the partial derivatives of modern
text-to-image models, AlignProp finetunes low-rank adapter weight modules and
uses gradient checkpointing, to render its memory usage viable. We test
AlignProp in finetuning diffusion models to various objectives, such as
image-text semantic alignment, aesthetics, compressibility and controllability
of the number of objects present, as well as their combinations. We show
AlignProp achieves higher rewards in fewer training steps than alternatives,
while being conceptually simpler, making it a straightforward choice for
optimizing diffusion models for differentiable reward functions of interest.
Code and Visualization results are available at https://align-prop.github.io/.
Stochastic interpolants with data-dependent couplings
October 05, 2023
Michael S. Albergo, Mark Goldstein, Nicholas M. Boffi, Rajesh Ranganath, Eric Vanden-Eijnden
Generative models inspired by dynamical transport of measure – such as flows
and diffusions – construct a continuous-time map between two probability
densities. Conventionally, one of these is the target density, only accessible
through samples, while the other is taken as a simple base density that is
data-agnostic. In this work, using the framework of stochastic interpolants, we
formalize how to \textit{couple} the base and the target densities, whereby
samples from the base are computed conditionally given samples from the target
in a way that is different from (but does preclude) incorporating information
about class labels or continuous embeddings. This enables us to construct
dynamical transport maps that serve as conditional generative models. We show
that these transport maps can be learned by solving a simple square loss
regression problem analogous to the standard independent setting. We
demonstrate the usefulness of constructing dependent couplings in practice
through experiments in super-resolution and in-painting.
Multimarginal generative modeling with stochastic interpolants
October 05, 2023
Michael S. Albergo, Nicholas M. Boffi, Michael Lindsey, Eric Vanden-Eijnden
Given a set of $K$ probability densities, we consider the multimarginal
generative modeling problem of learning a joint distribution that recovers
these densities as marginals. The structure of this joint distribution should
identify multi-way correspondences among the prescribed marginals. We formalize
an approach to this task within a generalization of the stochastic interpolant
framework, leading to efficient learning algorithms built upon dynamical
transport of measure. Our generative models are defined by velocity and score
fields that can be characterized as the minimizers of simple quadratic
objectives, and they are defined on a simplex that generalizes the time
variable in the usual dynamical transport framework. The resulting transport on
the simplex is influenced by all marginals, and we show that multi-way
correspondences can be extracted. The identification of such correspondences
has applications to style transfer, algorithmic fairness, and data
decorruption. In addition, the multimarginal perspective enables an efficient
algorithm for reducing the dynamical transport cost in the ordinary
two-marginal setting. We demonstrate these capacities with several numerical
examples.
Wasserstein Distortion: Unifying Fidelity and Realism
October 05, 2023
Yang Qiu, Aaron B. Wagner, Johannes Ballé, Lucas Theis
cs.IT, cs.CV, eess.IV, math.IT
We introduce a distortion measure for images, Wasserstein distortion, that
simultaneously generalizes pixel-level fidelity on the one hand and realism on
the other. We show how Wasserstein distortion reduces mathematically to a pure
fidelity constraint or a pure realism constraint under different parameter
choices. Pairs of images that are close under Wasserstein distortion illustrate
its utility. In particular, we generate random textures that have high fidelity
to a reference texture in one location of the image and smoothly transition to
an independent realization of the texture as one moves away from this point.
Connections between Wasserstein distortion and models of the human visual
system are noted.
Diffusing on Two Levels and Optimizing for Multiple Properties: A Novel Approach to Generating Molecules with Desirable Properties
October 05, 2023
Siyuan Guo, Jihong Guan, Shuigeng Zhou
In the past decade, Artificial Intelligence driven drug design and discovery
has been a hot research topic, where an important branch is molecule generation
by generative models, from GAN-based models and VAE-based models to the latest
diffusion-based models. However, most existing models pursue only the basic
properties like validity and uniqueness of the generated molecules, a few go
further to explicitly optimize one single important molecular property (e.g.
QED or PlogP), which makes most generated molecules little usefulness in
practice. In this paper, we present a novel approach to generating molecules
with desirable properties, which expands the diffusion model framework with
multiple innovative designs. The novelty is two-fold. On the one hand,
considering that the structures of molecules are complex and diverse, and
molecular properties are usually determined by some substructures (e.g.
pharmacophores), we propose to perform diffusion on two structural levels:
molecules and molecular fragments respectively, with which a mixed Gaussian
distribution is obtained for the reverse diffusion process. To get desirable
molecular fragments, we develop a novel electronic effect based fragmentation
method. On the other hand, we introduce two ways to explicitly optimize
multiple molecular properties under the diffusion model framework. First, as
potential drug molecules must be chemically valid, we optimize molecular
validity by an energy-guidance function. Second, since potential drug molecules
should be desirable in various properties, we employ a multi-objective
mechanism to optimize multiple molecular properties simultaneously. Extensive
experiments with two benchmark datasets QM9 and ZINC250k show that the
molecules generated by our proposed method have better validity, uniqueness,
novelty, Fr'echet ChemNet Distance (FCD), QED, and PlogP than those generated
by current SOTA models.
Denoising Diffusion Step-aware Models
October 05, 2023
Shuai Yang, Yukang Chen, Luozhou Wang, Shu Liu, Yingcong Chen
Denoising Diffusion Probabilistic Models (DDPMs) have garnered popularity for
data generation across various domains. However, a significant bottleneck is
the necessity for whole-network computation during every step of the generative
process, leading to high computational overheads. This paper presents a novel
framework, Denoising Diffusion Step-aware Models (DDSM), to address this
challenge. Unlike conventional approaches, DDSM employs a spectrum of neural
networks whose sizes are adapted according to the importance of each generative
step, as determined through evolutionary search. This step-wise network
variation effectively circumvents redundant computational efforts, particularly
in less critical steps, thereby enhancing the efficiency of the diffusion
model. Furthermore, the step-aware design can be seamlessly integrated with
other efficiency-geared diffusion models such as DDIMs and latent diffusion,
thus broadening the scope of computational savings. Empirical evaluations
demonstrate that DDSM achieves computational savings of 49% for CIFAR-10, 61%
for CelebA-HQ, 59% for LSUN-bedroom, 71% for AFHQ, and 76% for ImageNet, all
without compromising the generation quality. Our code and models will be
publicly available.
EfficientDM: Efficient Quantization-Aware Fine-Tuning of Low-Bit Diffusion Models
October 05, 2023
Yefei He, Jing Liu, Weijia Wu, Hong Zhou, Bohan Zhuang
Diffusion models have demonstrated remarkable capabilities in image synthesis
and related generative tasks. Nevertheless, their practicality for low-latency
real-world applications is constrained by substantial computational costs and
latency issues. Quantization is a dominant way to compress and accelerate
diffusion models, where post-training quantization (PTQ) and quantization-aware
training (QAT) are two main approaches, each bearing its own properties. While
PTQ exhibits efficiency in terms of both time and data usage, it may lead to
diminished performance in low bit-width. On the other hand, QAT can alleviate
performance degradation but comes with substantial demands on computational and
data resources. To capitalize on the advantages while avoiding their respective
drawbacks, we introduce a data-free and parameter-efficient fine-tuning
framework for low-bit diffusion models, dubbed EfficientDM, to achieve
QAT-level performance with PTQ-like efficiency. Specifically, we propose a
quantization-aware variant of the low-rank adapter (QALoRA) that can be merged
with model weights and jointly quantized to low bit-width. The fine-tuning
process distills the denoising capabilities of the full-precision model into
its quantized counterpart, eliminating the requirement for training data. We
also introduce scale-aware optimization and employ temporal learned step-size
quantization to further enhance performance. Extensive experimental results
demonstrate that our method significantly outperforms previous PTQ-based
diffusion models while maintaining similar time and data efficiency.
Specifically, there is only a marginal 0.05 sFID increase when quantizing both
weights and activations of LDM-4 to 4-bit on ImageNet 256x256. Compared to
QAT-based methods, our EfficientDM also boasts a 16.2x faster quantization
speed with comparable generation quality.
Diffusion Generative Flow Samplers: Improving learning signals through partial trajectory optimization
October 04, 2023
Dinghuai Zhang, Ricky T. Q. Chen, Cheng-Hao Liu, Aaron Courville, Yoshua Bengio
cs.LG, cs.AI, stat.CO, stat.ME, stat.ML
We tackle the problem of sampling from intractable high-dimensional density
functions, a fundamental task that often appears in machine learning and
statistics. We extend recent sampling-based approaches that leverage controlled
stochastic processes to model approximate samples from these target densities.
The main drawback of these approaches is that the training objective requires
full trajectories to compute, resulting in sluggish credit assignment issues
due to use of entire trajectories and a learning signal present only at the
terminal time. In this work, we present Diffusion Generative Flow Samplers
(DGFS), a sampling-based framework where the learning process can be tractably
broken down into short partial trajectory segments, via parameterizing an
additional “flow function”. Our method takes inspiration from the theory
developed for generative flow networks (GFlowNets), allowing us to make use of
intermediate learning signals. Through various challenging experiments, we
demonstrate that DGFS achieves more accurate estimates of the normalization
constant than closely-related prior methods.
On Memorization in Diffusion Models
October 04, 2023
Xiangming Gu, Chao Du, Tianyu Pang, Chongxuan Li, Min Lin, Ye Wang
Due to their capacity to generate novel and high-quality samples, diffusion
models have attracted significant research interest in recent years. Notably,
the typical training objective of diffusion models, i.e., denoising score
matching, has a closed-form optimal solution that can only generate training
data replicating samples. This indicates that a memorization behavior is
theoretically expected, which contradicts the common generalization ability of
state-of-the-art diffusion models, and thus calls for a deeper understanding.
Looking into this, we first observe that memorization behaviors tend to occur
on smaller-sized datasets, which motivates our definition of effective model
memorization (EMM), a metric measuring the maximum size of training data at
which a learned diffusion model approximates its theoretical optimum. Then, we
quantify the impact of the influential factors on these memorization behaviors
in terms of EMM, focusing primarily on data distribution, model configuration,
and training procedure. Besides comprehensive empirical results identifying the
influential factors, we surprisingly find that conditioning training data on
uninformative random labels can significantly trigger the memorization in
diffusion models. Our study holds practical significance for diffusion model
users and offers clues to theoretical research in deep generative models. Code
is available at https://github.com/sail-sg/DiffMemorize.
Generalization in diffusion models arises from geometry-adaptive harmonic representation
October 04, 2023
Zahra Kadkhodaie, Florentin Guth, Eero P. Simoncelli, Stéphane Mallat
High-quality samples generated with score-based reverse diffusion algorithms
provide evidence that deep neural networks (DNN) trained for denoising can
learn high-dimensional densities, despite the curse of dimensionality. However,
recent reports of memorization of the training set raise the question of
whether these networks are learning the “true” continuous density of the data.
Here, we show that two denoising DNNs trained on non-overlapping subsets of a
dataset learn nearly the same score function, and thus the same density, with a
surprisingly small number of training images. This strong generalization
demonstrates an alignment of powerful inductive biases in the DNN architecture
and/or training algorithm with properties of the data distribution. We analyze
these, demonstrating that the denoiser performs a shrinkage operation in a
basis adapted to the underlying image. Examination of these bases reveals
oscillating harmonic structures along contours and in homogeneous image
regions. We show that trained denoisers are inductively biased towards these
geometry-adaptive harmonic representations by demonstrating that they arise
even when the network is trained on image classes such as low-dimensional
manifolds, for which the harmonic basis is suboptimal. Additionally, we show
that the denoising performance of the networks is near-optimal when trained on
regular image classes for which the optimal basis is known to be
geometry-adaptive and harmonic.
MedDiffusion: Boosting Health Risk Prediction via Diffusion-based Data Augmentation
October 04, 2023
Yuan Zhong, Suhan Cui, Jiaqi Wang, Xiaochen Wang, Ziyi Yin, Yaqing Wang, Houping Xiao, Mengdi Huai, Ting Wang, Fenglong Ma
Health risk prediction is one of the fundamental tasks under predictive
modeling in the medical domain, which aims to forecast the potential health
risks that patients may face in the future using their historical Electronic
Health Records (EHR). Researchers have developed several risk prediction models
to handle the unique challenges of EHR data, such as its sequential nature,
high dimensionality, and inherent noise. These models have yielded impressive
results. Nonetheless, a key issue undermining their effectiveness is data
insufficiency. A variety of data generation and augmentation methods have been
introduced to mitigate this issue by expanding the size of the training data
set through the learning of underlying data distributions. However, the
performance of these methods is often limited due to their task-unrelated
design. To address these shortcomings, this paper introduces a novel,
end-to-end diffusion-based risk prediction model, named MedDiffusion. It
enhances risk prediction performance by creating synthetic patient data during
training to enlarge sample space. Furthermore, MedDiffusion discerns hidden
relationships between patient visits using a step-wise attention mechanism,
enabling the model to automatically retain the most vital information for
generating high-quality data. Experimental evaluation on four real-world
medical datasets demonstrates that MedDiffusion outperforms 14 cutting-edge
baselines in terms of PR-AUC, F1, and Cohen’s Kappa. We also conduct ablation
studies and benchmark our model against GAN-based alternatives to further
validate the rationality and adaptability of our model design. Additionally, we
analyze generated data to offer fresh insights into the model’s
interpretability.
Learning to Reach Goals via Diffusion
October 04, 2023
Vineet Jain, Siamak Ravanbakhsh
Diffusion models are a powerful class of generative models capable of mapping
random noise in high-dimensional spaces to a target manifold through iterative
denoising. In this work, we present a novel perspective on goal-conditioned
reinforcement learning by framing it within the context of diffusion modeling.
Analogous to the diffusion process, where Gaussian noise is used to create
random trajectories that walk away from the data manifold, we construct
trajectories that move away from potential goal states. We then learn a
goal-conditioned policy analogous to the score function. This approach, which
we call Merlin, can reach predefined or novel goals from an arbitrary initial
state without learning a separate value function. We consider three choices for
the noise model to replace Gaussian noise in diffusion - reverse play from the
buffer, reverse dynamics model, and a novel non-parametric approach. We
theoretically justify our approach and validate it on offline goal-reaching
tasks. Empirical results are competitive with state-of-the-art methods, which
suggests this perspective on diffusion for RL is a simple, scalable, and
effective direction for sequential decision-making.
SE(3)-Stochastic Flow Matching for Protein Backbone Generation
October 03, 2023
Avishek Joey Bose, Tara Akhound-Sadegh, Kilian Fatras, Guillaume Huguet, Jarrid Rector-Brooks, Cheng-Hao Liu, Andrei Cristian Nica, Maksym Korablyov, Michael Bronstein, Alexander Tong
The computational design of novel protein structures has the potential to
impact numerous scientific disciplines greatly. Toward this goal, we introduce
$\text{FoldFlow}$ a series of novel generative models of increasing modeling
power based on the flow-matching paradigm over $3\text{D}$ rigid motions –
i.e. the group $\text{SE(3)}$ – enabling accurate modeling of protein
backbones. We first introduce $\text{FoldFlow-Base}$, a simulation-free
approach to learning deterministic continuous-time dynamics and matching
invariant target distributions on $\text{SE(3)}$. We next accelerate training
by incorporating Riemannian optimal transport to create $\text{FoldFlow-OT}$,
leading to the construction of both more simple and stable flows. Finally, we
design $\text{FoldFlow-SFM}$ coupling both Riemannian OT and simulation-free
training to learn stochastic continuous-time dynamics over $\text{SE(3)}$. Our
family of $\text{FoldFlow}$ generative models offer several key advantages over
previous approaches to the generative modeling of proteins: they are more
stable and faster to train than diffusion-based approaches, and our models
enjoy the ability to map any invariant source distribution to any invariant
target distribution over $\text{SE(3)}$. Empirically, we validate our FoldFlow
models on protein backbone generation of up to $300$ amino acids leading to
high-quality designable, diverse, and novel samples.
Predicated Diffusion: Predicate Logic-Based Attention Guidance for Text-to-Image Diffusion Models
October 03, 2023
Kota Sueyoshi, Takashi Matsubara
Diffusion models have achieved remarkable results in generating high-quality,
diverse, and creative images. However, when it comes to text-based image
generation, they often fail to capture the intended meaning presented in the
text. For instance, a specified object may not be generated, an unnecessary
object may be generated, and an adjective may alter objects it was not intended
to modify. Moreover, we found that relationships indicating possession between
objects are often overlooked. While users’ intentions in text are diverse,
existing methods tend to specialize in only some aspects of these. In this
paper, we propose Predicated Diffusion, a unified framework to express users’
intentions. We consider that the root of the above issues lies in the text
encoder, which often focuses only on individual words and neglects the logical
relationships between them. The proposed method does not solely rely on the
text encoder, but instead, represents the intended meaning in the text as
propositions using predicate logic and treats the pixels in the attention maps
as the fuzzy predicates. This enables us to obtain a differentiable loss
function that makes the image fulfill the proposition by minimizing it. When
compared to several existing methods, we demonstrated that Predicated Diffusion
can generate images that are more faithful to various text prompts, as verified
by human evaluators and pretrained image-text models.
Conditional Diffusion Distillation
October 02, 2023
Kangfu Mei, Mauricio Delbracio, Hossein Talebi, Zhengzhong Tu, Vishal M. Patel, Peyman Milanfar
Generative diffusion models provide strong priors for text-to-image
generation and thereby serve as a foundation for conditional generation tasks
such as image editing, restoration, and super-resolution. However, one major
limitation of diffusion models is their slow sampling time. To address this
challenge, we present a novel conditional distillation method designed to
supplement the diffusion priors with the help of image conditions, allowing for
conditional sampling with very few steps. We directly distill the unconditional
pre-training in a single stage through joint-learning, largely simplifying the
previous two-stage procedures that involve both distillation and conditional
finetuning separately. Furthermore, our method enables a new
parameter-efficient distillation mechanism that distills each task with only a
small number of additional parameters combined with the shared frozen
unconditional backbone. Experiments across multiple tasks including
super-resolution, image editing, and depth-to-image generation demonstrate that
our method outperforms existing distillation techniques for the same sampling
time. Notably, our method is the first distillation strategy that can match the
performance of the much slower fine-tuned conditional diffusion models.
Mirror Diffusion Models for Constrained and Watermarked Generation
October 02, 2023
Guan-Horng Liu, Tianrong Chen, Evangelos A. Theodorou, Molei Tao
Modern successes of diffusion models in learning complex, high-dimensional
data distributions are attributed, in part, to their capability to construct
diffusion processes with analytic transition kernels and score functions. The
tractability results in a simulation-free framework with stable regression
losses, from which reversed, generative processes can be learned at scale.
However, when data is confined to a constrained set as opposed to a standard
Euclidean space, these desirable characteristics appear to be lost based on
prior attempts. In this work, we propose Mirror Diffusion Models (MDM), a new
class of diffusion models that generate data on convex constrained sets without
losing any tractability. This is achieved by learning diffusion processes in a
dual space constructed from a mirror map, which, crucially, is a standard
Euclidean space. We derive efficient computation of mirror maps for popular
constrained sets, such as simplices and $\ell_2$-balls, showing significantly
improved performance of MDM over existing methods. For safety and privacy
purposes, we also explore constrained sets as a new mechanism to embed
invisible but quantitative information (i.e., watermarks) in generated data,
for which MDM serves as a compelling approach. Our work brings new algorithmic
opportunities for learning tractable diffusion on complex domains.
Light Schrödinger Bridge
October 02, 2023
Alexander Korotin, Nikita Gushchin, Evgeny Burnaev
Despite the recent advances in the field of computational Schrodinger Bridges
(SB), most existing SB solvers are still heavy-weighted and require complex
optimization of several neural networks. It turns out that there is no
principal solver which plays the role of simple-yet-effective baseline for SB
just like, e.g., $k$-means method in clustering, logistic regression in
classification or Sinkhorn algorithm in discrete optimal transport. We address
this issue and propose a novel fast and simple SB solver. Our development is a
smart combination of two ideas which recently appeared in the field: (a)
parameterization of the Schrodinger potentials with sum-exp quadratic functions
and (b) viewing the log-Schrodinger potentials as the energy functions. We show
that combined together these ideas yield a lightweight, simulation-free and
theoretically justified SB solver with a simple straightforward optimization
objective. As a result, it allows solving SB in moderate dimensions in a matter
of minutes on CPU without a painful hyperparameter selection. Our light solver
resembles the Gaussian mixture model which is widely used for density
estimation. Inspired by this similarity, we also prove an important theoretical
result showing that our light solver is a universal approximator of SBs. The
code for the LightSB solver can be found at
https://github.com/ngushchin/LightSB
Consistency Trajectory Models: Learning Probability Flow ODE Trajectory of Diffusion
October 01, 2023
Dongjun Kim, Chieh-Hsin Lai, Wei-Hsiang Liao, Naoki Murata, Yuhta Takida, Toshimitsu Uesaka, Yutong He, Yuki Mitsufuji, Stefano Ermon
cs.LG, cs.AI, cs.CV, stat.ML
Consistency Models (CM) (Song et al., 2023) accelerate score-based diffusion
model sampling at the cost of sample quality but lack a natural way to
trade-off quality for speed. To address this limitation, we propose Consistency
Trajectory Model (CTM), a generalization encompassing CM and score-based models
as special cases. CTM trains a single neural network that can – in a single
forward pass – output scores (i.e., gradients of log-density) and enables
unrestricted traversal between any initial and final time along the Probability
Flow Ordinary Differential Equation (ODE) in a diffusion process. CTM enables
the efficient combination of adversarial training and denoising score matching
loss to enhance performance and achieves new state-of-the-art FIDs for
single-step diffusion model sampling on CIFAR-10 (FID 1.73) and ImageNet at
64X64 resolution (FID 2.06). CTM also enables a new family of sampling schemes,
both deterministic and stochastic, involving long jumps along the ODE solution
trajectories. It consistently improves sample quality as computational budgets
increase, avoiding the degradation seen in CM. Furthermore, CTM’s access to the
score accommodates all diffusion model inference techniques, including exact
likelihood computation.
DiffPoseTalk: Speech-Driven Stylistic 3D Facial Animation and Head Pose Generation via Diffusion Models
September 30, 2023
Zhiyao Sun, Tian Lv, Sheng Ye, Matthieu Gaetan Lin, Jenny Sheng, Yu-Hui Wen, Minjing Yu, Yong-jin Liu
The generation of stylistic 3D facial animations driven by speech poses a
significant challenge as it requires learning a many-to-many mapping between
speech, style, and the corresponding natural facial motion. However, existing
methods either employ a deterministic model for speech-to-motion mapping or
encode the style using a one-hot encoding scheme. Notably, the one-hot encoding
approach fails to capture the complexity of the style and thus limits
generalization ability. In this paper, we propose DiffPoseTalk, a generative
framework based on the diffusion model combined with a style encoder that
extracts style embeddings from short reference videos. During inference, we
employ classifier-free guidance to guide the generation process based on the
speech and style. We extend this to include the generation of head poses,
thereby enhancing user perception. Additionally, we address the shortage of
scanned 3D talking face data by training our model on reconstructed 3DMM
parameters from a high-quality, in-the-wild audio-visual dataset. Our extensive
experiments and user study demonstrate that our approach outperforms
state-of-the-art methods. The code and dataset will be made publicly available.
Efficient Planning with Latent Diffusion
September 30, 2023
Wenhao Li
Temporal abstraction and efficient planning pose significant challenges in
offline reinforcement learning, mainly when dealing with domains that involve
temporally extended tasks and delayed sparse rewards. Existing methods
typically plan in the raw action space and can be inefficient and inflexible.
Latent action spaces offer a more flexible paradigm, capturing only possible
actions within the behavior policy support and decoupling the temporal
structure between planning and modeling. However, current latent-action-based
methods are limited to discrete spaces and require expensive planning. This
paper presents a unified framework for continuous latent action space
representation learning and planning by leveraging latent, score-based
diffusion models. We establish the theoretical equivalence between planning in
the latent action space and energy-guided sampling with a pretrained diffusion
model and incorporate a novel sequence-level exact sampling method. Our
proposed method, $\texttt{LatentDiffuser}$, demonstrates competitive
performance on low-dimensional locomotion control tasks and surpasses existing
methods in higher-dimensional tasks.
Steered Diffusion: A Generalized Framework for Plug-and-Play Conditional Image Synthesis
September 30, 2023
Nithin Gopalakrishnan Nair, Anoop Cherian, Suhas Lohit, Ye Wang, Toshiaki Koike-Akino, Vishal M. Patel, Tim K. Marks
Conditional generative models typically demand large annotated training sets
to achieve high-quality synthesis. As a result, there has been significant
interest in designing models that perform plug-and-play generation, i.e., to
use a predefined or pretrained model, which is not explicitly trained on the
generative task, to guide the generative process (e.g., using language).
However, such guidance is typically useful only towards synthesizing high-level
semantics rather than editing fine-grained details as in image-to-image
translation tasks. To this end, and capitalizing on the powerful fine-grained
generative control offered by the recent diffusion-based generative models, we
introduce Steered Diffusion, a generalized framework for photorealistic
zero-shot conditional image generation using a diffusion model trained for
unconditional generation. The key idea is to steer the image generation of the
diffusion model at inference time via designing a loss using a pre-trained
inverse model that characterizes the conditional task. This loss modulates the
sampling trajectory of the diffusion process. Our framework allows for easy
incorporation of multiple conditions during inference. We present experiments
using steered diffusion on several tasks including inpainting, colorization,
text-guided semantic editing, and image super-resolution. Our results
demonstrate clear qualitative and quantitative improvements over
state-of-the-art diffusion-based plug-and-play models while adding negligible
additional computational cost.
FashionFlow: Leveraging Diffusion Models for Dynamic Fashion Video Synthesis from Static Imagery
September 29, 2023
Tasin Islam, Alina Miron, XiaoHui Liu, Yongmin Li
Our study introduces a new image-to-video generator called FashionFlow. By
utilising a diffusion model, we are able to create short videos from still
images. Our approach involves developing and connecting relevant components
with the diffusion model, which sets our work apart. The components include the
use of pseudo-3D convolutional layers to generate videos efficiently. VAE and
CLIP encoders capture vital characteristics from still images to influence the
diffusion model. Our research demonstrates a successful synthesis of fashion
videos featuring models posing from various angles, showcasing the fit and
appearance of the garment. Our findings hold great promise for improving and
enhancing the shopping experience for the online fashion industry.
Directly Fine-Tuning Diffusion Models on Differentiable Rewards
September 29, 2023
Kevin Clark, Paul Vicol, Kevin Swersky, David J Fleet
We present Direct Reward Fine-Tuning (DRaFT), a simple and effective method
for fine-tuning diffusion models to maximize differentiable reward functions,
such as scores from human preference models. We first show that it is possible
to backpropagate the reward function gradient through the full sampling
procedure, and that doing so achieves strong performance on a variety of
rewards, outperforming reinforcement learning-based approaches. We then propose
more efficient variants of DRaFT: DRaFT-K, which truncates backpropagation to
only the last K steps of sampling, and DRaFT-LV, which obtains lower-variance
gradient estimates for the case when K=1. We show that our methods work well
for a variety of reward functions and can be used to substantially improve the
aesthetic quality of images generated by Stable Diffusion 1.4. Finally, we draw
connections between our approach and prior work, providing a unifying
perspective on the design space of gradient-based fine-tuning algorithms.
In search of dispersed memories: Generative diffusion models are associative memory networks
September 29, 2023
Luca Ambrogioni
Uncovering the mechanisms behind long-term memory is one of the most
fascinating open problems in neuroscience and artificial intelligence.
Artificial associative memory networks have been used to formalize important
aspects of biological memory. Generative diffusion models are a type of
generative machine learning techniques that have shown great performance in
many tasks. Like associative memory systems, these networks define a dynamical
system that converges to a set of target states. In this work we show that
generative diffusion models can be interpreted as energy-based models and that,
when trained on discrete patterns, their energy function is (asymptotically)
identical to that of modern Hopfield networks. This equivalence allows us to
interpret the supervised training of diffusion models as a synaptic learning
process that encodes the associative dynamics of a modern Hopfield network in
the weight structure of a deep neural network. Leveraging this connection, we
formulate a generalized framework for understanding the formation of long-term
memory, where creative generation and memory recall can be seen as parts of a
unified continuum.
Diffusion Models as Stochastic Quantization in Lattice Field Theory
September 29, 2023
Lingxiao Wang, Gert Aarts, Kai Zhou
In this work, we establish a direct connection between generative diffusion
models (DMs) and stochastic quantization (SQ). The DM is realized by
approximating the reversal of a stochastic process dictated by the Langevin
equation, generating samples from a prior distribution to effectively mimic the
target distribution. Using numerical simulations, we demonstrate that the DM
can serve as a global sampler for generating quantum lattice field
configurations in two-dimensional $\phi^4$ theory. We demonstrate that DMs can
notably reduce autocorrelation times in the Markov chain, especially in the
critical region where standard Markov Chain Monte-Carlo (MCMC) algorithms
experience critical slowing down. The findings can potentially inspire further
advancements in lattice field theory simulations, in particular in cases where
it is expensive to generate large ensembles.
Consistency Models as a Rich and Efficient Policy Class for Reinforcement Learning
September 29, 2023
Zihan Ding, Chi Jin
Score-based generative models like the diffusion model have been testified to
be effective in modeling multi-modal data from image generation to
reinforcement learning (RL). However, the inference process of diffusion model
can be slow, which hinders its usage in RL with iterative sampling. We propose
to apply the consistency model as an efficient yet expressive policy
representation, namely consistency policy, with an actor-critic style algorithm
for three typical RL settings: offline, offline-to-online and online. For
offline RL, we demonstrate the expressiveness of generative models as policies
from multi-modal data. For offline-to-online RL, the consistency policy is
shown to be more computational efficient than diffusion policy, with a
comparable performance. For online RL, the consistency policy demonstrates
significant speedup and even higher average performances than the diffusion
policy.
Denoising Diffusion Bridge Models
September 29, 2023
Linqi Zhou, Aaron Lou, Samar Khanna, Stefano Ermon
Diffusion models are powerful generative models that map noise to data using
stochastic processes. However, for many applications such as image editing, the
model input comes from a distribution that is not random noise. As such,
diffusion models must rely on cumbersome methods like guidance or projected
sampling to incorporate this information in the generative process. In our
work, we propose Denoising Diffusion Bridge Models (DDBMs), a natural
alternative to this paradigm based on diffusion bridges, a family of processes
that interpolate between two paired distributions given as endpoints. Our
method learns the score of the diffusion bridge from data and maps from one
endpoint distribution to the other by solving a (stochastic) differential
equation based on the learned score. Our method naturally unifies several
classes of generative models, such as score-based diffusion models and
OT-Flow-Matching, allowing us to adapt existing design and architectural
choices to our more general problem. Empirically, we apply DDBMs to challenging
image datasets in both pixel and latent space. On standard image translation
problems, DDBMs achieve significant improvement over baseline methods, and,
when we reduce the problem to image generation by setting the source
distribution to random noise, DDBMs achieve comparable FID scores to
state-of-the-art methods despite being built for a more general task.
Distilling ODE Solvers of Diffusion Models into Smaller Steps
September 28, 2023
Sanghwan Kim, Hao Tang, Fisher Yu
Distillation techniques have substantially improved the sampling speed of
diffusion models, allowing of the generation within only one step or a few
steps. However, these distillation methods require extensive training for each
dataset, sampler, and network, which limits their practical applicability. To
address this limitation, we propose a straightforward distillation approach,
Distilled-ODE solvers (D-ODE solvers), that optimizes the ODE solver rather
than training the denoising network. D-ODE solvers are formulated by simply
applying a single parameter adjustment to existing ODE solvers. Subsequently,
D-ODE solvers with smaller steps are optimized by ODE solvers with larger steps
through distillation over a batch of samples. Our comprehensive experiments
indicate that D-ODE solvers outperform existing ODE solvers, including DDIM,
PNDM, DPM-Solver, DEIS, and EDM, especially when generating samples with fewer
steps. Our method incur negligible computational overhead compared to previous
distillation techniques, enabling simple and rapid integration with previous
samplers. Qualitative analysis further shows that D-ODE solvers enhance image
quality while preserving the sampling trajectory of ODE solvers.
DiffGAN-F2S: Symmetric and Efficient Denoising Diffusion GANs for Structural Connectivity Prediction from Brain fMRI
September 28, 2023
Qiankun Zuo, Ruiheng Li, Yi Di, Hao Tian, Changhong Jing, Xuhang Chen, Shuqiang Wang
Mapping from functional connectivity (FC) to structural connectivity (SC) can
facilitate multimodal brain network fusion and discover potential biomarkers
for clinical implications. However, it is challenging to directly bridge the
reliable non-linear mapping relations between SC and functional magnetic
resonance imaging (fMRI). In this paper, a novel diffusision generative
adversarial network-based fMRI-to-SC (DiffGAN-F2S) model is proposed to predict
SC from brain fMRI in an end-to-end manner. To be specific, the proposed
DiffGAN-F2S leverages denoising diffusion probabilistic models (DDPMs) and
adversarial learning to efficiently generate high-fidelity SC through a few
steps from fMRI. By designing the dual-channel multi-head spatial attention
(DMSA) and graph convolutional modules, the symmetric graph generator first
captures global relations among direct and indirect connected brain regions,
then models the local brain region interactions. It can uncover the complex
mapping relations between fMRI and structural connectivity. Furthermore, the
spatially connected consistency loss is devised to constrain the generator to
preserve global-local topological information for accurate intrinsic SC
prediction. Testing on the public Alzheimer’s Disease Neuroimaging Initiative
(ADNI) dataset, the proposed model can effectively generate empirical
SC-preserved connectivity from four-dimensional imaging data and shows superior
performance in SC prediction compared with other related models. Furthermore,
the proposed model can identify the vast majority of important brain regions
and connections derived from the empirical method, providing an alternative way
to fuse multimodal brain networks and analyze clinical disease.
Exploiting the Signal-Leak Bias in Diffusion Models
September 27, 2023
Martin Nicolas Everaert, Athanasios Fitsios, Marco Bocchio, Sami Arpa, Sabine Süsstrunk, Radhakrishna Achanta
There is a bias in the inference pipeline of most diffusion models. This bias
arises from a signal leak whose distribution deviates from the noise
distribution, creating a discrepancy between training and inference processes.
We demonstrate that this signal-leak bias is particularly significant when
models are tuned to a specific style, causing sub-optimal style matching.
Recent research tries to avoid the signal leakage during training. We instead
show how we can exploit this signal-leak bias in existing diffusion models to
allow more control over the generated images. This enables us to generate
images with more varied brightness, and images that better match a desired
style or color. By modeling the distribution of the signal leak in the spatial
frequency and pixel domains, and including a signal leak in the initial latent,
we generate images that better match expected results without any additional
training.
High Perceptual Quality Wireless Image Delivery with Denoising Diffusion Models
September 27, 2023
Selim F. Yilmaz, Xueyan Niu, Bo Bai, Wei Han, Lei Deng, Deniz Gunduz
eess.IV, cs.CV, cs.IT, cs.LG, cs.MM, math.IT
We consider the image transmission problem over a noisy wireless channel via
deep learning-based joint source-channel coding (DeepJSCC) along with a
denoising diffusion probabilistic model (DDPM) at the receiver. Specifically,
we are interested in the perception-distortion trade-off in the practical
finite block length regime, in which separate source and channel coding can be
highly suboptimal. We introduce a novel scheme that utilizes the range-null
space decomposition of the target image. We transmit the range-space of the
image after encoding and employ DDPM to progressively refine its null space
contents. Through extensive experiments, we demonstrate significant
improvements in distortion and perceptual quality of reconstructed images
compared to standard DeepJSCC and the state-of-the-art generative
learning-based method. We will publicly share our source code to facilitate
further research and reproducibility.
Factorized Diffusion Architectures for Unsupervised Image Generation and Segmentation
September 27, 2023
Xin Yuan, Michael Maire
We develop a neural network architecture which, trained in an unsupervised
manner as a denoising diffusion model, simultaneously learns to both generate
and segment images. Learning is driven entirely by the denoising diffusion
objective, without any annotation or prior knowledge about regions during
training. A computational bottleneck, built into the neural architecture,
encourages the denoising network to partition an input into regions, denoise
them in parallel, and combine the results. Our trained model generates both
synthetic images and, by simple examination of its internal predicted
partitions, a semantic segmentation of those images. Without any finetuning, we
directly apply our unsupervised model to the downstream task of segmenting real
images via noising and subsequently denoising them. Experiments demonstrate
that our model achieves accurate unsupervised image segmentation and
high-quality synthetic image generation across multiple datasets.
Learning Using Generated Privileged Information by Text-to-Image Diffusion Models
September 26, 2023
Rafael-Edy Menadil, Mariana-Iuliana Georgescu, Radu Tudor Ionescu
Learning Using Privileged Information is a particular type of knowledge
distillation where the teacher model benefits from an additional data
representation during training, called privileged information, improving the
student model, which does not see the extra representation. However, privileged
information is rarely available in practice. To this end, we propose a text
classification framework that harnesses text-to-image diffusion models to
generate artificial privileged information. The generated images and the
original text samples are further used to train multimodal teacher models based
on state-of-the-art transformer-based architectures. Finally, the knowledge
from multimodal teachers is distilled into a text-based (unimodal) student.
Hence, by employing a generative model to produce synthetic data as privileged
information, we guide the training of the student model. Our framework, called
Learning Using Generated Privileged Information (LUGPI), yields noticeable
performance gains on four text classification data sets, demonstrating its
potential in text classification without any additional cost during inference.
Diffusion-based Holistic Texture Rectification and Synthesis
September 26, 2023
Guoqing Hao, Satoshi Iizuka, Kensho Hara, Edgar Simo-Serra, Hirokatsu Kataoka, Kazuhiro Fukui
We present a novel framework for rectifying occlusions and distortions in
degraded texture samples from natural images. Traditional texture synthesis
approaches focus on generating textures from pristine samples, which
necessitate meticulous preparation by humans and are often unattainable in most
natural images. These challenges stem from the frequent occlusions and
distortions of texture samples in natural images due to obstructions and
variations in object surface geometry. To address these issues, we propose a
framework that synthesizes holistic textures from degraded samples in natural
images, extending the applicability of exemplar-based texture synthesis
techniques. Our framework utilizes a conditional Latent Diffusion Model (LDM)
with a novel occlusion-aware latent transformer. This latent transformer not
only effectively encodes texture features from partially-observed samples
necessary for the generation process of the LDM, but also explicitly captures
long-range dependencies in samples with large occlusions. To train our model,
we introduce a method for generating synthetic data by applying geometric
transformations and free-form mask generation to clean textures. Experimental
results demonstrate that our framework significantly outperforms existing
methods both quantitatively and quantitatively. Furthermore, we conduct
comprehensive ablation studies to validate the different components of our
proposed framework. Results are corroborated by a perceptual user study which
highlights the efficiency of our proposed approach.
Text-image guided Diffusion Model for generating Deepfake celebrity interactions
September 26, 2023
Yunzhuo Chen, Nur Al Hasan Haldar, Naveed Akhtar, Ajmal Mian
Deepfake images are fast becoming a serious concern due to their realism.
Diffusion models have recently demonstrated highly realistic visual content
generation, which makes them an excellent potential tool for Deepfake
generation. To curb their exploitation for Deepfakes, it is imperative to first
explore the extent to which diffusion models can be used to generate realistic
content that is controllable with convenient prompts. This paper devises and
explores a novel method in that regard. Our technique alters the popular stable
diffusion model to generate a controllable high-quality Deepfake image with
text and image prompts. In addition, the original stable model lacks severely
in generating quality images that contain multiple persons. The modified
diffusion model is able to address this problem, it add input anchor image’s
latent at the beginning of inferencing rather than Gaussian random latent as
input. Hence, we focus on generating forged content for celebrity interactions,
which may be used to spread rumors. We also apply Dreambooth to enhance the
realism of our fake images. Dreambooth trains the pairing of center words and
specific features to produce more refined and personalized output images. Our
results show that with the devised scheme, it is possible to create fake visual
content with alarming realism, such that the content can serve as believable
evidence of meetings between powerful political figures.
Bootstrap Diffusion Model Curve Estimation for High Resolution Low-Light Image Enhancement
September 26, 2023
Jiancheng Huang, Yifan Liu, Shifeng Chen
Learning-based methods have attracted a lot of research attention and led to
significant improvements in low-light image enhancement. However, most of them
still suffer from two main problems: expensive computational cost in high
resolution images and unsatisfactory performance in simultaneous enhancement
and denoising. To address these problems, we propose BDCE, a bootstrap
diffusion model that exploits the learning of the distribution of the curve
parameters instead of the normal-light image itself. Specifically, we adopt the
curve estimation method to handle the high-resolution images, where the curve
parameters are estimated by our bootstrap diffusion model. In addition, a
denoise module is applied in each iteration of curve adjustment to denoise the
intermediate enhanced result of each iteration. We evaluate BDCE on commonly
used benchmark datasets, and extensive experiments show that it achieves
state-of-the-art qualitative and quantitative performance.
Multiple Noises in Diffusion Model for Semi-Supervised Multi-Domain Translation
September 25, 2023
Tsiry Mayet, Simon Bernard, Clement Chatelain, Romain Herault
Domain-to-domain translation involves generating a target domain sample given
a condition in the source domain. Most existing methods focus on fixed input
and output domains, i.e. they only work for specific configurations (i.e. for
two domains, either $D_1\rightarrow{}D_2$ or $D_2\rightarrow{}D_1$). This paper
proposes Multi-Domain Diffusion (MDD), a conditional diffusion framework for
multi-domain translation in a semi-supervised context. Unlike previous methods,
MDD does not require defining input and output domains, allowing translation
between any partition of domains within a set (such as $(D_1,
D_2)\rightarrow{}D_3$, $D_2\rightarrow{}(D_1, D_3)$, $D_3\rightarrow{}D_1$,
etc. for 3 domains), without the need to train separate models for each domain
configuration. The key idea behind MDD is to leverage the noise formulation of
diffusion models by incorporating one noise level per domain, which allows
missing domains to be modeled with noise in a natural way. This transforms the
training task from a simple reconstruction task to a domain translation task,
where the model relies on less noisy domains to reconstruct more noisy domains.
We present results on a multi-domain (with more than two domains) synthetic
image translation dataset with challenging semantic domain inversion.
NetDiffus: Network Traffic Generation by Diffusion Models through Time-Series Imaging
September 23, 2023
Nirhoshan Sivaroopan, Dumindu Bandara, Chamara Madarasingha, Guilluame Jourjon, Anura Jayasumana, Kanchana Thilakarathna
Network data analytics are now at the core of almost every networking
solution. Nonetheless, limited access to networking data has been an enduring
challenge due to many reasons including complexity of modern networks,
commercial sensitivity, privacy and regulatory constraints. In this work, we
explore how to leverage recent advancements in Diffusion Models (DM) to
generate synthetic network traffic data. We develop an end-to-end framework -
NetDiffus that first converts one-dimensional time-series network traffic into
two-dimensional images, and then synthesizes representative images for the
original data. We demonstrate that NetDiffus outperforms the state-of-the-art
traffic generation methods based on Generative Adversarial Networks (GANs) by
providing 66.4% increase in fidelity of the generated data and 18.1% increase
in downstream machine learning tasks. We evaluate NetDiffus on seven diverse
traffic traces and show that utilizing synthetic data significantly improves
traffic fingerprinting, anomaly detection and traffic classification.
Domain-Guided Conditional Diffusion Model for Unsupervised Domain Adaptation
September 23, 2023
Yulong Zhang, Shuhao Chen, Weisen Jiang, Yu Zhang, Jiangang Lu, James T. Kwok
Limited transferability hinders the performance of deep learning models when
applied to new application scenarios. Recently, Unsupervised Domain Adaptation
(UDA) has achieved significant progress in addressing this issue via learning
domain-invariant features. However, the performance of existing UDA methods is
constrained by the large domain shift and limited target domain data. To
alleviate these issues, we propose DomAin-guided Conditional Diffusion Model
(DACDM) to generate high-fidelity and diversity samples for the target domain.
In the proposed DACDM, by introducing class information, the labels of
generated samples can be controlled, and a domain classifier is further
introduced in DACDM to guide the generated samples for the target domain. The
generated samples help existing UDA methods transfer from the source domain to
the target domain more easily, thus improving the transfer performance.
Extensive experiments on various benchmarks demonstrate that DACDM brings a
large improvement to the performance of existing UDA methods.
MosaicFusion: Diffusion Models as Data Augmenters for Large Vocabulary Instance Segmentation
September 22, 2023
Jiahao Xie, Wei Li, Xiangtai Li, Ziwei Liu, Yew Soon Ong, Chen Change Loy
We present MosaicFusion, a simple yet effective diffusion-based data
augmentation approach for large vocabulary instance segmentation. Our method is
training-free and does not rely on any label supervision. Two key designs
enable us to employ an off-the-shelf text-to-image diffusion model as a useful
dataset generator for object instances and mask annotations. First, we divide
an image canvas into several regions and perform a single round of diffusion
process to generate multiple instances simultaneously, conditioning on
different text prompts. Second, we obtain corresponding instance masks by
aggregating cross-attention maps associated with object prompts across layers
and diffusion time steps, followed by simple thresholding and edge-aware
refinement processing. Without bells and whistles, our MosaicFusion can produce
a significant amount of synthetic labeled data for both rare and novel
categories. Experimental results on the challenging LVIS long-tailed and
open-vocabulary benchmarks demonstrate that MosaicFusion can significantly
improve the performance of existing instance segmentation models, especially
for rare and novel categories. Code will be released at
https://github.com/Jiahao000/MosaicFusion.
Deep learning probability flows and entropy production rates in active matter
September 22, 2023
Nicholas M. Boffi, Eric Vanden-Eijnden
cond-mat.stat-mech, cond-mat.soft, cs.LG, cs.NA, math.NA
Active matter systems, from self-propelled colloids to motile bacteria, are
characterized by the conversion of free energy into useful work at the
microscopic scale. These systems generically involve physics beyond the reach
of equilibrium statistical mechanics, and a persistent challenge has been to
understand the nature of their nonequilibrium states. The entropy production
rate and the magnitude of the steady-state probability current provide
quantitative ways to do so by measuring the breakdown of time-reversal symmetry
and the strength of nonequilibrium transport of measure. Yet, their efficient
computation has remained elusive, as they depend on the system’s unknown and
high-dimensional probability density. Here, building upon recent advances in
generative modeling, we develop a deep learning framework that estimates the
score of this density. We show that the score, together with the microscopic
equations of motion, gives direct access to the entropy production rate, the
probability current, and their decomposition into local contributions from
individual particles, spatial regions, and degrees of freedom. To represent the
score, we introduce a novel, spatially-local transformer-based network
architecture that learns high-order interactions between particles while
respecting their underlying permutation symmetry. We demonstrate the broad
utility and scalability of the method by applying it to several
high-dimensional systems of interacting active particles undergoing
motility-induced phase separation (MIPS). We show that a single instance of our
network trained on a system of 4096 particles at one packing fraction can
generalize to other regions of the phase diagram, including systems with as
many as 32768 particles. We use this observation to quantify the spatial
structure of the departure from equilibrium in MIPS as a function of the number
of particles and the packing fraction.
A Diffusion-Model of Joint Interactive Navigation
September 21, 2023
Matthew Niedoba, Jonathan Wilder Lavington, Yunpeng Liu, Vasileios Lioutas, Justice Sefas, Xiaoxuan Liang, Dylan Green, Setareh Dabiri, Berend Zwartsenberg, Adam Scibior, Frank Wood
Simulation of autonomous vehicle systems requires that simulated traffic
participants exhibit diverse and realistic behaviors. The use of prerecorded
real-world traffic scenarios in simulation ensures realism but the rarity of
safety critical events makes large scale collection of driving scenarios
expensive. In this paper, we present DJINN - a diffusion based method of
generating traffic scenarios. Our approach jointly diffuses the trajectories of
all agents, conditioned on a flexible set of state observations from the past,
present, or future. On popular trajectory forecasting datasets, we report state
of the art performance on joint trajectory metrics. In addition, we demonstrate
how DJINN flexibly enables direct test-time sampling from a variety of valuable
conditional distributions including goal-based sampling, behavior-class
sampling, and scenario editing.
Latent Diffusion Models for Structural Component Design
September 20, 2023
Ethan Herron, Jaydeep Rade, Anushrut Jignasu, Baskar Ganapathysubramanian, Aditya Balu, Soumik Sarkar, Adarsh Krishnamurthy
Recent advances in generative modeling, namely Diffusion models, have
revolutionized generative modeling, enabling high-quality image generation
tailored to user needs. This paper proposes a framework for the generative
design of structural components. Specifically, we employ a Latent Diffusion
model to generate potential designs of a component that can satisfy a set of
problem-specific loading conditions. One of the distinct advantages our
approach offers over other generative approaches, such as generative
adversarial networks (GANs), is that it permits the editing of existing
designs. We train our model using a dataset of geometries obtained from
structural topology optimization utilizing the SIMP algorithm. Consequently,
our framework generates inherently near-optimal designs. Our work presents
quantitative results that support the structural performance of the generated
designs and the variability in potential candidate designs. Furthermore, we
provide evidence of the scalability of our framework by operating over voxel
domains with resolutions varying from $32^3$ to $128^3$. Our framework can be
used as a starting point for generating novel near-optimal designs similar to
topology-optimized designs.
Light Field Diffusion for Single-View Novel View Synthesis
September 20, 2023
Yifeng Xiong, Haoyu Ma, Shanlin Sun, Kun Han, Xiaohui Xie
Single-view novel view synthesis, the task of generating images from new
viewpoints based on a single reference image, is an important but challenging
task in computer vision. Recently, Denoising Diffusion Probabilistic Model
(DDPM) has become popular in this area due to its strong ability to generate
high-fidelity images. However, current diffusion-based methods directly rely on
camera pose matrices as viewing conditions, globally and implicitly introducing
3D constraints. These methods may suffer from inconsistency among generated
images from different perspectives, especially in regions with intricate
textures and structures. In this work, we present Light Field Diffusion (LFD),
a conditional diffusion-based model for single-view novel view synthesis.
Unlike previous methods that employ camera pose matrices, LFD transforms the
camera view information into light field encoding and combines it with the
reference image. This design introduces local pixel-wise constraints within the
diffusion models, thereby encouraging better multi-view consistency.
Experiments on several datasets show that our LFD can efficiently generate
high-fidelity images and maintain better 3D consistency even in intricate
regions. Our method can generate images with higher quality than NeRF-based
models, and we obtain sample quality similar to other diffusion-based models
but with only one-third of the model size.
Assessing the capacity of a denoising diffusion probabilistic model to reproduce spatial context
September 19, 2023
Rucha Deshpande, Muzaffer Özbey, Hua Li, Mark A. Anastasio, Frank J. Brooks
eess.IV, cs.CV, cs.LG, stat.ML
Diffusion models have emerged as a popular family of deep generative models
(DGMs). In the literature, it has been claimed that one class of diffusion
models – denoising diffusion probabilistic models (DDPMs) – demonstrate
superior image synthesis performance as compared to generative adversarial
networks (GANs). To date, these claims have been evaluated using either
ensemble-based methods designed for natural images, or conventional measures of
image quality such as structural similarity. However, there remains an
important need to understand the extent to which DDPMs can reliably learn
medical imaging domain-relevant information, which is referred to as spatial
context' in this work. To address this, a systematic assessment of the ability
of DDPMs to learn spatial context relevant to medical imaging applications is
reported for the first time. A key aspect of the studies is the use of
stochastic context models (SCMs) to produce training data. In this way, the
ability of the DDPMs to reliably reproduce spatial context can be
quantitatively assessed by use of post-hoc image analyses. Error-rates in
DDPM-generated ensembles are reported, and compared to those corresponding to a
modern GAN. The studies reveal new and important insights regarding the
capacity of DDPMs to learn spatial context. Notably, the results demonstrate
that DDPMs hold significant capacity for generating contextually correct images
that are
interpolated’ between training samples, which may benefit
data-augmentation tasks in ways that GANs cannot.
PGDiff: Guiding Diffusion Models for Versatile Face Restoration via Partial Guidance
September 19, 2023
Peiqing Yang, Shangchen Zhou, Qingyi Tao, Chen Change Loy
Exploiting pre-trained diffusion models for restoration has recently become a
favored alternative to the traditional task-specific training approach.
Previous works have achieved noteworthy success by limiting the solution space
using explicit degradation models. However, these methods often fall short when
faced with complex degradations as they generally cannot be precisely modeled.
In this paper, we propose PGDiff by introducing partial guidance, a fresh
perspective that is more adaptable to real-world degradations compared to
existing works. Rather than specifically defining the degradation process, our
approach models the desired properties, such as image structure and color
statistics of high-quality images, and applies this guidance during the reverse
diffusion process. These properties are readily available and make no
assumptions about the degradation process. When combined with a diffusion
prior, this partial guidance can deliver appealing results across a range of
restoration tasks. Additionally, PGDiff can be extended to handle composite
tasks by consolidating multiple high-quality image properties, achieved by
integrating the guidance from respective tasks. Experimental results
demonstrate that our method not only outperforms existing diffusion-prior-based
approaches but also competes favorably with task-specific models.
Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation
September 19, 2023
Yatong Bai, Trung Dang, Dung Tran, Kazuhito Koishida, Somayeh Sojoudi
cs.SD, cs.LG, cs.MM, eess.AS
Diffusion models power a vast majority of text-to-audio (TTA) generation
methods. Unfortunately, these models suffer from slow inference speed due to
iterative queries to the underlying denoising network, thus unsuitable for
scenarios with inference time or computational constraints. This work modifies
the recently proposed consistency distillation framework to train TTA models
that require only a single neural network query. In addition to incorporating
classifier-free guidance into the distillation process, we leverage the
availability of generated audio during distillation training to fine-tune the
consistency TTA model with novel loss functions in the audio space, such as the
CLAP score. Our objective and subjective evaluation results on the AudioCaps
dataset show that consistency models retain diffusion models’ high generation
quality and diversity while reducing the number of queries by a factor of 400.
Forgedit: Text Guided Image Editing via Learning and Forgetting
September 19, 2023
Shiwen Zhang, Shuai Xiao, Weilin Huang
Text guided image editing on real images given only the image and the target
text prompt as inputs, is a very general and challenging problem, which
requires the editing model to reason by itself which part of the image should
be edited, to preserve the characteristics of original image, and also to
perform complicated non-rigid editing. Previous fine-tuning based solutions are
time-consuming and vulnerable to overfitting, limiting their editing
capabilities. To tackle these issues, we design a novel text guided image
editing method, Forgedit. First, we propose a novel fine-tuning framework which
learns to reconstruct the given image in less than one minute by vision
language joint learning. Then we introduce vector subtraction and vector
projection to explore the proper text embedding for editing. We also find a
general property of UNet structures in Diffusion Models and inspired by such a
finding, we design forgetting strategies to diminish the fatal overfitting
issues and significantly boost the editing abilities of Diffusion Models. Our
method, Forgedit, implemented with Stable Diffusion, achieves new
state-of-the-art results on the challenging text guided image editing benchmark
TEdBench, surpassing the previous SOTA method Imagic with Imagen, in terms of
both CLIP score and LPIPS score. Codes are available at
https://github.com/witcherofresearch/Forgedit.
Learning End-to-End Channel Coding with Diffusion Models
September 19, 2023
Muah Kim, Rick Fritschek, Rafael F. Schaefer
The training of neural encoders via deep learning necessitates a
differentiable channel model due to the backpropagation algorithm. This
requirement can be sidestepped by approximating either the channel distribution
or its gradient through pilot signals in real-world scenarios. The initial
approach draws upon the latest advancements in image generation, utilizing
generative adversarial networks (GANs) or their enhanced variants to generate
channel distributions. In this paper, we address this channel approximation
challenge with diffusion models, which have demonstrated high sample quality in
image generation. We offer an end-to-end channel coding framework underpinned
by diffusion models and propose an efficient training algorithm. Our
simulations with various channel models establish that our diffusion models
learn the channel distribution accurately, thereby achieving near-optimal
end-to-end symbol error rates (SERs). We also note a significant advantage of
diffusion models: A robust generalization capability in high signal-to-noise
ratio regions, in contrast to GAN variants that suffer from error floor.
Furthermore, we examine the trade-off between sample quality and sampling
speed, when an accelerated sampling algorithm is deployed, and investigate the
effect of the noise scheduling on this trade-off. With an apt choice of noise
scheduling, sampling time can be significantly reduced with a minor increase in
SER.
Diffusion-based speech enhancement with a weighted generative-supervised learning loss
September 19, 2023
Jean-Eudes Ayilo, Mostafa Sadeghi, Romain Serizel
cs.CV, cs.SD, eess.AS, eess.SP, stat.ML
Diffusion-based generative models have recently gained attention in speech
enhancement (SE), providing an alternative to conventional supervised methods.
These models transform clean speech training samples into Gaussian noise
centered at noisy speech, and subsequently learn a parameterized model to
reverse this process, conditionally on noisy speech. Unlike supervised methods,
generative-based SE approaches usually rely solely on an unsupervised loss,
which may result in less efficient incorporation of conditioned noisy speech.
To address this issue, we propose augmenting the original diffusion training
objective with a mean squared error (MSE) loss, measuring the discrepancy
between estimated enhanced speech and ground-truth clean speech at each reverse
process iteration. Experimental results demonstrate the effectiveness of our
proposed methodology.
AutoDiffusion: Training-Free Optimization of Time Steps and Architectures for Automated Diffusion Model Acceleration
September 19, 2023
Lijiang Li, Huixia Li, Xiawu Zheng, Jie Wu, Xuefeng Xiao, Rui Wang, Min Zheng, Xin Pan, Fei Chao, Rongrong Ji
Diffusion models are emerging expressive generative models, in which a large
number of time steps (inference steps) are required for a single image
generation. To accelerate such tedious process, reducing steps uniformly is
considered as an undisputed principle of diffusion models. We consider that
such a uniform assumption is not the optimal solution in practice; i.e., we can
find different optimal time steps for different models. Therefore, we propose
to search the optimal time steps sequence and compressed model architecture in
a unified framework to achieve effective image generation for diffusion models
without any further training. Specifically, we first design a unified search
space that consists of all possible time steps and various architectures. Then,
a two stage evolutionary algorithm is introduced to find the optimal solution
in the designed search space. To further accelerate the search process, we
employ FID score between generated and real samples to estimate the performance
of the sampled examples. As a result, the proposed method is (i).training-free,
obtaining the optimal time steps and model architecture without any training
process; (ii). orthogonal to most advanced diffusion samplers and can be
integrated to gain better sample quality. (iii). generalized, where the
searched time steps and architectures can be directly applied on different
diffusion models with the same guidance scale. Experimental results show that
our method achieves excellent performance by using only a few time steps, e.g.
17.86 FID score on ImageNet 64 $\times$ 64 with only four steps, compared to
138.66 with DDIM. The code is available at
https://github.com/lilijiangg/AutoDiffusion.
Diffusion Methods for Generating Transition Paths
September 19, 2023
Luke Triplett, Jianfeng Lu
physics.comp-ph, cs.LG, stat.ML
In this work, we seek to simulate rare transitions between metastable states
using score-based generative models. An efficient method for generating
high-quality transition paths is valuable for the study of molecular systems
since data is often difficult to obtain. We develop two novel methods for path
generation in this paper: a chain-based approach and a midpoint-based approach.
The first biases the original dynamics to facilitate transitions, while the
second mirrors splitting techniques and breaks down the original transition
into smaller transitions. Numerical results of generated transition paths for
the M"uller potential and for Alanine dipeptide demonstrate the effectiveness
of these approaches in both the data-rich and data-scarce regimes.
Generating and Imputing Tabular Data via Diffusion and Flow-based Gradient-Boosted Trees
September 18, 2023
Alexia Jolicoeur-Martineau, Kilian Fatras, Tal Kachman
Tabular data is hard to acquire and is subject to missing values. This paper
proposes a novel approach to generate and impute mixed-type (continuous and
categorical) tabular data using score-based diffusion and conditional flow
matching. Contrary to previous work that relies on neural networks to learn the
score function or the vector field, we instead rely on XGBoost, a popular
Gradient-Boosted Tree (GBT) method. We empirically show on 27 different
datasets that our approach i) generates highly realistic synthetic data when
the training dataset is either clean or tainted by missing data and ii)
generates diverse plausible data imputations. Furthermore, our method
outperforms deep-learning generation methods on data generation and is
competitive on data imputation. Finally, it can be trained in parallel using
CPUs without the need for a GPU. To make it easily accessible, we release our
code through a Python library and an R package.
What is a Fair Diffusion Model? Designing Generative Text-To-Image Models to Incorporate Various Worldviews
September 18, 2023
Zoe De Simone, Angie Boggust, Arvind Satyanarayan, Ashia Wilson
cs.LG, cs.AI, cs.CV, cs.CY
Generative text-to-image (GTI) models produce high-quality images from short
textual descriptions and are widely used in academic and creative domains.
However, GTI models frequently amplify biases from their training data, often
producing prejudiced or stereotypical images. Yet, current bias mitigation
strategies are limited and primarily focus on enforcing gender parity across
occupations. To enhance GTI bias mitigation, we introduce DiffusionWorldViewer,
a tool to analyze and manipulate GTI models’ attitudes, values, stories, and
expectations of the world that impact its generated images. Through an
interactive interface deployed as a web-based GUI and Jupyter Notebook plugin,
DiffusionWorldViewer categorizes existing demographics of GTI-generated images
and provides interactive methods to align image demographics with user
worldviews. In a study with 13 GTI users, we find that DiffusionWorldViewer
allows users to represent their varied viewpoints about what GTI outputs are
fair and, in doing so, challenges current notions of fairness that assume a
universal worldview.
Speeding Up Speech Synthesis In Diffusion Models By Reducing Data Distribution Recovery Steps Via Content Transfer
September 18, 2023
Peter Ochieng
Diffusion based vocoders have been criticised for being slow due to the many
steps required during sampling. Moreover, the model’s loss function that is
popularly implemented is designed such that the target is the original input
$x_0$ or error $\epsilon_0$. For early time steps of the reverse process, this
results in large prediction errors, which can lead to speech distortions and
increase the learning time. We propose a setup where the targets are the
different outputs of forward process time steps with a goal to reduce the
magnitude of prediction errors and reduce the training time. We use the
different layers of a neural network (NN) to perform denoising by training them
to learn to generate representations similar to the noised outputs in the
forward process of the diffusion. The NN layers learn to progressively denoise
the input in the reverse process until finally the final layer estimates the
clean speech. To avoid 1:1 mapping between layers of the neural network and the
forward process steps, we define a skip parameter $\tau>1$ such that an NN
layer is trained to cumulatively remove the noise injected in the $\tau$ steps
in the forward process. This significantly reduces the number of data
distribution recovery steps and, consequently, the time to generate speech. We
show through extensive evaluation that the proposed technique generates
high-fidelity speech in competitive time that outperforms current
state-of-the-art tools. The proposed technique is also able to generalize well
to unseen speech.
Gradpaint: Gradient-Guided Inpainting with Diffusion Models
September 18, 2023
Asya Grechka, Guillaume Couairon, Matthieu Cord
Denoising Diffusion Probabilistic Models (DDPMs) have recently achieved
remarkable results in conditional and unconditional image generation. The
pre-trained models can be adapted without further training to different
downstream tasks, by guiding their iterative denoising process at inference
time to satisfy additional constraints. For the specific task of image
inpainting, the current guiding mechanism relies on copying-and-pasting the
known regions from the input image at each denoising step. However, diffusion
models are strongly conditioned by the initial random noise, and therefore
struggle to harmonize predictions inside the inpainting mask with the real
parts of the input image, often producing results with unnatural artifacts.
Our method, dubbed GradPaint, steers the generation towards a globally
coherent image. At each step in the denoising process, we leverage the model’s
“denoised image estimation” by calculating a custom loss measuring its
coherence with the masked input image. Our guiding mechanism uses the gradient
obtained from backpropagating this loss through the diffusion model itself.
GradPaint generalizes well to diffusion models trained on various datasets,
improving upon current state-of-the-art supervised and unsupervised methods.
Progressive Text-to-Image Diffusion with Soft Latent Direction
September 18, 2023
YuTeng Ye, Jiale Cai, Hang Zhou, Guanwen Li, Youjia Zhang, Zikai Song, Chenxing Gao, Junqing Yu, Wei Yang
In spite of the rapidly evolving landscape of text-to-image generation, the
synthesis and manipulation of multiple entities while adhering to specific
relational constraints pose enduring challenges. This paper introduces an
innovative progressive synthesis and editing operation that systematically
incorporates entities into the target image, ensuring their adherence to
spatial and relational constraints at each sequential step. Our key insight
stems from the observation that while a pre-trained text-to-image diffusion
model adeptly handles one or two entities, it often falters when dealing with a
greater number. To address this limitation, we propose harnessing the
capabilities of a Large Language Model (LLM) to decompose intricate and
protracted text descriptions into coherent directives adhering to stringent
formats. To facilitate the execution of directives involving distinct semantic
operations-namely insertion, editing, and erasing-we formulate the Stimulus,
Response, and Fusion (SRF) framework. Within this framework, latent regions are
gently stimulated in alignment with each operation, followed by the fusion of
the responsive latent components to achieve cohesive entity manipulation. Our
proposed framework yields notable advancements in object synthesis,
particularly when confronted with intricate and lengthy textual inputs.
Consequently, it establishes a new benchmark for text-to-image generation
tasks, further elevating the field’s performance standards.
Regularised Diffusion-Shock Inpainting
September 15, 2023
Kristina Schaefer, Joachim Weickert
We introduce regularised diffusion–shock (RDS) inpainting as a modification
of diffusion–shock inpainting from our SSVM 2023 conference paper. RDS
inpainting combines two carefully chosen components: homogeneous diffusion and
coherence-enhancing shock filtering. It benefits from the complementary synergy
of its building blocks: The shock term propagates edge data with perfect
sharpness and directional accuracy over large distances due to its high degree
of anisotropy. Homogeneous diffusion fills large areas efficiently. The second
order equation underlying RDS inpainting inherits a maximum–minimum principle
from its components, which is also fulfilled in the discrete case, in contrast
to competing anisotropic methods. The regularisation addresses the largest
drawback of the original model: It allows a drastic reduction in model
parameters without any loss in quality. Furthermore, we extend RDS inpainting
to vector-valued data. Our experiments show a performance that is comparable to
or better than many inpainting methods based on partial differential equations
and related integrodifferential models
DCTTS: Discrete Diffusion Model with Contrastive Learning for Text-to-speech Generation
September 13, 2023
Zhichao Wu, Qiulin Li, Sixing Liu, Qun Yang
In the Text-to-speech(TTS) task, the latent diffusion model has excellent
fidelity and generalization, but its expensive resource consumption and slow
inference speed have always been a challenging. This paper proposes Discrete
Diffusion Model with Contrastive Learning for Text-to-Speech Generation(DCTTS).
The following contributions are made by DCTTS: 1) The TTS diffusion model based
on discrete space significantly lowers the computational consumption of the
diffusion model and improves sampling speed; 2) The contrastive learning method
based on discrete space is used to enhance the alignment connection between
speech and text and improve sampling quality; and 3) It uses an efficient text
encoder to simplify the model’s parameters and increase computational
efficiency. The experimental results demonstrate that the approach proposed in
this paper has outstanding speech synthesis quality and sampling speed while
significantly reducing the resource consumption of diffusion model. The
synthesized samples are available at https://github.com/lawtherWu/DCTTS.
Adapt and Diffuse: Sample-adaptive Reconstruction via Latent Diffusion Models
September 12, 2023
Zalan Fabian, Berk Tinaz, Mahdi Soltanolkotabi
eess.IV, cs.LG, I.2.6; I.4.5
Inverse problems arise in a multitude of applications, where the goal is to
recover a clean signal from noisy and possibly (non)linear observations. The
difficulty of a reconstruction problem depends on multiple factors, such as the
structure of the ground truth signal, the severity of the degradation, the
implicit bias of the reconstruction model and the complex interactions between
the above factors. This results in natural sample-by-sample variation in the
difficulty of a reconstruction task, which is often overlooked by contemporary
techniques. Recently, diffusion-based inverse problem solvers have established
new state-of-the-art in various reconstruction tasks. However, they have the
drawback of being computationally prohibitive. Our key observation in this
paper is that most existing solvers lack the ability to adapt their compute
power to the difficulty of the reconstruction task, resulting in long inference
times, subpar performance and wasteful resource allocation. We propose a novel
method that we call severity encoding, to estimate the degradation severity of
noisy, degraded signals in the latent space of an autoencoder. We show that the
estimated severity has strong correlation with the true corruption level and
can give useful hints at the difficulty of reconstruction problems on a
sample-by-sample basis. Furthermore, we propose a reconstruction method based
on latent diffusion models that leverages the predicted degradation severities
to fine-tune the reverse diffusion sampling trajectory and thus achieve
sample-adaptive inference times. We utilize latent diffusion posterior sampling
to maintain data consistency with observations. We perform experiments on both
linear and nonlinear inverse problems and demonstrate that our technique
achieves performance comparable to state-of-the-art diffusion-based techniques,
with significant improvements in computational efficiency.
On the Contraction Coefficient of the Schrödinger Bridge for Stochastic Linear Systems
September 12, 2023
Alexis M. H. Teter, Yongxin Chen, Abhishek Halder
math.OC, cs.LG, cs.SY, eess.SY, stat.ML
Schr"{o}dinger bridge is a stochastic optimal control problem to steer a
given initial state density to another, subject to controlled diffusion and
deadline constraints. A popular method to numerically solve the Schr"{o}dinger
bridge problems, in both classical and in the linear system settings, is via
contractive fixed point recursions. These recursions can be seen as dynamic
versions of the well-known Sinkhorn iterations, and under mild assumptions,
they solve the so-called Schr"{o}dinger systems with guaranteed linear
convergence. In this work, we study a priori estimates for the contraction
coefficients associated with the convergence of respective Schr"{o}dinger
systems. We provide new geometric and control-theoretic interpretations for the
same. Building on these newfound interpretations, we point out the possibility
of improved computation for the worst-case contraction coefficients of linear
SBPs by preconditioning the endpoint support sets.
Reasoning with Latent Diffusion in Offline Reinforcement Learning
September 12, 2023
Siddarth Venkatraman, Shivesh Khaitan, Ravi Tej Akella, John Dolan, Jeff Schneider, Glen Berseth
Offline reinforcement learning (RL) holds promise as a means to learn
high-reward policies from a static dataset, without the need for further
environment interactions. However, a key challenge in offline RL lies in
effectively stitching portions of suboptimal trajectories from the static
dataset while avoiding extrapolation errors arising due to a lack of support in
the dataset. Existing approaches use conservative methods that are tricky to
tune and struggle with multi-modal data (as we show) or rely on noisy Monte
Carlo return-to-go samples for reward conditioning. In this work, we propose a
novel approach that leverages the expressiveness of latent diffusion to model
in-support trajectory sequences as compressed latent skills. This facilitates
learning a Q-function while avoiding extrapolation error via
batch-constraining. The latent space is also expressive and gracefully copes
with multi-modal data. We show that the learned temporally-abstract latent
space encodes richer task-specific information for offline RL tasks as compared
to raw state-actions. This improves credit assignment and facilitates faster
reward propagation during Q-learning. Our method demonstrates state-of-the-art
performance on the D4RL benchmarks, particularly excelling in long-horizon,
sparse-reward tasks.
InstaFlow: One Step is Enough for High-Quality Diffusion-Based Text-to-Image Generation
September 12, 2023
Xingchao Liu, Xiwen Zhang, Jianzhu Ma, Jian Peng, Qiang Liu
Diffusion models have revolutionized text-to-image generation with its
exceptional quality and creativity. However, its multi-step sampling process is
known to be slow, often requiring tens of inference steps to obtain
satisfactory results. Previous attempts to improve its sampling speed and
reduce computational costs through distillation have been unsuccessful in
achieving a functional one-step model. In this paper, we explore a recent
method called Rectified Flow, which, thus far, has only been applied to small
datasets. The core of Rectified Flow lies in its \emph{reflow} procedure, which
straightens the trajectories of probability flows, refines the coupling between
noises and images, and facilitates the distillation process with student
models. We propose a novel text-conditioned pipeline to turn Stable Diffusion
(SD) into an ultra-fast one-step model, in which we find reflow plays a
critical role in improving the assignment between noise and images. Leveraging
our new pipeline, we create, to the best of our knowledge, the first one-step
diffusion-based text-to-image generator with SD-level image quality, achieving
an FID (Frechet Inception Distance) of $23.3$ on MS COCO 2017-5k, surpassing
the previous state-of-the-art technique, progressive distillation, by a
significant margin ($37.2$ $\rightarrow$ $23.3$ in FID). By utilizing an
expanded network with 1.7B parameters, we further improve the FID to $22.4$. We
call our one-step models \emph{InstaFlow}. On MS COCO 2014-30k, InstaFlow
yields an FID of $13.1$ in just $0.09$ second, the best in $\leq 0.1$ second
regime, outperforming the recent StyleGAN-T ($13.9$ in $0.1$ second). Notably,
the training of InstaFlow only costs 199 A100 GPU days. Project
page:~\url{https://github.com/gnobitab/InstaFlow}.
Elucidating the solution space of extended reverse-time SDE for diffusion models
September 12, 2023
Qinpeng Cui, Xinyi Zhang, Zongqing Lu, Qingmin Liao
Diffusion models (DMs) demonstrate potent image generation capabilities in
various generative modeling tasks. Nevertheless, their primary limitation lies
in slow sampling speed, requiring hundreds or thousands of sequential function
evaluations through large neural networks to generate high-quality images.
Sampling from DMs can be seen alternatively as solving corresponding stochastic
differential equations (SDEs) or ordinary differential equations (ODEs). In
this work, we formulate the sampling process as an extended reverse-time SDE
(ER SDE), unifying prior explorations into ODEs and SDEs. Leveraging the
semi-linear structure of ER SDE solutions, we offer exact solutions and
arbitrarily high-order approximate solutions for VP SDE and VE SDE,
respectively. Based on the solution space of the ER SDE, we yield mathematical
insights elucidating the superior performance of ODE solvers over SDE solvers
in terms of fast sampling. Additionally, we unveil that VP SDE solvers stand on
par with their VE SDE counterparts. Finally, we devise fast and training-free
samplers, ER-SDE-Solvers, achieving state-of-the-art performance across all
stochastic samplers. Experimental results demonstrate achieving 3.45 FID in 20
function evaluations and 2.24 FID in 50 function evaluations on the ImageNet
$64\times64$ dataset.
Diffusion-based Adversarial Purification for Robust Deep MRI Reconstruction
September 11, 2023
Ismail Alkhouri, Shijun Liang, Rongrong Wang, Qing Qu, Saiprasad Ravishankar
Deep learning (DL) techniques have been extensively employed in magnetic
resonance imaging (MRI) reconstruction, delivering notable performance
enhancements over traditional non-DL methods. Nonetheless, recent studies have
identified vulnerabilities in these models during testing, namely, their
susceptibility to (\textit{i}) worst-case measurement perturbations and to
(\textit{ii}) variations in training/testing settings like acceleration factors
and k-space sampling locations. This paper addresses the robustness challenges
by leveraging diffusion models. In particular, we present a robustification
strategy that improves the resilience of DL-based MRI reconstruction methods by
utilizing pretrained diffusion models as noise purifiers. In contrast to
conventional robustification methods for DL-based MRI reconstruction, such as
adversarial training (AT), our proposed approach eliminates the need to tackle
a minimax optimization problem. It only necessitates fine-tuning on purified
examples. Our experimental results highlight the efficacy of our approach in
mitigating the aforementioned instabilities when compared to leading
robustification approaches for deep MRI reconstruction, including AT and
randomized smoothing.
Diffusion-Guided Reconstruction of Everyday Hand-Object Interaction Clips
September 11, 2023
Yufei Ye, Poorvi Hebbar, Abhinav Gupta, Shubham Tulsiani
We tackle the task of reconstructing hand-object interactions from short
video clips. Given an input video, our approach casts 3D inference as a
per-video optimization and recovers a neural 3D representation of the object
shape, as well as the time-varying motion and hand articulation. While the
input video naturally provides some multi-view cues to guide 3D inference,
these are insufficient on their own due to occlusions and limited viewpoint
variations. To obtain accurate 3D, we augment the multi-view signals with
generic data-driven priors to guide reconstruction. Specifically, we learn a
diffusion network to model the conditional distribution of (geometric)
renderings of objects conditioned on hand configuration and category label, and
leverage it as a prior to guide the novel-view renderings of the reconstructed
scene. We empirically evaluate our approach on egocentric videos across 6
object categories, and observe significant improvements over prior single-view
and multi-view methods. Finally, we demonstrate our system’s ability to
reconstruct arbitrary clips from YouTube, showing both 1st and 3rd person
interactions.
Diffusion-Based Co-Speech Gesture Generation Using Joint Text and Audio Representation
September 11, 2023
Anna Deichler, Shivam Mehta, Simon Alexanderson, Jonas Beskow
eess.AS, cs.HC, cs.LG, cs.SD, 68T42, I.2.6; I.2.7
This paper describes a system developed for the GENEA (Generation and
Evaluation of Non-verbal Behaviour for Embodied Agents) Challenge 2023. Our
solution builds on an existing diffusion-based motion synthesis model. We
propose a contrastive speech and motion pretraining (CSMP) module, which learns
a joint embedding for speech and gesture with the aim to learn a semantic
coupling between these modalities. The output of the CSMP module is used as a
conditioning signal in the diffusion-based gesture synthesis model in order to
achieve semantically-aware co-speech gesture generation. Our entry achieved
highest human-likeness and highest speech appropriateness rating among the
submitted entries. This indicates that our system is a promising approach to
achieve human-like co-speech gestures in agents that carry semantic meaning.
Treatment-aware Diffusion Probabilistic Model for Longitudinal MRI Generation and Diffuse Glioma Growth Prediction
September 11, 2023
Qinghui Liu, Elies Fuster-Garcia, Ivar Thokle Hovden, Donatas Sederevicius, Karoline Skogen, Bradley J MacIntosh, Edvard Grødem, Till Schellhorn, Petter Brandal, Atle Bjørnerud, Kyrre Eeg Emblem
Diffuse gliomas are malignant brain tumors that grow widespread through the
brain. The complex interactions between neoplastic cells and normal tissue, as
well as the treatment-induced changes often encountered, make glioma tumor
growth modeling challenging. In this paper, we present a novel end-to-end
network capable of generating future tumor masks and realistic MRIs of how the
tumor will look at any future time points for different treatment plans. Our
approach is based on cutting-edge diffusion probabilistic models and
deep-segmentation neural networks. We included sequential multi-parametric
magnetic resonance images (MRI) and treatment information as conditioning
inputs to guide the generative diffusion process. This allows for tumor growth
estimates at any given time point. We trained the model using real-world
postoperative longitudinal MRI data with glioma tumor growth trajectories
represented as tumor segmentation maps over time. The model has demonstrated
promising performance across a range of tasks, including the generation of
high-quality synthetic MRIs with tumor masks, time-series tumor segmentations,
and uncertainty estimates. Combined with the treatment-aware generated MRIs,
the tumor growth predictions with uncertainty estimates can provide useful
information for clinical decision-making.
Diff-Privacy: Diffusion-based Face Privacy Protection
September 11, 2023
Xiao He, Mingrui Zhu, Dongxin Chen, Nannan Wang, Xinbo Gao
Privacy protection has become a top priority as the proliferation of AI
techniques has led to widespread collection and misuse of personal data.
Anonymization and visual identity information hiding are two important facial
privacy protection tasks that aim to remove identification characteristics from
facial images at the human perception level. However, they have a significant
difference in that the former aims to prevent the machine from recognizing
correctly, while the latter needs to ensure the accuracy of machine
recognition. Therefore, it is difficult to train a model to complete these two
tasks simultaneously. In this paper, we unify the task of anonymization and
visual identity information hiding and propose a novel face privacy protection
method based on diffusion models, dubbed Diff-Privacy. Specifically, we train
our proposed multi-scale image inversion module (MSI) to obtain a set of SDM
format conditional embeddings of the original image. Based on the conditional
embeddings, we design corresponding embedding scheduling strategies and
construct different energy functions during the denoising process to achieve
anonymization and visual identity information hiding. Extensive experiments
have been conducted to validate the effectiveness of our proposed framework in
protecting facial privacy.
Learning Energy-Based Models by Cooperative Diffusion Recovery Likelihood
September 10, 2023
Yaxuan Zhu, Jianwen Xie, Yingnian Wu, Ruiqi Gao
Training energy-based models (EBMs) with maximum likelihood estimation on
high-dimensional data can be both challenging and time-consuming. As a result,
there a noticeable gap in sample quality between EBMs and other generative
frameworks like GANs and diffusion models. To close this gap, inspired by the
recent efforts of learning EBMs by maximimizing diffusion recovery likelihood
(DRL), we propose cooperative diffusion recovery likelihood (CDRL), an
effective approach to tractably learn and sample from a series of EBMs defined
on increasingly noisy versons of a dataset, paired with an initializer model
for each EBM. At each noise level, the initializer model learns to amortize the
sampling process of the EBM, and the two models are jointly estimated within a
cooperative training framework. Samples from the initializer serve as starting
points that are refined by a few sampling steps from the EBM. With the refined
samples, the EBM is optimized by maximizing recovery likelihood, while the
initializer is optimized by learning from the difference between the refined
samples and the initial samples. We develop a new noise schedule and a variance
reduction technique to further improve the sample quality. Combining these
advances, we significantly boost the FID scores compared to existing EBM
methods on CIFAR-10 and ImageNet 32x32, with a 2x speedup over DRL. In
addition, we extend our method to compositional generation and image inpainting
tasks, and showcase the compatibility of CDRL with classifier-free guidance for
conditional generation, achieving similar trade-offs between sample quality and
sample diversity as in diffusion models.
SA-Solver: Stochastic Adams Solver for Fast Sampling of Diffusion Models
September 10, 2023
Shuchen Xue, Mingyang Yi, Weijian Luo, Shifeng Zhang, Jiacheng Sun, Zhenguo Li, Zhi-Ming Ma
Diffusion Probabilistic Models (DPMs) have achieved considerable success in
generation tasks. As sampling from DPMs is equivalent to solving diffusion SDE
or ODE which is time-consuming, numerous fast sampling methods built upon
improved differential equation solvers are proposed. The majority of such
techniques consider solving the diffusion ODE due to its superior efficiency.
However, stochastic sampling could offer additional advantages in generating
diverse and high-quality data. In this work, we engage in a comprehensive
analysis of stochastic sampling from two aspects: variance-controlled diffusion
SDE and linear multi-step SDE solver. Based on our analysis, we propose
SA-Solver, which is an improved efficient stochastic Adams method for solving
diffusion SDE to generate data with high quality. Our experiments show that
SA-Solver achieves: 1) improved or comparable performance compared with the
existing state-of-the-art sampling methods for few-step sampling; 2) SOTA FID
scores on substantial benchmark datasets under a suitable number of function
evaluations (NFEs).
Prefix-diffusion: A Lightweight Diffusion Model for Diverse Image Captioning
September 10, 2023
Guisheng Liu, Yi Li, Zhengcong Fei, Haiyan Fu, Xiangyang Luo, Yanqing Guo
While impressive performance has been achieved in image captioning, the
limited diversity of the generated captions and the large parameter scale
remain major barriers to the real-word application of these systems. In this
work, we propose a lightweight image captioning network in combination with
continuous diffusion, called Prefix-diffusion. To achieve diversity, we design
an efficient method that injects prefix image embeddings into the denoising
process of the diffusion model. In order to reduce trainable parameters, we
employ a pre-trained model to extract image features and further design an
extra mapping network. Prefix-diffusion is able to generate diverse captions
with relatively less parameters, while maintaining the fluency and relevance of
the captions benefiting from the generative capabilities of the diffusion
model. Our work paves the way for scaling up diffusion models for image
captioning, and achieves promising performance compared with recent approaches.
MaskDiffusion: Boosting Text-to-Image Consistency with Conditional Mask
September 08, 2023
Yupeng Zhou, Daquan Zhou, Zuo-Liang Zhu, Yaxing Wang, Qibin Hou, Jiashi Feng
Recent advancements in diffusion models have showcased their impressive
capacity to generate visually striking images. Nevertheless, ensuring a close
match between the generated image and the given prompt remains a persistent
challenge. In this work, we identify that a crucial factor leading to the
text-image mismatch issue is the inadequate cross-modality relation learning
between the prompt and the output image. To better align the prompt and image
content, we advance the cross-attention with an adaptive mask, which is
conditioned on the attention maps and the prompt embeddings, to dynamically
adjust the contribution of each text token to the image features. This
mechanism explicitly diminishes the ambiguity in semantic information embedding
from the text encoder, leading to a boost of text-to-image consistency in the
synthesized images. Our method, termed MaskDiffusion, is training-free and
hot-pluggable for popular pre-trained diffusion models. When applied to the
latent diffusion models, our MaskDiffusion can significantly improve the
text-to-image consistency with negligible computation overhead compared to the
original diffusion models.
InstructDiffusion: A Generalist Modeling Interface for Vision Tasks
September 07, 2023
Zigang Geng, Binxin Yang, Tiankai Hang, Chen Li, Shuyang Gu, Ting Zhang, Jianmin Bao, Zheng Zhang, Han Hu, Dong Chen, Baining Guo
We present InstructDiffusion, a unifying and generic framework for aligning
computer vision tasks with human instructions. Unlike existing approaches that
integrate prior knowledge and pre-define the output space (e.g., categories and
coordinates) for each vision task, we cast diverse vision tasks into a
human-intuitive image-manipulating process whose output space is a flexible and
interactive pixel space. Concretely, the model is built upon the diffusion
process and is trained to predict pixels according to user instructions, such
as encircling the man’s left shoulder in red or applying a blue mask to the
left car. InstructDiffusion could handle a variety of vision tasks, including
understanding tasks (such as segmentation and keypoint detection) and
generative tasks (such as editing and enhancement). It even exhibits the
ability to handle unseen tasks and outperforms prior methods on novel datasets.
This represents a significant step towards a generalist modeling interface for
vision tasks, advancing artificial general intelligence in the field of
computer vision.
DiffusionEngine: Diffusion Model is Scalable Data Engine for Object Detection
September 07, 2023
Manlin Zhang, Jie Wu, Yuxi Ren, Ming Li, Jie Qin, Xuefeng Xiao, Wei Liu, Rui Wang, Min Zheng, Andy J. Ma
Data is the cornerstone of deep learning. This paper reveals that the
recently developed Diffusion Model is a scalable data engine for object
detection. Existing methods for scaling up detection-oriented data often
require manual collection or generative models to obtain target images,
followed by data augmentation and labeling to produce training pairs, which are
costly, complex, or lacking diversity. To address these issues, we
presentDiffusionEngine (DE), a data scaling-up engine that provides
high-quality detection-oriented training pairs in a single stage. DE consists
of a pre-trained diffusion model and an effective Detection-Adapter,
contributing to generating scalable, diverse and generalizable detection data
in a plug-and-play manner. Detection-Adapter is learned to align the implicit
semantic and location knowledge in off-the-shelf diffusion models with
detection-aware signals to make better bounding-box predictions. Additionally,
we contribute two datasets, i.e., COCO-DE and VOC-DE, to scale up existing
detection benchmarks for facilitating follow-up research. Extensive experiments
demonstrate that data scaling-up via DE can achieve significant improvements in
diverse scenarios, such as various detection algorithms, self-supervised
pre-training, data-sparse, label-scarce, cross-domain, and semi-supervised
learning. For example, when using DE with a DINO-based adapter to scale up
data, mAP is improved by 3.1% on COCO, 7.6% on VOC, and 11.5% on Clipart.
Text-to-feature diffusion for audio-visual few-shot learning
September 07, 2023
Otniel-Bogdan Mercea, Thomas Hummel, A. Sophia Koepke, Zeynep Akata
Training deep learning models for video classification from audio-visual data
commonly requires immense amounts of labeled training data collected via a
costly process. A challenging and underexplored, yet much cheaper, setup is
few-shot learning from video data. In particular, the inherently multi-modal
nature of video data with sound and visual information has not been leveraged
extensively for the few-shot video classification task. Therefore, we introduce
a unified audio-visual few-shot video classification benchmark on three
datasets, i.e. the VGGSound-FSL, UCF-FSL, ActivityNet-FSL datasets, where we
adapt and compare ten methods. In addition, we propose AV-DIFF, a
text-to-feature diffusion framework, which first fuses the temporal and
audio-visual features via cross-modal attention and then generates multi-modal
features for the novel classes. We show that AV-DIFF obtains state-of-the-art
performance on our proposed benchmark for audio-visual (generalised) few-shot
learning. Our benchmark paves the way for effective audio-visual classification
when only limited labeled data is available. Code and data are available at
https://github.com/ExplainableML/AVDIFF-GFSL.
DiffDefense: Defending against Adversarial Attacks via Diffusion Models
September 07, 2023
Hondamunige Prasanna Silva, Lorenzo Seidenari, Alberto Del Bimbo
This paper presents a novel reconstruction method that leverages Diffusion
Models to protect machine learning classifiers against adversarial attacks, all
without requiring any modifications to the classifiers themselves. The
susceptibility of machine learning models to minor input perturbations renders
them vulnerable to adversarial attacks. While diffusion-based methods are
typically disregarded for adversarial defense due to their slow reverse
process, this paper demonstrates that our proposed method offers robustness
against adversarial threats while preserving clean accuracy, speed, and
plug-and-play compatibility. Code at:
https://github.com/HondamunigePrasannaSilva/DiffDefence.
Reuse and Diffuse: Iterative Denoising for Text-to-Video Generation
September 07, 2023
Jiaxi Gu, Shicong Wang, Haoyu Zhao, Tianyi Lu, Xing Zhang, Zuxuan Wu, Songcen Xu, Wei Zhang, Yu-Gang Jiang, Hang Xu
Inspired by the remarkable success of Latent Diffusion Models (LDMs) for
image synthesis, we study LDM for text-to-video generation, which is a
formidable challenge due to the computational and memory constraints during
both model training and inference. A single LDM is usually only capable of
generating a very limited number of video frames. Some existing works focus on
separate prediction models for generating more video frames, which suffer from
additional training cost and frame-level jittering, however. In this paper, we
propose a framework called “Reuse and Diffuse” dubbed $\textit{VidRD}$ to
produce more frames following the frames already generated by an LDM.
Conditioned on an initial video clip with a small number of frames, additional
frames are iteratively generated by reusing the original latent features and
following the previous diffusion process. Besides, for the autoencoder used for
translation between pixel space and latent space, we inject temporal layers
into its decoder and fine-tune these layers for higher temporal consistency. We
also propose a set of strategies for composing video-text data that involve
diverse content from multiple existing datasets including video datasets for
action recognition and image-text datasets. Extensive experiments show that our
method achieves good results in both quantitative and qualitative evaluations.
Our project page is available
$\href{https://anonymous0x233.github.io/ReuseAndDiffuse/}{here}$.
Diffusion Model is Secretly a Training-free Open Vocabulary Semantic Segmenter
September 06, 2023
Jinglong Wang, Xiawei Li, Jing Zhang, Qingyuan Xu, Qin Zhou, Qian Yu, Lu Sheng, Dong Xu
Recent research has explored the utilization of pre-trained text-image
discriminative models, such as CLIP, to tackle the challenges associated with
open-vocabulary semantic segmentation. However, it is worth noting that the
alignment process based on contrastive learning employed by these models may
unintentionally result in the loss of crucial localization information and
object completeness, which are essential for achieving accurate semantic
segmentation. More recently, there has been an emerging interest in extending
the application of diffusion models beyond text-to-image generation tasks,
particularly in the domain of semantic segmentation. These approaches utilize
diffusion models either for generating annotated data or for extracting
features to facilitate semantic segmentation. This typically involves training
segmentation models by generating a considerable amount of synthetic data or
incorporating additional mask annotations. To this end, we uncover the
potential of generative text-to-image conditional diffusion models as highly
efficient open-vocabulary semantic segmenters, and introduce a novel
training-free approach named DiffSegmenter. Specifically, by feeding an input
image and candidate classes into an off-the-shelf pre-trained conditional
latent diffusion model, the cross-attention maps produced by the denoising
U-Net are directly used as segmentation scores, which are further refined and
completed by the followed self-attention maps. Additionally, we carefully
design effective textual prompts and a category filtering mechanism to further
enhance the segmentation results. Extensive experiments on three benchmark
datasets show that the proposed DiffSegmenter achieves impressive results for
open-vocabulary semantic segmentation.
Diffusion-EDFs: Bi-equivariant Denoising Generative Modeling on SE(3) for Visual Robotic Manipulation
September 06, 2023
Hyunwoo Ryu, Jiwoo Kim, Hyunseok An, Junwoo Chang, Joohwan Seo, Taehan Kim, Yubin Kim, Chaewon Hwang, Jongeun Choi, Roberto Horowitz
Diffusion generative modeling has become a promising approach for learning
robotic manipulation tasks from stochastic human demonstrations. In this paper,
we present Diffusion-EDFs, a novel SE(3)-equivariant diffusion-based approach
for visual robotic manipulation tasks. We show that our proposed method
achieves remarkable data efficiency, requiring only 5 to 10 human
demonstrations for effective end-to-end training in less than an hour.
Furthermore, our benchmark experiments demonstrate that our approach has
superior generalizability and robustness compared to state-of-the-art methods.
Lastly, we validate our methods with real hardware experiments. Project
Website: https://sites.google.com/view/diffusion-edfs/home
Diffusion on the Probability Simplex
September 05, 2023
Griffin Floto, Thorsteinn Jonsson, Mihai Nica, Scott Sanner, Eric Zhengyu Zhu
Diffusion models learn to reverse the progressive noising of a data
distribution to create a generative model. However, the desired continuous
nature of the noising process can be at odds with discrete data. To deal with
this tension between continuous and discrete objects, we propose a method of
performing diffusion on the probability simplex. Using the probability simplex
naturally creates an interpretation where points correspond to categorical
probability distributions. Our method uses the softmax function applied to an
Ornstein-Unlenbeck Process, a well-known stochastic differential equation. We
find that our methodology also naturally extends to include diffusion on the
unit cube which has applications for bounded image generation.
Diffusion-based 3D Object Detection with Random Boxes
September 05, 2023
Xin Zhou, Jinghua Hou, Tingting Yao, Dingkang Liang, Zhe Liu, Zhikang Zou, Xiaoqing Ye, Jianwei Cheng, Xiang Bai
3D object detection is an essential task for achieving autonomous driving.
Existing anchor-based detection methods rely on empirical heuristics setting of
anchors, which makes the algorithms lack elegance. In recent years, we have
witnessed the rise of several generative models, among which diffusion models
show great potential for learning the transformation of two distributions. Our
proposed Diff3Det migrates the diffusion model to proposal generation for 3D
object detection by considering the detection boxes as generative targets.
During training, the object boxes diffuse from the ground truth boxes to the
Gaussian distribution, and the decoder learns to reverse this noise process. In
the inference stage, the model progressively refines a set of random boxes to
the prediction results. We provide detailed experiments on the KITTI benchmark
and achieve promising performance compared to classical anchor-based 3D
detection methods.
Diffusion Generative Inverse Design
September 05, 2023
Marin Vlastelica, Tatiana López-Guevara, Kelsey Allen, Peter Battaglia, Arnaud Doucet, Kimberley Stachenfeld
Inverse design refers to the problem of optimizing the input of an objective
function in order to enact a target outcome. For many real-world engineering
problems, the objective function takes the form of a simulator that predicts
how the system state will evolve over time, and the design challenge is to
optimize the initial conditions that lead to a target outcome. Recent
developments in learned simulation have shown that graph neural networks (GNNs)
can be used for accurate, efficient, differentiable estimation of simulator
dynamics, and support high-quality design optimization with gradient- or
sampling-based optimization procedures. However, optimizing designs from
scratch requires many expensive model queries, and these procedures exhibit
basic failures on either non-convex or high-dimensional problems. In this work,
we show how denoising diffusion models (DDMs) can be used to solve inverse
design problems efficiently and propose a particle sampling algorithm for
further improving their efficiency. We perform experiments on a number of fluid
dynamics design challenges, and find that our approach substantially reduces
the number of calls to the simulator compared to standard techniques.
sasdim: self-adaptive noise scaling diffusion model for spatial time series imputation
September 05, 2023
Shunyang Zhang, Senzhang Wang, Xianzhen Tan, Ruochen Liu, Jian Zhang, Jianxin Wang
Spatial time series imputation is critically important to many real
applications such as intelligent transportation and air quality monitoring.
Although recent transformer and diffusion model based approaches have achieved
significant performance gains compared with conventional statistic based
methods, spatial time series imputation still remains as a challenging issue
due to the complex spatio-temporal dependencies and the noise uncertainty of
the spatial time series data. Especially, recent diffusion process based models
may introduce random noise to the imputations, and thus cause negative impact
on the model performance. To this end, we propose a self-adaptive noise scaling
diffusion model named SaSDim to more effectively perform spatial time series
imputation. Specially, we propose a new loss function that can scale the noise
to the similar intensity, and propose the across spatial-temporal global
convolution module to more effectively capture the dynamic spatial-temporal
dependencies. Extensive experiments conducted on three real world datasets
verify the effectiveness of SaSDim by comparison with current state-of-the-art
baselines.
Gradient Domain Diffusion Models for Image Synthesis
September 05, 2023
Yuanhao Gong
cs.CV, cs.LG, cs.MM, cs.PF, eess.IV
Diffusion models are getting popular in generative image and video synthesis.
However, due to the diffusion process, they require a large number of steps to
converge. To tackle this issue, in this paper, we propose to perform the
diffusion process in the gradient domain, where the convergence becomes faster.
There are two reasons. First, thanks to the Poisson equation, the gradient
domain is mathematically equivalent to the original image domain. Therefore,
each diffusion step in the image domain has a unique corresponding gradient
domain representation. Second, the gradient domain is much sparser than the
image domain. As a result, gradient domain diffusion models converge faster.
Several numerical experiments confirm that the gradient domain diffusion models
are more efficient than the original diffusion models. The proposed method can
be applied in a wide range of applications such as image processing, computer
vision and machine learning tasks.
Softmax Bias Correction for Quantized Generative Models
September 04, 2023
Nilesh Prasad Pandey, Marios Fournarakis, Chirag Patel, Markus Nagel
Post-training quantization (PTQ) is the go-to compression technique for large
generative models, such as stable diffusion or large language models. PTQ
methods commonly keep the softmax activation in higher precision as it has been
shown to be very sensitive to quantization noise. However, this can lead to a
significant runtime and power overhead during inference on resource-constraint
edge devices. In this work, we investigate the source of the softmax
sensitivity to quantization and show that the quantization operation leads to a
large bias in the softmax output, causing accuracy degradation. To overcome
this issue, we propose an offline bias correction technique that improves the
quantizability of softmax without additional compute during deployment, as it
can be readily absorbed into the quantization parameters. We demonstrate the
effectiveness of our method on stable diffusion v1.5 and 125M-size OPT language
model, achieving significant accuracy improvement for 8-bit quantized softmax.
Relay Diffusion: Unifying diffusion process across resolutions for image synthesis
September 04, 2023
Jiayan Teng, Wendi Zheng, Ming Ding, Wenyi Hong, Jianqiao Wangni, Zhuoyi Yang, Jie Tang
Diffusion models achieved great success in image synthesis, but still face
challenges in high-resolution generation. Through the lens of discrete cosine
transformation, we find the main reason is that \emph{the same noise level on a
higher resolution results in a higher Signal-to-Noise Ratio in the frequency
domain}. In this work, we present Relay Diffusion Model (RDM), which transfers
a low-resolution image or noise into an equivalent high-resolution one for
diffusion model via blurring diffusion and block noise. Therefore, the
diffusion process can continue seamlessly in any new resolution or model
without restarting from pure noise or low-resolution conditioning. RDM achieves
state-of-the-art FID on CelebA-HQ and sFID on ImageNet 256$\times$256,
surpassing previous works such as ADM, LDM and DiT by a large margin. All the
codes and checkpoints are open-sourced at
\url{https://github.com/THUDM/RelayDiffusion}.
GenSelfDiff-HIS: Generative Self-Supervision Using Diffusion for Histopathological Image Segmentation
September 04, 2023
Vishnuvardhan Purma, Suhas Srinath, Seshan Srirangarajan, Aanchal Kakkar, Prathosh A. P
Histopathological image segmentation is a laborious and time-intensive task,
often requiring analysis from experienced pathologists for accurate
examinations. To reduce this burden, supervised machine-learning approaches
have been adopted using large-scale annotated datasets for histopathological
image analysis. However, in several scenarios, the availability of large-scale
annotated data is a bottleneck while training such models. Self-supervised
learning (SSL) is an alternative paradigm that provides some respite by
constructing models utilizing only the unannotated data which is often
abundant. The basic idea of SSL is to train a network to perform one or many
pseudo or pretext tasks on unannotated data and use it subsequently as the
basis for a variety of downstream tasks. It is seen that the success of SSL
depends critically on the considered pretext task. While there have been many
efforts in designing pretext tasks for classification problems, there haven’t
been many attempts on SSL for histopathological segmentation. Motivated by
this, we propose an SSL approach for segmenting histopathological images via
generative diffusion models in this paper. Our method is based on the
observation that diffusion models effectively solve an image-to-image
translation task akin to a segmentation task. Hence, we propose generative
diffusion as the pretext task for histopathological image segmentation. We also
propose a multi-loss function-based fine-tuning for the downstream task. We
validate our method using several metrics on two publically available datasets
along with a newly proposed head and neck (HN) cancer dataset containing
hematoxylin and eosin (H\&E) stained images along with annotations. Codes will
be made public at
https://github.com/PurmaVishnuVardhanReddy/GenSelfDiff-HIS.git.
FinDiff: Diffusion Models for Financial Tabular Data Generation
September 04, 2023
Timur Sattarov, Marco Schreyer, Damian Borth
The sharing of microdata, such as fund holdings and derivative instruments,
by regulatory institutions presents a unique challenge due to strict data
confidentiality and privacy regulations. These challenges often hinder the
ability of both academics and practitioners to conduct collaborative research
effectively. The emergence of generative models, particularly diffusion models,
capable of synthesizing data mimicking the underlying distributions of
real-world data presents a compelling solution. This work introduces ‘FinDiff’,
a diffusion model designed to generate real-world financial tabular data for a
variety of regulatory downstream tasks, for example economic scenario modeling,
stress tests, and fraud detection. The model uses embedding encodings to model
mixed modality financial data, comprising both categorical and numeric
attributes. The performance of FinDiff in generating synthetic tabular
financial data is evaluated against state-of-the-art baseline models using
three real-world financial datasets (including two publicly available datasets
and one proprietary dataset). Empirical results demonstrate that FinDiff excels
in generating synthetic tabular financial data with high fidelity, privacy, and
utility.
Accelerating Markov Chain Monte Carlo sampling with diffusion models
September 04, 2023
N. T. Hunt-Smith, W. Melnitchouk, F. Ringer, N. Sato, A. W Thomas, M. J. White
Global fits of physics models require efficient methods for exploring
high-dimensional and/or multimodal posterior functions. We introduce a novel
method for accelerating Markov Chain Monte Carlo (MCMC) sampling by pairing a
Metropolis-Hastings algorithm with a diffusion model that can draw global
samples with the aim of approximating the posterior. We briefly review
diffusion models in the context of image synthesis before providing a
streamlined diffusion model tailored towards low-dimensional data arrays. We
then present our adapted Metropolis-Hastings algorithm which combines local
proposals with global proposals taken from a diffusion model that is regularly
trained on the samples produced during the MCMC run. Our approach leads to a
significant reduction in the number of likelihood evaluations required to
obtain an accurate representation of the Bayesian posterior across several
analytic functions, as well as for a physical example based on a global
analysis of parton distribution functions. Our method is extensible to other
MCMC techniques, and we briefly compare our method to similar approaches based
on normalizing flows. A code implementation can be found at
https://github.com/NickHunt-Smith/MCMC-diffusion.
Diffusion Models with Deterministic Normalizing Flow Priors
September 03, 2023
Mohsen Zand, Ali Etemad, Michael Greenspan
For faster sampling and higher sample quality, we propose DiNof
($\textbf{Di}$ffusion with $\textbf{No}$rmalizing $\textbf{f}$low priors), a
technique that makes use of normalizing flows and diffusion models. We use
normalizing flows to parameterize the noisy data at any arbitrary step of the
diffusion process and utilize it as the prior in the reverse diffusion process.
More specifically, the forward noising process turns a data distribution into
partially noisy data, which are subsequently transformed into a Gaussian
distribution by a nonlinear process. The backward denoising procedure begins
with a prior created by sampling from the Gaussian distribution and applying
the invertible normalizing flow transformations deterministically. To generate
the data distribution, the prior then undergoes the remaining diffusion
stochastic denoising procedure. Through the reduction of the number of total
diffusion steps, we are able to speed up both the forward and backward
processes. More importantly, we improve the expressive power of diffusion
models by employing both deterministic and stochastic mappings. Experiments on
standard image generation datasets demonstrate the advantage of the proposed
method over existing approaches. On the unconditional CIFAR10 dataset, for
example, we achieve an FID of 2.01 and an Inception score of 9.96. Our method
also demonstrates competitive performance on CelebA-HQ-256 dataset as it
obtains an FID score of 7.11. Code is available at
https://github.com/MohsenZand/DiNof.
NADiffuSE: Noise-aware Diffusion-based Model for Speech Enhancement
September 03, 2023
Wen Wang, Dongchao Yang, Qichen Ye, Bowen Cao, Yuexian Zou
The goal of speech enhancement (SE) is to eliminate the background
interference from the noisy speech signal. Generative models such as diffusion
models (DM) have been applied to the task of SE because of better
generalization in unseen noisy scenes. Technical routes for the DM-based SE
methods can be summarized into three types: task-adapted diffusion process
formulation, generator-plus-conditioner (GPC) structures and the multi-stage
frameworks. We focus on the first two approaches, which are constructed under
the GPC architecture and use the task-adapted diffusion process to better deal
with the real noise. However, the performance of these SE models is limited by
the following issues: (a) Non-Gaussian noise estimation in the task-adapted
diffusion process. (b) Conditional domain bias caused by the weak conditioner
design in the GPC structure. (c) Large amount of residual noise caused by
unreasonable interpolation operations during inference. To solve the above
problems, we propose a noise-aware diffusion-based SE model (NADiffuSE) to
boost the SE performance, where the noise representation is extracted from the
noisy speech signal and introduced as a global conditional information for
estimating the non-Gaussian components. Furthermore, the anchor-based inference
algorithm is employed to achieve a compromise between the speech distortion and
noise residual. In order to mitigate the performance degradation caused by the
conditional domain bias in the GPC framework, we investigate three model
variants, all of which can be viewed as multi-stage SE based on the
preprocessing networks for Mel spectrograms. Experimental results show that
NADiffuSE outperforms other DM-based SE models under the GPC infrastructure.
Audio samples are available at: https://square-of-w.github.io/NADiffuSE-demo/.
VGDiffZero: Text-to-image Diffusion Models Can Be Zero-shot Visual Grounders
September 03, 2023
Xuyang Liu, Siteng Huang, Yachen Kang, Honggang Chen, Donglin Wang
Large-scale text-to-image diffusion models have shown impressive capabilities
for generative tasks by leveraging strong vision-language alignment from
pre-training. However, most vision-language discriminative tasks require
extensive fine-tuning on carefully-labeled datasets to acquire such alignment,
with great cost in time and computing resources. In this work, we explore
directly applying a pre-trained generative diffusion model to the challenging
discriminative task of visual grounding without any fine-tuning and additional
training dataset. Specifically, we propose VGDiffZero, a simple yet effective
zero-shot visual grounding framework based on text-to-image diffusion models.
We also design a comprehensive region-scoring method considering both global
and local contexts of each isolated proposal. Extensive experiments on RefCOCO,
RefCOCO+, and RefCOCOg show that VGDiffZero achieves strong performance on
zero-shot visual grounding. Our code is available at
https://github.com/xuyang-liu16/VGDiffZero.
DiCLET-TTS: Diffusion Model based Cross-lingual Emotion Transfer for Text-to-Speech – A Study between English and Mandarin
September 02, 2023
Tao Li, Chenxu Hu, Jian Cong, Xinfa Zhu, Jingbei Li, Qiao Tian, Yuping Wang, Lei Xie
While the performance of cross-lingual TTS based on monolingual corpora has
been significantly improved recently, generating cross-lingual speech still
suffers from the foreign accent problem, leading to limited naturalness.
Besides, current cross-lingual methods ignore modeling emotion, which is
indispensable paralinguistic information in speech delivery. In this paper, we
propose DiCLET-TTS, a Diffusion model based Cross-Lingual Emotion Transfer
method that can transfer emotion from a source speaker to the intra- and
cross-lingual target speakers. Specifically, to relieve the foreign accent
problem while improving the emotion expressiveness, the terminal distribution
of the forward diffusion process is parameterized into a speaker-irrelevant but
emotion-related linguistic prior by a prior text encoder with the emotion
embedding as a condition. To address the weaker emotional expressiveness
problem caused by speaker disentanglement in emotion embedding, a novel
orthogonal projection based emotion disentangling module (OP-EDM) is proposed
to learn the speaker-irrelevant but emotion-discriminative embedding. Moreover,
a condition-enhanced DPM decoder is introduced to strengthen the modeling
ability of the speaker and the emotion in the reverse diffusion process to
further improve emotion expressiveness in speech delivery. Cross-lingual
emotion transfer experiments show the superiority of DiCLET-TTS over various
competitive models and the good design of OP-EDM in learning speaker-irrelevant
but emotion-discriminative embedding.
September 02, 2023
Yu Guan, Chuanming Yu, Shiyu Lu, Zhuoxu Cui, Dong Liang, Qiegen Liu
Most existing MRI reconstruction methods perform tar-geted reconstruction of
the entire MR image without tak-ing specific tissue regions into consideration.
This may fail to emphasize the reconstruction accuracy on im-portant tissues
for diagnosis. In this study, leveraging a combination of the properties of
k-space data and the diffusion process, our novel scheme focuses on mining the
multi-frequency prior with different strategies to pre-serve fine texture
details in the reconstructed image. In addition, a diffusion process can
converge more quickly if its target distribution closely resembles the noise
distri-bution in the process. This can be accomplished through various
high-frequency prior extractors. The finding further solidifies the
effectiveness of the score-based gen-erative model. On top of all the
advantages, our method improves the accuracy of MRI reconstruction and
accel-erates sampling process. Experimental results verify that the proposed
method successfully obtains more accurate reconstruction and outperforms
state-of-the-art methods.
Diffusion Modeling with Domain-conditioned Prior Guidance for Accelerated MRI and qMRI Reconstruction
September 02, 2023
Wanyu Bian, Albert Jang, Fang Liu
This study introduces a novel approach for image reconstruction based on a
diffusion model conditioned on the native data domain. Our method is applied to
multi-coil MRI and quantitative MRI reconstruction, leveraging the
domain-conditioned diffusion model within the frequency and parameter domains.
The prior MRI physics are used as embeddings in the diffusion model, enforcing
data consistency to guide the training and sampling process, characterizing MRI
k-space encoding in MRI reconstruction, and leveraging MR signal modeling for
qMRI reconstruction. Furthermore, a gradient descent optimization is
incorporated into the diffusion steps, enhancing feature learning and improving
denoising. The proposed method demonstrates a significant promise, particularly
for reconstructing images at high acceleration factors. Notably, it maintains
great reconstruction accuracy and efficiency for static and quantitative MRI
reconstruction across diverse anatomical structures. Beyond its immediate
applications, this method provides potential generalization capability, making
it adaptable to inverse problems across various domains.
Fast Diffusion EM: a diffusion model for blind inverse problems with application to deconvolution
September 01, 2023
Charles Laroche, Andrés Almansa, Eva Coupete
Using diffusion models to solve inverse problems is a growing field of
research. Current methods assume the degradation to be known and provide
impressive results in terms of restoration quality and diversity. In this work,
we leverage the efficiency of those models to jointly estimate the restored
image and unknown parameters of the degradation model such as blur kernel. In
particular, we designed an algorithm based on the well-known
Expectation-Minimization (EM) estimation method and diffusion models. Our
method alternates between approximating the expected log-likelihood of the
inverse problem using samples drawn from a diffusion model and a maximization
step to estimate unknown model parameters. For the maximization step, we also
introduce a novel blur kernel regularization based on a Plug \& Play denoiser.
Diffusion models are long to run, thus we provide a fast version of our
algorithm. Extensive experiments on blind image deblurring demonstrate the
effectiveness of our method when compared to other state-of-the-art approaches.
DiffuGen: Adaptable Approach for Generating Labeled Image Datasets using Stable Diffusion Models
September 01, 2023
Michael Shenoda, Edward Kim
Generating high-quality labeled image datasets is crucial for training
accurate and robust machine learning models in the field of computer vision.
However, the process of manually labeling real images is often time-consuming
and costly. To address these challenges associated with dataset generation, we
introduce “DiffuGen,” a simple and adaptable approach that harnesses the power
of stable diffusion models to create labeled image datasets efficiently. By
leveraging stable diffusion models, our approach not only ensures the quality
of generated datasets but also provides a versatile solution for label
generation. In this paper, we present the methodology behind DiffuGen, which
combines the capabilities of diffusion models with two distinct labeling
techniques: unsupervised and supervised. Distinctively, DiffuGen employs prompt
templating for adaptable image generation and textual inversion to enhance
diffusion model capabilities.
Diffusion Model with Clustering-based Conditioning for Food Image Generation
September 01, 2023
Yue Han, Jiangpeng He, Mridul Gupta, Edward J. Delp, Fengqing Zhu
Image-based dietary assessment serves as an efficient and accurate solution
for recording and analyzing nutrition intake using eating occasion images as
input. Deep learning-based techniques are commonly used to perform image
analysis such as food classification, segmentation, and portion size
estimation, which rely on large amounts of food images with annotations for
training. However, such data dependency poses significant barriers to
real-world applications, because acquiring a substantial, diverse, and balanced
set of food images can be challenging. One potential solution is to use
synthetic food images for data augmentation. Although existing work has
explored the use of generative adversarial networks (GAN) based structures for
generation, the quality of synthetic food images still remains subpar. In
addition, while diffusion-based generative models have shown promising results
for general image generation tasks, the generation of food images can be
challenging due to the substantial intra-class variance. In this paper, we
investigate the generation of synthetic food images based on the conditional
diffusion model and propose an effective clustering-based training framework,
named ClusDiff, for generating high-quality and representative food images. The
proposed method is evaluated on the Food-101 dataset and shows improved
performance when compared with existing image generation works. We also
demonstrate that the synthetic food images generated by ClusDiff can help
address the severe class imbalance issue in long-tailed food classification
using the VFN-LT dataset.
BuilDiff: 3D Building Shape Generation using Single-Image Conditional Point Cloud Diffusion Models
August 31, 2023
Yao Wei, George Vosselman, Michael Ying Yang
3D building generation with low data acquisition costs, such as single
image-to-3D, becomes increasingly important. However, most of the existing
single image-to-3D building creation works are restricted to those images with
specific viewing angles, hence they are difficult to scale to general-view
images that commonly appear in practical cases. To fill this gap, we propose a
novel 3D building shape generation method exploiting point cloud diffusion
models with image conditioning schemes, which demonstrates flexibility to the
input images. By cooperating two conditional diffusion models and introducing a
regularization strategy during denoising process, our method is able to
synthesize building roofs while maintaining the overall structures. We validate
our framework on two newly built datasets and extensive experiments show that
our method outperforms previous works in terms of building generation quality.
August 31, 2023
Sirui Xu, Zhengyuan Li, Yu-Xiong Wang, Liang-Yan Gui
This paper addresses a novel task of anticipating 3D human-object
interactions (HOIs). Most existing research on HOI synthesis lacks
comprehensive whole-body interactions with dynamic objects, e.g., often limited
to manipulating small or static objects. Our task is significantly more
challenging, as it requires modeling dynamic objects with various shapes,
capturing whole-body motion, and ensuring physically valid interactions. To
this end, we propose InterDiff, a framework comprising two key steps: (i)
interaction diffusion, where we leverage a diffusion model to encode the
distribution of future human-object interactions; (ii) interaction correction,
where we introduce a physics-informed predictor to correct denoised HOIs in a
diffusion step. Our key insight is to inject prior knowledge that the
interactions under reference with respect to contact points follow a simple
pattern and are easily predictable. Experiments on multiple human-object
interaction datasets demonstrate the effectiveness of our method for this task,
capable of producing realistic, vivid, and remarkably long-term 3D HOI
predictions.
Diffusion Models for Interferometric Satellite Aperture Radar
August 31, 2023
Alexandre Tuel, Thomas Kerdreux, Claudia Hulbert, Bertrand Rouet-Leduc
Probabilistic Diffusion Models (PDMs) have recently emerged as a very
promising class of generative models, achieving high performance in natural
image generation. However, their performance relative to non-natural images,
like radar-based satellite data, remains largely unknown. Generating large
amounts of synthetic (and especially labelled) satellite data is crucial to
implement deep-learning approaches for the processing and analysis of
(interferometric) satellite aperture radar data. Here, we leverage PDMs to
generate several radar-based satellite image datasets. We show that PDMs
succeed in generating images with complex and realistic structures, but that
sampling time remains an issue. Indeed, accelerated sampling strategies, which
work well on simple image datasets like MNIST, fail on our radar datasets. We
provide a simple and versatile open-source
https://github.com/thomaskerdreux/PDM_SAR_InSAR_generation to train, sample and
evaluate PDMs using any dataset on a single GPU.
Ref-Diff: Zero-shot Referring Image Segmentation with Generative Models
August 31, 2023
Minheng Ni, Yabo Zhang, Kailai Feng, Xiaoming Li, Yiwen Guo, Wangmeng Zuo
Zero-shot referring image segmentation is a challenging task because it aims
to find an instance segmentation mask based on the given referring
descriptions, without training on this type of paired data. Current zero-shot
methods mainly focus on using pre-trained discriminative models (e.g., CLIP).
However, we have observed that generative models (e.g., Stable Diffusion) have
potentially understood the relationships between various visual elements and
text descriptions, which are rarely investigated in this task. In this work, we
introduce a novel Referring Diffusional segmentor (Ref-Diff) for this task,
which leverages the fine-grained multi-modal information from generative
models. We demonstrate that without a proposal generator, a generative model
alone can achieve comparable performance to existing SOTA weakly-supervised
models. When we combine both generative and discriminative models, our Ref-Diff
outperforms these competing methods by a significant margin. This indicates
that generative models are also beneficial for this task and can complement
discriminative models for better referring segmentation. Our code is publicly
available at https://github.com/kodenii/Ref-Diff.
Terrain Diffusion Network: Climatic-Aware Terrain Generation with Geological Sketch Guidance
August 31, 2023
Zexin Hu, Kun Hu, Clinton Mo, Lei Pan, Zhiyong Wang
Sketch-based terrain generation seeks to create realistic landscapes for
virtual environments in various applications such as computer games, animation
and virtual reality. Recently, deep learning based terrain generation has
emerged, notably the ones based on generative adversarial networks (GAN).
However, these methods often struggle to fulfill the requirements of flexible
user control and maintain generative diversity for realistic terrain.
Therefore, we propose a novel diffusion-based method, namely terrain diffusion
network (TDN), which actively incorporates user guidance for enhanced
controllability, taking into account terrain features like rivers, ridges,
basins, and peaks. Instead of adhering to a conventional monolithic denoising
process, which often compromises the fidelity of terrain details or the
alignment with user control, a multi-level denoising scheme is proposed to
generate more realistic terrains by taking into account fine-grained details,
particularly those related to climatic patterns influenced by erosion and
tectonic activities. Specifically, three terrain synthesisers are designed for
structural, intermediate, and fine-grained level denoising purposes, which
allow each synthesiser concentrate on a distinct terrain aspect. Moreover, to
maximise the efficiency of our TDN, we further introduce terrain and sketch
latent spaces for the synthesizers with pre-trained terrain autoencoders.
Comprehensive experiments on a new dataset constructed from NASA Topology
Images clearly demonstrate the effectiveness of our proposed method, achieving
the state-of-the-art performance. Our code and dataset will be publicly
available.
Diffusion Inertial Poser: Human Motion Reconstruction from Arbitrary Sparse IMU Configurations
August 31, 2023
Tom Van Wouwe, Seunghwan Lee, Antoine Falisse, Scott Delp, C. Karen Liu
Motion capture from a limited number of inertial measurement units (IMUs) has
important applications in health, human performance, and virtual reality.
Real-world limitations and application-specific goals dictate different IMU
configurations (i.e., number of IMUs and chosen attachment body segments),
trading off accuracy and practicality. Although recent works were successful in
accurately reconstructing whole-body motion from six IMUs, these systems only
work with a specific IMU configuration. Here we propose a single diffusion
generative model, Diffusion Inertial Poser (DiffIP), which reconstructs human
motion in real-time from arbitrary IMU configurations. We show that DiffIP has
the benefit of flexibility with respect to the IMU configuration while being as
accurate as the state-of-the-art for the commonly used six IMU configuration.
Our system enables selecting an optimal configuration for different
applications without retraining the model. For example, when only four IMUs are
available, DiffIP found that the configuration that minimizes errors in joint
kinematics instruments the thighs and forearms. However, global translation
reconstruction is better when instrumenting the feet instead of the thighs.
Although our approach is agnostic to the underlying model, we built DiffIP
based on physiologically realistic musculoskeletal models to enable use in
biomedical research and health applications.
MFR-Net: Multi-faceted Responsive Listening Head Generation via Denoising Diffusion Model
August 31, 2023
Jin Liu, Xi Wang, Xiaomeng Fu, Yesheng Chai, Cai Yu, Jiao Dai, Jizhong Han
Face-to-face communication is a common scenario including roles of speakers
and listeners. Most existing research methods focus on producing speaker
videos, while the generation of listener heads remains largely overlooked.
Responsive listening head generation is an important task that aims to model
face-to-face communication scenarios by generating a listener head video given
a speaker video and a listener head image. An ideal generated responsive
listening video should respond to the speaker with attitude or viewpoint
expressing while maintaining diversity in interaction patterns and accuracy in
listener identity information. To achieve this goal, we propose the
\textbf{M}ulti-\textbf{F}aceted \textbf{R}esponsive Listening Head Generation
Network (MFR-Net). Specifically, MFR-Net employs the probabilistic denoising
diffusion model to predict diverse head pose and expression features. In order
to perform multi-faceted response to the speaker video, while maintaining
accurate listener identity preservation, we design the Feature Aggregation
Module to boost listener identity features and fuse them with other
speaker-related features. Finally, a renderer finetuned with identity
consistency loss produces the final listening head videos. Our extensive
experiments demonstrate that MFR-Net not only achieves multi-faceted responses
in diversity and speaker identity information but also in attitude and
viewpoint expression.
Any-Size-Diffusion: Toward Efficient Text-Driven Synthesis for Any-Size HD Images
August 31, 2023
Qingping Zheng, Yuanfan Guo, Jiankang Deng, Jianhua Han, Ying Li, Songcen Xu, Hang Xu
Stable diffusion, a generative model used in text-to-image synthesis,
frequently encounters resolution-induced composition problems when generating
images of varying sizes. This issue primarily stems from the model being
trained on pairs of single-scale images and their corresponding text
descriptions. Moreover, direct training on images of unlimited sizes is
unfeasible, as it would require an immense number of text-image pairs and
entail substantial computational expenses. To overcome these challenges, we
propose a two-stage pipeline named Any-Size-Diffusion (ASD), designed to
efficiently generate well-composed images of any size, while minimizing the
need for high-memory GPU resources. Specifically, the initial stage, dubbed Any
Ratio Adaptability Diffusion (ARAD), leverages a selected set of images with a
restricted range of ratios to optimize the text-conditional diffusion model,
thereby improving its ability to adjust composition to accommodate diverse
image sizes. To support the creation of images at any desired size, we further
introduce a technique called Fast Seamless Tiled Diffusion (FSTD) at the
subsequent stage. This method allows for the rapid enlargement of the ASD
output to any high-resolution size, avoiding seaming artifacts or memory
overloads. Experimental results on the LAION-COCO and MM-CelebA-HQ benchmarks
demonstrate that ASD can produce well-structured images of arbitrary sizes,
cutting down the inference time by 2x compared to the traditional tiled
algorithm.
LightGrad: Lightweight Diffusion Probabilistic Model for Text-to-Speech
August 31, 2023
Jie Chen, Xingchen Song, Zhendong Peng, Binbin Zhang, Fuping Pan, Zhiyong Wu
Recent advances in neural text-to-speech (TTS) models bring thousands of TTS
applications into daily life, where models are deployed in cloud to provide
services for customs. Among these models are diffusion probabilistic models
(DPMs), which can be stably trained and are more parameter-efficient compared
with other generative models. As transmitting data between customs and the
cloud introduces high latency and the risk of exposing private data, deploying
TTS models on edge devices is preferred. When implementing DPMs onto edge
devices, there are two practical problems. First, current DPMs are not
lightweight enough for resource-constrained devices. Second, DPMs require many
denoising steps in inference, which increases latency. In this work, we present
LightGrad, a lightweight DPM for TTS. LightGrad is equipped with a lightweight
U-Net diffusion decoder and a training-free fast sampling technique, reducing
both model parameters and inference latency. Streaming inference is also
implemented in LightGrad to reduce latency further. Compared with Grad-TTS,
LightGrad achieves 62.2% reduction in paramters, 65.7% reduction in latency,
while preserving comparable speech quality on both Chinese Mandarin and English
in 4 denoising steps.
Conditioning Score-Based Generative Models by Neuro-Symbolic Constraints
August 31, 2023
Davide Scassola, Sebastiano Saccani, Ginevra Carbone, Luca Bortolussi
Score-based and diffusion models have emerged as effective approaches for
both conditional and unconditional generation. Still conditional generation is
based on either a specific training of a conditional model or classifier
guidance, which requires training a noise-dependent classifier, even when the
classifier for uncorrupted data is given. We propose an approach to sample from
unconditional score-based generative models enforcing arbitrary logical
constraints, without any additional training. Firstly, we show how to
manipulate the learned score in order to sample from an un-normalized
distribution conditional on a user-defined constraint. Then, we define a
flexible and numerically stable neuro-symbolic framework for encoding soft
logical constraints. Combining these two ingredients we obtain a general, but
approximate, conditional sampling algorithm. We further developed effective
heuristics aimed at improving the approximation. Finally, we show the
effectiveness of our approach for various types of constraints and data:
tabular data, images and time series.
Latent Painter
August 31, 2023
Shih-Chieh Su
Latent diffusers revolutionized the generative AI and inspired creative art.
When denoising the latent, the predicted original image at each step
collectively animates the formation. However, the animation is limited by the
denoising nature of the diffuser, and only renders a sharpening process. This
work presents Latent Painter, which uses the latent as the canvas, and the
diffuser predictions as the plan, to generate painting animation. Latent
Painter also transits one generated image to another, which can happen between
images from two different sets of checkpoints.
A Recycling Training Strategy for Medical Image Segmentation with Diffusion Denoising Models
August 30, 2023
Yunguan Fu, Yiwen Li, Shaheer U Saeed, Matthew J Clarkson, Yipeng Hu
Denoising diffusion models have found applications in image segmentation by
generating segmented masks conditioned on images. Existing studies
predominantly focus on adjusting model architecture or improving inference,
such as test-time sampling strategies. In this work, we focus on improving the
training strategy and propose a novel recycling method. During each training
step, a segmentation mask is first predicted given an image and a random noise.
This predicted mask, which replaces the conventional ground truth mask, is used
for denoising task during training. This approach can be interpreted as
aligning the training strategy with inference by eliminating the dependence on
ground truth masks for generating noisy samples. Our proposed method
significantly outperforms standard diffusion training, self-conditioning, and
existing recycling strategies across multiple medical imaging data sets: muscle
ultrasound, abdominal CT, prostate MR, and brain MR. This holds for two widely
adopted sampling strategies: denoising diffusion probabilistic model and
denoising diffusion implicit model. Importantly, existing diffusion models
often display a declining or unstable performance during inference, whereas our
novel recycling consistently enhances or maintains performance. We show that,
under a fair comparison with the same network architectures and computing
budget, the proposed recycling-based diffusion models achieved on-par
performance with non-diffusion-based supervised training. By ensembling the
proposed diffusion and the non-diffusion models, significant improvements to
the non-diffusion models have been observed across all applications,
demonstrating the value of this novel training method. This paper summarizes
these quantitative results and discusses their values, with a fully
reproducible JAX-based implementation, released at
https://github.com/mathpluscode/ImgX-DiffSeg.
Modality Cycles with Masked Conditional Diffusion for Unsupervised Anomaly Segmentation in MRI
August 30, 2023
Ziyun Liang, Harry Anthony, Felix Wagner, Konstantinos Kamnitsas
Unsupervised anomaly segmentation aims to detect patterns that are distinct
from any patterns processed during training, commonly called abnormal or
out-of-distribution patterns, without providing any associated manual
segmentations. Since anomalies during deployment can lead to model failure,
detecting the anomaly can enhance the reliability of models, which is valuable
in high-risk domains like medical imaging. This paper introduces Masked
Modality Cycles with Conditional Diffusion (MMCCD), a method that enables
segmentation of anomalies across diverse patterns in multimodal MRI. The method
is based on two fundamental ideas. First, we propose the use of cyclic modality
translation as a mechanism for enabling abnormality detection.
Image-translation models learn tissue-specific modality mappings, which are
characteristic of tissue physiology. Thus, these learned mappings fail to
translate tissues or image patterns that have never been encountered during
training, and the error enables their segmentation. Furthermore, we combine
image translation with a masked conditional diffusion model, which attempts to
`imagine’ what tissue exists under a masked area, further exposing unknown
patterns as the generative model fails to recreate them. We evaluate our method
on a proxy task by training on healthy-looking slices of BraTS2021
multi-modality MRIs and testing on slices with tumors. We show that our method
compares favorably to previous unsupervised approaches based on image
reconstruction and denoising with autoencoders and diffusion models.
SignDiff: Learning Diffusion Models for American Sign Language Production
August 30, 2023
Sen Fang, Chunyu Sui, Xuedong Zhang, Yapeng Tian
The field of Sign Language Production (SLP) lacked a large-scale, pre-trained
model based on deep learning for continuous American Sign Language (ASL)
production in the past decade. This limitation hampers communication for all
individuals with disabilities relying on ASL. To address this issue, we
undertook the secondary development and utilization of How2Sign, one of the
largest publicly available ASL datasets. Despite its significance, prior
researchers in the field of sign language have not effectively employed this
corpus due to the intricacies involved in American Sign Language Production
(ASLP).
To conduct large-scale ASLP, we propose SignDiff based on the latest work in
related fields, which is a dual-condition diffusion pre-training model that can
generate human sign language speakers from a skeleton pose. SignDiff has a
novel Frame Reinforcement Network called FR-Net, similar to dense human pose
estimation work, which enhances the correspondence between text lexical symbols
and sign language dense pose frames reduce the occurrence of multiple fingers
in the diffusion model. In addition, our ASLP method proposes two new improved
modules and a new loss function to improve the accuracy and quality of sign
language skeletal posture and enhance the ability of the model to train on
large-scale data.
We propose the first baseline for ASL production and report the scores of
17.19 and 12.85 on BLEU-4 on the How2Sign dev/test sets. We also evaluated our
model on the previous mainstream dataset called PHOENIX14T, and the main
experiments achieved the results of SOTA. In addition, our image quality far
exceeds all previous results by 10 percentage points on the SSIM indicator.
Finally, we conducted ablation studies and qualitative evaluations for
discussion.
RetroBridge: Modeling Retrosynthesis with Markov Bridges
August 30, 2023
Ilia Igashov, Arne Schneuing, Marwin Segler, Michael Bronstein, Bruno Correia
q-bio.QM, cs.LG, q-bio.BM
Retrosynthesis planning is a fundamental challenge in chemistry which aims at
designing reaction pathways from commercially available starting materials to a
target molecule. Each step in multi-step retrosynthesis planning requires
accurate prediction of possible precursor molecules given the target molecule
and confidence estimates to guide heuristic search algorithms. We model
single-step retrosynthesis planning as a distribution learning problem in a
discrete state space. First, we introduce the Markov Bridge Model, a generative
framework aimed to approximate the dependency between two intractable discrete
distributions accessible via a finite sample of coupled data points. Our
framework is based on the concept of a Markov bridge, a Markov process pinned
at its endpoints. Unlike diffusion-based methods, our Markov Bridge Model does
not need a tractable noise distribution as a sampling proxy and directly
operates on the input product molecules as samples from the intractable prior
distribution. We then address the retrosynthesis planning problem with our
novel framework and introduce RetroBridge, a template-free retrosynthesis
modeling approach that achieves state-of-the-art results on standard evaluation
benchmarks.
DiffuVolume: Diffusion Model for Volume based Stereo Matching
August 30, 2023
Dian Zheng, Xiao-Ming Wu, Zuhao Liu, Jingke Meng, Wei-shi Zheng
Stereo matching is a significant part in many computer vision tasks and
driving-based applications. Recently cost volume-based methods have achieved
great success benefiting from the rich geometry information in paired images.
However, the redundancy of cost volume also interferes with the model training
and limits the performance. To construct a more precise cost volume, we
pioneeringly apply the diffusion model to stereo matching. Our method, termed
DiffuVolume, considers the diffusion model as a cost volume filter, which will
recurrently remove the redundant information from the cost volume. Two main
designs make our method not trivial. Firstly, to make the diffusion model more
adaptive to stereo matching, we eschew the traditional manner of directly
adding noise into the image but embed the diffusion model into a task-specific
module. In this way, we outperform the traditional diffusion stereo matching
method by 22% EPE improvement and 240 times inference acceleration. Secondly,
DiffuVolume can be easily embedded into any volume-based stereo matching
network with boost performance but slight parameters rise (only 2%). By adding
the DiffuVolume into well-performed methods, we outperform all the published
methods on Scene Flow, KITTI2012, KITTI2015 benchmarks and zero-shot
generalization setting. It is worth mentioning that the proposed model ranks
1st on KITTI 2012 leader board, 2nd on KITTI 2015 leader board since 15, July
2023.
Elucidating the Exposure Bias in Diffusion Models
August 29, 2023
Mang Ning, Mingxiao Li, Jianlin Su, Albert Ali Salah, Itir Onal Ertugrul
Diffusion models have demonstrated impressive generative capabilities, but
their \textit{exposure bias} problem, described as the input mismatch between
training and sampling, lacks in-depth exploration. In this paper, we
systematically investigate the exposure bias problem in diffusion models by
first analytically modelling the sampling distribution, based on which we then
attribute the prediction error at each sampling step as the root cause of the
exposure bias issue. Furthermore, we discuss potential solutions to this issue
and propose an intuitive metric for it. Along with the elucidation of exposure
bias, we propose a simple, yet effective, training-free method called Epsilon
Scaling to alleviate the exposure bias. We show that Epsilon Scaling explicitly
moves the sampling trajectory closer to the vector field learned in the
training phase by scaling down the network output (Epsilon), mitigating the
input mismatch between training and sampling. Experiments on various diffusion
frameworks (ADM, DDPM/DDIM, EDM, LDM), unconditional and conditional settings,
and deterministic vs. stochastic sampling verify the effectiveness of our
method. Remarkably, our ADM-ES, as a SOTA stochastic sampler, obtains 2.17 FID
on CIFAR-10 under 100-step unconditional generation. The code is available at
\url{https://github.com/forever208/ADM-ES} and
\url{https://github.com/forever208/EDM-ES}.
Data-iterative Optimization Score Model for Stable Ultra-Sparse-View CT Reconstruction
August 28, 2023
Weiwen Wu, Yanyang Wang
Score-based generative models (SGMs) have gained prominence in sparse-view CT
reconstruction for their precise sampling of complex distributions. In
SGM-based reconstruction, data consistency in the score-based diffusion model
ensures close adherence of generated samples to observed data distribution,
crucial for improving image quality. Shortcomings in data consistency
characterization manifest in three aspects. Firstly, data from the optimization
process can lead to artifacts in reconstructed images. Secondly, it often
neglects that the generation model and original data constraints are
independently completed, fragmenting unity. Thirdly, it predominantly focuses
on constraining intermediate results in the inverse sampling process, rather
than ideal real images. Thus, we propose an iterative optimization data scoring
model. This paper introduces the data-iterative optimization score-based model
(DOSM), integrating innovative data consistency into the Stochastic
Differential Equation, a valuable constraint for ultra-sparse-view CT
reconstruction. The novelty of this data consistency element lies in its sole
reliance on original measurement data to confine generation outcomes,
effectively balancing measurement data and generative model constraints.
Additionally, we pioneer an inference strategy that traces back from current
iteration results to ideal truth, enhancing reconstruction stability. We
leverage conventional iteration techniques to optimize DOSM updates.
Quantitative and qualitative results from 23 views of numerical and clinical
cardiac datasets demonstrate DOSM’s superiority over other methods. Remarkably,
even with 10 views, our method achieves excellent performance.
DiffSmooth: Certifiably Robust Learning via Diffusion Models and Local Smoothing
August 28, 2023
Jiawei Zhang, Zhongzhu Chen, Huan Zhang, Chaowei Xiao, Bo Li
Diffusion models have been leveraged to perform adversarial purification and
thus provide both empirical and certified robustness for a standard model. On
the other hand, different robustly trained smoothed models have been studied to
improve the certified robustness. Thus, it raises a natural question: Can
diffusion model be used to achieve improved certified robustness on those
robustly trained smoothed models? In this work, we first theoretically show
that recovered instances by diffusion models are in the bounded neighborhood of
the original instance with high probability; and the “one-shot” denoising
diffusion probabilistic models (DDPM) can approximate the mean of the generated
distribution of a continuous-time diffusion model, which approximates the
original instance under mild conditions. Inspired by our analysis, we propose a
certifiably robust pipeline DiffSmooth, which first performs adversarial
purification via diffusion models and then maps the purified instances to a
common region via a simple yet effective local smoothing strategy. We conduct
extensive experiments on different datasets and show that DiffSmooth achieves
SOTA-certified robustness compared with eight baselines. For instance,
DiffSmooth improves the SOTA-certified accuracy from $36.0\%$ to $53.0\%$ under
$\ell_2$ radius $1.5$ on ImageNet. The code is available at
[https://github.com/javyduck/DiffSmooth].
Voice Conversion with Denoising Diffusion Probabilistic GAN Models
August 28, 2023
Xulong Zhang, Jianzong Wang, Ning Cheng, Jing Xiao
Voice conversion is a method that allows for the transformation of speaking
style while maintaining the integrity of linguistic information. There are many
researchers using deep generative models for voice conversion tasks. Generative
Adversarial Networks (GANs) can quickly generate high-quality samples, but the
generated samples lack diversity. The samples generated by the Denoising
Diffusion Probabilistic Models (DDPMs) are better than GANs in terms of mode
coverage and sample diversity. But the DDPMs have high computational costs and
the inference speed is slower than GANs. In order to make GANs and DDPMs more
practical we proposes DiffGAN-VC, a variant of GANs and DDPMS, to achieve
non-parallel many-to-many voice conversion (VC). We use large steps to achieve
denoising, and also introduce a multimodal conditional GANs to model the
denoising diffusion generative adversarial network. According to both objective
and subjective evaluation experiments, DiffGAN-VC has been shown to achieve
high voice quality on non-parallel data sets. Compared with the CycleGAN-VC
method, DiffGAN-VC achieves speaker similarity, naturalness and higher sound
quality.
Score-Based Generative Models for PET Image Reconstruction
August 27, 2023
Imraj RD Singh, Alexander Denker, Riccardo Barbano, Željko Kereta, Bangti Jin, Kris Thielemans, Peter Maass, Simon Arridge
eess.IV, cs.AI, cs.CV, cs.LG, 15A29, 45Q05, I.4.9; J.2; I.2.1
Score-based generative models have demonstrated highly promising results for
medical image reconstruction tasks in magnetic resonance imaging or computed
tomography. However, their application to Positron Emission Tomography (PET) is
still largely unexplored. PET image reconstruction involves a variety of
challenges, including Poisson noise with high variance and a wide dynamic
range. To address these challenges, we propose several PET-specific adaptations
of score-based generative models. The proposed framework is developed for both
2D and 3D PET. In addition, we provide an extension to guided reconstruction
using magnetic resonance images. We validate the approach through extensive 2D
and 3D $\textit{in-silico}$ experiments with a model trained on
patient-realistic data without lesions, and evaluate on data without lesions as
well as out-of-distribution data with lesions. This demonstrates the proposed
method’s robustness and significant potential for improved PET reconstruction.
Sampling with flows, diffusion and autoregressive neural networks: A spin-glass perspective
August 27, 2023
Davide Ghio, Yatin Dandi, Florent Krzakala, Lenka Zdeborová
cond-mat.dis-nn, cond-mat.stat-mech, cs.LG
Recent years witnessed the development of powerful generative models based on
flows, diffusion or autoregressive neural networks, achieving remarkable
success in generating data from examples with applications in a broad range of
areas. A theoretical analysis of the performance and understanding of the
limitations of these methods remain, however, challenging. In this paper, we
undertake a step in this direction by analysing the efficiency of sampling by
these methods on a class of problems with a known probability distribution and
comparing it with the sampling performance of more traditional methods such as
the Monte Carlo Markov chain and Langevin dynamics. We focus on a class of
probability distribution widely studied in the statistical physics of
disordered systems that relate to spin glasses, statistical inference and
constraint satisfaction problems.
We leverage the fact that sampling via flow-based, diffusion-based or
autoregressive networks methods can be equivalently mapped to the analysis of a
Bayes optimal denoising of a modified probability measure. Our findings
demonstrate that these methods encounter difficulties in sampling stemming from
the presence of a first-order phase transition along the algorithm’s denoising
path. Our conclusions go both ways: we identify regions of parameters where
these methods are unable to sample efficiently, while that is possible using
standard Monte Carlo or Langevin approaches. We also identify regions where the
opposite happens: standard approaches are inefficient while the discussed
generative methods work well.
DiffI2I: Efficient Diffusion Model for Image-to-Image Translation
August 26, 2023
Bin Xia, Yulun Zhang, Shiyin Wang, Yitong Wang, Xinglong Wu, Yapeng Tian, Wenming Yang, Radu Timotfe, Luc Van Gool
The Diffusion Model (DM) has emerged as the SOTA approach for image
synthesis. However, the existing DM cannot perform well on some image-to-image
translation (I2I) tasks. Different from image synthesis, some I2I tasks, such
as super-resolution, require generating results in accordance with GT images.
Traditional DMs for image synthesis require extensive iterations and large
denoising models to estimate entire images, which gives their strong generative
ability but also leads to artifacts and inefficiency for I2I. To tackle this
challenge, we propose a simple, efficient, and powerful DM framework for I2I,
called DiffI2I. Specifically, DiffI2I comprises three key components: a compact
I2I prior extraction network (CPEN), a dynamic I2I transformer (DI2Iformer),
and a denoising network. We train DiffI2I in two stages: pretraining and DM
training. For pretraining, GT and input images are fed into CPEN${S1}$ to
capture a compact I2I prior representation (IPR) guiding DI2Iformer. In the
second stage, the DM is trained to only use the input images to estimate the
same IRP as CPEN${S1}$. Compared to traditional DMs, the compact IPR enables
DiffI2I to obtain more accurate outcomes and employ a lighter denoising network
and fewer iterations. Through extensive experiments on various I2I tasks, we
demonstrate that DiffI2I achieves SOTA performance while significantly reducing
computational burdens.
Residual Denoising Diffusion Models
August 25, 2023
Jiawei Liu, Qiang Wang, Huijie Fan, Yinong Wang, Yandong Tang, Liangqiong Qu
We propose residual denoising diffusion models (RDDM), a novel dual diffusion
process that decouples the traditional single denoising diffusion process into
residual diffusion and noise diffusion. This dual diffusion framework expands
the denoising-based diffusion models, initially uninterpretable for image
restoration, into a unified and interpretable model for both image generation
and restoration by introducing residuals. Specifically, our residual diffusion
represents directional diffusion from the target image to the degraded input
image and explicitly guides the reverse generation process for image
restoration, while noise diffusion represents random perturbations in the
diffusion process. The residual prioritizes certainty, while the noise
emphasizes diversity, enabling RDDM to effectively unify tasks with varying
certainty or diversity requirements, such as image generation and restoration.
We demonstrate that our sampling process is consistent with that of DDPM and
DDIM through coefficient transformation, and propose a partially
path-independent generation process to better understand the reverse process.
Notably, our RDDM enables a generic UNet, trained with only an $\ell _1$ loss
and a batch size of 1, to compete with state-of-the-art image restoration
methods. We provide code and pre-trained models to encourage further
exploration, application, and development of our innovative framework
(https://github.com/nachifur/RDDM).
Diff-Retinex: Rethinking Low-light Image Enhancement with A Generative Diffusion Model
August 25, 2023
Xunpeng Yi, Han Xu, Hao Zhang, Linfeng Tang, Jiayi Ma
In this paper, we rethink the low-light image enhancement task and propose a
physically explainable and generative diffusion model for low-light image
enhancement, termed as Diff-Retinex. We aim to integrate the advantages of the
physical model and the generative network. Furthermore, we hope to supplement
and even deduce the information missing in the low-light image through the
generative network. Therefore, Diff-Retinex formulates the low-light image
enhancement problem into Retinex decomposition and conditional image
generation. In the Retinex decomposition, we integrate the superiority of
attention in Transformer and meticulously design a Retinex Transformer
decomposition network (TDN) to decompose the image into illumination and
reflectance maps. Then, we design multi-path generative diffusion networks to
reconstruct the normal-light Retinex probability distribution and solve the
various degradations in these components respectively, including dark
illumination, noise, color deviation, loss of scene contents, etc. Owing to
generative diffusion model, Diff-Retinex puts the restoration of low-light
subtle detail into practice. Extensive experiments conducted on real-world
low-light datasets qualitatively and quantitatively demonstrate the
effectiveness, superiority, and generalization of the proposed method.
A Survey of Diffusion Based Image Generation Models: Issues and Their Solutions
August 25, 2023
Tianyi Zhang, Zheng Wang, Jing Huang, Mohiuddin Muhammad Tasnim, Wei Shi
Recently, there has been significant progress in the development of large
models. Following the success of ChatGPT, numerous language models have been
introduced, demonstrating remarkable performance. Similar advancements have
also been observed in image generation models, such as Google’s Imagen model,
OpenAI’s DALL-E 2, and stable diffusion models, which have exhibited impressive
capabilities in generating images. However, similar to large language models,
these models still encounter unresolved challenges. Fortunately, the
availability of open-source stable diffusion models and their underlying
mathematical principles has enabled the academic community to extensively
analyze the performance of current image generation models and make
improvements based on this stable diffusion framework. This survey aims to
examine the existing issues and the current solutions pertaining to image
generation models.
Diffuse, Attend, and Segment: Unsupervised Zero-Shot Segmentation using Stable Diffusion
August 23, 2023
Junjiao Tian, Lavisha Aggarwal, Andrea Colaco, Zsolt Kira, Mar Gonzalez-Franco
Producing quality segmentation masks for images is a fundamental problem in
computer vision. Recent research has explored large-scale supervised training
to enable zero-shot segmentation on virtually any image style and unsupervised
training to enable segmentation without dense annotations. However,
constructing a model capable of segmenting anything in a zero-shot manner
without any annotations is still challenging. In this paper, we propose to
utilize the self-attention layers in stable diffusion models to achieve this
goal because the pre-trained stable diffusion model has learned inherent
concepts of objects within its attention layers. Specifically, we introduce a
simple yet effective iterative merging process based on measuring KL divergence
among attention maps to merge them into valid segmentation masks. The proposed
method does not require any training or language dependency to extract quality
segmentation for any images. On COCO-Stuff-27, our method surpasses the prior
unsupervised zero-shot SOTA method by an absolute 26% in pixel accuracy and 17%
in mean IoU. The project page is at
\url{https://sites.google.com/view/diffseg/home}.
Renormalizing Diffusion Models
August 23, 2023
Jordan Cotler, Semon Rezchikov
We explain how to use diffusion models to learn inverse renormalization group
flows of statistical and quantum field theories. Diffusion models are a class
of machine learning models which have been used to generate samples from
complex distributions, such as the distribution of natural images. These models
achieve sample generation by learning the inverse process to a diffusion
process which adds noise to the data until the distribution of the data is pure
noise. Nonperturbative renormalization group schemes in physics can naturally
be written as diffusion processes in the space of fields. We combine these
observations in a concrete framework for building ML-based models for studying
field theories, in which the models learn the inverse process to an
explicitly-specified renormalization group scheme. We detail how these models
define a class of adaptive bridge (or parallel tempering) samplers for lattice
field theory. Because renormalization group schemes have a physical meaning, we
provide explicit prescriptions for how to compare results derived from models
associated to several different renormalization group schemes of interest. We
also explain how to use diffusion models in a variational method to find ground
states of quantum systems. We apply some of our methods to numerically find RG
flows of interacting statistical field theories. From the perspective of
machine learning, our work provides an interpretation of multiscale diffusion
models, and gives physically-inspired suggestions for diffusion models which
should have novel properties.
Improving Generative Model-based Unfolding with Schrödinger Bridges
August 23, 2023
Sascha Diefenbacher, Guan-Horng Liu, Vinicius Mikuni, Benjamin Nachman, Weili Nie
Machine learning-based unfolding has enabled unbinned and high-dimensional
differential cross section measurements. Two main approaches have emerged in
this research area: one based on discriminative models and one based on
generative models. The main advantage of discriminative models is that they
learn a small correction to a starting simulation while generative models scale
better to regions of phase space with little data. We propose to use
Schroedinger Bridges and diffusion models to create SBUnfold, an unfolding
approach that combines the strengths of both discriminative and generative
models. The key feature of SBUnfold is that its generative model maps one set
of events into another without having to go through a known probability density
as is the case for normalizing flows and standard diffusion models. We show
that SBUnfold achieves excellent performance compared to state of the art
methods on a synthetic Z+jets dataset.
August 23, 2023
Giovanni Conforti, Alain Durmus, Marta Gentiloni Silveri
math.ST, stat.ML, stat.TH
Diffusion models are a new class of generative models that revolve around the
estimation of the score function associated with a stochastic differential
equation. Subsequent to its acquisition, the approximated score function is
then harnessed to simulate the corresponding time-reversal process, ultimately
enabling the generation of approximate data samples. Despite their evident
practical significance these models carry, a notable challenge persists in the
form of a lack of comprehensive quantitative results, especially in scenarios
involving non-regular scores and estimators. In almost all reported bounds in
Kullback Leibler (KL) divergence, it is assumed that either the score function
or its approximation is Lipschitz uniformly in time. However, this condition is
very restrictive in practice or appears to be difficult to establish.
To circumvent this issue, previous works mainly focused on establishing
convergence bounds in KL for an early stopped version of the diffusion model
and a smoothed version of the data distribution, or assuming that the data
distribution is supported on a compact manifold. These explorations have lead
to interesting bounds in either Wasserstein or Fortet-Mourier metrics. However,
the question remains about the relevance of such early-stopping procedure or
compactness conditions. In particular, if there exist a natural and mild
condition ensuring explicit and sharp convergence bounds in KL.
In this article, we tackle the aforementioned limitations by focusing on
score diffusion models with fixed step size stemming from the
Ornstein-Ulhenbeck semigroup and its kinetic counterpart. Our study provides a
rigorous analysis, yielding simple, improved and sharp convergence bounds in KL
applicable to any data distribution with finite Fisher information with respect
to the standard Gaussian distribution.
August 23, 2023
Jiasheng Ye, Zaixiang Zheng, Yu Bao, Lihua Qian, Quanquan Gu
The recent surge of generative AI has been fueled by the generative power of
diffusion probabilistic models and the scalable capabilities of large language
models. Despite their potential, it remains elusive whether diffusion language
models can solve general language tasks comparable to their autoregressive
counterparts. This paper demonstrates that scaling diffusion models w.r.t.
data, sizes, and tasks can effectively make them strong language learners. We
build competent diffusion language models at scale by first acquiring knowledge
from massive data via masked language modeling pretraining thanks to their
intrinsic connections. We then reprogram pretrained masked language models into
diffusion language models via diffusive adaptation, wherein task-specific
finetuning and instruction finetuning are explored to unlock their versatility
in solving general language tasks. Experiments show that scaling diffusion
language models consistently improves performance across downstream language
tasks. We further discover that instruction finetuning can elicit zero-shot and
few-shot in-context learning abilities that help tackle many unseen tasks by
following natural language instructions, and show promise in advanced and
challenging abilities such as reasoning.
Quantum-Noise-driven Generative Diffusion Models
August 23, 2023
Marco Parigi, Stefano Martina, Filippo Caruso
quant-ph, cond-mat.dis-nn, cs.AI, cs.LG, stat.ML
Generative models realized with machine learning techniques are powerful
tools to infer complex and unknown data distributions from a finite number of
training samples in order to produce new synthetic data. Diffusion models are
an emerging framework that have recently overcome the performance of the
generative adversarial networks in creating synthetic text and high-quality
images. Here, we propose and discuss the quantum generalization of diffusion
models, i.e., three quantum-noise-driven generative diffusion models that could
be experimentally tested on real quantum systems. The idea is to harness unique
quantum features, in particular the non-trivial interplay among coherence,
entanglement and noise that the currently available noisy quantum processors do
unavoidably suffer from, in order to overcome the main computational burdens of
classical diffusion models during inference. Hence, we suggest to exploit
quantum noise not as an issue to be detected and solved but instead as a very
remarkably beneficial key ingredient to generate much more complex probability
distributions that would be difficult or even impossible to express
classically, and from which a quantum processor might sample more efficiently
than a classical one. An example of numerical simulations for an hybrid
classical-quantum generative diffusion model is also included. Therefore, our
results are expected to pave the way for new quantum-inspired or quantum-based
generative diffusion algorithms addressing more powerfully classical tasks as
data generation/prediction with widespread real-world applications ranging from
climate forecasting to neuroscience, from traffic flow analysis to financial
forecasting.
Audio Generation with Multiple Conditional Diffusion Model
August 23, 2023
Zhifang Guo, Jianguo Mao, Rui Tao, Long Yan, Kazushige Ouchi, Hong Liu, Xiangdong Wang
cs.SD, cs.CL, cs.LG, eess.AS
Text-based audio generation models have limitations as they cannot encompass
all the information in audio, leading to restricted controllability when
relying solely on text. To address this issue, we propose a novel model that
enhances the controllability of existing pre-trained text-to-audio models by
incorporating additional conditions including content (timestamp) and style
(pitch contour and energy contour) as supplements to the text. This approach
achieves fine-grained control over the temporal order, pitch, and energy of
generated audio. To preserve the diversity of generation, we employ a trainable
control condition encoder that is enhanced by a large language model and a
trainable Fusion-Net to encode and fuse the additional conditions while keeping
the weights of the pre-trained text-to-audio model frozen. Due to the lack of
suitable datasets and evaluation metrics, we consolidate existing datasets into
a new dataset comprising the audio and corresponding conditions and use a
series of evaluation metrics to evaluate the controllability performance.
Experimental results demonstrate that our model successfully achieves
fine-grained control to accomplish controllable audio generation. Audio samples
and our dataset are publicly available at
https://conditionaudiogen.github.io/conditionaudiogen/
DF-3DFace: One-to-Many Speech Synchronized 3D Face Animation with Diffusion
August 23, 2023
Se Jin Park, Joanna Hong, Minsu Kim, Yong Man Ro
cs.CV, cs.AI, cs.MM, eess.IV
Speech-driven 3D facial animation has gained significant attention for its
ability to create realistic and expressive facial animations in 3D space based
on speech. Learning-based methods have shown promising progress in achieving
accurate facial motion synchronized with speech. However, one-to-many nature of
speech-to-3D facial synthesis has not been fully explored: while the lip
accurately synchronizes with the speech content, other facial attributes beyond
speech-related motions are variable with respect to the speech. To account for
the potential variance in the facial attributes within a single speech, we
propose DF-3DFace, a diffusion-driven speech-to-3D face mesh synthesis.
DF-3DFace captures the complex one-to-many relationships between speech and 3D
face based on diffusion. It concurrently achieves aligned lip motion by
exploiting audio-mesh synchronization and masked conditioning. Furthermore, the
proposed method jointly models identity and pose in addition to facial motions
so that it can generate 3D face animation without requiring a reference
identity mesh and produce natural head poses. We contribute a new large-scale
3D facial mesh dataset, 3D-HDTF to enable the synthesis of variations in
identities, poses, and facial motions of 3D face mesh. Extensive experiments
demonstrate that our method successfully generates highly variable facial
shapes and motions from speech and simultaneously achieves more realistic
facial animation than the state-of-the-art methods.
MatFuse: Controllable Material Generation with Diffusion Models
August 22, 2023
Giuseppe Vecchio, Renato Sortino, Simone Palazzo, Concetto Spampinato
Creating high-quality materials in computer graphics is a challenging and
time-consuming task, which requires great expertise. To simply this process, we
introduce MatFuse, a unified approach that harnesses the generative power of
diffusion models to simplify the creation of SVBRDF maps. Our pipeline
integrates multiple sources of conditioning, including color palettes,
sketches, text, and pictures, for a fine-grained control and flexibility in
material synthesis. This design enables the combination of diverse information
sources (e.g., sketch + text), enhancing creative possibilities in line with
the principle of compositionality. Additionally, we propose a multi-encoder
compression model with a two-fold purpose: it improves reconstruction
performance by learning a separate latent representation for each map and
enables a map-level material editing capabilities. We demonstrate the
effectiveness of MatFuse under multiple conditioning settings and explore the
potential of material editing. We also quantitatively assess the quality of the
generated materials in terms of CLIP-IQA and FID scores. \ Source code for
training MatFuse will be made publically available at
https://gvecchio.com/matfuse.
DiffCloth: Diffusion Based Garment Synthesis and Manipulation via Structural Cross-modal Semantic Alignment
August 22, 2023
Xujie Zhang, Binbin Yang, Michael C. Kampffmeyer, Wenqing Zhang, Shiyue Zhang, Guansong Lu, Liang Lin, Hang Xu, Xiaodan Liang
Cross-modal garment synthesis and manipulation will significantly benefit the
way fashion designers generate garments and modify their designs via flexible
linguistic interfaces.Current approaches follow the general text-to-image
paradigm and mine cross-modal relations via simple cross-attention modules,
neglecting the structural correspondence between visual and textual
representations in the fashion design domain. In this work, we instead
introduce DiffCloth, a diffusion-based pipeline for cross-modal garment
synthesis and manipulation, which empowers diffusion models with flexible
compositionality in the fashion domain by structurally aligning the cross-modal
semantics. Specifically, we formulate the part-level cross-modal alignment as a
bipartite matching problem between the linguistic Attribute-Phrases (AP) and
the visual garment parts which are obtained via constituency parsing and
semantic segmentation, respectively. To mitigate the issue of attribute
confusion, we further propose a semantic-bundled cross-attention to preserve
the spatial structure similarities between the attention maps of attribute
adjectives and part nouns in each AP. Moreover, DiffCloth allows for
manipulation of the generated results by simply replacing APs in the text
prompts. The manipulation-irrelevant regions are recognized by blended masks
obtained from the bundled attention maps of the APs and kept unchanged.
Extensive experiments on the CM-Fashion benchmark demonstrate that DiffCloth
both yields state-of-the-art garment synthesis results by leveraging the
inherent structural information and supports flexible manipulation with region
consistency.
Hey That’s Mine Imperceptible Watermarks are Preserved in Diffusion Generated Outputs
August 22, 2023
Luke Ditria, Tom Drummond
Generative models have seen an explosion in popularity with the release of
huge generative Diffusion models like Midjourney and Stable Diffusion to the
public. Because of this new ease of access, questions surrounding the automated
collection of data and issues regarding content ownership have started to
build. In this paper we present new work which aims to provide ways of
protecting content when shared to the public. We show that a generative
Diffusion model trained on data that has been imperceptibly watermarked will
generate new images with these watermarks present. We further show that if a
given watermark is correlated with a certain feature of the training data, the
generated images will also have this correlation. Using statistical tests we
show that we are able to determine whether a model has been trained on marked
data, and what data was marked. As a result our system offers a solution to
protect intellectual property when sharing content online.
Diffusion Model as Representation Learner
August 21, 2023
Xingyi Yang, Xinchao Wang
Diffusion Probabilistic Models (DPMs) have recently demonstrated impressive
results on various generative tasks.Despite its promises, the learned
representations of pre-trained DPMs, however, have not been fully understood.
In this paper, we conduct an in-depth investigation of the representation power
of DPMs, and propose a novel knowledge transfer method that leverages the
knowledge acquired by generative DPMs for recognition tasks. Our study begins
by examining the feature space of DPMs, revealing that DPMs are inherently
denoising autoencoders that balance the representation learning with
regularizing model capacity. To this end, we introduce a novel knowledge
transfer paradigm named RepFusion. Our paradigm extracts representations at
different time steps from off-the-shelf DPMs and dynamically employs them as
supervision for student networks, in which the optimal time is determined
through reinforcement learning. We evaluate our approach on several image
classification, semantic segmentation, and landmark detection benchmarks, and
demonstrate that it outperforms state-of-the-art methods. Our results uncover
the potential of DPMs as a powerful tool for representation learning and
provide insights into the usefulness of generative models beyond sample
generation. The code is available at
\url{https://github.com/Adamdad/Repfusion}.
Spiking-Diffusion: Vector Quantized Discrete Diffusion Model with Spiking Neural Networks
August 20, 2023
Mingxuan Liu, Jie Gan, Rui Wen, Tao Li, Yongli Chen, Hong Chen
Spiking neural networks (SNNs) have tremendous potential for energy-efficient
neuromorphic chips due to their binary and event-driven architecture. SNNs have
been primarily used in classification tasks, but limited exploration on image
generation tasks. To fill the gap, we propose a Spiking-Diffusion model, which
is based on the vector quantized discrete diffusion model. First, we develop a
vector quantized variational autoencoder with SNNs (VQ-SVAE) to learn a
discrete latent space for images. In VQ-SVAE, image features are encoded using
both the spike firing rate and postsynaptic potential, and an adaptive spike
generator is designed to restore embedding features in the form of spike
trains. Next, we perform absorbing state diffusion in the discrete latent space
and construct a spiking diffusion image decoder (SDID) with SNNs to denoise the
image. Our work is the first to build the diffusion model entirely from SNN
layers. Experimental results on MNIST, FMNIST, KMNIST, Letters, and Cifar10
demonstrate that Spiking-Diffusion outperforms the existing SNN-based
generation model. We achieve FIDs of 37.50, 91.98, 59.23, 67.41, and 120.5 on
the above datasets respectively, with reductions of 58.60\%, 18.75\%, 64.51\%,
29.75\%, and 44.88\% in FIDs compared with the state-of-art work. Our code will
be available at \url{https://github.com/Arktis2022/Spiking-Diffusion}.
Contrastive Diffusion Model with Auxiliary Guidance for Coarse-to-Fine PET Reconstruction
August 20, 2023
Zeyu Han, Yuhan Wang, Luping Zhou, Peng Wang, Binyu Yan, Jiliu Zhou, Yan Wang, Dinggang Shen
To obtain high-quality positron emission tomography (PET) scans while
reducing radiation exposure to the human body, various approaches have been
proposed to reconstruct standard-dose PET (SPET) images from low-dose PET
(LPET) images. One widely adopted technique is the generative adversarial
networks (GANs), yet recently, diffusion probabilistic models (DPMs) have
emerged as a compelling alternative due to their improved sample quality and
higher log-likelihood scores compared to GANs. Despite this, DPMs suffer from
two major drawbacks in real clinical settings, i.e., the computationally
expensive sampling process and the insufficient preservation of correspondence
between the conditioning LPET image and the reconstructed PET (RPET) image. To
address the above limitations, this paper presents a coarse-to-fine PET
reconstruction framework that consists of a coarse prediction module (CPM) and
an iterative refinement module (IRM). The CPM generates a coarse PET image via
a deterministic process, and the IRM samples the residual iteratively. By
delegating most of the computational overhead to the CPM, the overall sampling
speed of our method can be significantly improved. Furthermore, two additional
strategies, i.e., an auxiliary guidance strategy and a contrastive diffusion
strategy, are proposed and integrated into the reconstruction process, which
can enhance the correspondence between the LPET image and the RPET image,
further improving clinical reliability. Extensive experiments on two human
brain PET datasets demonstrate that our method outperforms the state-of-the-art
PET reconstruction methods. The source code is available at
\url{https://github.com/Show-han/PET-Reconstruction}.
Semi-Implicit Variational Inference via Score Matching
August 19, 2023
Longlin Yu, Cheng Zhang
Semi-implicit variational inference (SIVI) greatly enriches the
expressiveness of variational families by considering implicit variational
distributions defined in a hierarchical manner. However, due to the intractable
densities of variational distributions, current SIVI approaches often use
surrogate evidence lower bounds (ELBOs) or employ expensive inner-loop MCMC
runs for unbiased ELBOs for training. In this paper, we propose SIVI-SM, a new
method for SIVI based on an alternative training objective via score matching.
Leveraging the hierarchical structure of semi-implicit variational families,
the score matching objective allows a minimax formulation where the intractable
variational densities can be naturally handled with denoising score matching.
We show that SIVI-SM closely matches the accuracy of MCMC and outperforms
ELBO-based SIVI methods in a variety of Bayesian inference tasks.
AltDiffusion: A Multilingual Text-to-Image Diffusion Model
August 19, 2023
Fulong Ye, Guang Liu, Xinya Wu, Ledell Wu
Large Text-to-Image(T2I) diffusion models have shown a remarkable capability
to produce photorealistic and diverse images based on text inputs. However,
existing works only support limited language input, e.g., English, Chinese, and
Japanese, leaving users beyond these languages underserved and blocking the
global expansion of T2I models. Therefore, this paper presents AltDiffusion, a
novel multilingual T2I diffusion model that supports eighteen different
languages. Specifically, we first train a multilingual text encoder based on
the knowledge distillation. Then we plug it into a pretrained English-only
diffusion model and train the model with a two-stage schema to enhance the
multilingual capability, including concept alignment and quality improvement
stage on a large-scale multilingual dataset. Furthermore, we introduce a new
benchmark, which includes Multilingual-General-18(MG-18) and
Multilingual-Cultural-18(MC-18) datasets, to evaluate the capabilities of T2I
diffusion models for generating high-quality images and capturing
culture-specific concepts in different languages. Experimental results on both
MG-18 and MC-18 demonstrate that AltDiffusion outperforms current
state-of-the-art T2I models, e.g., Stable Diffusion in multilingual
understanding, especially with respect to culture-specific concepts, while
still having comparable capability for generating high-quality images. All
source code and checkpoints could be found in
https://github.com/superhero-7/AltDiffuson.
Monte Carlo guided Diffusion for Bayesian linear inverse problems
August 15, 2023
Gabriel Cardoso, Yazid Janati El Idrissi, Sylvain Le Corff, Eric Moulines
Ill-posed linear inverse problems arise frequently in various applications,
from computational photography to medical imaging. A recent line of research
exploits Bayesian inference with informative priors to handle the ill-posedness
of such problems. Amongst such priors, score-based generative models (SGM) have
recently been successfully applied to several different inverse problems. In
this study, we exploit the particular structure of the prior defined by the SGM
to define a sequence of intermediate linear inverse problems. As the noise
level decreases, the posteriors of these inverse problems get closer to the
target posterior of the original inverse problem. To sample from this sequence
of posteriors, we propose the use of Sequential Monte Carlo (SMC) methods. The
proposed algorithm, MCGDiff, is shown to be theoretically grounded and we
provide numerical simulations showing that it outperforms competing baselines
when dealing with ill-posed inverse problems in a Bayesian setting.
DiffSED: Sound Event Detection with Denoising Diffusion
August 14, 2023
Swapnil Bhosale, Sauradip Nag, Diptesh Kanojia, Jiankang Deng, Xiatian Zhu
Sound Event Detection (SED) aims to predict the temporal boundaries of all
the events of interest and their class labels, given an unconstrained audio
sample. Taking either the splitand-classify (i.e., frame-level) strategy or the
more principled event-level modeling approach, all existing methods consider
the SED problem from the discriminative learning perspective. In this work, we
reformulate the SED problem by taking a generative learning perspective.
Specifically, we aim to generate sound temporal boundaries from noisy proposals
in a denoising diffusion process, conditioned on a target audio sample. During
training, our model learns to reverse the noising process by converting noisy
latent queries to the groundtruth versions in the elegant Transformer decoder
framework. Doing so enables the model generate accurate event boundaries from
even noisy queries during inference. Extensive experiments on the Urban-SED and
EPIC-Sounds datasets demonstrate that our model significantly outperforms
existing alternatives, with 40+% faster convergence in training.
Diffusion Based Augmentation for Captioning and Retrieval in Cultural Heritage
August 14, 2023
Dario Cioni, Lorenzo Berlincioni, Federico Becattini, Alberto del Bimbo
Cultural heritage applications and advanced machine learning models are
creating a fruitful synergy to provide effective and accessible ways of
interacting with artworks. Smart audio-guides, personalized art-related content
and gamification approaches are just a few examples of how technology can be
exploited to provide additional value to artists or exhibitions. Nonetheless,
from a machine learning point of view, the amount of available artistic data is
often not enough to train effective models. Off-the-shelf computer vision
modules can still be exploited to some extent, yet a severe domain shift is
present between art images and standard natural image datasets used to train
such models. As a result, this can lead to degraded performance. This paper
introduces a novel approach to address the challenges of limited annotated data
and domain shifts in the cultural heritage domain. By leveraging generative
vision-language models, we augment art datasets by generating diverse
variations of artworks conditioned on their captions. This augmentation
strategy enhances dataset diversity, bridging the gap between natural images
and artworks, and improving the alignment of visual cues with knowledge from
general-purpose datasets. The generated variations assist in training vision
and language models with a deeper understanding of artistic characteristics and
that are able to generate better captions with appropriate jargon.
Shape-guided Conditional Latent Diffusion Models for Synthesising Brain Vasculature
August 13, 2023
Yash Deo, Haoran Dou, Nishant Ravikumar, Alejandro F. Frangi, Toni Lassila
The Circle of Willis (CoW) is the part of cerebral vasculature responsible
for delivering blood to the brain. Understanding the diverse anatomical
variations and configurations of the CoW is paramount to advance research on
cerebrovascular diseases and refine clinical interventions. However,
comprehensive investigation of less prevalent CoW variations remains
challenging because of the dominance of a few commonly occurring
configurations. We propose a novel generative approach utilising a conditional
latent diffusion model with shape and anatomical guidance to generate realistic
3D CoW segmentations, including different phenotypical variations. Our
conditional latent diffusion model incorporates shape guidance to better
preserve vessel continuity and demonstrates superior performance when compared
to alternative generative models, including conditional variants of 3D GAN and
3D VAE. We observed that our model generated CoW variants that are more
realistic and demonstrate higher visual fidelity than competing approaches with
an FID score 53\% better than the best-performing GAN-based model.
TextDiff: Mask-Guided Residual Diffusion Models for Scene Text Image Super-Resolution
August 13, 2023
Baolin Liu, Zongyuan Yang, Pengfei Wang, Junjie Zhou, Ziqi Liu, Ziyi Song, Yan Liu, Yongping Xiong
The goal of scene text image super-resolution is to reconstruct
high-resolution text-line images from unrecognizable low-resolution inputs. The
existing methods relying on the optimization of pixel-level loss tend to yield
text edges that exhibit a notable degree of blurring, thereby exerting a
substantial impact on both the readability and recognizability of the text. To
address these issues, we propose TextDiff, the first diffusion-based framework
tailored for scene text image super-resolution. It contains two modules: the
Text Enhancement Module (TEM) and the Mask-Guided Residual Diffusion Module
(MRD). The TEM generates an initial deblurred text image and a mask that
encodes the spatial location of the text. The MRD is responsible for
effectively sharpening the text edge by modeling the residuals between the
ground-truth images and the initial deblurred images. Extensive experiments
demonstrate that our TextDiff achieves state-of-the-art (SOTA) performance on
public benchmark datasets and can improve the readability of scene text images.
Moreover, our proposed MRD module is plug-and-play that effectively sharpens
the text edges produced by SOTA methods. This enhancement not only improves the
readability and recognizability of the results generated by SOTA methods but
also does not require any additional joint training. Available
Codes:https://github.com/Lenubolim/TextDiff.
CLE Diffusion: Controllable Light Enhancement Diffusion Model
August 13, 2023
Yuyang Yin, Dejia Xu, Chuangchuang Tan, Ping Liu, Yao Zhao, Yunchao Wei
Low light enhancement has gained increasing importance with the rapid
development of visual creation and editing. However, most existing enhancement
algorithms are designed to homogeneously increase the brightness of images to a
pre-defined extent, limiting the user experience. To address this issue, we
propose Controllable Light Enhancement Diffusion Model, dubbed CLE Diffusion, a
novel diffusion framework to provide users with rich controllability. Built
with a conditional diffusion model, we introduce an illumination embedding to
let users control their desired brightness level. Additionally, we incorporate
the Segment-Anything Model (SAM) to enable user-friendly region
controllability, where users can click on objects to specify the regions they
wish to enhance. Extensive experiments demonstrate that CLE Diffusion achieves
competitive performance regarding quantitative metrics, qualitative results,
and versatile controllability. Project page:
https://yuyangyin.github.io/CLEDiffusion/
LAW-Diffusion: Complex Scene Generation by Diffusion with Layouts
August 13, 2023
Binbin Yang, Yi Luo, Ziliang Chen, Guangrun Wang, Xiaodan Liang, Liang Lin
Thanks to the rapid development of diffusion models, unprecedented progress
has been witnessed in image synthesis. Prior works mostly rely on pre-trained
linguistic models, but a text is often too abstract to properly specify all the
spatial properties of an image, e.g., the layout configuration of a scene,
leading to the sub-optimal results of complex scene generation. In this paper,
we achieve accurate complex scene generation by proposing a semantically
controllable Layout-AWare diffusion model, termed LAW-Diffusion. Distinct from
the previous Layout-to-Image generation (L2I) methods that only explore
category-aware relationships, LAW-Diffusion introduces a spatial dependency
parser to encode the location-aware semantic coherence across objects as a
layout embedding and produces a scene with perceptually harmonious object
styles and contextual relations. To be specific, we delicately instantiate each
object’s regional semantics as an object region map and leverage a
location-aware cross-object attention module to capture the spatial
dependencies among those disentangled representations. We further propose an
adaptive guidance schedule for our layout guidance to mitigate the trade-off
between the regional semantic alignment and the texture fidelity of generated
objects. Moreover, LAW-Diffusion allows for instance reconfiguration while
maintaining the other regions in a synthesized image by introducing a
layout-aware latent grafting mechanism to recompose its local regional
semantics. To better verify the plausibility of generated scenes, we propose a
new evaluation metric for the L2I task, dubbed Scene Relation Score (SRS) to
measure how the images preserve the rational and harmonious relations among
contextual objects. Comprehensive experiments demonstrate that our
LAW-Diffusion yields the state-of-the-art generative performance, especially
with coherent object relations.
Accelerating Diffusion-based Combinatorial Optimization Solvers by Progressive Distillation
August 12, 2023
Junwei Huang, Zhiqing Sun, Yiming Yang
Graph-based diffusion models have shown promising results in terms of
generating high-quality solutions to NP-complete (NPC) combinatorial
optimization (CO) problems. However, those models are often inefficient in
inference, due to the iterative evaluation nature of the denoising diffusion
process. This paper proposes to use progressive distillation to speed up the
inference by taking fewer steps (e.g., forecasting two steps ahead within a
single step) during the denoising process. Our experimental results show that
the progressively distilled model can perform inference 16 times faster with
only 0.019% degradation in performance on the TSP-50 dataset.
EquiDiff: A Conditional Equivariant Diffusion Model For Trajectory Prediction
August 12, 2023
Kehua Chen, Xianda Chen, Zihan Yu, Meixin Zhu, Hai Yang
Accurate trajectory prediction is crucial for the safe and efficient
operation of autonomous vehicles. The growing popularity of deep learning has
led to the development of numerous methods for trajectory prediction. While
deterministic deep learning models have been widely used, deep generative
models have gained popularity as they learn data distributions from training
data and account for trajectory uncertainties. In this study, we propose
EquiDiff, a deep generative model for predicting future vehicle trajectories.
EquiDiff is based on the conditional diffusion model, which generates future
trajectories by incorporating historical information and random Gaussian noise.
The backbone model of EquiDiff is an SO(2)-equivariant transformer that fully
utilizes the geometric properties of location coordinates. In addition, we
employ Recurrent Neural Networks and Graph Attention Networks to extract social
interactions from historical trajectories. To evaluate the performance of
EquiDiff, we conduct extensive experiments on the NGSIM dataset. Our results
demonstrate that EquiDiff outperforms other baseline models in short-term
prediction, but has slightly higher errors for long-term prediction.
Furthermore, we conduct an ablation study to investigate the contribution of
each component of EquiDiff to the prediction accuracy. Additionally, we present
a visualization of the generation process of our diffusion model, providing
insights into the uncertainty of the prediction.
Mirror Diffusion Models
August 11, 2023
Jaesung Tae
Diffusion models have successfully been applied to generative tasks in
various continuous domains. However, applying diffusion to discrete categorical
data remains a non-trivial task. Moreover, generation in continuous domains
often requires clipping in practice, which motivates the need for a theoretical
framework for adapting diffusion to constrained domains. Inspired by the mirror
Langevin algorithm for the constrained sampling problem, in this theoretical
report we propose Mirror Diffusion Models (MDMs). We demonstrate MDMs in the
context of simplex diffusion and propose natural extensions to popular domains
such as image and text generation.
DatasetDM: Synthesizing Data with Perception Annotations Using Diffusion Models
August 11, 2023
Weijia Wu, Yuzhong Zhao, Hao Chen, Yuchao Gu, Rui Zhao, Yefei He, Hong Zhou, Mike Zheng Shou, Chunhua Shen
Current deep networks are very data-hungry and benefit from training on
largescale datasets, which are often time-consuming to collect and annotate. By
contrast, synthetic data can be generated infinitely using generative models
such as DALL-E and diffusion models, with minimal effort and cost. In this
paper, we present DatasetDM, a generic dataset generation model that can
produce diverse synthetic images and the corresponding high-quality perception
annotations (e.g., segmentation masks, and depth). Our method builds upon the
pre-trained diffusion model and extends text-guided image synthesis to
perception data generation. We show that the rich latent code of the diffusion
model can be effectively decoded as accurate perception annotations using a
decoder module. Training the decoder only needs less than 1% (around 100
images) manually labeled images, enabling the generation of an infinitely large
annotated dataset. Then these synthetic data can be used for training various
perception models for downstream tasks. To showcase the power of the proposed
approach, we generate datasets with rich dense pixel-wise labels for a wide
range of downstream tasks, including semantic segmentation, instance
segmentation, and depth estimation. Notably, it achieves 1) state-of-the-art
results on semantic segmentation and instance segmentation; 2) significantly
more robust on domain generalization than using the real data alone; and
state-of-the-art results in zero-shot segmentation setting; and 3) flexibility
for efficient application and novel task composition (e.g., image editing). The
project website and code can be found at
https://weijiawu.github.io/DatasetDM_page/ and
https://github.com/showlab/DatasetDM, respectively
Diffusion-based Visual Counterfactual Explanations – Towards Systematic Quantitative Evaluation
August 11, 2023
Philipp Vaeth, Alexander M. Fruehwald, Benjamin Paassen, Magda Gregorova
Latest methods for visual counterfactual explanations (VCE) harness the power
of deep generative models to synthesize new examples of high-dimensional images
of impressive quality. However, it is currently difficult to compare the
performance of these VCE methods as the evaluation procedures largely vary and
often boil down to visual inspection of individual examples and small scale
user studies. In this work, we propose a framework for systematic, quantitative
evaluation of the VCE methods and a minimal set of metrics to be used. We use
this framework to explore the effects of certain crucial design choices in the
latest diffusion-based generative models for VCEs of natural image
classification (ImageNet). We conduct a battery of ablation-like experiments,
generating thousands of VCEs for a suite of classifiers of various complexity,
accuracy and robustness. Our findings suggest multiple directions for future
advancements and improvements of VCE methods. By sharing our methodology and
our approach to tackle the computational challenges of such a study on a
limited hardware setup (including the complete code base), we offer a valuable
guidance for researchers in the field fostering consistency and transparency in
the assessment of counterfactual explanations.
Head Rotation in Denoising Diffusion Models
August 11, 2023
Andrea Asperti, Gabriele Colasuonno, Antonio Guerra
cs.CV, I.2.10; I.4.5; I.5
Denoising Diffusion Models (DDM) are emerging as the cutting-edge technology
in the realm of deep generative modeling, challenging the dominance of
Generative Adversarial Networks. However, effectively exploring the latent
space’s semantics and identifying compelling trajectories for manipulating and
editing important attributes of the generated samples remains challenging,
primarily due to the high-dimensional nature of the latent space. In this
study, we specifically concentrate on face rotation, which is known to be one
of the most intricate editing operations. By leveraging a recent embedding
technique for Denoising Diffusion Implicit Models (DDIM), we achieve, in many
cases, noteworthy manipulations encompassing a wide rotation angle of $\pm
30^o$, preserving the distinct characteristics of the individual. Our
methodology exploits the computation of trajectories approximating clouds of
latent representations of dataset samples with different yaw rotations through
linear regression. Specific trajectories are obtained by restricting the
analysis to subsets of data sharing significant attributes with the source
image. One of these attributes is the light provenance: a byproduct of our
research is a labeling of CelebA, categorizing images into three major groups
based on the illumination direction: left, center, and right.
Diverse Data Augmentation with Diffusions for Effective Test-time Prompt Tuning
August 11, 2023
Chun-Mei Feng, Kai Yu, Yong Liu, Salman Khan, Wangmeng Zuo
Benefiting from prompt tuning, recent years have witnessed the promising
performance of pre-trained vision-language models, e.g., CLIP, on versatile
downstream tasks. In this paper, we focus on a particular setting of learning
adaptive prompts on the fly for each test sample from an unseen new domain,
which is known as test-time prompt tuning (TPT). Existing TPT methods typically
rely on data augmentation and confidence selection. However, conventional data
augmentation techniques, e.g., random resized crops, suffers from the lack of
data diversity, while entropy-based confidence selection alone is not
sufficient to guarantee prediction fidelity. To address these issues, we
propose a novel TPT method, named DiffTPT, which leverages pre-trained
diffusion models to generate diverse and informative new data. Specifically, we
incorporate augmented data by both conventional method and pre-trained stable
diffusion to exploit their respective merits, improving the models ability to
adapt to unknown new test data. Moreover, to ensure the prediction fidelity of
generated data, we introduce a cosine similarity-based filtration technique to
select the generated data with higher similarity to the single test sample. Our
experiments on test datasets with distribution shifts and unseen categories
demonstrate that DiffTPT improves the zero-shot accuracy by an average of
5.13\% compared to the state-of-the-art TPT method. Our code and models will be
publicly released.
Masked-Attention Diffusion Guidance for Spatially Controlling Text-to-Image Generation
August 11, 2023
Yuki Endo
Text-to-image synthesis has achieved high-quality results with recent
advances in diffusion models. However, text input alone has high spatial
ambiguity and limited user controllability. Most existing methods allow spatial
control through additional visual guidance (e.g., sketches and semantic masks)
but require additional training with annotated images. In this paper, we
propose a method for spatially controlling text-to-image generation without
further training of diffusion models. Our method is based on the insight that
the cross-attention maps reflect the positional relationship between words and
pixels. Our aim is to control the attention maps according to given semantic
masks and text prompts. To this end, we first explore a simple approach of
directly swapping the cross-attention maps with constant maps computed from the
semantic regions. Some prior works also allow training-free spatial control of
text-to-image diffusion models by directly manipulating cross-attention maps.
However, these approaches still suffer from misalignment to given masks because
manipulated attention maps are far from actual ones learned by diffusion
models. To address this issue, we propose masked-attention guidance, which can
generate images more faithful to semantic masks via indirect control of
attention to each word and pixel by manipulating noise images fed to diffusion
models. Masked-attention guidance can be easily integrated into pre-trained
off-the-shelf diffusion models (e.g., Stable Diffusion) and applied to the
tasks of text-guided image editing. Experiments show that our method enables
more accurate spatial control than baselines qualitatively and quantitatively.
Masked Diffusion as Self-supervised Representation Learner
August 10, 2023
Zixuan Pan, Jianxu Chen, Yiyu Shi
Denoising diffusion probabilistic models have recently demonstrated
state-of-the-art generative performance and have been used as strong
pixel-level representation learners. This paper decomposes the interrelation
between the generative capability and representation learning ability inherent
in diffusion models. We present the masked diffusion model (MDM), a scalable
self-supervised representation learner for semantic segmentation, substituting
the conventional additive Gaussian noise of traditional diffusion with a
masking mechanism. Our proposed approach convincingly surpasses prior
benchmarks, demonstrating remarkable advancements in both medical and natural
image semantic segmentation tasks, particularly in few-shot scenarios.
Improved Multi-Shot Diffusion-Weighted MRI with Zero-Shot Self-Supervised Learning Reconstruction
August 09, 2023
Jaejin Cho, Yohan Jun, Xiaoqing Wang, Caique Kobayashi, Berkin Bilgic
Diffusion MRI is commonly performed using echo-planar imaging (EPI) due to
its rapid acquisition time. However, the resolution of diffusion-weighted
images is often limited by magnetic field inhomogeneity-related artifacts and
blurring induced by T2- and T2*-relaxation effects. To address these
limitations, multi-shot EPI (msEPI) combined with parallel imaging techniques
is frequently employed. Nevertheless, reconstructing msEPI can be challenging
due to phase variation between multiple shots. In this study, we introduce a
novel msEPI reconstruction approach called zero-MIRID (zero-shot
self-supervised learning of Multi-shot Image Reconstruction for Improved
Diffusion MRI). This method jointly reconstructs msEPI data by incorporating
deep learning-based image regularization techniques. The network incorporates
CNN denoisers in both k- and image-spaces, while leveraging virtual coils to
enhance image reconstruction conditioning. By employing a self-supervised
learning technique and dividing sampled data into three groups, the proposed
approach achieves superior results compared to the state-of-the-art parallel
imaging method, as demonstrated in an in-vivo experiment.
Do Diffusion Models Suffer Error Propagation? Theoretical Analysis and Consistency Regularization
August 09, 2023
Yangming Li, Mihaela van der Schaar
Although diffusion models (DMs) have shown promising performances in a number
of tasks (e.g., speech synthesis and image generation), they might suffer from
error propagation because of their sequential structure. However, this is not
certain because some sequential models, such as Conditional Random Field (CRF),
are free from this problem. To address this issue, we develop a theoretical
framework to mathematically formulate error propagation in the architecture of
DMs, The framework contains three elements, including modular error, cumulative
error, and propagation equation. The modular and cumulative errors are related
by the equation, which interprets that DMs are indeed affected by error
propagation. Our theoretical study also suggests that the cumulative error is
closely related to the generation quality of DMs. Based on this finding, we
apply the cumulative error as a regularization term to reduce error
propagation. Because the term is computationally intractable, we derive its
upper bound and design a bootstrap algorithm to efficiently estimate the bound
for optimization. We have conducted extensive experiments on multiple image
datasets, showing that our proposed regularization reduces error propagation,
significantly improves vanilla DMs, and outperforms previous baselines.
IDiff-Face: Synthetic-based Face Recognition through Fizzy Identity-Conditioned Diffusion Models
August 09, 2023
Fadi Boutros, Jonas Henry Grebe, Arjan Kuijper, Naser Damer
The availability of large-scale authentic face databases has been crucial to
the significant advances made in face recognition research over the past
decade. However, legal and ethical concerns led to the recent retraction of
many of these databases by their creators, raising questions about the
continuity of future face recognition research without one of its key
resources. Synthetic datasets have emerged as a promising alternative to
privacy-sensitive authentic data for face recognition development. However,
recent synthetic datasets that are used to train face recognition models suffer
either from limitations in intra-class diversity or cross-class (identity)
discrimination, leading to less optimal accuracies, far away from the
accuracies achieved by models trained on authentic data. This paper targets
this issue by proposing IDiff-Face, a novel approach based on conditional
latent diffusion models for synthetic identity generation with realistic
identity variations for face recognition training. Through extensive
evaluations, our proposed synthetic-based face recognition approach pushed the
limits of state-of-the-art performances, achieving, for example, 98.00%
accuracy on the Labeled Faces in the Wild (LFW) benchmark, far ahead from the
recent synthetic-based face recognition solutions with 95.40% and bridging the
gap to authentic-based face recognition with 99.82% accuracy.
3D Scene Diffusion Guidance using Scene Graphs
August 08, 2023
Mohammad Naanaa, Katharina Schmid, Yinyu Nie
Guided synthesis of high-quality 3D scenes is a challenging task. Diffusion
models have shown promise in generating diverse data, including 3D scenes.
However, current methods rely directly on text embeddings for controlling the
generation, limiting the incorporation of complex spatial relationships between
objects. We propose a novel approach for 3D scene diffusion guidance using
scene graphs. To leverage the relative spatial information the scene graphs
provide, we make use of relational graph convolutional blocks within our
denoising network. We show that our approach significantly improves the
alignment between scene description and generated scene.
Linear Convergence Bounds for Diffusion Models via Stochastic Localization
August 07, 2023
Joe Benton, Valentin De Bortoli, Arnaud Doucet, George Deligiannidis
Diffusion models are a powerful method for generating approximate samples
from high-dimensional data distributions. Several recent results have provided
polynomial bounds on the convergence rate of such models, assuming
$L^2$-accurate score estimators. However, up until now the best known such
bounds were either superlinear in the data dimension or required strong
smoothness assumptions. We provide the first convergence bounds which are
linear in the data dimension (up to logarithmic factors) assuming only finite
second moments of the data distribution. We show that diffusion models require
at most $\tilde O(\frac{d \log^2(1/\delta)}{\varepsilon^2})$ steps to
approximate an arbitrary data distribution on $\mathbb{R}^d$ corrupted with
Gaussian noise of variance $\delta$ to within $\varepsilon^2$ in
Kullback–Leibler divergence. Our proof builds on the Girsanov-based methods of
previous works. We introduce a refined treatment of the error arising from the
discretization of the reverse SDE, which is based on tools from stochastic
localization.
Diffusion Model in Causal Inference with Unmeasured Confounders
August 07, 2023
Tatsuhiro Shimizu
We study how to extend the use of the diffusion model to answer the causal
question from the observational data under the existence of unmeasured
confounders. In Pearl’s framework of using a Directed Acyclic Graph (DAG) to
capture the causal intervention, a Diffusion-based Causal Model (DCM) was
proposed incorporating the diffusion model to answer the causal questions more
accurately, assuming that all of the confounders are observed. However,
unmeasured confounders in practice exist, which hinders DCM from being
applicable. To alleviate this limitation of DCM, we propose an extended model
called Backdoor Criterion based DCM (BDCM), whose idea is rooted in the
Backdoor criterion to find the variables in DAG to be included in the decoding
process of the diffusion model so that we can extend DCM to the case with
unmeasured confounders. Synthetic data experiment demonstrates that our
proposed model captures the counterfactual distribution more precisely than DCM
under the unmeasured confounders.
Energy-Guided Diffusion Model for CBCT-to-CT Synthesis
August 07, 2023
Linjie Fu, Xia Li, Xiuding Cai, Dong Miao, Yu Yao, Yali Shen
eess.IV, cs.CV, physics.med-ph
Cone Beam CT (CBCT) plays a crucial role in Adaptive Radiation Therapy (ART)
by accurately providing radiation treatment when organ anatomy changes occur.
However, CBCT images suffer from scatter noise and artifacts, making relying
solely on CBCT for precise dose calculation and accurate tissue localization
challenging. Therefore, there is a need to improve CBCT image quality and
Hounsfield Unit (HU) accuracy while preserving anatomical structures. To
enhance the role and application value of CBCT in ART, we propose an
energy-guided diffusion model (EGDiff) and conduct experiments on a chest tumor
dataset to generate synthetic CT (sCT) from CBCT. The experimental results
demonstrate impressive performance with an average absolute error of
26.87$\pm$6.14 HU, a structural similarity index measurement of 0.850$\pm$0.03,
a peak signal-to-noise ratio of the sCT of 19.83$\pm$1.39 dB, and a normalized
cross-correlation of the sCT of 0.874$\pm$0.04. These results indicate that our
method outperforms state-of-the-art unsupervised synthesis methods in accuracy
and visual quality, producing superior sCT images.
Implicit Graph Neural Diffusion Based on Constrained Dirichlet Energy Minimization
August 07, 2023
Guoji Fu, Mohammed Haroon Dupty, Yanfei Dong, Lee Wee Sun
Implicit graph neural networks (GNNs) have emerged as a potential approach to
enable GNNs to capture long-range dependencies effectively. However, poorly
designed implicit GNN layers can experience over-smoothing or may have limited
adaptability to learn data geometry, potentially hindering their performance in
graph learning problems. To address these issues, we introduce a geometric
framework to design implicit graph diffusion layers based on a parameterized
graph Laplacian operator. Our framework allows learning the geometry of vertex
and edge spaces, as well as the graph gradient operator from data. We further
show how implicit GNN layers can be viewed as the fixed-point solution of a
Dirichlet energy minimization problem and give conditions under which it may
suffer from over-smoothing. To overcome the over-smoothing problem, we design
our implicit graph diffusion layer as the solution of a Dirichlet energy
minimization problem with constraints on vertex features, enabling it to trade
off smoothing with the preservation of node feature information. With an
appropriate hyperparameter set to be larger than the largest eigenvalue of the
parameterized graph Laplacian, our framework guarantees a unique equilibrium
and quick convergence. Our models demonstrate better performance than leading
implicit and explicit GNNs on benchmark datasets for node and graph
classification tasks, with substantial accuracy improvements observed for some
datasets.
Generative Approach for Probabilistic Human Mesh Recovery using Diffusion Models
August 05, 2023
Hanbyel Cho, Junmo Kim
This work focuses on the problem of reconstructing a 3D human body mesh from
a given 2D image. Despite the inherent ambiguity of the task of human mesh
recovery, most existing works have adopted a method of regressing a single
output. In contrast, we propose a generative approach framework, called
“Diffusion-based Human Mesh Recovery (Diff-HMR)” that takes advantage of the
denoising diffusion process to account for multiple plausible outcomes. During
the training phase, the SMPL parameters are diffused from ground-truth
parameters to random distribution, and Diff-HMR learns the reverse process of
this diffusion. In the inference phase, the model progressively refines the
given random SMPL parameters into the corresponding parameters that align with
the input image. Diff-HMR, being a generative approach, is capable of
generating diverse results for the same input image as the input noise varies.
We conduct validation experiments, and the results demonstrate that the
proposed framework effectively models the inherent ambiguity of the task of
human mesh recovery in a probabilistic manner. The code is available at
https://github.com/hanbyel0105/Diff-HMR
DermoSegDiff: A Boundary-aware Segmentation Diffusion Model for Skin Lesion Delineation
August 05, 2023
Afshin Bozorgpour, Yousef Sadegheih, Amirhossein Kazerouni, Reza Azad, Dorit Merhof
Skin lesion segmentation plays a critical role in the early detection and
accurate diagnosis of dermatological conditions. Denoising Diffusion
Probabilistic Models (DDPMs) have recently gained attention for their
exceptional image-generation capabilities. Building on these advancements, we
propose DermoSegDiff, a novel framework for skin lesion segmentation that
incorporates boundary information during the learning process. Our approach
introduces a novel loss function that prioritizes the boundaries during
training, gradually reducing the significance of other regions. We also
introduce a novel U-Net-based denoising network that proficiently integrates
noise and semantic information inside the network. Experimental results on
multiple skin segmentation datasets demonstrate the superiority of DermoSegDiff
over existing CNN, transformer, and diffusion-based approaches, showcasing its
effectiveness and generalization in various scenarios. The implementation is
publicly accessible on
\href{https://github.com/mindflow-institue/dermosegdiff}{GitHub}
DiffDance: Cascaded Human Motion Diffusion Model for Dance Generation
August 05, 2023
Qiaosong Qi, Le Zhuo, Aixi Zhang, Yue Liao, Fei Fang, Si Liu, Shuicheng Yan
cs.GR, cs.CV, cs.SD, eess.AS
When hearing music, it is natural for people to dance to its rhythm.
Automatic dance generation, however, is a challenging task due to the physical
constraints of human motion and rhythmic alignment with target music.
Conventional autoregressive methods introduce compounding errors during
sampling and struggle to capture the long-term structure of dance sequences. To
address these limitations, we present a novel cascaded motion diffusion model,
DiffDance, designed for high-resolution, long-form dance generation. This model
comprises a music-to-dance diffusion model and a sequence super-resolution
diffusion model. To bridge the gap between music and motion for conditional
generation, DiffDance employs a pretrained audio representation learning model
to extract music embeddings and further align its embedding space to motion via
contrastive loss. During training our cascaded diffusion model, we also
incorporate multiple geometric losses to constrain the model outputs to be
physically plausible and add a dynamic loss weight that adaptively changes over
diffusion timesteps to facilitate sample diversity. Through comprehensive
experiments performed on the benchmark dataset AIST++, we demonstrate that
DiffDance is capable of generating realistic dance sequences that align
effectively with the input music. These results are comparable to those
achieved by state-of-the-art autoregressive methods.
Diffusion-Augmented Depth Prediction with Sparse Annotations
August 04, 2023
Jiaqi Li, Yiran Wang, Zihao Huang, Jinghong Zheng, Ke Xian, Zhiguo Cao, Jianming Zhang
Depth estimation aims to predict dense depth maps. In autonomous driving
scenes, sparsity of annotations makes the task challenging. Supervised models
produce concave objects due to insufficient structural information. They
overfit to valid pixels and fail to restore spatial structures. Self-supervised
methods are proposed for the problem. Their robustness is limited by pose
estimation, leading to erroneous results in natural scenes. In this paper, we
propose a supervised framework termed Diffusion-Augmented Depth Prediction
(DADP). We leverage the structural characteristics of diffusion model to
enforce depth structures of depth models in a plug-and-play manner. An
object-guided integrality loss is also proposed to further enhance regional
structure integrality by fetching objective information. We evaluate DADP on
three driving benchmarks and achieve significant improvements in depth
structures and robustness. Our work provides a new perspective on depth
estimation with sparse annotations in autonomous driving scenes.
Diffusion probabilistic models enhance variational autoencoder for crystal structure generative modeling
August 04, 2023
Teerachote Pakornchote, Natthaphon Choomphon-anomakhun, Sorrjit Arrerut, Chayanon Atthapak, Sakarn Khamkaeo, Thiparat Chotibut, Thiti Bovornratanaraks
cs.LG, cond-mat.mtrl-sci, cond-mat.stat-mech, physics.comp-ph, 68T07
The crystal diffusion variational autoencoder (CDVAE) is a machine learning
model that leverages score matching to generate realistic crystal structures
that preserve crystal symmetry. In this study, we leverage novel diffusion
probabilistic (DP) models to denoise atomic coordinates rather than adopting
the standard score matching approach in CDVAE. Our proposed DP-CDVAE model can
reconstruct and generate crystal structures whose qualities are statistically
comparable to those of the original CDVAE. Furthermore, notably, when comparing
the carbon structures generated by the DP-CDVAE model with relaxed structures
obtained from density functional theory calculations, we find that the DP-CDVAE
generated structures are remarkably closer to their respective ground states.
The energy differences between these structures and the true ground states are,
on average, 68.1 meV/atom lower than those generated by the original CDVAE.
This significant improvement in the energy accuracy highlights the
effectiveness of the DP-CDVAE model in generating crystal structures that
better represent their ground-state configurations.
Improved Order Analysis and Design of Exponential Integrator for Diffusion Models Sampling
August 04, 2023
Qinsheng Zhang, Jiaming Song, Yongxin Chen
Efficient differential equation solvers have significantly reduced the
sampling time of diffusion models (DMs) while retaining high sampling quality.
Among these solvers, exponential integrators (EI) have gained prominence by
demonstrating state-of-the-art performance. However, existing high-order
EI-based sampling algorithms rely on degenerate EI solvers, resulting in
inferior error bounds and reduced accuracy in contrast to the theoretically
anticipated results under optimal settings. This situation makes the sampling
quality extremely vulnerable to seemingly innocuous design choices such as
timestep schedules. For example, an inefficient timestep scheduler might
necessitate twice the number of steps to achieve a quality comparable to that
obtained through carefully optimized timesteps. To address this issue, we
reevaluate the design of high-order differential solvers for DMs. Through a
thorough order analysis, we reveal that the degeneration of existing high-order
EI solvers can be attributed to the absence of essential order conditions. By
reformulating the differential equations in DMs and capitalizing on the theory
of exponential integrators, we propose refined EI solvers that fulfill all the
order conditions, which we designate as Refined Exponential Solver (RES).
Utilizing these improved solvers, RES exhibits more favorable error bounds
theoretically and achieves superior sampling efficiency and stability in
practical applications. For instance, a simple switch from the single-step
DPM-Solver++ to our order-satisfied RES solver when Number of Function
Evaluations (NFE) $=9$, results in a reduction of numerical defects by $25.2\%$
and FID improvement of $25.4\%$ (16.77 vs 12.51) on a pre-trained ImageNet
diffusion model.
SDDM: Score-Decomposed Diffusion Models on Manifolds for Unpaired Image-to-Image Translation
August 04, 2023
Shikun Sun, Longhui Wei, Junliang Xing, Jia Jia, Qi Tian
Recent score-based diffusion models (SBDMs) show promising results in
unpaired image-to-image translation (I2I). However, existing methods, either
energy-based or statistically-based, provide no explicit form of the interfered
intermediate generative distributions. This work presents a new
score-decomposed diffusion model (SDDM) on manifolds to explicitly optimize the
tangled distributions during image generation. SDDM derives manifolds to make
the distributions of adjacent time steps separable and decompose the score
function or energy guidance into an image denoising" part and a content
refinement” part. To refine the image in the same noise level, we equalize
the refinement parts of the score function and energy guidance, which permits
multi-objective optimization on the manifold. We also leverage the block
adaptive instance normalization module to construct manifolds with lower
dimensions but still concentrated with the perturbed reference image. SDDM
outperforms existing SBDM-based methods with much fewer diffusion steps on
several I2I benchmarks.
Diffusion Models for Counterfactual Generation and Anomaly Detection in Brain Images
August 03, 2023
Alessandro Fontanella, Grant Mair, Joanna Wardlaw, Emanuele Trucco, Amos Storkey
Segmentation masks of pathological areas are useful in many medical
applications, such as brain tumour and stroke management. Moreover, healthy
counterfactuals of diseased images can be used to enhance radiologists’
training files and to improve the interpretability of segmentation models. In
this work, we present a weakly supervised method to generate a healthy version
of a diseased image and then use it to obtain a pixel-wise anomaly map. To do
so, we start by considering a saliency map that approximately covers the
pathological areas, obtained with ACAT. Then, we propose a technique that
allows to perform targeted modifications to these regions, while preserving the
rest of the image. In particular, we employ a diffusion model trained on
healthy samples and combine Denoising Diffusion Probabilistic Model (DDPM) and
Denoising Diffusion Implicit Model (DDIM) at each step of the sampling process.
DDPM is used to modify the areas affected by a lesion within the saliency map,
while DDIM guarantees reconstruction of the normal anatomy outside of it. The
two parts are also fused at each timestep, to guarantee the generation of a
sample with a coherent appearance and a seamless transition between edited and
unedited parts. We verify that when our method is applied to healthy samples,
the input images are reconstructed without significant modifications. We
compare our approach with alternative weakly supervised methods on IST-3 for
stroke lesion segmentation and on BraTS2021 for brain tumour segmentation,
where we improve the DICE score of the best competing method from $0.6534$ to
$0.7056$.
Synthesising Rare Cataract Surgery Samples with Guided Diffusion Models
August 03, 2023
Yannik Frisch, Moritz Fuchs, Antoine Sanner, Felix Anton Ucar, Marius Frenzel, Joana Wasielica-Poslednik, Adrian Gericke, Felix Mathias Wagner, Thomas Dratsch, Anirban Mukhopadhyay
Cataract surgery is a frequently performed procedure that demands automation
and advanced assistance systems. However, gathering and annotating data for
training such systems is resource intensive. The publicly available data also
comprises severe imbalances inherent to the surgical process. Motivated by
this, we analyse cataract surgery video data for the worst-performing phases of
a pre-trained downstream tool classifier. The analysis demonstrates that
imbalances deteriorate the classifier’s performance on underrepresented cases.
To address this challenge, we utilise a conditional generative model based on
Denoising Diffusion Implicit Models (DDIM) and Classifier-Free Guidance (CFG).
Our model can synthesise diverse, high-quality examples based on complex
multi-class multi-label conditions, such as surgical phases and combinations of
surgical tools. We affirm that the synthesised samples display tools that the
classifier recognises. These samples are hard to differentiate from real
images, even for clinical experts with more than five years of experience.
Further, our synthetically extended data can improve the data sparsity problem
for the downstream task of tool classification. The evaluations demonstrate
that the model can generate valuable unseen examples, allowing the tool
classifier to improve by up to 10% for rare cases. Overall, our approach can
facilitate the development of automated assistance systems for cataract surgery
by providing a reliable source of realistic synthetic data, which we make
available for everyone.
DiffGANPaint: Fast Inpainting Using Denoising Diffusion GANs
August 03, 2023
Moein Heidari, Alireza Morsali, Tohid Abedini, Samin Heydarian
Free-form image inpainting is the task of reconstructing parts of an image
specified by an arbitrary binary mask. In this task, it is typically desired to
generalize model capabilities to unseen mask types, rather than learning
certain mask distributions. Capitalizing on the advances in diffusion models,
in this paper, we propose a Denoising Diffusion Probabilistic Model (DDPM)
based model capable of filling missing pixels fast as it models the backward
diffusion process using the generator of a generative adversarial network (GAN)
network to reduce sampling cost in diffusion models. Experiments on
general-purpose image inpainting datasets verify that our approach performs
superior or on par with most contemporary works.
Diffusion-based Time Series Data Imputation for Microsoft 365
August 03, 2023
Fangkai Yang, Wenjie Yin, Lu Wang, Tianci Li, Pu Zhao, Bo Liu, Paul Wang, Bo Qiao, Yudong Liu, Mårten Björkman, Saravan Rajmohan, Qingwei Lin, Dongmei Zhang
Reliability is extremely important for large-scale cloud systems like
Microsoft 365. Cloud failures such as disk failure, node failure, etc. threaten
service reliability, resulting in online service interruptions and economic
loss. Existing works focus on predicting cloud failures and proactively taking
action before failures happen. However, they suffer from poor data quality like
data missing in model training and prediction, which limits the performance. In
this paper, we focus on enhancing data quality through data imputation by the
proposed Diffusion+, a sample-efficient diffusion model, to impute the missing
data efficiently based on the observed data. Our experiments and application
practice show that our model contributes to improving the performance of the
downstream failure prediction task.
Adversarial Training of Denoising Diffusion Model Using Dual Discriminators for High-Fidelity Multi-Speaker TTS
August 03, 2023
Myeongjin Ko, Yong-Hoon Choi
The diffusion model is capable of generating high-quality data through a
probabilistic approach. However, it suffers from the drawback of slow
generation speed due to the requirement of a large number of time steps. To
address this limitation, recent models such as denoising diffusion implicit
models (DDIM) focus on generating samples without directly modeling the
probability distribution, while models like denoising diffusion generative
adversarial networks (GAN) combine diffusion processes with GANs. In the field
of speech synthesis, a recent diffusion speech synthesis model called
DiffGAN-TTS, utilizing the structure of GANs, has been introduced and
demonstrates superior performance in both speech quality and generation speed.
In this paper, to further enhance the performance of DiffGAN-TTS, we propose a
speech synthesis model with two discriminators: a diffusion discriminator for
learning the distribution of the reverse process and a spectrogram
discriminator for learning the distribution of the generated data. Objective
metrics such as structural similarity index measure (SSIM), mel-cepstral
distortion (MCD), F0 root mean squared error (F0 RMSE), short-time objective
intelligibility (STOI), perceptual evaluation of speech quality (PESQ), as well
as subjective metrics like mean opinion score (MOS), are used to evaluate the
performance of the proposed model. The evaluation results show that the
proposed model outperforms recent state-of-the-art models such as FastSpeech2
and DiffGAN-TTS in various metrics. Our implementation and audio samples are
located on GitHub.
Motion Planning Diffusion: Learning and Planning of Robot Motions with Diffusion Models
August 03, 2023
Joao Carvalho, An T. Le, Mark Baierl, Dorothea Koert, Jan Peters
Learning priors on trajectory distributions can help accelerate robot motion
planning optimization. Given previously successful plans, learning trajectory
generative models as priors for a new planning problem is highly desirable.
Prior works propose several ways on utilizing this prior to bootstrapping the
motion planning problem. Either sampling the prior for initializations or using
the prior distribution in a maximum-a-posterior formulation for trajectory
optimization. In this work, we propose learning diffusion models as priors. We
then can sample directly from the posterior trajectory distribution conditioned
on task goals, by leveraging the inverse denoising process of diffusion models.
Furthermore, diffusion has been recently shown to effectively encode data
multimodality in high-dimensional settings, which is particularly well-suited
for large trajectory dataset. To demonstrate our method efficacy, we compare
our proposed method - Motion Planning Diffusion - against several baselines in
simulated planar robot and 7-dof robot arm manipulator environments. To assess
the generalization capabilities of our method, we test it in environments with
previously unseen obstacles. Our experiments show that diffusion models are
strong priors to encode high-dimensional trajectory distributions of robot
motions.
Reverse Stable Diffusion: What prompt was used to generate this image?
August 02, 2023
Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, Mubarak Shah
Text-to-image diffusion models such as Stable Diffusion have recently
attracted the interest of many researchers, and inverting the diffusion process
can play an important role in better understanding the generative process and
how to engineer prompts in order to obtain the desired images. To this end, we
introduce the new task of predicting the text prompt given an image generated
by a generative diffusion model. We combine a series of white-box and black-box
models (with and without access to the weights of the diffusion network) to
deal with the proposed task. We propose a novel learning framework comprising
of a joint prompt regression and multi-label vocabulary classification
objective that generates improved prompts. To further improve our method, we
employ a curriculum learning procedure that promotes the learning of
image-prompt pairs with lower labeling noise (i.e. that are better aligned),
and an unsupervised domain-adaptive kernel learning method that uses the
similarities between samples in the source and target domains as extra
features. We conduct experiments on the DiffusionDB data set, predicting text
prompts from images generated by Stable Diffusion. Our novel learning framework
produces excellent results on the aforementioned task, yielding the highest
gains when applied on the white-box model. In addition, we make an interesting
discovery: training a diffusion model on the prompt generation task can make
the model generate images that are much better aligned with the input prompts,
when the model is directly reused for text-to-image generation.
Training Data Protection with Compositional Diffusion Models
August 02, 2023
Aditya Golatkar, Alessandro Achille, Ashwin Swaminathan, Stefano Soatto
cs.LG, cs.AI, cs.CR, cs.CV
We introduce Compartmentalized Diffusion Models (CDM), a method to train
different diffusion models (or prompts) on distinct data sources and
arbitrarily compose them at inference time. The individual models can be
trained in isolation, at different times, and on different distributions and
domains and can be later composed to achieve performance comparable to a
paragon model trained on all data simultaneously. Furthermore, each model only
contains information about the subset of the data it was exposed to during
training, enabling several forms of training data protection. In particular,
CDMs are the first method to enable both selective forgetting and continual
learning for large-scale diffusion models, as well as allowing serving
customized models based on the user’s access rights. CDMs also allow
determining the importance of a subset of the data in generating particular
samples.
From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion
August 02, 2023
Robin San Roman, Yossi Adi, Antoine Deleforge, Romain Serizel, Gabriel Synnaeve, Alexandre Défossez
Deep generative models can generate high-fidelity audio conditioned on
various types of representations (e.g., mel-spectrograms, Mel-frequency
Cepstral Coefficients (MFCC)). Recently, such models have been used to
synthesize audio waveforms conditioned on highly compressed representations.
Although such methods produce impressive results, they are prone to generate
audible artifacts when the conditioning is flawed or imperfect. An alternative
modeling approach is to use diffusion models. However, these have mainly been
used as speech vocoders (i.e., conditioned on mel-spectrograms) or generating
relatively low sampling rate signals. In this work, we propose a high-fidelity
multi-band diffusion-based framework that generates any type of audio modality
(e.g., speech, music, environmental sounds) from low-bitrate discrete
representations. At equal bit rate, the proposed approach outperforms
state-of-the-art generative techniques in terms of perceptual quality. Training
and, evaluation code, along with audio samples, are available on the
facebookresearch/audiocraft Github page.
Patched Denoising Diffusion Models For High-Resolution Image Synthesis
August 02, 2023
Zheng Ding, Mengqi Zhang, Jiajun Wu, Zhuowen Tu
We propose an effective denoising diffusion model for generating
high-resolution images (e.g., 1024$\times$512), trained on small-size image
patches (e.g., 64$\times$64). We name our algorithm Patch-DM, in which a new
feature collage strategy is designed to avoid the boundary artifact when
synthesizing large-size images. Feature collage systematically crops and
combines partial features of the neighboring patches to predict the features of
a shifted image patch, allowing the seamless generation of the entire image due
to the overlap in the patch feature space. Patch-DM produces high-quality image
synthesis results on our newly collected dataset of nature images
(1024$\times$512), as well as on standard benchmarks of smaller sizes
(256$\times$256), including LSUN-Bedroom, LSUN-Church, and FFHQ. We compare our
method with previous patch-based generation methods and achieve
state-of-the-art FID scores on all four datasets. Further, Patch-DM also
reduces memory complexity compared to the classic diffusion models.
Contrast-augmented Diffusion Model with Fine-grained Sequence Alignment for Markup-to-Image Generation
August 02, 2023
Guojin Zhong, Jin Yuan, Pan Wang, Kailun Yang, Weili Guan, Zhiyong Li
The recently rising markup-to-image generation poses greater challenges as
compared to natural image generation, due to its low tolerance for errors as
well as the complex sequence and context correlations between markup and
rendered image. This paper proposes a novel model named “Contrast-augmented
Diffusion Model with Fine-grained Sequence Alignment” (FSA-CDM), which
introduces contrastive positive/negative samples into the diffusion model to
boost performance for markup-to-image generation. Technically, we design a
fine-grained cross-modal alignment module to well explore the sequence
similarity between the two modalities for learning robust feature
representations. To improve the generalization ability, we propose a
contrast-augmented diffusion model to explicitly explore positive and negative
samples by maximizing a novel contrastive variational objective, which is
mathematically inferred to provide a tighter bound for the model’s
optimization. Moreover, the context-aware cross attention module is developed
to capture the contextual information within markup language during the
denoising process, yielding better noise prediction results. Extensive
experiments are conducted on four benchmark datasets from different domains,
and the experimental results demonstrate the effectiveness of the proposed
components in FSA-CDM, significantly exceeding state-of-the-art performance by
about 2%-12% DTW improvements. The code will be released at
https://github.com/zgj77/FSACDM.
DiffusePast: Diffusion-based Generative Replay for Class Incremental Semantic Segmentation
August 02, 2023
Jingfan Chen, Yuxi Wang, Pengfei Wang, Xiao Chen, Zhaoxiang Zhang, Zhen Lei, Qing Li
The Class Incremental Semantic Segmentation (CISS) extends the traditional
segmentation task by incrementally learning newly added classes. Previous work
has introduced generative replay, which involves replaying old class samples
generated from a pre-trained GAN, to address the issues of catastrophic
forgetting and privacy concerns. However, the generated images lack semantic
precision and exhibit out-of-distribution characteristics, resulting in
inaccurate masks that further degrade the segmentation performance. To tackle
these challenges, we propose DiffusePast, a novel framework featuring a
diffusion-based generative replay module that generates semantically accurate
images with more reliable masks guided by different instructions (e.g., text
prompts or edge maps). Specifically, DiffusePast introduces a dual-generator
paradigm, which focuses on generating old class images that align with the
distribution of downstream datasets while preserving the structure and layout
of the original images, enabling more precise masks. To adapt to the novel
visual concepts of newly added classes continuously, we incorporate class-wise
token embedding when updating the dual-generator. Moreover, we assign adequate
pseudo-labels of old classes to the background pixels in the new step images,
further mitigating the forgetting of previously learned knowledge. Through
comprehensive experiments, our method demonstrates competitive performance
across mainstream benchmarks, striking a better balance between the performance
of old and novel classes.
Universal Adversarial Defense in Remote Sensing Based on Pre-trained Denoising Diffusion Models
July 31, 2023
Weikang Yu, Yonghao Xu, Pedram Ghamisi
Deep neural networks (DNNs) have achieved tremendous success in many remote
sensing (RS) applications, in which DNNs are vulnerable to adversarial
perturbations. Unfortunately, current adversarial defense approaches in RS
studies usually suffer from performance fluctuation and unnecessary re-training
costs due to the need for prior knowledge of the adversarial perturbations
among RS data. To circumvent these challenges, we propose a universal
adversarial defense approach in RS imagery (UAD-RS) using pre-trained diffusion
models to defend the common DNNs against multiple unknown adversarial attacks.
Specifically, the generative diffusion models are first pre-trained on
different RS datasets to learn generalized representations in various data
domains. After that, a universal adversarial purification framework is
developed using the forward and reverse process of the pre-trained diffusion
models to purify the perturbations from adversarial samples. Furthermore, an
adaptive noise level selection (ANLS) mechanism is built to capture the optimal
noise level of the diffusion model that can achieve the best purification
results closest to the clean samples according to their Frechet Inception
Distance (FID) in deep feature space. As a result, only a single pre-trained
diffusion model is needed for the universal purification of adversarial samples
on each dataset, which significantly alleviates the re-training efforts and
maintains high performance without prior knowledge of the adversarial
perturbations. Experiments on four heterogeneous RS datasets regarding scene
classification and semantic segmentation verify that UAD-RS outperforms
state-of-the-art adversarial purification approaches with a universal defense
against seven commonly existing adversarial perturbations. Codes and the
pre-trained models are available online (https://github.com/EricYu97/UAD-RS).
DiffPose: SpatioTemporal Diffusion Model for Video-Based Human Pose Estimation
July 31, 2023
Runyang Feng, Yixing Gao, Tze Ho Elden Tse, Xueqing Ma, Hyung Jin Chang
Denoising diffusion probabilistic models that were initially proposed for
realistic image generation have recently shown success in various perception
tasks (e.g., object detection and image segmentation) and are increasingly
gaining attention in computer vision. However, extending such models to
multi-frame human pose estimation is non-trivial due to the presence of the
additional temporal dimension in videos. More importantly, learning
representations that focus on keypoint regions is crucial for accurate
localization of human joints. Nevertheless, the adaptation of the
diffusion-based methods remains unclear on how to achieve such objective. In
this paper, we present DiffPose, a novel diffusion architecture that formulates
video-based human pose estimation as a conditional heatmap generation problem.
First, to better leverage temporal information, we propose SpatioTemporal
Representation Learner which aggregates visual evidences across frames and uses
the resulting features in each denoising step as a condition. In addition, we
present a mechanism called Lookup-based MultiScale Feature Interaction that
determines the correlations between local joints and global contexts across
multiple scales. This mechanism generates delicate representations that focus
on keypoint regions. Altogether, by extending diffusion models, we show two
unique characteristics from DiffPose on pose estimation task: (i) the ability
to combine multiple sets of pose estimates to improve prediction accuracy,
particularly for challenging joints, and (ii) the ability to adjust the number
of iterative steps for feature refinement without retraining the model.
DiffPose sets new state-of-the-art results on three benchmarks: PoseTrack2017,
PoseTrack2018, and PoseTrack21.
Contrastive Conditional Latent Diffusion for Audio-visual Segmentation
July 31, 2023
Yuxin Mao, Jing Zhang, Mochu Xiang, Yunqiu Lv, Yiran Zhong, Yuchao Dai
cs.CV, cs.MM, cs.SD, eess.AS
We propose a latent diffusion model with contrastive learning for
audio-visual segmentation (AVS) to extensively explore the contribution of
audio. We interpret AVS as a conditional generation task, where audio is
defined as the conditional variable for sound producer(s) segmentation. With
our new interpretation, it is especially necessary to model the correlation
between audio and the final segmentation map to ensure its contribution. We
introduce a latent diffusion model to our framework to achieve
semantic-correlated representation learning. Specifically, our diffusion model
learns the conditional generation process of the ground-truth segmentation map,
leading to ground-truth aware inference when we perform the denoising process
at the test stage. As a conditional diffusion model, we argue it is essential
to ensure that the conditional variable contributes to model output. We then
introduce contrastive learning to our framework to learn audio-visual
correspondence, which is proven consistent with maximizing the mutual
information between model prediction and the audio data. In this way, our
latent diffusion model via contrastive learning explicitly maximizes the
contribution of audio for AVS. Experimental results on the benchmark dataset
verify the effectiveness of our solution. Code and results are online via our
project page: https://github.com/OpenNLPLab/DiffusionAVS.
DiffProsody: Diffusion-based Latent Prosody Generation for Expressive Speech Synthesis with Prosody Conditional Adversarial Training
July 31, 2023
Hyung-Seok Oh, Sang-Hoon Lee, Seong-Whan Lee
Expressive text-to-speech systems have undergone significant advancements
owing to prosody modeling, but conventional methods can still be improved.
Traditional approaches have relied on the autoregressive method to predict the
quantized prosody vector; however, it suffers from the issues of long-term
dependency and slow inference. This study proposes a novel approach called
DiffProsody in which expressive speech is synthesized using a diffusion-based
latent prosody generator and prosody conditional adversarial training. Our
findings confirm the effectiveness of our prosody generator in generating a
prosody vector. Furthermore, our prosody conditional discriminator
significantly improves the quality of the generated speech by accurately
emulating prosody. We use denoising diffusion generative adversarial networks
to improve the prosody generation speed. Consequently, DiffProsody is capable
of generating prosody 16 times faster than the conventional diffusion model.
The superior performance of our proposed method has been demonstrated via
experiments.
Don’t be so negative! Score-based Generative Modeling with Oracle-assisted Guidance
July 31, 2023
Saeid Naderiparizi, Xiaoxuan Liang, Berend Zwartsenberg, Frank Wood
The maximum likelihood principle advocates parameter estimation via
optimization of the data likelihood function. Models estimated in this way can
exhibit a variety of generalization characteristics dictated by, e.g.
architecture, parameterization, and optimization bias. This work addresses
model learning in a setting where there further exists side-information in the
form of an oracle that can label samples as being outside the support of the
true data generating distribution. Specifically we develop a new denoising
diffusion probabilistic modeling (DDPM) methodology, Gen-neG, that leverages
this additional side-information. Our approach builds on generative adversarial
networks (GANs) and discriminator guidance in diffusion models to guide the
generation process towards the positive support region indicated by the oracle.
We empirically establish the utility of Gen-neG in applications including
collision avoidance in self-driving simulators and safety-guarded human motion
generation.
July 31, 2023
Baoquan Zhang, Chuyao Luo, Demin Yu, Huiwei Lin, Xutao Li, Yunming Ye, Bowen Zhang
Equipping a deep model the abaility of few-shot learning, i.e., learning
quickly from only few examples, is a core challenge for artificial
intelligence. Gradient-based meta-learning approaches effectively address the
challenge by learning how to learn novel tasks. Its key idea is learning a deep
model in a bi-level optimization manner, where the outer-loop process learns a
shared gradient descent algorithm (i.e., its hyperparameters), while the
inner-loop process leverage it to optimize a task-specific model by using only
few labeled data. Although these existing methods have shown superior
performance, the outer-loop process requires calculating second-order
derivatives along the inner optimization path, which imposes considerable
memory burdens and the risk of vanishing gradients. Drawing inspiration from
recent progress of diffusion models, we find that the inner-loop gradient
descent process can be actually viewed as a reverse process (i.e., denoising)
of diffusion where the target of denoising is model weights but the origin
data. Based on this fact, in this paper, we propose to model the gradient
descent optimizer as a diffusion model and then present a novel
task-conditional diffusion-based meta-learning, called MetaDiff, that
effectively models the optimization process of model weights from Gaussion
noises to target weights in a denoising manner. Thanks to the training
efficiency of diffusion models, our MetaDiff do not need to differentiate
through the inner-loop path such that the memory burdens and the risk of
vanishing gradients can be effectvely alleviated. Experiment results show that
our MetaDiff outperforms the state-of-the-art gradient-based meta-learning
family in few-shot learning tasks.
UniBriVL: Robust Universal Representation and Generation of Audio Driven Diffusion Models
July 29, 2023
Sen Fang, Bowen Gao, Yangjian Wu, Teik Toe Teoh
Multimodal large models have been recognized for their advantages in various
performance and downstream tasks. The development of these models is crucial
towards achieving general artificial intelligence in the future. In this paper,
we propose a novel universal language representation learning method called
UniBriVL, which is based on Bridging-Vision-and-Language (BriVL). Universal
BriVL embeds audio, image, and text into a shared space, enabling the
realization of various multimodal applications. Our approach addresses major
challenges in robust language (both text and audio) representation learning and
effectively captures the correlation between audio and image. Additionally, we
demonstrate the qualitative evaluation of the generated images from UniBriVL,
which serves to highlight the potential of our approach in creating images from
audio. Overall, our experimental results demonstrate the efficacy of UniBriVL
in downstream tasks and its ability to choose appropriate images from audio.
The proposed approach has the potential for various applications such as speech
recognition, music signal processing, and captioning systems.
AbDiffuser: Full-Atom Generation of In-Vitro Functioning Antibodies
July 28, 2023
Karolis Martinkus, Jan Ludwiczak, Kyunghyun Cho, Wei-Ching Liang, Julien Lafrance-Vanasse, Isidro Hotzel, Arvind Rajpal, Yan Wu, Richard Bonneau, Vladimir Gligorijevic, Andreas Loukas
We introduce AbDiffuser, an equivariant and physics-informed diffusion model
for the joint generation of antibody 3D structures and sequences. AbDiffuser is
built on top of a new representation of protein structure, relies on a novel
architecture for aligned proteins, and utilizes strong diffusion priors to
improve the denoising process. Our approach improves protein diffusion by
taking advantage of domain knowledge and physics-based constraints; handles
sequence-length changes; and reduces memory complexity by an order of magnitude
enabling backbone and side chain generation. We validate AbDiffuser in silico
and in vitro. Numerical experiments showcase the ability of AbDiffuser to
generate antibodies that closely track the sequence and structural properties
of a reference set. Laboratory experiments confirm that all 16 HER2 antibodies
discovered were expressed at high levels and that 57.1% of selected designs
were tight binders.
TEDi: Temporally-Entangled Diffusion for Long-Term Motion Synthesis
July 27, 2023
Zihan Zhang, Richard Liu, Kfir Aberman, Rana Hanocka
The gradual nature of a diffusion process that synthesizes samples in small
increments constitutes a key ingredient of Denoising Diffusion Probabilistic
Models (DDPM), which have presented unprecedented quality in image synthesis
and been recently explored in the motion domain. In this work, we propose to
adapt the gradual diffusion concept (operating along a diffusion time-axis)
into the temporal-axis of the motion sequence. Our key idea is to extend the
DDPM framework to support temporally varying denoising, thereby entangling the
two axes. Using our special formulation, we iteratively denoise a motion buffer
that contains a set of increasingly-noised poses, which auto-regressively
produces an arbitrarily long stream of frames. With a stationary diffusion
time-axis, in each diffusion step we increment only the temporal-axis of the
motion such that the framework produces a new, clean frame which is removed
from the beginning of the buffer, followed by a newly drawn noise vector that
is appended to it. This new mechanism paves the way towards a new framework for
long-term motion synthesis with applications to character animation and other
domains.
LLDiffusion: Learning Degradation Representations in Diffusion Models for Low-Light Image Enhancement
July 27, 2023
Tao Wang, Kaihao Zhang, Ziqian Shao, Wenhan Luo, Bjorn Stenger, Tae-Kyun Kim, Wei Liu, Hongdong Li
Current deep learning methods for low-light image enhancement (LLIE)
typically rely on pixel-wise mapping learned from paired data. However, these
methods often overlook the importance of considering degradation
representations, which can lead to sub-optimal outcomes. In this paper, we
address this limitation by proposing a degradation-aware learning scheme for
LLIE using diffusion models, which effectively integrates degradation and image
priors into the diffusion process, resulting in improved image enhancement. Our
proposed degradation-aware learning scheme is based on the understanding that
degradation representations play a crucial role in accurately modeling and
capturing the specific degradation patterns present in low-light images. To
this end, First, a joint learning framework for both image generation and image
enhancement is presented to learn the degradation representations. Second, to
leverage the learned degradation representations, we develop a Low-Light
Diffusion model (LLDiffusion) with a well-designed dynamic diffusion module.
This module takes into account both the color map and the latent degradation
representations to guide the diffusion process. By incorporating these
conditioning factors, the proposed LLDiffusion can effectively enhance
low-light images, considering both the inherent degradation patterns and the
desired color fidelity. Finally, we evaluate our proposed method on several
well-known benchmark datasets, including synthetic and real-world unpaired
datasets. Extensive experiments on public benchmarks demonstrate that our
LLDiffusion outperforms state-of-the-art LLIE methods both quantitatively and
qualitatively. The source code and pre-trained models are available at
https://github.com/TaoWangzj/LLDiffusion.
Artifact Restoration in Histology Images with Diffusion Probabilistic Models
July 26, 2023
Zhenqi He, Junjun He, Jin Ye, Yiqing Shen
Histological whole slide images (WSIs) can be usually compromised by
artifacts, such as tissue folding and bubbles, which will increase the
examination difficulty for both pathologists and Computer-Aided Diagnosis (CAD)
systems. Existing approaches to restoring artifact images are confined to
Generative Adversarial Networks (GANs), where the restoration process is
formulated as an image-to-image transfer. Those methods are prone to suffer
from mode collapse and unexpected mistransfer in the stain style, leading to
unsatisfied and unrealistic restored images. Innovatively, we make the first
attempt at a denoising diffusion probabilistic model for histological artifact
restoration, namely ArtiFusion.Specifically, ArtiFusion formulates the artifact
region restoration as a gradual denoising process, and its training relies
solely on artifact-free images to simplify the training complexity.Furthermore,
to capture local-global correlations in the regional artifact restoration, a
novel Swin-Transformer denoising architecture is designed, along with a time
token scheme. Our extensive evaluations demonstrate the effectiveness of
ArtiFusion as a pre-processing method for histology analysis, which can
successfully preserve the tissue structures and stain style in artifact-free
regions during the restoration. Code is available at
https://github.com/zhenqi-he/ArtiFusion.
Pre-Training with Diffusion models for Dental Radiography segmentation
July 26, 2023
Jérémy Rousseau, Christian Alaka, Emma Covili, Hippolyte Mayard, Laura Misrachi, Willy Au
Medical radiography segmentation, and specifically dental radiography, is
highly limited by the cost of labeling which requires specific expertise and
labor-intensive annotations. In this work, we propose a straightforward
pre-training method for semantic segmentation leveraging Denoising Diffusion
Probabilistic Models (DDPM), which have shown impressive results for generative
modeling. Our straightforward approach achieves remarkable performance in terms
of label efficiency and does not require architectural modifications between
pre-training and downstream tasks. We propose to first pre-train a Unet by
exploiting the DDPM training objective, and then fine-tune the resulting model
on a segmentation task. Our experimental results on the segmentation of dental
radiographs demonstrate that the proposed method is competitive with
state-of-the-art pre-training methods.
MCMC-Correction of Score-Based Diffusion Models for Model Composition
July 26, 2023
Anders Sjöberg, Jakob Lindqvist, Magnus Önnheim, Mats Jirstrand, Lennart Svensson
Diffusion models can be parameterised in terms of either a score or an energy
function. The energy parameterisation has better theoretical properties, mainly
that it enables an extended sampling procedure with a Metropolis–Hastings
correction step, based on the change in total energy in the proposed samples.
However, it seems to yield slightly worse performance, and more importantly,
due to the widespread popularity of score-based diffusion, there are limited
availability of off-the-shelf pre-trained energy-based ones. This limitation
undermines the purpose of model composition, which aims to combine pre-trained
models to sample from new distributions. Our proposal, however, suggests
retaining the score parameterization and instead computing the energy-based
acceptance probability through line integration of the score function. This
allows us to re-use existing diffusion models and still combine the reverse
process with various Markov-Chain Monte Carlo (MCMC) methods. We evaluate our
method on a 2D experiment and find that it achieve similar or arguably better
performance than the energy parameterisation.
Diff-E: Diffusion-based Learning for Decoding Imagined Speech EEG
July 26, 2023
Soowon Kim, Young-Eun Lee, Seo-Hyun Lee, Seong-Whan Lee
eess.AS, cs.CL, cs.HC, cs.LG, 68T10
Decoding EEG signals for imagined speech is a challenging task due to the
high-dimensional nature of the data and low signal-to-noise ratio. In recent
years, denoising diffusion probabilistic models (DDPMs) have emerged as
promising approaches for representation learning in various domains. Our study
proposes a novel method for decoding EEG signals for imagined speech using
DDPMs and a conditional autoencoder named Diff-E. Results indicate that Diff-E
significantly improves the accuracy of decoding EEG signals for imagined speech
compared to traditional machine learning techniques and baseline models. Our
findings suggest that DDPMs can be an effective tool for EEG signal decoding,
with potential implications for the development of brain-computer interfaces
that enable communication through imagined speech.
Composite Diffusion | whole >= Σparts
July 25, 2023
Vikram Jamwal, Ramaneswaran S
cs.CV, cs.AI, cs.GR, cs.HC, I.3.3; I.4.6; I.4.9; J.5
For an artist or a graphic designer, the spatial layout of a scene is a
critical design choice. However, existing text-to-image diffusion models
provide limited support for incorporating spatial information. This paper
introduces Composite Diffusion as a means for artists to generate high-quality
images by composing from the sub-scenes. The artists can specify the
arrangement of these sub-scenes through a flexible free-form segment layout.
They can describe the content of each sub-scene primarily using natural text
and additionally by utilizing reference images or control inputs such as line
art, scribbles, human pose, canny edges, and more.
We provide a comprehensive and modular method for Composite Diffusion that
enables alternative ways of generating, composing, and harmonizing sub-scenes.
Further, we wish to evaluate the composite image for effectiveness in both
image quality and achieving the artist’s intent. We argue that existing image
quality metrics lack a holistic evaluation of image composites. To address
this, we propose novel quality criteria especially relevant to composite
generation.
We believe that our approach provides an intuitive method of art creation.
Through extensive user surveys, quantitative and qualitative analysis, we show
how it achieves greater spatial, semantic, and creative control over image
generation. In addition, our methods do not need to retrain or modify the
architecture of the base diffusion models and can work in a plug-and-play
manner with the fine-tuned models.
Understanding the Latent Space of Diffusion Models through the Lens of Riemannian Geometry
July 24, 2023
Yong-Hyun Park, Mingi Kwon, Jaewoong Choi, Junghyo Jo, Youngjung Uh
Despite the success of diffusion models (DMs), we still lack a thorough
understanding of their latent space. To understand the latent space
$\mathbf{x}_t \in \mathcal{X}$, we analyze them from a geometrical perspective.
Our approach involves deriving the local latent basis within $\mathcal{X}$ by
leveraging the pullback metric associated with their encoding feature maps.
Remarkably, our discovered local latent basis enables image editing
capabilities by moving $\mathbf{x}_t$, the latent space of DMs, along the basis
vector at specific timesteps. We further analyze how the geometric structure of
DMs evolves over diffusion timesteps and differs across different text
conditions. This confirms the known phenomenon of coarse-to-fine generation, as
well as reveals novel insights such as the discrepancy between $\mathbf{x}_t$
across timesteps, the effect of dataset complexity, and the time-varying
influence of text prompts. To the best of our knowledge, this paper is the
first to present image editing through $\mathbf{x}$-space traversal, editing
only once at specific timestep $t$ without any additional training, and
providing thorough analyses of the latent structure of DMs. The code to
reproduce our experiments can be found at
https://github.com/enkeejunior1/Diffusion-Pullback.
AdvDiff: Generating Unrestricted Adversarial Examples using Diffusion Models
July 24, 2023
Xuelong Dai, Kaisheng Liang, Bin Xiao
Unrestricted adversarial attacks present a serious threat to deep learning
models and adversarial defense techniques. They pose severe security problems
for deep learning applications because they can effectively bypass defense
mechanisms. However, previous attack methods often utilize Generative
Adversarial Networks (GANs), which are not theoretically provable and thus
generate unrealistic examples by incorporating adversarial objectives,
especially for large-scale datasets like ImageNet. In this paper, we propose a
new method, called AdvDiff, to generate unrestricted adversarial examples with
diffusion models. We design two novel adversarial guidance techniques to
conduct adversarial sampling in the reverse generation process of diffusion
models. These two techniques are effective and stable to generate high-quality,
realistic adversarial examples by integrating gradients of the target
classifier interpretably. Experimental results on MNIST and ImageNet datasets
demonstrate that AdvDiff is effective to generate unrestricted adversarial
examples, which outperforms GAN-based methods in terms of attack performance
and generation quality.
DiAMoNDBack: Diffusion-denoising Autoregressive Model for Non-Deterministic Backmapping of Cα Protein Traces
July 23, 2023
Michael S. Jones, Kirill Shmilovich, Andrew L. Ferguson
Coarse-grained molecular models of proteins permit access to length and time
scales unattainable by all-atom models and the simulation of processes that
occur on long-time scales such as aggregation and folding. The reduced
resolution realizes computational accelerations but an atomistic representation
can be vital for a complete understanding of mechanistic details. Backmapping
is the process of restoring all-atom resolution to coarse-grained molecular
models. In this work, we report DiAMoNDBack (Diffusion-denoising Autoregressive
Model for Non-Deterministic Backmapping) as an autoregressive denoising
diffusion probability model to restore all-atom details to coarse-grained
protein representations retaining only C{\alpha} coordinates. The
autoregressive generation process proceeds from the protein N-terminus to
C-terminus in a residue-by-residue fashion conditioned on the C{\alpha} trace
and previously backmapped backbone and side chain atoms within the local
neighborhood. The local and autoregressive nature of our model makes it
transferable between proteins. The stochastic nature of the denoising diffusion
process means that the model generates a realistic ensemble of backbone and
side chain all-atom configurations consistent with the coarse-grained C{\alpha}
trace. We train DiAMoNDBack over 65k+ structures from Protein Data Bank (PDB)
and validate it in applications to a hold-out PDB test set,
intrinsically-disordered protein structures from the Protein Ensemble Database
(PED), molecular dynamics simulations of fast-folding mini-proteins from DE
Shaw Research, and coarse-grained simulation data. We achieve state-of-the-art
reconstruction performance in terms of correct bond formation, avoidance of
side chain clashes, and diversity of the generated side chain configurational
states. We make DiAMoNDBack model publicly available as a free and open source
Python package.
Synthesis of Batik Motifs using a Diffusion – Generative Adversarial Network
July 22, 2023
One Octadion, Novanto Yudistira, Diva Kurnianingtyas
Batik, a unique blend of art and craftsmanship, is a distinct artistic and
technological creation for Indonesian society. Research on batik motifs is
primarily focused on classification. However, further studies may extend to the
synthesis of batik patterns. Generative Adversarial Networks (GANs) have been
an important deep learning model for generating synthetic data, but often face
challenges in the stability and consistency of results. This research focuses
on the use of StyleGAN2-Ada and Diffusion techniques to produce realistic and
high-quality synthetic batik patterns. StyleGAN2-Ada is a variation of the GAN
model that separates the style and content aspects in an image, whereas
diffusion techniques introduce random noise into the data. In the context of
batik, StyleGAN2-Ada and Diffusion are used to produce realistic synthetic
batik patterns. This study also made adjustments to the model architecture and
used a well-curated batik dataset. The main goal is to assist batik designers
or craftsmen in producing unique and quality batik motifs with efficient
production time and costs. Based on qualitative and quantitative evaluations,
the results show that the model tested is capable of producing authentic and
quality batik patterns, with finer details and rich artistic variations. The
dataset and code can be accessed
here:https://github.com/octadion/diffusion-stylegan2-ada-pytorch
July 22, 2023
Yi Qin, Xiaomeng Li
Unsupervised deformable image registration is one of the challenging tasks in
medical imaging. Obtaining a high-quality deformation field while preserving
deformation topology remains demanding amid a series of deep-learning-based
solutions. Meanwhile, the diffusion model’s latent feature space shows
potential in modeling the deformation semantics. To fully exploit the diffusion
model’s ability to guide the registration task, we present two modules:
Feature-wise Diffusion-Guided Module (FDG) and Score-wise Diffusion-Guided
Module (SDG). Specifically, FDG uses the diffusion model’s multi-scale semantic
features to guide the generation of the deformation field. SDG uses the
diffusion score to guide the optimization process for preserving deformation
topology with barely any additional computation. Experiment results on the 3D
medical cardiac image registration task validate our model’s ability to provide
refined deformation fields with preserved topology effectively. Code is
available at: https://github.com/xmed-lab/FSDiffReg.git.
PartDiff: Image Super-resolution with Partial Diffusion Models
July 21, 2023
Kai Zhao, Alex Ling Yu Hung, Kaifeng Pang, Haoxin Zheng, Kyunghyun Sung
Denoising diffusion probabilistic models (DDPMs) have achieved impressive
performance on various image generation tasks, including image
super-resolution. By learning to reverse the process of gradually diffusing the
data distribution into Gaussian noise, DDPMs generate new data by iteratively
denoising from random noise. Despite their impressive performance,
diffusion-based generative models suffer from high computational costs due to
the large number of denoising steps.In this paper, we first observed that the
intermediate latent states gradually converge and become indistinguishable when
diffusing a pair of low- and high-resolution images. This observation inspired
us to propose the Partial Diffusion Model (PartDiff), which diffuses the image
to an intermediate latent state instead of pure random noise, where the
intermediate latent state is approximated by the latent of diffusing the
low-resolution image. During generation, Partial Diffusion Models start
denoising from the intermediate distribution and perform only a part of the
denoising steps. Additionally, to mitigate the error caused by the
approximation, we introduce “latent alignment”, which aligns the latent between
low- and high-resolution images during training. Experiments on both magnetic
resonance imaging (MRI) and natural images show that, compared to plain
diffusion-based super-resolution methods, Partial Diffusion Models
significantly reduce the number of denoising steps without sacrificing the
quality of generation.
FEDD – Fair, Efficient, and Diverse Diffusion-based Lesion Segmentation and Malignancy Classification
July 21, 2023
Héctor Carrión, Narges Norouzi
Skin diseases affect millions of people worldwide, across all ethnicities.
Increasing diagnosis accessibility requires fair and accurate segmentation and
classification of dermatology images. However, the scarcity of annotated
medical images, especially for rare diseases and underrepresented skin tones,
poses a challenge to the development of fair and accurate models. In this
study, we introduce a Fair, Efficient, and Diverse Diffusion-based framework
for skin lesion segmentation and malignancy classification. FEDD leverages
semantically meaningful feature embeddings learned through a denoising
diffusion probabilistic backbone and processes them via linear probes to
achieve state-of-the-art performance on Diverse Dermatology Images (DDI). We
achieve an improvement in intersection over union of 0.18, 0.13, 0.06, and 0.07
while using only 5%, 10%, 15%, and 20% labeled samples, respectively.
Additionally, FEDD trained on 10% of DDI demonstrates malignancy classification
accuracy of 81%, 14% higher compared to the state-of-the-art. We showcase high
efficiency in data-constrained scenarios while providing fair performance for
diverse skin tones and rare malignancy conditions. Our newly annotated DDI
segmentation masks and training code can be found on
https://github.com/hectorcarrion/fedd.
Predict, Refine, Synthesize: Self-Guiding Diffusion Models for Probabilistic Time Series Forecasting
July 21, 2023
Marcel Kollovieh, Abdul Fatir Ansari, Michael Bohlke-Schneider, Jasper Zschiegner, Hao Wang, Yuyang Wang
Diffusion models have achieved state-of-the-art performance in generative
modeling tasks across various domains. Prior works on time series diffusion
models have primarily focused on developing conditional models tailored to
specific forecasting or imputation tasks. In this work, we explore the
potential of task-agnostic, unconditional diffusion models for several time
series applications. We propose TSDiff, an unconditionally-trained diffusion
model for time series. Our proposed self-guidance mechanism enables
conditioning TSDiff for downstream tasks during inference, without requiring
auxiliary networks or altering the training procedure. We demonstrate the
effectiveness of our method on three different time series tasks: forecasting,
refinement, and synthetic data generation. First, we show that TSDiff is
competitive with several task-specific conditional forecasting methods
(predict). Second, we leverage the learned implicit probability density of
TSDiff to iteratively refine the predictions of base forecasters with reduced
computational overhead over reverse diffusion (refine). Notably, the generative
performance of the model remains intact – downstream forecasters trained on
synthetic samples from TSDiff outperform forecasters that are trained on
samples from other state-of-the-art generative time series models, occasionally
even outperforming models trained on real data (synthesize).
Subject-Diffusion:Open Domain Personalized Text-to-Image Generation without Test-time Fine-tuning
July 21, 2023
Jian Ma, Junhao Liang, Chen Chen, Haonan Lu
Recent progress in personalized image generation using diffusion models has
been significant. However, development in the area of open-domain and
non-fine-tuning personalized image generation is proceeding rather slowly. In
this paper, we propose Subject-Diffusion, a novel open-domain personalized
image generation model that, in addition to not requiring test-time
fine-tuning, also only requires a single reference image to support
personalized generation of single- or multi-subject in any domain. Firstly, we
construct an automatic data labeling tool and use the LAION-Aesthetics dataset
to construct a large-scale dataset consisting of 76M images and their
corresponding subject detection bounding boxes, segmentation masks and text
descriptions. Secondly, we design a new unified framework that combines text
and image semantics by incorporating coarse location and fine-grained reference
image control to maximize subject fidelity and generalization. Furthermore, we
also adopt an attention control mechanism to support multi-subject generation.
Extensive qualitative and quantitative results demonstrate that our method
outperforms other SOTA frameworks in single, multiple, and human customized
image generation. Please refer to our
\href{https://oppo-mente-lab.github.io/subject_diffusion/}{project page}
DPM-OT: A New Diffusion Probabilistic Model Based on Optimal Transport
July 21, 2023
Zezeng Li, ShengHao Li, Zhanpeng Wang, Na Lei, Zhongxuan Luo, Xianfeng Gu
Sampling from diffusion probabilistic models (DPMs) can be viewed as a
piecewise distribution transformation, which generally requires hundreds or
thousands of steps of the inverse diffusion trajectory to get a high-quality
image. Recent progress in designing fast samplers for DPMs achieves a trade-off
between sampling speed and sample quality by knowledge distillation or
adjusting the variance schedule or the denoising equation. However, it can’t be
optimal in both aspects and often suffer from mode mixture in short steps. To
tackle this problem, we innovatively regard inverse diffusion as an optimal
transport (OT) problem between latents at different stages and propose the
DPM-OT, a unified learning framework for fast DPMs with a direct expressway
represented by OT map, which can generate high-quality samples within around 10
function evaluations. By calculating the semi-discrete optimal transport map
between the data latents and the white noise, we obtain an expressway from the
prior distribution to the data distribution, while significantly alleviating
the problem of mode mixture. In addition, we give the error bound of the
proposed method, which theoretically guarantees the stability of the algorithm.
Extensive experiments validate the effectiveness and advantages of DPM-OT in
terms of speed and quality (FID and mode mixture), thus representing an
efficient solution for generative modeling. Source codes are available at
https://github.com/cognaclee/DPM-OT
Progressive distillation diffusion for raw music generation
July 20, 2023
Svetlana Pavlova
This paper aims to apply a new deep learning approach to the task of
generating raw audio files. It is based on diffusion models, a recent type of
deep generative model. This new type of method has recently shown outstanding
results with image generation. A lot of focus has been given to those models by
the computer vision community. On the other hand, really few have been given
for other types of applications such as music generation in waveform domain.
In this paper the model for unconditional generating applied to music is
implemented: Progressive distillation diffusion with 1D U-Net. Then, a
comparison of different parameters of diffusion and their value in a full
result is presented. One big advantage of the methods implemented through this
work is the fact that the model is able to deal with progressing audio
processing and generating , using transformation from 1-channel 128 x 384 to
3-channel 128 x 128 mel-spectrograms and looped generation. The empirical
comparisons are realized across different self-collected datasets.
Diffusion Models for Probabilistic Deconvolution of Galaxy Images
July 20, 2023
Zhiwei Xue, Yuhang Li, Yash Patel, Jeffrey Regier
astro-ph.IM, cs.LG, stat.AP
Telescopes capture images with a particular point spread function (PSF).
Inferring what an image would have looked like with a much sharper PSF, a
problem known as PSF deconvolution, is ill-posed because PSF convolution is not
an invertible transformation. Deep generative models are appealing for PSF
deconvolution because they can infer a posterior distribution over candidate
images that, if convolved with the PSF, could have generated the observation.
However, classical deep generative models such as VAEs and GANs often provide
inadequate sample diversity. As an alternative, we propose a classifier-free
conditional diffusion model for PSF deconvolution of galaxy images. We
demonstrate that this diffusion model captures a greater diversity of possible
deconvolutions compared to a conditional VAE.
Diffusion Sampling with Momentum for Mitigating Divergence Artifacts
July 20, 2023
Suttisak Wizadwongsa, Worameth Chinchuthakun, Pramook Khungurn, Amit Raj, Supasorn Suwajanakorn
Despite the remarkable success of diffusion models in image generation, slow
sampling remains a persistent issue. To accelerate the sampling process, prior
studies have reformulated diffusion sampling as an ODE/SDE and introduced
higher-order numerical methods. However, these methods often produce divergence
artifacts, especially with a low number of sampling steps, which limits the
achievable acceleration. In this paper, we investigate the potential causes of
these artifacts and suggest that the small stability regions of these methods
could be the principal cause. To address this issue, we propose two novel
techniques. The first technique involves the incorporation of Heavy Ball (HB)
momentum, a well-known technique for improving optimization, into existing
diffusion numerical methods to expand their stability regions. We also prove
that the resulting methods have first-order convergence. The second technique,
called Generalized Heavy Ball (GHVB), constructs a new high-order method that
offers a variable trade-off between accuracy and artifact suppression.
Experimental results show that our techniques are highly effective in reducing
artifacts and improving image quality, surpassing state-of-the-art diffusion
solvers on both pixel-based and latent-based diffusion models for low-step
sampling. Our research provides novel insights into the design of numerical
methods for future diffusion work.
BoxDiff: Text-to-Image Synthesis with Training-Free Box-Constrained Diffusion
July 20, 2023
Jinheng Xie, Yuexiang Li, Yawen Huang, Haozhe Liu, Wentian Zhang, Yefeng Zheng, Mike Zheng Shou
Recent text-to-image diffusion models have demonstrated an astonishing
capacity to generate high-quality images. However, researchers mainly studied
the way of synthesizing images with only text prompts. While some works have
explored using other modalities as conditions, considerable paired data, e.g.,
box/mask-image pairs, and fine-tuning time are required for nurturing models.
As such paired data is time-consuming and labor-intensive to acquire and
restricted to a closed set, this potentially becomes the bottleneck for
applications in an open world. This paper focuses on the simplest form of
user-provided conditions, e.g., box or scribble. To mitigate the aforementioned
problem, we propose a training-free method to control objects and contexts in
the synthesized images adhering to the given spatial conditions. Specifically,
three spatial constraints, i.e., Inner-Box, Outer-Box, and Corner Constraints,
are designed and seamlessly integrated into the denoising step of diffusion
models, requiring no additional training and massive annotated layout data.
Extensive experimental results demonstrate that the proposed constraints can
control what and where to present in the images while retaining the ability of
Diffusion models to synthesize with high fidelity and diverse concept coverage.
The code is publicly available at https://github.com/showlab/BoxDiff.
AdjointDPM: Adjoint Sensitivity Method for Gradient Backpropagation of Diffusion Probabilistic Models
July 20, 2023
Jiachun Pan, Jun Hao Liew, Vincent Y. F. Tan, Jiashi Feng, Hanshu Yan
Existing customization methods require access to multiple reference examples
to align pre-trained diffusion probabilistic models (DPMs) with user-provided
concepts. This paper aims to address the challenge of DPM customization when
the only available supervision is a differentiable metric defined on the
generated contents. Since the sampling procedure of DPMs involves recursive
calls to the denoising UNet, na"ive gradient backpropagation requires storing
the intermediate states of all iterations, resulting in extremely high memory
consumption. To overcome this issue, we propose a novel method AdjointDPM,
which first generates new samples from diffusion models by solving the
corresponding probability-flow ODEs. It then uses the adjoint sensitivity
method to backpropagate the gradients of the loss to the models’ parameters
(including conditioning signals, network weights, and initial noises) by
solving another augmented ODE. To reduce numerical errors in both the forward
generation and gradient backpropagation processes, we further reparameterize
the probability-flow ODE and augmented ODE as simple non-stiff ODEs using
exponential integration. Finally, we demonstrate the effectiveness of
AdjointDPM on three interesting tasks: converting visual effects into
identification text embeddings, finetuning DPMs for specific types of
stylization, and optimizing initial noise to generate adversarial samples for
security auditing.
PreDiff: Precipitation Nowcasting with Latent Diffusion Models
July 19, 2023
Zhihan Gao, Xingjian Shi, Boran Han, Hao Wang, Xiaoyong Jin, Danielle Maddix, Yi Zhu, Mu Li, Yuyang Wang
Earth system forecasting has traditionally relied on complex physical models
that are computationally expensive and require significant domain expertise. In
the past decade, the unprecedented increase in spatiotemporal Earth observation
data has enabled data-driven forecasting models using deep learning techniques.
These models have shown promise for diverse Earth system forecasting tasks but
either struggle with handling uncertainty or neglect domain-specific prior
knowledge, resulting in averaging possible futures to blurred forecasts or
generating physically implausible predictions. To address these limitations, we
propose a two-stage pipeline for probabilistic spatiotemporal forecasting: 1)
We develop PreDiff, a conditional latent diffusion model capable of
probabilistic forecasts. 2) We incorporate an explicit knowledge alignment
mechanism to align forecasts with domain-specific physical constraints. This is
achieved by estimating the deviation from imposed constraints at each denoising
step and adjusting the transition distribution accordingly. We conduct
empirical studies on two datasets: N-body MNIST, a synthetic dataset with
chaotic behavior, and SEVIR, a real-world precipitation nowcasting dataset.
Specifically, we impose the law of conservation of energy in N-body MNIST and
anticipated precipitation intensity in SEVIR. Experiments demonstrate the
effectiveness of PreDiff in handling uncertainty, incorporating domain-specific
prior knowledge, and generating forecasts that exhibit high operational
utility.
TokenFlow: Consistent Diffusion Features for Consistent Video Editing
July 19, 2023
Michal Geyer, Omer Bar-Tal, Shai Bagon, Tali Dekel
The generative AI revolution has recently expanded to videos. Nevertheless,
current state-of-the-art video models are still lagging behind image models in
terms of visual quality and user control over the generated content. In this
work, we present a framework that harnesses the power of a text-to-image
diffusion model for the task of text-driven video editing. Specifically, given
a source video and a target text-prompt, our method generates a high-quality
video that adheres to the target text, while preserving the spatial layout and
motion of the input video. Our method is based on a key observation that
consistency in the edited video can be obtained by enforcing consistency in the
diffusion feature space. We achieve this by explicitly propagating diffusion
features based on inter-frame correspondences, readily available in the model.
Thus, our framework does not require any training or fine-tuning, and can work
in conjunction with any off-the-shelf text-to-image editing method. We
demonstrate state-of-the-art editing results on a variety of real-world videos.
Webpage: https://diffusion-tokenflow.github.io/
BSDM: Background Suppression Diffusion Model for Hyperspectral Anomaly Detection
July 19, 2023
Jitao Ma, Weiying Xie, Yunsong Li, Leyuan Fang
Hyperspectral anomaly detection (HAD) is widely used in Earth observation and
deep space exploration. A major challenge for HAD is the complex background of
the input hyperspectral images (HSIs), resulting in anomalies confused in the
background. On the other hand, the lack of labeled samples for HSIs leads to
poor generalization of existing HAD methods. This paper starts the first
attempt to study a new and generalizable background learning problem without
labeled samples. We present a novel solution BSDM (background suppression
diffusion model) for HAD, which can simultaneously learn latent background
distributions and generalize to different datasets for suppressing complex
background. It is featured in three aspects: (1) For the complex background of
HSIs, we design pseudo background noise and learn the potential background
distribution in it with a diffusion model (DM). (2) For the generalizability
problem, we apply a statistical offset module so that the BSDM adapts to
datasets of different domains without labeling samples. (3) For achieving
background suppression, we innovatively improve the inference process of DM by
feeding the original HSIs into the denoising network, which removes the
background as noise. Our work paves a new background suppression way for HAD
that can improve HAD performance without the prerequisite of manually labeled
data. Assessments and generalization experiments of four HAD methods on several
real HSI datasets demonstrate the above three unique properties of the proposed
method. The code is available at https://github.com/majitao-xd/BSDM-HAD.
DiffDP: Radiotherapy Dose Prediction via a Diffusion Model
July 19, 2023
Zhenghao Feng, Lu Wen, Peng Wang, Binyu Yan, Xi Wu, Jiliu Zhou, Yan Wang
eess.IV, cs.CV, physics.med-ph
Currently, deep learning (DL) has achieved the automatic prediction of dose
distribution in radiotherapy planning, enhancing its efficiency and quality.
However, existing methods suffer from the over-smoothing problem for their
commonly used L_1 or L_2 loss with posterior average calculations. To alleviate
this limitation, we innovatively introduce a diffusion-based dose prediction
(DiffDP) model for predicting the radiotherapy dose distribution of cancer
patients. Specifically, the DiffDP model contains a forward process and a
reverse process. In the forward process, DiffDP gradually transforms dose
distribution maps into Gaussian noise by adding small noise and trains a noise
predictor to predict the noise added in each timestep. In the reverse process,
it removes the noise from the original Gaussian noise in multiple steps with
the well-trained noise predictor and finally outputs the predicted dose
distribution map. To ensure the accuracy of the prediction, we further design a
structure encoder to extract anatomical information from patient anatomy images
and enable the noise predictor to be aware of the dose constraints within
several essential organs, i.e., the planning target volume and organs at risk.
Extensive experiments on an in-house dataset with 130 rectum cancer patients
demonstrate the s
Text2Layer: Layered Image Generation using Latent Diffusion Model
July 19, 2023
Xinyang Zhang, Wentian Zhao, Xin Lu, Jeff Chien
Layer compositing is one of the most popular image editing workflows among
both amateurs and professionals. Motivated by the success of diffusion models,
we explore layer compositing from a layered image generation perspective.
Instead of generating an image, we propose to generate background, foreground,
layer mask, and the composed image simultaneously. To achieve layered image
generation, we train an autoencoder that is able to reconstruct layered images
and train diffusion models on the latent representation. One benefit of the
proposed problem is to enable better compositing workflows in addition to the
high-quality image output. Another benefit is producing higher-quality layer
masks compared to masks produced by a separate step of image segmentation.
Experimental results show that the proposed method is able to generate
high-quality layered images and initiates a benchmark for future work.
Polyffusion: A Diffusion Model for Polyphonic Score Generation with Internal and External Controls
July 19, 2023
Lejun Min, Junyan Jiang, Gus Xia, Jingwei Zhao
We propose Polyffusion, a diffusion model that generates polyphonic music
scores by regarding music as image-like piano roll representations. The model
is capable of controllable music generation with two paradigms: internal
control and external control. Internal control refers to the process in which
users pre-define a part of the music and then let the model infill the rest,
similar to the task of masked music generation (or music inpainting). External
control conditions the model with external yet related information, such as
chord, texture, or other features, via the cross-attention mechanism. We show
that by using internal and external controls, Polyffusion unifies a wide range
of music creation tasks, including melody generation given accompaniment,
accompaniment generation given melody, arbitrary music segment inpainting, and
music arrangement given chords or textures. Experimental results show that our
model significantly outperforms existing Transformer and sampling-based
baselines, and using pre-trained disentangled representations as external
conditions yields more effective controls.
DreaMR: Diffusion-driven Counterfactual Explanation for Functional MRI
July 18, 2023
Hasan Atakan Bedel, Tolga Çukur
Deep learning analyses have offered sensitivity leaps in detection of
cognitive states from functional MRI (fMRI) measurements across the brain. Yet,
as deep models perform hierarchical nonlinear transformations on their input,
interpreting the association between brain responses and cognitive states is
challenging. Among common explanation approaches for deep fMRI classifiers,
attribution methods show poor specificity and perturbation methods show limited
plausibility. While counterfactual generation promises to address these
limitations, previous methods use variational or adversarial priors that yield
suboptimal sample fidelity. Here, we introduce the first diffusion-driven
counterfactual method, DreaMR, to enable fMRI interpretation with high
specificity, plausibility and fidelity. DreaMR performs diffusion-based
resampling of an input fMRI sample to alter the decision of a downstream
classifier, and then computes the minimal difference between the original and
counterfactual samples for explanation. Unlike conventional diffusion methods,
DreaMR leverages a novel fractional multi-phase-distilled diffusion prior to
improve sampling efficiency without compromising fidelity, and it employs a
transformer architecture to account for long-range spatiotemporal context in
fMRI scans. Comprehensive experiments on neuroimaging datasets demonstrate the
superior specificity, fidelity and efficiency of DreaMR in sample generation
over state-of-the-art counterfactual methods for fMRI interpretation.
Autoregressive Diffusion Model for Graph Generation
July 17, 2023
Lingkai Kong, Jiaming Cui, Haotian Sun, Yuchen Zhuang, B. Aditya Prakash, Chao Zhang
Diffusion-based graph generative models have recently obtained promising
results for graph generation. However, existing diffusion-based graph
generative models are mostly one-shot generative models that apply Gaussian
diffusion in the dequantized adjacency matrix space. Such a strategy can suffer
from difficulty in model training, slow sampling speed, and incapability of
incorporating constraints. We propose an \emph{autoregressive diffusion} model
for graph generation. Unlike existing methods, we define a node-absorbing
diffusion process that operates directly in the discrete graph space. For
forward diffusion, we design a \emph{diffusion ordering network}, which learns
a data-dependent node absorbing ordering from graph topology. For reverse
generation, we design a \emph{denoising network} that uses the reverse node
ordering to efficiently reconstruct the graph by predicting the node type of
the new node and its edges with previously denoised nodes at a time. Based on
the permutation invariance of graph, we show that the two networks can be
jointly trained by optimizing a simple lower bound of data likelihood. Our
experiments on six diverse generic graph datasets and two molecule datasets
show that our model achieves better or comparable generation performance with
previous state-of-the-art, and meanwhile enjoys fast generation speed.
Diffusion Models Beat GANs on Image Classification
July 17, 2023
Soumik Mukhopadhyay, Matthew Gwilliam, Vatsal Agarwal, Namitha Padmanabhan, Archana Swaminathan, Srinidhi Hegde, Tianyi Zhou, Abhinav Shrivastava
While many unsupervised learning models focus on one family of tasks, either
generative or discriminative, we explore the possibility of a unified
representation learner: a model which uses a single pre-training stage to
address both families of tasks simultaneously. We identify diffusion models as
a prime candidate. Diffusion models have risen to prominence as a
state-of-the-art method for image generation, denoising, inpainting,
super-resolution, manipulation, etc. Such models involve training a U-Net to
iteratively predict and remove noise, and the resulting model can synthesize
high fidelity, diverse, novel images. The U-Net architecture, as a
convolution-based architecture, generates a diverse set of feature
representations in the form of intermediate feature maps. We present our
findings that these embeddings are useful beyond the noise prediction task, as
they contain discriminative information and can also be leveraged for
classification. We explore optimal methods for extracting and using these
embeddings for classification tasks, demonstrating promising results on the
ImageNet classification task. We find that with careful feature selection and
pooling, diffusion models outperform comparable generative-discriminative
methods such as BigBiGAN for classification tasks. We investigate diffusion
models in the transfer learning regime, examining their performance on several
fine-grained visual classification datasets. We compare these embeddings to
those generated by competing architectures and pre-trainings for classification
tasks.
Flow Matching in Latent Space
July 17, 2023
Quan Dao, Hao Phung, Binh Nguyen, Anh Tran
Flow matching is a recent framework to train generative models that exhibits
impressive empirical performance while being relatively easier to train
compared with diffusion-based models. Despite its advantageous properties,
prior methods still face the challenges of expensive computing and a large
number of function evaluations of off-the-shelf solvers in the pixel space.
Furthermore, although latent-based generative methods have shown great success
in recent years, this particular model type remains underexplored in this area.
In this work, we propose to apply flow matching in the latent spaces of
pretrained autoencoders, which offers improved computational efficiency and
scalability for high-resolution image synthesis. This enables flow-matching
training on constrained computational resources while maintaining their quality
and flexibility. Additionally, our work stands as a pioneering contribution in
the integration of various conditions into flow matching for conditional
generation tasks, including label-conditioned image generation, image
inpainting, and semantic-to-image generation. Through extensive experiments,
our approach demonstrates its effectiveness in both quantitative and
qualitative results on various datasets, such as CelebA-HQ, FFHQ, LSUN Church &
Bedroom, and ImageNet. We also provide a theoretical control of the
Wasserstein-2 distance between the reconstructed latent flow distribution and
true data distribution, showing it is upper-bounded by the latent flow matching
objective. Our code will be available at
https://github.com/VinAIResearch/LFM.git.
SEMI-DiffusionInst: A Diffusion Model Based Approach for Semiconductor Defect Classification and Segmentation
July 17, 2023
Vic De Ridder, Bappaditya Dey, Sandip Halder, Bartel Van Waeyenberge
With continuous progression of Moore’s Law, integrated circuit (IC) device
complexity is also increasing. Scanning Electron Microscope (SEM) image based
extensive defect inspection and accurate metrology extraction are two main
challenges in advanced node (2 nm and beyond) technology. Deep learning (DL)
algorithm based computer vision approaches gained popularity in semiconductor
defect inspection over last few years. In this research work, a new
semiconductor defect inspection framework “SEMI-DiffusionInst” is investigated
and compared to previous frameworks. To the best of the authors’ knowledge,
this work is the first demonstration to accurately detect and precisely segment
semiconductor defect patterns by using a diffusion model. Different feature
extractor networks as backbones and data sampling strategies are investigated
towards achieving a balanced trade-off between precision and computing
efficiency. Our proposed approach outperforms previous work on overall mAP and
performs comparatively better or as per for almost all defect classes (per
class APs). The bounding box and segmentation mAPs achieved by the proposed
SEMI-DiffusionInst model are improved by 3.83% and 2.10%, respectively. Among
individual defect types, precision on line collapse and thin bridge defects are
improved approximately 15\% on detection task for both defect types. It has
also been shown that by tuning inference hyperparameters, inference time can be
improved significantly without compromising model precision. Finally, certain
limitations and future work strategy to overcome them are discussed.
Multimodal Diffusion Segmentation Model for Object Segmentation from Manipulation Instructions
July 17, 2023
Yui Iioka, Yu Yoshida, Yuiga Wada, Shumpei Hatanaka, Komei Sugiura
In this study, we aim to develop a model that comprehends a natural language
instruction (e.g., “Go to the living room and get the nearest pillow to the
radio art on the wall”) and generates a segmentation mask for the target
everyday object. The task is challenging because it requires (1) the
understanding of the referring expressions for multiple objects in the
instruction, (2) the prediction of the target phrase of the sentence among the
multiple phrases, and (3) the generation of pixel-wise segmentation masks
rather than bounding boxes. Studies have been conducted on languagebased
segmentation methods; however, they sometimes mask irrelevant regions for
complex sentences. In this paper, we propose the Multimodal Diffusion
Segmentation Model (MDSM), which generates a mask in the first stage and
refines it in the second stage. We introduce a crossmodal parallel feature
extraction mechanism and extend diffusion probabilistic models to handle
crossmodal features. To validate our model, we built a new dataset based on the
well-known Matterport3D and REVERIE datasets. This dataset consists of
instructions with complex referring expressions accompanied by real indoor
environmental images that feature various target objects, in addition to
pixel-wise segmentation masks. The performance of MDSM surpassed that of the
baseline method by a large margin of +10.13 mean IoU.
Synthetic Lagrangian Turbulence by Generative Diffusion Models
July 17, 2023
Tianyi Li, Luca Biferale, Fabio Bonaccorso, Martino Andrea Scarpolini, Michele Buzzicotti
physics.flu-dyn, cond-mat.stat-mech, cs.CE, cs.LG, nlin.CD
Lagrangian turbulence lies at the core of numerous applied and fundamental
problems related to the physics of dispersion and mixing in engineering,
bio-fluids, atmosphere, oceans, and astrophysics. Despite exceptional
theoretical, numerical, and experimental efforts conducted over the past thirty
years, no existing models are capable of faithfully reproducing statistical and
topological properties exhibited by particle trajectories in turbulence. We
propose a machine learning approach, based on a state-of-the-art Diffusion
Model, to generate single-particle trajectories in three-dimensional turbulence
at high Reynolds numbers, thereby bypassing the need for direct numerical
simulations or experiments to obtain reliable Lagrangian data. Our model
demonstrates the ability to quantitatively reproduce all relevant statistical
benchmarks over the entire range of time scales, including the presence of fat
tails distribution for the velocity increments, anomalous power law, and
enhancement of intermittency around the dissipative scale. The model exhibits
good generalizability for extreme events, achieving unprecedented intensity and
rarity. This paves the way for producing synthetic high-quality datasets for
pre-training various downstream applications of Lagrangian turbulence.
Manifold-Guided Sampling in Diffusion Models for Unbiased Image Generation
July 17, 2023
Xingzhe Su, Yi Ren, Wenwen Qiang, Zeen Song, Hang Gao, Fengge Wu, Changwen Zheng
Diffusion models are a potent class of generative models capable of producing
high-quality images. However, they can face challenges related to data bias,
favoring specific modes of data, especially when the training data does not
accurately represent the true data distribution and exhibits skewed or
imbalanced patterns. For instance, the CelebA dataset contains more female
images than male images, leading to biased generation results and impacting
downstream applications. To address this issue, we propose a novel method that
leverages manifold guidance to mitigate data bias in diffusion models. Our key
idea is to estimate the manifold of the training data using an unsupervised
approach, and then use it to guide the sampling process of diffusion models.
This encourages the generated images to be uniformly distributed on the data
manifold without altering the model architecture or necessitating labels or
retraining. Theoretical analysis and empirical evidence demonstrate the
effectiveness of our method in improving the quality and unbiasedness of image
generation compared to standard diffusion models.
Solving Inverse Problems with Latent Diffusion Models via Hard Data Consistency
July 16, 2023
Bowen Song, Soo Min Kwon, Zecheng Zhang, Xinyu Hu, Qing Qu, Liyue Shen
Diffusion models have recently emerged as powerful generative priors for
solving inverse problems. However, training diffusion models in the pixel space
are both data-intensive and computationally demanding, which restricts their
applicability as priors for high-dimensional real-world data such as medical
images. Latent diffusion models, which operate in a much lower-dimensional
space, offer a solution to these challenges. However, incorporating latent
diffusion models to solve inverse problems remains a challenging problem due to
the nonlinearity of the encoder and decoder. To address these issues, we
propose \textit{ReSample}, an algorithm that can solve general inverse problems
with pre-trained latent diffusion models. Our algorithm incorporates data
consistency by solving an optimization problem during the reverse sampling
process, a concept that we term as hard data consistency. Upon solving this
optimization problem, we propose a novel resampling scheme to map the
measurement-consistent sample back onto the noisy data manifold and
theoretically demonstrate its benefits. Lastly, we apply our algorithm to solve
a wide range of linear and nonlinear inverse problems in both natural and
medical images, demonstrating that our approach outperforms existing
state-of-the-art approaches, including those based on pixel-space diffusion
models.
Diffusion to Confusion: Naturalistic Adversarial Patch Generation Based on Diffusion Model for Object Detector
July 16, 2023
Shuo-Yen Lin, Ernie Chu, Che-Hsien Lin, Jun-Cheng Chen, Jia-Ching Wang
Many physical adversarial patch generation methods are widely proposed to
protect personal privacy from malicious monitoring using object detectors.
However, they usually fail to generate satisfactory patch images in terms of
both stealthiness and attack performance without making huge efforts on careful
hyperparameter tuning. To address this issue, we propose a novel naturalistic
adversarial patch generation method based on the diffusion models (DM). Through
sampling the optimal image from the DM model pretrained upon natural images, it
allows us to stably craft high-quality and naturalistic physical adversarial
patches to humans without suffering from serious mode collapse problems as
other deep generative models. To the best of our knowledge, we are the first to
propose DM-based naturalistic adversarial patch generation for object
detectors. With extensive quantitative, qualitative, and subjective
experiments, the results demonstrate the effectiveness of the proposed approach
to generate better-quality and more naturalistic adversarial patches while
achieving acceptable attack performance than other state-of-the-art patch
generation methods. We also show various generation trade-offs under different
conditions.
LafitE: Latent Diffusion Model with Feature Editing for Unsupervised Multi-class Anomaly Detection
July 16, 2023
Haonan Yin, Guanlong Jiao, Qianhui Wu, Borje F. Karlsson, Biqing Huang, Chin Yew Lin
In the context of flexible manufacturing systems that are required to produce
different types and quantities of products with minimal reconfiguration, this
paper addresses the problem of unsupervised multi-class anomaly detection:
develop a unified model to detect anomalies from objects belonging to multiple
classes when only normal data is accessible. We first explore the
generative-based approach and investigate latent diffusion models for
reconstruction to mitigate the notorious identity shortcut'' issue in
auto-encoder based methods. We then introduce a feature editing strategy that
modifies the input feature space of the diffusion model to further alleviate
identity shortcuts’’ and meanwhile improve the reconstruction quality of
normal regions, leading to fewer false positive predictions. Moreover, we are
the first who pose the problem of hyperparameter selection in unsupervised
anomaly detection, and propose a solution of synthesizing anomaly data for a
pseudo validation set to address this problem. Extensive experiments on
benchmark datasets MVTec-AD and MPDD show that the proposed LafitE, \ie, Latent
Diffusion Model with Feature Editing, outperforms state-of-art methods by a
significant margin in terms of average AUROC. The hyperparamters selected via
our pseudo validation set are well-matched to the real test set.
ExposureDiffusion: Learning to Expose for Low-light Image Enhancement
July 15, 2023
Yufei Wang, Yi Yu, Wenhan Yang, Lanqing Guo, Lap-Pui Chau, Alex C. Kot, Bihan Wen
Previous raw image-based low-light image enhancement methods predominantly
relied on feed-forward neural networks to learn deterministic mappings from
low-light to normally-exposed images. However, they failed to capture critical
distribution information, leading to visually undesirable results. This work
addresses the issue by seamlessly integrating a diffusion model with a
physics-based exposure model. Different from a vanilla diffusion model that has
to perform Gaussian denoising, with the injected physics-based exposure model,
our restoration process can directly start from a noisy image instead of pure
noise. As such, our method obtains significantly improved performance and
reduced inference time compared with vanilla diffusion models. To make full use
of the advantages of different intermediate steps, we further propose an
adaptive residual layer that effectively screens out the side-effect in the
iterative refinement when the intermediate results have been already
well-exposed. The proposed framework can work with both real-paired datasets,
SOTA noise models, and different backbone networks. Note that, the proposed
framework is compatible with real-paired datasets, real/synthetic noise models,
and different backbone networks. We evaluate the proposed method on various
public benchmarks, achieving promising results with consistent improvements
using different exposure models and backbones. Besides, the proposed method
achieves better generalization capacity for unseen amplifying ratios and better
performance than a larger feedforward neural model when few parameters are
adopted.
Diffusion Based Multi-Agent Adversarial Tracking
July 12, 2023
Sean Ye, Manisha Natarajan, Zixuan Wu, Matthew Gombolay
Target tracking plays a crucial role in real-world scenarios, particularly in
drug-trafficking interdiction, where the knowledge of an adversarial target’s
location is often limited. Improving autonomous tracking systems will enable
unmanned aerial, surface, and underwater vehicles to better assist in
interdicting smugglers that use manned surface, semi-submersible, and aerial
vessels. As unmanned drones proliferate, accurate autonomous target estimation
is even more crucial for security and safety. This paper presents Constrained
Agent-based Diffusion for Enhanced Multi-Agent Tracking (CADENCE), an approach
aimed at generating comprehensive predictions of adversary locations by
leveraging past sparse state information. To assess the effectiveness of this
approach, we evaluate predictions on single-target and multi-target pursuit
environments, employing Monte-Carlo sampling of the diffusion model to estimate
the probability associated with each generated trajectory. We propose a novel
cross-attention based diffusion model that utilizes constraint-based sampling
to generate multimodal track hypotheses. Our single-target model surpasses the
performance of all baseline methods on Average Displacement Error (ADE) for
predictions across all time horizons.
DiffuseGAE: Controllable and High-fidelity Image Manipulation from Disentangled Representation
July 12, 2023
Yipeng Leng, Qiangjuan Huang, Zhiyuan Wang, Yangyang Liu, Haoyu Zhang
Diffusion probabilistic models (DPMs) have shown remarkable results on
various image synthesis tasks such as text-to-image generation and image
inpainting. However, compared to other generative methods like VAEs and GANs,
DPMs lack a low-dimensional, interpretable, and well-decoupled latent code.
Recently, diffusion autoencoders (Diff-AE) were proposed to explore the
potential of DPMs for representation learning via autoencoding. Diff-AE
provides an accessible latent space that exhibits remarkable interpretability,
allowing us to manipulate image attributes based on latent codes from the
space. However, previous works are not generic as they only operated on a few
limited attributes. To further explore the latent space of Diff-AE and achieve
a generic editing pipeline, we proposed a module called Group-supervised
AutoEncoder(dubbed GAE) for Diff-AE to achieve better disentanglement on the
latent code. Our proposed GAE has trained via an attribute-swap strategy to
acquire the latent codes for multi-attribute image manipulation based on
examples. We empirically demonstrate that our method enables
multiple-attributes manipulation and achieves convincing sample quality and
attribute alignments, while significantly reducing computational requirements
compared to pixel-based approaches for representational decoupling. Code will
be released soon.
Metropolis Sampling for Constrained Diffusion Models
July 11, 2023
Nic Fishman, Leo Klarner, Emile Mathieu, Michael Hutchinson, Valentin de Bortoli
Denoising diffusion models have recently emerged as the predominant paradigm
for generative modelling on image domains. In addition, their extension to
Riemannian manifolds has facilitated a range of applications across the natural
sciences. While many of these problems stand to benefit from the ability to
specify arbitrary, domain-informed constraints, this setting is not covered by
the existing (Riemannian) diffusion model methodology. Recent work has
attempted to address this issue by constructing novel noising processes based
on the reflected Brownian motion and logarithmic barrier methods. However, the
associated samplers are either computationally burdensome or only apply to
convex subsets of Euclidean space. In this paper, we introduce an alternative,
simple noising scheme based on Metropolis sampling that affords substantial
gains in computational efficiency and empirical performance compared to the
earlier samplers. Of independent interest, we prove that this new process
corresponds to a valid discretisation of the reflected Brownian motion. We
demonstrate the scalability and flexibility of our approach on a range of
problem settings with convex and non-convex constraints, including applications
from geospatial modelling, robotics and protein design.
Geometric Neural Diffusion Processes
July 11, 2023
Emile Mathieu, Vincent Dutordoir, Michael J. Hutchinson, Valentin De Bortoli, Yee Whye Teh, Richard E. Turner
Denoising diffusion models have proven to be a flexible and effective
paradigm for generative modelling. Their recent extension to infinite
dimensional Euclidean spaces has allowed for the modelling of stochastic
processes. However, many problems in the natural sciences incorporate
symmetries and involve data living in non-Euclidean spaces. In this work, we
extend the framework of diffusion models to incorporate a series of geometric
priors in infinite-dimension modelling. We do so by a) constructing a noising
process which admits, as limiting distribution, a geometric Gaussian process
that transforms under the symmetry group of interest, and b) approximating the
score with a neural network that is equivariant w.r.t. this group. We show that
with these conditions, the generative functional model admits the same
symmetry. We demonstrate scalability and capacity of the model, using a novel
Langevin-based conditional sampler, to fit complex scalar and vector fields,
with Euclidean and spherical codomain, on synthetic and real-world weather
data.
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
July 10, 2023
Yuwei Guo, Ceyuan Yang, Anyi Rao, Yaohui Wang, Yu Qiao, Dahua Lin, Bo Dai
With the advance of text-to-image models (e.g., Stable Diffusion) and
corresponding personalization techniques such as DreamBooth and LoRA, everyone
can manifest their imagination into high-quality images at an affordable cost.
Subsequently, there is a great demand for image animation techniques to further
combine generated static images with motion dynamics. In this report, we
propose a practical framework to animate most of the existing personalized
text-to-image models once and for all, saving efforts in model-specific tuning.
At the core of the proposed framework is to insert a newly initialized motion
modeling module into the frozen text-to-image model and train it on video clips
to distill reasonable motion priors. Once trained, by simply injecting this
motion modeling module, all personalized versions derived from the same base
T2I readily become text-driven models that produce diverse and personalized
animated images. We conduct our evaluation on several public representative
personalized text-to-image models across anime pictures and realistic
photographs, and demonstrate that our proposed framework helps these models
generate temporally smooth animation clips while preserving the domain and
diversity of their outputs. Code and pre-trained weights will be publicly
available at https://animatediff.github.io/ .
Timbre transfer using image-to-image denoising diffusion models
July 10, 2023
Luca Comanducci, Fabio Antonacci, Augusto Sarti
Timbre transfer techniques aim at converting the sound of a musical piece
generated by one instrument into the same one as if it was played by another
instrument, while maintaining as much as possible the content in terms of
musical characteristics such as melody and dynamics. Following their recent
breakthroughs in deep learning-based generation, we apply Denoising Diffusion
Models (DDMs) to perform timbre transfer. Specifically, we apply the recently
proposed Denoising Diffusion Implicit Models (DDIMs) that enable to accelerate
the sampling procedure. Inspired by the recent application of DDMs to image
translation problems we formulate the timbre transfer task similarly, by first
converting the audio tracks into log mel spectrograms and by conditioning the
generation of the desired timbre spectrogram through the input timbre
spectrogram. We perform both one-to-one and many-to-many timbre transfer, by
converting audio waveforms containing only single instruments and multiple
instruments, respectively. We compare the proposed technique with existing
state-of-the-art methods both through listening tests and objective measures in
order to demonstrate the effectiveness of the proposed model.
Geometric Constraints in Probabilistic Manifolds: A Bridge from Molecular Dynamics to Structured Diffusion Processes
July 10, 2023
Justin Diamond, Markus Lill
Understanding the macroscopic characteristics of biological complexes demands
precision and specificity in statistical ensemble modeling. One of the primary
challenges in this domain lies in sampling from particular subsets of the
state-space, driven either by existing structural knowledge or specific areas
of interest within the state-space. We propose a method that enables sampling
from distributions that rigorously adhere to arbitrary sets of geometric
constraints in Euclidean spaces. This is achieved by integrating a constraint
projection operator within the well-regarded architecture of Denoising
Diffusion Probabilistic Models, a framework founded in generative modeling and
probabilistic inference. The significance of this work becomes apparent, for
instance, in the context of deep learning-based drug design, where it is
imperative to maintain specific molecular profile interactions to realize the
desired therapeutic outcomes and guarantee safety.
July 09, 2023
Dan Ruta, Gemma Canet Tarrés, Andrew Gilbert, Eli Shechtman, Nicholas Kolkin, John Collomosse
Neural Style Transfer (NST) is the field of study applying neural techniques
to modify the artistic appearance of a content image to match the style of a
reference style image. Traditionally, NST methods have focused on texture-based
image edits, affecting mostly low level information and keeping most image
structures the same. However, style-based deformation of the content is
desirable for some styles, especially in cases where the style is abstract or
the primary concept of the style is in its deformed rendition of some content.
With the recent introduction of diffusion models, such as Stable Diffusion, we
can access far more powerful image generation techniques, enabling new
possibilities. In our work, we propose using this new class of models to
perform style transfer while enabling deformable style transfer, an elusive
capability in previous models. We show how leveraging the priors of these
models can expose new artistic controls at inference time, and we document our
findings in exploring this new direction for the field of style transfer.
Score-based Conditional Generation with Fewer Labeled Data by Self-calibrating Classifier Guidance
July 09, 2023
Paul Kuo-Ming Huang, Si-An Chen, Hsuan-Tien Lin
Score-based generative models (SGMs) are a popular family of deep generative
models that achieve leading image generation quality. Early studies extend SGMs
to tackle class-conditional generation by coupling an unconditional SGM with
the guidance of a trained classifier. Nevertheless, such classifier-guided SGMs
do not always achieve accurate conditional generation, especially when trained
with fewer labeled data. We argue that the problem is rooted in the
classifier’s tendency to overfit without coordinating with the underlying
unconditional distribution. We propose improving classifier-guided SGMs by
letting the classifier regularize itself to respect the unconditional
distribution. Our key idea is to use principles from energy-based models to
convert the classifier as another view of the unconditional SGM. Then, existing
loss for the unconditional SGM can be leveraged to achieve regularization by
calibrating the classifier’s internal unconditional scores. The regularization
scheme can be applied to not only the labeled data but also unlabeled ones to
further improve the classifier. Empirical results show that the proposed
approach significantly improves conditional generation quality across various
percentages of fewer labeled data. The results confirm the potential of the
proposed approach for generative modeling with limited labeled data.
Measuring the Success of Diffusion Models at Imitating Human Artists
July 08, 2023
Stephen Casper, Zifan Guo, Shreya Mogulothu, Zachary Marinov, Chinmay Deshpande, Rui-Jie Yew, Zheng Dai, Dylan Hadfield-Menell
Modern diffusion models have set the state-of-the-art in AI image generation.
Their success is due, in part, to training on Internet-scale data which often
includes copyrighted work. This prompts questions about the extent to which
these models learn from, imitate, or copy the work of human artists. This work
suggests that tying copyright liability to the capabilities of the model may be
useful given the evolving ecosystem of generative models. Specifically, much of
the legal analysis of copyright and generative systems focuses on the use of
protected data for training. As a result, the connections between data,
training, and the system are often obscured. In our approach, we consider
simple image classification techniques to measure a model’s ability to imitate
specific artists. Specifically, we use Contrastive Language-Image Pretrained
(CLIP) encoders to classify images in a zero-shot fashion. Our process first
prompts a model to imitate a specific artist. Then, we test whether CLIP can be
used to reclassify the artist (or the artist’s work) from the imitation. If
these tests match the imitation back to the original artist, this suggests the
model can imitate that artist’s expression. Our approach is simple and
quantitative. Furthermore, it uses standard techniques and does not require
additional training. We demonstrate our approach with an audit of Stable
Diffusion’s capacity to imitate 70 professional digital artists with
copyrighted work online. When Stable Diffusion is prompted to imitate an artist
from this set, we find that the artist can be identified from the imitation
with an average accuracy of 81.0%. Finally, we also show that a sample of the
artist’s work can be matched to these imitation images with a high degree of
statistical reliability. Overall, these results suggest that Stable Diffusion
is broadly successful at imitating individual human artists.
Stimulating the Diffusion Model for Image Denoising via Adaptive Embedding and Ensembling
July 08, 2023
Tong Li, Hansen Feng, Lizhi Wang, Zhiwei Xiong, Hua Huang
Image denoising is a fundamental problem in computational photography, where
achieving high perception with low distortion is highly demanding. Current
methods either struggle with perceptual quality or suffer from significant
distortion. Recently, the emerging diffusion model has achieved
state-of-the-art performance in various tasks and demonstrates great potential
for image denoising. However, stimulating diffusion models for image denoising
is not straightforward and requires solving several critical problems. For one
thing, the input inconsistency hinders the connection between diffusion models
and image denoising. For another, the content inconsistency between the
generated image and the desired denoised image introduces distortion. To tackle
these problems, we present a novel strategy called the Diffusion Model for
Image Denoising (DMID) by understanding and rethinking the diffusion model from
a denoising perspective. Our DMID strategy includes an adaptive embedding
method that embeds the noisy image into a pre-trained unconditional diffusion
model and an adaptive ensembling method that reduces distortion in the denoised
image. Our DMID strategy achieves state-of-the-art performance on both
distortion-based and perception-based metrics, for both Gaussian and real-world
image denoising.The code is available at https://github.com/Li-Tong-621/DMID.
Unsupervised 3D out-of-distribution detection with latent diffusion models
July 07, 2023
Mark S. Graham, Walter Hugo Lopez Pinaya, Paul Wright, Petru-Daniel Tudosiu, Yee H. Mah, James T. Teo, H. Rolf Jäger, David Werring, Parashkev Nachev, Sebastien Ourselin, M. Jorge Cardoso
Methods for out-of-distribution (OOD) detection that scale to 3D data are
crucial components of any real-world clinical deep learning system. Classic
denoising diffusion probabilistic models (DDPMs) have been recently proposed as
a robust way to perform reconstruction-based OOD detection on 2D datasets, but
do not trivially scale to 3D data. In this work, we propose to use Latent
Diffusion Models (LDMs), which enable the scaling of DDPMs to high-resolution
3D medical data. We validate the proposed approach on near- and far-OOD
datasets and compare it to a recently proposed, 3D-enabled approach using
Latent Transformer Models (LTMs). Not only does the proposed LDM-based approach
achieve statistically significant better performance, it also shows less
sensitivity to the underlying latent representation, more favourable memory
scaling, and produces better spatial anomaly maps. Code is available at
https://github.com/marksgraham/ddpm-ood
AutoDecoding Latent 3D Diffusion Models
July 07, 2023
Evangelos Ntavelis, Aliaksandr Siarohin, Kyle Olszewski, Chaoyang Wang, Luc Van Gool, Sergey Tulyakov
We present a novel approach to the generation of static and articulated 3D
assets that has a 3D autodecoder at its core. The 3D autodecoder framework
embeds properties learned from the target dataset in the latent space, which
can then be decoded into a volumetric representation for rendering
view-consistent appearance and geometry. We then identify the appropriate
intermediate volumetric latent space, and introduce robust normalization and
de-normalization operations to learn a 3D diffusion from 2D images or monocular
videos of rigid or articulated objects. Our approach is flexible enough to use
either existing camera supervision or no camera information at all – instead
efficiently learning it during training. Our evaluations demonstrate that our
generation results outperform state-of-the-art alternatives on various
benchmark datasets and metrics, including multi-view image datasets of
synthetic objects, real in-the-wild videos of moving people, and a large-scale,
real video dataset of static objects.
Simulation-free Schrödinger bridges via score and flow matching
July 07, 2023
Alexander Tong, Nikolay Malkin, Kilian Fatras, Lazar Atanackovic, Yanlei Zhang, Guillaume Huguet, Guy Wolf, Yoshua Bengio
We present simulation-free score and flow matching ([SF]$^2$M), a
simulation-free objective for inferring stochastic dynamics given unpaired
samples drawn from arbitrary source and target distributions. Our method
generalizes both the score-matching loss used in the training of diffusion
models and the recently proposed flow matching loss used in the training of
continuous normalizing flows. [SF]$^2$M interprets continuous-time stochastic
generative modeling as a Schr"odinger bridge problem. It relies on static
entropy-regularized optimal transport, or a minibatch approximation, to
efficiently learn the SB without simulating the learned stochastic process. We
find that [SF]$^2$M is more efficient and gives more accurate solutions to the
SB problem than simulation-based methods from prior work. Finally, we apply
[SF]$^2$M to the problem of learning cell dynamics from snapshot data. Notably,
[SF]$^2$M is the first method to accurately model cell dynamics in high
dimensions and can recover known gene regulatory networks from simulated data.
Hyperspectral and Multispectral Image Fusion Using the Conditional Denoising Diffusion Probabilistic Model
July 07, 2023
Shuaikai Shi, Lijun Zhang, Jie Chen
Hyperspectral images (HSI) have a large amount of spectral information
reflecting the characteristics of matter, while their spatial resolution is low
due to the limitations of imaging technology. Complementary to this are
multispectral images (MSI), e.g., RGB images, with high spatial resolution but
insufficient spectral bands. Hyperspectral and multispectral image fusion is a
technique for acquiring ideal images that have both high spatial and high
spectral resolution cost-effectively. Many existing HSI and MSI fusion
algorithms rely on known imaging degradation models, which are often not
available in practice. In this paper, we propose a deep fusion method based on
the conditional denoising diffusion probabilistic model, called DDPM-Fus.
Specifically, the DDPM-Fus contains the forward diffusion process which
gradually adds Gaussian noise to the high spatial resolution HSI (HrHSI) and
another reverse denoising process which learns to predict the desired HrHSI
from its noisy version conditioning on the corresponding high spatial
resolution MSI (HrMSI) and low spatial resolution HSI (LrHSI). Once the
training is completes, the proposed DDPM-Fus implements the reverse process on
the test HrMSI and LrHSI to generate the fused HrHSI. Experiments conducted on
one indoor and two remote sensing datasets show the superiority of the proposed
model when compared with other advanced deep learningbased fusion methods. The
codes of this work will be opensourced at this address:
https://github.com/shuaikaishi/DDPMFus for reproducibility.
DragonDiffusion: Enabling Drag-style Manipulation on Diffusion Models
July 05, 2023
Chong Mou, Xintao Wang, Jiechong Song, Ying Shan, Jian Zhang
Despite the ability of existing large-scale text-to-image (T2I) models to
generate high-quality images from detailed textual descriptions, they often
lack the ability to precisely edit the generated or real images. In this paper,
we propose a novel image editing method, DragonDiffusion, enabling Drag-style
manipulation on Diffusion models. Specifically, we construct classifier
guidance based on the strong correspondence of intermediate features in the
diffusion model. It can transform the editing signals into gradients via
feature correspondence loss to modify the intermediate representation of the
diffusion model. Based on this guidance strategy, we also build a multi-scale
guidance to consider both semantic and geometric alignment. Moreover, a
cross-branch self-attention is added to maintain the consistency between the
original image and the editing result. Our method, through an efficient design,
achieves various editing modes for the generated or real images, such as object
moving, object resizing, object appearance replacement, and content dragging.
It is worth noting that all editing and content preservation signals come from
the image itself, and the model does not require fine-tuning or additional
modules. Our source code will be available at
https://github.com/MC-E/DragonDiffusion.
RADiff: Controllable Diffusion Models for Radio Astronomical Maps Generation
July 05, 2023
Renato Sortino, Thomas Cecconello, Andrea DeMarco, Giuseppe Fiameni, Andrea Pilzer, Andrew M. Hopkins, Daniel Magro, Simone Riggi, Eva Sciacca, Adriano Ingallinera, Cristobal Bordiu, Filomena Bufano, Concetto Spampinato
Along with the nearing completion of the Square Kilometre Array (SKA), comes
an increasing demand for accurate and reliable automated solutions to extract
valuable information from the vast amount of data it will allow acquiring.
Automated source finding is a particularly important task in this context, as
it enables the detection and classification of astronomical objects.
Deep-learning-based object detection and semantic segmentation models have
proven to be suitable for this purpose. However, training such deep networks
requires a high volume of labeled data, which is not trivial to obtain in the
context of radio astronomy. Since data needs to be manually labeled by experts,
this process is not scalable to large dataset sizes, limiting the possibilities
of leveraging deep networks to address several tasks. In this work, we propose
RADiff, a generative approach based on conditional diffusion models trained
over an annotated radio dataset to generate synthetic images, containing radio
sources of different morphologies, to augment existing datasets and reduce the
problems caused by class imbalances. We also show that it is possible to
generate fully-synthetic image-annotation pairs to automatically augment any
annotated dataset. We evaluate the effectiveness of this approach by training a
semantic segmentation model on a real dataset augmented in two ways: 1) using
synthetic images obtained from real masks, and 2) generating images from
synthetic semantic masks. We show an improvement in performance when applying
augmentation, gaining up to 18% in performance when using real masks and 4%
when augmenting with synthetic masks. Finally, we employ this model to generate
large-scale radio maps with the objective of simulating Data Challenges.
SVDM: Single-View Diffusion Model for Pseudo-Stereo 3D Object Detection
July 05, 2023
Yuguang Shi
One of the key problems in 3D object detection is to reduce the accuracy gap
between methods based on LiDAR sensors and those based on monocular cameras. A
recently proposed framework for monocular 3D detection based on Pseudo-Stereo
has received considerable attention in the community. However, so far these two
problems are discovered in existing practices, including (1) monocular depth
estimation and Pseudo-Stereo detector must be trained separately, (2) Difficult
to be compatible with different stereo detectors and (3) the overall
calculation is large, which affects the reasoning speed. In this work, we
propose an end-to-end, efficient pseudo-stereo 3D detection framework by
introducing a Single-View Diffusion Model (SVDM) that uses a few iterations to
gradually deliver right informative pixels to the left image. SVDM allows the
entire pseudo-stereo 3D detection pipeline to be trained end-to-end and can
benefit from the training of stereo detectors. Afterwards, we further explore
the application of SVDM in depth-free stereo 3D detection, and the final
framework is compatible with most stereo detectors. Among multiple benchmarks
on the KITTI dataset, we achieve new state-of-the-art performance.
Diffusion Models for Computational Design at the Example of Floor Plans
July 05, 2023
Joern Ploennigs, Markus Berger
AI Image generators based on diffusion models are widely discussed recently
for their capability to create images from simple text prompts. But, for
practical use in civil engineering they need to be able to create specific
construction plans for given constraints. Within this paper we explore the
capabilities of those diffusion-based AI generators for computational design at
the example of floor plans and identify their current limitation. We explain
how the diffusion-models work and propose new diffusion models with improved
semantic encoding. In several experiments we show that we can improve validity
of generated floor plans from 6% to 90% and query performance for different
examples. We identify short comings and derive future research challenges of
those models and discuss the need to combine diffusion models with building
information modelling. With this we provide key insights into the current state
and future directions for diffusion models in civil engineering.
DiffFlow: A Unified SDE Framework for Score-Based Diffusion Models and Generative Adversarial Networks
July 05, 2023
Jingwei Zhang, Han Shi, Jincheng Yu, Enze Xie, Zhenguo Li
stat.ML, cs.CV, cs.LG, math.AP
Generative models can be categorized into two types: explicit generative
models that define explicit density forms and allow exact likelihood inference,
such as score-based diffusion models (SDMs) and normalizing flows; implicit
generative models that directly learn a transformation from the prior to the
data distribution, such as generative adversarial nets (GANs). While these two
types of models have shown great success, they suffer from respective
limitations that hinder them from achieving fast sampling and high sample
quality simultaneously. In this paper, we propose a unified theoretic framework
for SDMs and GANs. We shown that: i) the learning dynamics of both SDMs and
GANs can be described as a novel SDE named Discriminator Denoising Diffusion
Flow (DiffFlow) where the drift can be determined by some weighted combinations
of scores of the real data and the generated data; ii) By adjusting the
relative weights between different score terms, we can obtain a smooth
transition between SDMs and GANs while the marginal distribution of the SDE
remains invariant to the change of the weights; iii) we prove the asymptotic
optimality and maximal likelihood training scheme of the DiffFlow dynamics; iv)
under our unified theoretic framework, we introduce several instantiations of
the DiffFLow that provide new algorithms beyond GANs and SDMs with exact
likelihood inference and have potential to achieve flexible trade-off between
high sample quality and fast sampling speed.
Prompting Diffusion Representations for Cross-Domain Semantic Segmentation
July 05, 2023
Rui Gong, Martin Danelljan, Han Sun, Julio Delgado Mangas, Luc Van Gool
While originally designed for image generation, diffusion models have
recently shown to provide excellent pretrained feature representations for
semantic segmentation. Intrigued by this result, we set out to explore how well
diffusion-pretrained representations generalize to new domains, a crucial
ability for any representation. We find that diffusion-pretraining achieves
extraordinary domain generalization results for semantic segmentation,
outperforming both supervised and self-supervised backbone networks. Motivated
by this, we investigate how to utilize the model’s unique ability of taking an
input prompt, in order to further enhance its cross-domain performance. We
introduce a scene prompt and a prompt randomization strategy to help further
disentangle the domain-invariant information when training the segmentation
head. Moreover, we propose a simple but highly effective approach for test-time
domain adaptation, based on learning a scene prompt on the target domain in an
unsupervised manner. Extensive experiments conducted on four synthetic-to-real
and clear-to-adverse weather benchmarks demonstrate the effectiveness of our
approaches. Without resorting to any complex techniques, such as image
translation, augmentation, or rare-class sampling, we set a new
state-of-the-art on all benchmarks. Our implementation will be publicly
available at \url{https://github.com/ETHRuiGong/PTDiffSeg}.
Monte Carlo Sampling without Isoperimetry: A Reverse Diffusion Approach
July 05, 2023
Xunpeng Huang, Hanze Dong, Yifan Hao, Yian Ma, Tong Zhang
The efficacy of modern generative models is commonly contingent upon the
precision of score estimation along the diffusion path, with a focus on
diffusion models and their ability to generate high-quality data samples. This
study delves into the application of reverse diffusion to Monte Carlo sampling.
It is shown that score estimation can be transformed into a mean estimation
problem via the decomposition of the transition kernel. By estimating the mean
of the posterior distribution, we derive a novel Monte Carlo sampling algorithm
from the reverse diffusion process, which is distinct from traditional Markov
Chain Monte Carlo (MCMC) methods. We calculate the error requirements and
sample size for the posterior distribution, and use the result to derive an
algorithm that can approximate the target distribution to any desired accuracy.
Additionally, by estimating the log-Sobolev constant of the posterior
distribution, we show under suitable conditions the problem of sampling from
the posterior can be easier than direct sampling from the target distribution
using traditional MCMC techniques. For Gaussian mixture models, we demonstrate
that the new algorithm achieves significant improvement over the traditional
Langevin-style MCMC sampling methods both theoretically and practically. Our
algorithm offers a new perspective and solution beyond classical MCMC
algorithms for challenging complex distributions.
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
July 04, 2023
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, Robin Rombach
We present SDXL, a latent diffusion model for text-to-image synthesis.
Compared to previous versions of Stable Diffusion, SDXL leverages a three times
larger UNet backbone: The increase of model parameters is mainly due to more
attention blocks and a larger cross-attention context as SDXL uses a second
text encoder. We design multiple novel conditioning schemes and train SDXL on
multiple aspect ratios. We also introduce a refinement model which is used to
improve the visual fidelity of samples generated by SDXL using a post-hoc
image-to-image technique. We demonstrate that SDXL shows drastically improved
performance compared the previous versions of Stable Diffusion and achieves
results competitive with those of black-box state-of-the-art image generators.
In the spirit of promoting open research and fostering transparency in large
model training and evaluation, we provide access to code and model weights at
https://github.com/Stability-AI/generative-models
ProtoDiffusion: Classifier-Free Diffusion Guidance with Prototype Learning
July 04, 2023
Gulcin Baykal, Halil Faruk Karagoz, Taha Binhuraib, Gozde Unal
Diffusion models are generative models that have shown significant advantages
compared to other generative models in terms of higher generation quality and
more stable training. However, the computational need for training diffusion
models is considerably increased. In this work, we incorporate prototype
learning into diffusion models to achieve high generation quality faster than
the original diffusion model. Instead of randomly initialized class embeddings,
we use separately learned class prototypes as the conditioning information to
guide the diffusion process. We observe that our method, called ProtoDiffusion,
achieves better performance in the early stages of training compared to the
baseline method, signifying that using the learned prototypes shortens the
training time. We demonstrate the performance of ProtoDiffusion using various
datasets and experimental settings, achieving the best performance in shorter
times across all settings.
Crossway Diffusion: Improving Diffusion-based Visuomotor Policy via Self-supervised Learning
July 04, 2023
Xiang Li, Varun Belagali, Jinghuan Shang, Michael S. Ryoo
Sequence modeling approaches have shown promising results in robot imitation
learning. Recently, diffusion models have been adopted for behavioral cloning
in a sequence modeling fashion, benefiting from their exceptional capabilities
in modeling complex data distributions. The standard diffusion-based policy
iteratively generates action sequences from random noise conditioned on the
input states. Nonetheless, the model for diffusion policy can be further
improved in terms of visual representations. In this work, we propose Crossway
Diffusion, a simple yet effective method to enhance diffusion-based visuomotor
policy learning via a carefully designed state decoder and an auxiliary
self-supervised learning (SSL) objective. The state decoder reconstructs raw
image pixels and other state information from the intermediate representations
of the reverse diffusion process. The whole model is jointly optimized by the
SSL objective and the original diffusion loss. Our experiments demonstrate the
effectiveness of Crossway Diffusion in various simulated and real-world robot
tasks, confirming its consistent advantages over the standard diffusion-based
policy and substantial improvements over the baselines.
Collaborative Score Distillation for Consistent Visual Synthesis
July 04, 2023
Subin Kim, Kyungmin Lee, June Suk Choi, Jongheon Jeong, Kihyuk Sohn, Jinwoo Shin
Generative priors of large-scale text-to-image diffusion models enable a wide
range of new generation and editing applications on diverse visual modalities.
However, when adapting these priors to complex visual modalities, often
represented as multiple images (e.g., video), achieving consistency across a
set of images is challenging. In this paper, we address this challenge with a
novel method, Collaborative Score Distillation (CSD). CSD is based on the Stein
Variational Gradient Descent (SVGD). Specifically, we propose to consider
multiple samples as “particles” in the SVGD update and combine their score
functions to distill generative priors over a set of images synchronously.
Thus, CSD facilitates seamless integration of information across 2D images,
leading to a consistent visual synthesis across multiple samples. We show the
effectiveness of CSD in a variety of tasks, encompassing the visual editing of
panorama images, videos, and 3D scenes. Our results underline the competency of
CSD as a versatile method for enhancing inter-sample consistency, thereby
broadening the applicability of text-to-image diffusion models.
July 04, 2023
Shentong Mo, Enze Xie, Ruihang Chu, Lewei Yao, Lanqing Hong, Matthias Nießner, Zhenguo Li
Recent Diffusion Transformers (e.g., DiT) have demonstrated their powerful
effectiveness in generating high-quality 2D images. However, it is still being
determined whether the Transformer architecture performs equally well in 3D
shape generation, as previous 3D diffusion methods mostly adopted the U-Net
architecture. To bridge this gap, we propose a novel Diffusion Transformer for
3D shape generation, namely DiT-3D, which can directly operate the denoising
process on voxelized point clouds using plain Transformers. Compared to
existing U-Net approaches, our DiT-3D is more scalable in model size and
produces much higher quality generations. Specifically, the DiT-3D adopts the
design philosophy of DiT but modifies it by incorporating 3D positional and
patch embeddings to adaptively aggregate input from voxelized point clouds. To
reduce the computational cost of self-attention in 3D shape generation, we
incorporate 3D window attention into Transformer blocks, as the increased 3D
token length resulting from the additional dimension of voxels can lead to high
computation. Finally, linear and devoxelization layers are used to predict the
denoised point clouds. In addition, our transformer architecture supports
efficient fine-tuning from 2D to 3D, where the pre-trained DiT-2D checkpoint on
ImageNet can significantly improve DiT-3D on ShapeNet. Experimental results on
the ShapeNet dataset demonstrate that the proposed DiT-3D achieves
state-of-the-art performance in high-fidelity and diverse 3D point cloud
generation. In particular, our DiT-3D decreases the 1-Nearest Neighbor Accuracy
of the state-of-the-art method by 4.59 and increases the Coverage metric by
3.51 when evaluated on Chamfer Distance.
Synchronous Image-Label Diffusion Probability Model with Application to Stroke Lesion Segmentation on Non-contrast CT
July 04, 2023
Jianhai Zhang, Tonghua Wan, Ethan MacDonald, Bijoy Menon, Aravind Ganesh, Qiu Wu
Stroke lesion volume is a key radiologic measurement for assessing the
prognosis of Acute Ischemic Stroke (AIS) patients, which is challenging to be
automatically measured on Non-Contrast CT (NCCT) scans. Recent diffusion
probabilistic models have shown potentials of being used for image
segmentation. In this paper, a novel Synchronous image-label Diffusion
Probability Model (SDPM) is proposed for stroke lesion segmentation on NCCT
using Markov diffusion process. The proposed SDPM is fully based on a Latent
Variable Model (LVM), offering a complete probabilistic elaboration. An
additional net-stream, parallel with a noise prediction stream, is introduced
to obtain initial noisy label estimates for efficiently inferring the final
labels. By optimizing the specified variational boundaries, the trained model
can infer multiple label estimates for reference given the input images with
noises. The proposed model was assessed on three stroke lesion datasets
including one public and two private datasets. Compared to several U-net and
transformer-based segmentation methods, our proposed SDPM model is able to
achieve state-of-the-art performance. The code is publicly available.
Training Energy-Based Models with Diffusion Contrastive Divergences
July 04, 2023
Weijian Luo, Hao Jiang, Tianyang Hu, Jiacheng Sun, Zhenguo Li, Zhihua Zhang
Energy-Based Models (EBMs) have been widely used for generative modeling.
Contrastive Divergence (CD), a prevailing training objective for EBMs, requires
sampling from the EBM with Markov Chain Monte Carlo methods (MCMCs), which
leads to an irreconcilable trade-off between the computational burden and the
validity of the CD. Running MCMCs till convergence is computationally
intensive. On the other hand, short-run MCMC brings in an extra non-negligible
parameter gradient term that is difficult to handle. In this paper, we provide
a general interpretation of CD, viewing it as a special instance of our
proposed Diffusion Contrastive Divergence (DCD) family. By replacing the
Langevin dynamic used in CD with other EBM-parameter-free diffusion processes,
we propose a more efficient divergence. We show that the proposed DCDs are both
more computationally efficient than the CD and are not limited to a
non-negligible gradient term. We conduct intensive experiments, including both
synthesis data modeling and high-dimensional image denoising and generation, to
show the advantages of the proposed DCDs. On the synthetic data learning and
image denoising experiments, our proposed DCD outperforms CD by a large margin.
In image generation experiments, the proposed DCD is capable of training an
energy-based model for generating the Celab-A $32\times 32$ dataset, which is
comparable to existing EBMs.
Improved sampling via learned diffusions
July 03, 2023
Lorenz Richter, Julius Berner, Guan-Horng Liu
cs.LG, math.OC, math.PR, stat.ML
Recently, a series of papers proposed deep learning-based approaches to
sample from unnormalized target densities using controlled diffusion processes.
In this work, we identify these approaches as special cases of the
Schr"odinger bridge problem, seeking the most likely stochastic evolution
between a given prior distribution and the specified target. We further
generalize this framework by introducing a variational formulation based on
divergences between path space measures of time-reversed diffusion processes.
This abstract perspective leads to practical losses that can be optimized by
gradient-based algorithms and includes previous objectives as special cases. At
the same time, it allows us to consider divergences other than the reverse
Kullback-Leibler divergence that is known to suffer from mode collapse. In
particular, we propose the so-called log-variance loss, which exhibits
favorable numerical properties and leads to significantly improved performance
across all considered approaches.
Squeezing Large-Scale Diffusion Models for Mobile
July 03, 2023
Jiwoong Choi, Minkyu Kim, Daehyun Ahn, Taesu Kim, Yulhwa Kim, Dongwon Jo, Hyesung Jeon, Jae-Joon Kim, Hyungjun Kim
The emergence of diffusion models has greatly broadened the scope of
high-fidelity image synthesis, resulting in notable advancements in both
practical implementation and academic research. With the active adoption of the
model in various real-world applications, the need for on-device deployment has
grown considerably. However, deploying large diffusion models such as Stable
Diffusion with more than one billion parameters to mobile devices poses
distinctive challenges due to the limited computational and memory resources,
which may vary according to the device. In this paper, we present the
challenges and solutions for deploying Stable Diffusion on mobile devices with
TensorFlow Lite framework, which supports both iOS and Android devices. The
resulting Mobile Stable Diffusion achieves the inference latency of smaller
than 7 seconds for a 512x512 image generation on Android devices with mobile
GPUs.
Learning Mixtures of Gaussians Using the DDPM Objective
July 03, 2023
Kulin Shah, Sitan Chen, Adam Klivans
Recent works have shown that diffusion models can learn essentially any
distribution provided one can perform score estimation. Yet it remains poorly
understood under what settings score estimation is possible, let alone when
practical gradient-based algorithms for this task can provably succeed.
In this work, we give the first provably efficient results along these lines
for one of the most fundamental distribution families, Gaussian mixture models.
We prove that gradient descent on the denoising diffusion probabilistic model
(DDPM) objective can efficiently recover the ground truth parameters of the
mixture model in the following two settings: 1) We show gradient descent with
random initialization learns mixtures of two spherical Gaussians in $d$
dimensions with $1/\text{poly}(d)$-separated centers. 2) We show gradient
descent with a warm start learns mixtures of $K$ spherical Gaussians with
$\Omega(\sqrt{\log(\min(K,d))})$-separated centers. A key ingredient in our
proofs is a new connection between score-based methods and two other approaches
to distribution learning, the EM algorithm and spectral methods.
Transport, Variational Inference and Diffusions: with Applications to Annealed Flows and Schrödinger Bridges
July 03, 2023
Francisco Vargas, Shreyas Padhy, Denis Blessing, Nikolas Nüsken
Connecting optimal transport and variational inference, we present a
principled and systematic framework for sampling and generative modelling
centred around divergences on path space. Our work culminates in the
development of the \emph{Controlled Monte Carlo Diffusion} sampler (CMCD) for
Bayesian computation, a score-based annealing technique that crucially adapts
both forward and backward dynamics in a diffusion model. On the way, we clarify
the relationship between the EM-algorithm and iterative proportional fitting
(IPF) for Schr{"o}dinger bridges, deriving as well a regularised objective
that bypasses the iterative bottleneck of standard IPF-updates. Finally, we
show that CMCD has a strong foundation in the Jarzinsky and Crooks identities
from statistical physics, and that it convincingly outperforms competing
approaches across a wide array of experiments.
DifFSS: Diffusion Model for Few-Shot Semantic Segmentation
July 03, 2023
Weimin Tan, Siyuan Chen, Bo Yan
Diffusion models have demonstrated excellent performance in image generation.
Although various few-shot semantic segmentation (FSS) models with different
network structures have been proposed, performance improvement has reached a
bottleneck. This paper presents the first work to leverage the diffusion model
for FSS task, called DifFSS. DifFSS, a novel FSS paradigm, can further improve
the performance of the state-of-the-art FSS models by a large margin without
modifying their network structure. Specifically, we utilize the powerful
generation ability of diffusion models to generate diverse auxiliary support
images by using the semantic mask, scribble or soft HED boundary of the support
image as control conditions. This generation process simulates the variety
within the class of the query image, such as color, texture variation,
lighting, $etc$. As a result, FSS models can refer to more diverse support
images, yielding more robust representations, thereby achieving a consistent
improvement in segmentation performance. Extensive experiments on three
publicly available datasets based on existing advanced FSS models demonstrate
the effectiveness of the diffusion model for FSS task. Furthermore, we explore
in detail the impact of different input settings of the diffusion model on
segmentation performance. Hopefully, this completely new paradigm will bring
inspiration to the study of FSS task integrated with AI-generated content. Code
is available at https://github.com/TrinitialChan/DifFSS
ImDiffusion: Imputed Diffusion Models for Multivariate Time Series Anomaly Detection
July 03, 2023
Yuhang Chen, Chaoyun Zhang, Minghua Ma, Yudong Liu, Ruomeng Ding, Bowen Li, Shilin He, Saravan Rajmohan, Qingwei Lin, Dongmei Zhang
Anomaly detection in multivariate time series data is of paramount importance
for ensuring the efficient operation of large-scale systems across diverse
domains. However, accurately detecting anomalies in such data poses significant
challenges. Existing approaches, including forecasting and reconstruction-based
methods, struggle to address these challenges effectively. To overcome these
limitations, we propose a novel anomaly detection framework named ImDiffusion,
which combines time series imputation and diffusion models to achieve accurate
and robust anomaly detection. The imputation-based approach employed by
ImDiffusion leverages the information from neighboring values in the time
series, enabling precise modeling of temporal and inter-correlated
dependencies, reducing uncertainty in the data, thereby enhancing the
robustness of the anomaly detection process. ImDiffusion further leverages
diffusion models as time series imputers to accurately capturing complex
dependencies. We leverage the step-by-step denoised outputs generated during
the inference process to serve as valuable signals for anomaly prediction,
resulting in improved accuracy and robustness of the detection process.
We evaluate the performance of ImDiffusion via extensive experiments on
benchmark datasets. The results demonstrate that our proposed framework
significantly outperforms state-of-the-art approaches in terms of detection
accuracy and timeliness. ImDiffusion is further integrated into the real
production system in Microsoft and observe a remarkable 11.4% increase in
detection F1 score compared to the legacy approach. To the best of our
knowledge, ImDiffusion represents a pioneering approach that combines
imputation-based techniques with time series anomaly detection, while
introducing the novel use of diffusion models to the field.
Variational Autoencoding Molecular Graphs with Denoising Diffusion Probabilistic Model
July 02, 2023
Daiki Koge, Naoaki Ono, Shigehiko Kanaya
In data-driven drug discovery, designing molecular descriptors is a very
important task. Deep generative models such as variational autoencoders (VAEs)
offer a potential solution by designing descriptors as probabilistic latent
vectors derived from molecular structures. These models can be trained on large
datasets, which have only molecular structures, and applied to transfer
learning. Nevertheless, the approximate posterior distribution of the latent
vectors of the usual VAE assumes a simple multivariate Gaussian distribution
with zero covariance, which may limit the performance of representing the
latent features. To overcome this limitation, we propose a novel molecular deep
generative model that incorporates a hierarchical structure into the
probabilistic latent vectors. We achieve this by a denoising diffusion
probabilistic model (DDPM). We demonstrate that our model can design effective
molecular latent vectors for molecular property prediction from some
experiments by small datasets on physical properties and activity. The results
highlight the superior prediction performance and robustness of our model
compared to existing approaches.
LEDITS: Real Image Editing with DDPM Inversion and Semantic Guidance
July 02, 2023
Linoy Tsaban, Apolinário Passos
Recent large-scale text-guided diffusion models provide powerful
image-generation capabilities. Currently, a significant effort is given to
enable the modification of these images using text only as means to offer
intuitive and versatile editing. However, editing proves to be difficult for
these generative models due to the inherent nature of editing techniques, which
involves preserving certain content from the original image. Conversely, in
text-based models, even minor modifications to the text prompt frequently
result in an entirely distinct result, making attaining one-shot generation
that accurately corresponds to the users intent exceedingly challenging. In
addition, to edit a real image using these state-of-the-art tools, one must
first invert the image into the pre-trained models domain - adding another
factor affecting the edit quality, as well as latency. In this exploratory
report, we propose LEDITS - a combined lightweight approach for real-image
editing, incorporating the Edit Friendly DDPM inversion technique with Semantic
Guidance, thus extending Semantic Guidance to real image editing, while
harnessing the editing capabilities of DDPM inversion as well. This approach
achieves versatile edits, both subtle and extensive as well as alterations in
composition and style, while requiring no optimization nor extensions to the
architecture.
MissDiff: Training Diffusion Models on Tabular Data with Missing Values
July 02, 2023
Yidong Ouyang, Liyan Xie, Chongxuan Li, Guang Cheng
The diffusion model has shown remarkable performance in modeling data
distributions and synthesizing data. However, the vanilla diffusion model
requires complete or fully observed data for training. Incomplete data is a
common issue in various real-world applications, including healthcare and
finance, particularly when dealing with tabular datasets. This work presents a
unified and principled diffusion-based framework for learning from data with
missing values under various missing mechanisms. We first observe that the
widely adopted “impute-then-generate” pipeline may lead to a biased learning
objective. Then we propose to mask the regression loss of Denoising Score
Matching in the training phase. We prove the proposed method is consistent in
learning the score of data distributions, and the proposed training objective
serves as an upper bound for the negative likelihood in certain cases. The
proposed framework is evaluated on multiple tabular datasets using realistic
and efficacious metrics and is demonstrated to outperform state-of-the-art
diffusion model on tabular data with “impute-then-generate” pipeline by a large
margin.
Practical and Asymptotically Exact Conditional Sampling in Diffusion Models
June 30, 2023
Luhuan Wu, Brian L. Trippe, Christian A. Naesseth, David M. Blei, John P. Cunningham
Diffusion models have been successful on a range of conditional generation
tasks including molecular design and text-to-image generation. However, these
achievements have primarily depended on task-specific conditional training or
error-prone heuristic approximations. Ideally, a conditional generation method
should provide exact samples for a broad range of conditional distributions
without requiring task-specific training. To this end, we introduce the Twisted
Diffusion Sampler, or TDS. TDS is a sequential Monte Carlo (SMC) algorithm that
targets the conditional distributions of diffusion models. The main idea is to
use twisting, an SMC technique that enjoys good computational efficiency, to
incorporate heuristic approximations without compromising asymptotic exactness.
We first find in simulation and on MNIST image inpainting and class-conditional
generation tasks that TDS provides a computational statistical trade-off,
yielding more accurate approximations with many particles but with empirical
improvements over heuristics with as few as two particles. We then turn to
motif-scaffolding, a core task in protein design, using a TDS extension to
Riemannian diffusion models. On benchmark test cases, TDS allows flexible
conditioning criteria and often outperforms the state of the art.
Counting Guidance for High Fidelity Text-to-Image Synthesis
June 30, 2023
Wonjun Kang, Kevin Galim, Hyung Il Koo
Recently, the quality and performance of text-to-image generation
significantly advanced due to the impressive results of diffusion models.
However, text-to-image diffusion models still fail to generate high fidelity
content with respect to the input prompt. One problem where text-to-diffusion
models struggle is generating the exact number of objects specified in the text
prompt. E.g. given a prompt “five apples and ten lemons on a table”,
diffusion-generated images usually contain the wrong number of objects. In this
paper, we propose a method to improve diffusion models to focus on producing
the correct object count given the input prompt. We adopt a counting network
that performs reference-less class-agnostic counting for any given image. We
calculate the gradients of the counting network and refine the predicted noise
for each step. To handle multiple types of objects in the prompt, we use novel
attention map guidance to obtain high-fidelity masks for each object. Finally,
we guide the denoising process by the calculated gradients for each object.
Through extensive experiments and evaluation, we demonstrate that our proposed
guidance method greatly improves the fidelity of diffusion models to object
count.
Class-Incremental Learning using Diffusion Model for Distillation and Replay
June 30, 2023
Quentin Jodelet, Xin Liu, Yin Jun Phua, Tsuyoshi Murata
Class-incremental learning aims to learn new classes in an incremental
fashion without forgetting the previously learned ones. Several research works
have shown how additional data can be used by incremental models to help
mitigate catastrophic forgetting. In this work, following the recent
breakthrough in text-to-image generative models and their wide distribution, we
propose the use of a pretrained Stable Diffusion model as a source of
additional data for class-incremental learning. Compared to competitive methods
that rely on external, often unlabeled, datasets of real images, our approach
can generate synthetic samples belonging to the same classes as the previously
encountered images. This allows us to use those additional data samples not
only in the distillation loss but also for replay in the classification loss.
Experiments on the competitive benchmarks CIFAR100, ImageNet-Subset, and
ImageNet demonstrate how this new approach can be used to further improve the
performance of state-of-the-art methods for class-incremental learning on large
scale datasets.
Filtered-Guided Diffusion: Fast Filter Guidance for Black-Box Diffusion Models
June 29, 2023
Zeqi Gu, Abe Davis
Recent advances in diffusion-based generative models have shown incredible
promise for Image-to-Image translation and editing. Most recent work in this
space relies on additional training or architecture-specific adjustments to the
diffusion process. In this work, we show that much of this low-level control
can be achieved without additional training or any access to features of the
diffusion model. Our method simply applies a filter to the input of each
diffusion step based on the output of the previous step in an adaptive manner.
Notably, this approach does not depend on any specific architecture or sampler
and can be done without access to internal features of the network, making it
easy to combine with other techniques, samplers, and diffusion architectures.
Furthermore, it has negligible cost to performance, and allows for more
continuous adjustment of guidance strength than other approaches. We show FGD
offers a fast and strong baseline that is competitive with recent
architecture-dependent approaches. Furthermore, FGD can also be used as a
simple add-on to enhance the structural guidance of other state-of-the-art I2I
methods. Finally, our derivation of this method helps to understand the impact
of self attention, a key component of other recent architecture-specific I2I
approaches, in a more architecture-independent way. Project page:
https://github.com/jaclyngu/FilteredGuidedDiffusion
Spiking Denoising Diffusion Probabilistic Models
June 29, 2023
Jiahang Cao, Ziqing Wang, Hanzhong Guo, Hao Cheng, Qiang Zhang, Renjing Xu
Spiking neural networks (SNNs) have ultra-low energy consumption and high
biological plausibility due to their binary and bio-driven nature compared with
artificial neural networks (ANNs). While previous research has primarily
focused on enhancing the performance of SNNs in classification tasks, the
generative potential of SNNs remains relatively unexplored. In our paper, we
put forward Spiking Denoising Diffusion Probabilistic Models (SDDPM), a new
class of SNN-based generative models that achieve high sample quality. To fully
exploit the energy efficiency of SNNs, we propose a purely Spiking U-Net
architecture, which achieves comparable performance to its ANN counterpart
using only 4 time steps, resulting in significantly reduced energy consumption.
Extensive experimental results reveal that our approach achieves
state-of-the-art on the generative tasks and substantially outperforms other
SNN-based generative models, achieving up to 12x and 6x improvement on the
CIFAR-10 and the CelebA datasets, respectively. Moreover, we propose a
threshold-guided strategy that can further improve the performances by 2.69% in
a training-free manner. The SDDPM symbolizes a significant advancement in the
field of SNN generation, injecting new perspectives and potential avenues of
exploration. Our code is available at https://github.com/AndyCao1125/SDDPM.
DreamDiffusion: Generating High-Quality Images from Brain EEG Signals
June 29, 2023
Yunpeng Bai, Xintao Wang, Yan-pei Cao, Yixiao Ge, Chun Yuan, Ying Shan
This paper introduces DreamDiffusion, a novel method for generating
high-quality images directly from brain electroencephalogram (EEG) signals,
without the need to translate thoughts into text. DreamDiffusion leverages
pre-trained text-to-image models and employs temporal masked signal modeling to
pre-train the EEG encoder for effective and robust EEG representations.
Additionally, the method further leverages the CLIP image encoder to provide
extra supervision to better align EEG, text, and image embeddings with limited
EEG-image pairs. Overall, the proposed method overcomes the challenges of using
EEG signals for image generation, such as noise, limited information, and
individual differences, and achieves promising results. Quantitative and
qualitative results demonstrate the effectiveness of the proposed method as a
significant step towards portable and low-cost ``thoughts-to-image’’, with
potential applications in neuroscience and computer vision. The code is
available here \url{https://github.com/bbaaii/DreamDiffusion}.
Diff-Foley: Synchronized Video-to-Audio Synthesis with Latent Diffusion Models
June 29, 2023
Simian Luo, Chuanhao Yan, Chenxu Hu, Hang Zhao
cs.SD, cs.CV, cs.LG, eess.AS
The Video-to-Audio (V2A) model has recently gained attention for its
practical application in generating audio directly from silent videos,
particularly in video/film production. However, previous methods in V2A have
limited generation quality in terms of temporal synchronization and
audio-visual relevance. We present Diff-Foley, a synchronized Video-to-Audio
synthesis method with a latent diffusion model (LDM) that generates
high-quality audio with improved synchronization and audio-visual relevance. We
adopt contrastive audio-visual pretraining (CAVP) to learn more temporally and
semantically aligned features, then train an LDM with CAVP-aligned visual
features on spectrogram latent space. The CAVP-aligned features enable LDM to
capture the subtler audio-visual correlation via a cross-attention module. We
further significantly improve sample quality with `double guidance’. Diff-Foley
achieves state-of-the-art V2A performance on current large scale V2A dataset.
Furthermore, we demonstrate Diff-Foley practical applicability and
generalization capabilities via downstream finetuning. Project Page: see
https://diff-foley.github.io/
DiffusionSTR: Diffusion Model for Scene Text Recognition
June 29, 2023
Masato Fujitake
This paper presents Diffusion Model for Scene Text Recognition
(DiffusionSTR), an end-to-end text recognition framework using diffusion models
for recognizing text in the wild. While existing studies have viewed the scene
text recognition task as an image-to-text transformation, we rethought it as a
text-text one under images in a diffusion model. We show for the first time
that the diffusion model can be applied to text recognition. Furthermore,
experimental results on publicly available datasets show that the proposed
method achieves competitive accuracy compared to state-of-the-art methods.
Self-Supervised MRI Reconstruction with Unrolled Diffusion Models
June 29, 2023
Yilmaz Korkmaz, Tolga Cukur, Vishal Patel
Magnetic Resonance Imaging (MRI) produces excellent soft tissue contrast,
albeit it is an inherently slow imaging modality. Promising deep learning
methods have recently been proposed to reconstruct accelerated MRI scans.
However, existing methods still suffer from various limitations regarding image
fidelity, contextual sensitivity, and reliance on fully-sampled acquisitions
for model training. To comprehensively address these limitations, we propose a
novel self-supervised deep reconstruction model, named Self-Supervised
Diffusion Reconstruction (SSDiffRecon). SSDiffRecon expresses a conditional
diffusion process as an unrolled architecture that interleaves cross-attention
transformers for reverse diffusion steps with data-consistency blocks for
physics-driven processing. Unlike recent diffusion methods for MRI
reconstruction, a self-supervision strategy is adopted to train SSDiffRecon
using only undersampled k-space data. Comprehensive experiments on public brain
MR datasets demonstrates the superiority of SSDiffRecon against
state-of-the-art supervised, and self-supervised baselines in terms of
reconstruction speed and quality. Implementation will be available at
https://github.com/yilmazkorkmaz1/SSDiffRecon.
DoseDiff: Distance-aware Diffusion Model for Dose Prediction in Radiotherapy
June 28, 2023
Ruihang Chu, Enze Xie, Shentong Mo, Zhenguo Li, Matthias Nießner, Chi-Wing Fu, Jiaya Jia
We introduce a new diffusion-based approach for shape completion on 3D range
scans. Compared with prior deterministic and probabilistic methods, we strike a
balance between realism, multi-modality, and high fidelity. We propose
DiffComplete by casting shape completion as a generative task conditioned on
the incomplete shape. Our key designs are two-fold. First, we devise a
hierarchical feature aggregation mechanism to inject conditional features in a
spatially-consistent manner. So, we can capture both local details and broader
contexts of the conditional inputs to control the shape completion. Second, we
propose an occupancy-aware fusion strategy in our model to enable the
completion of multiple partial shapes and introduce higher flexibility on the
input conditions. DiffComplete sets a new SOTA performance (e.g., 40% decrease
on l_1 error) on two large-scale 3D shape completion benchmarks. Our completed
shapes not only have a realistic outlook compared with the deterministic
methods but also exhibit high similarity to the ground truths compared with the
probabilistic alternatives. Further, DiffComplete has strong generalizability
on objects of entirely unseen classes for both synthetic and real data,
eliminating the need for model re-training in various applications.
PFB-Diff: Progressive Feature Blending Diffusion for Text-driven Image Editing
June 28, 2023
Wenjing Huang, Shikui Tu, Lei Xu
Diffusion models have showcased their remarkable capability to synthesize
diverse and high-quality images, sparking interest in their application for
real image editing. However, existing diffusion-based approaches for local
image editing often suffer from undesired artifacts due to the pixel-level
blending of the noised target images and diffusion latent variables, which lack
the necessary semantics for maintaining image consistency. To address these
issues, we propose PFB-Diff, a Progressive Feature Blending method for
Diffusion-based image editing. Unlike previous methods, PFB-Diff seamlessly
integrates text-guided generated content into the target image through
multi-level feature blending. The rich semantics encoded in deep features and
the progressive blending scheme from high to low levels ensure semantic
coherence and high quality in edited images. Additionally, we introduce an
attention masking mechanism in the cross-attention layers to confine the impact
of specific words to desired regions, further improving the performance of
background editing. PFB-Diff can effectively address various editing tasks,
including object/background replacement and object attribute editing. Our
method demonstrates its superior performance in terms of image fidelity,
editing accuracy, efficiency, and faithfulness to the original image, without
the need for fine-tuning or training.
Easing Color Shifts in Score-Based Diffusion Models
June 27, 2023
Katherine Deck, Tobias Bischoff
Generated images of score-based models can suffer from errors in their
spatial means, an effect, referred to as a color shift, which grows for larger
images. This paper investigates a previously-introduced approach to mitigate
color shifts in score-based diffusion models. We quantify the performance of a
nonlinear bypass connection in the score network, designed to process the
spatial mean of the input and to predict the mean of the score function. We
show that this network architecture substantially improves the resulting
quality of the generated images, and that this improvement is approximately
independent of the size of the generated images. As a result, this modified
architecture offers a simple solution for the color shift problem across image
sizes. We additionally discuss the origin of color shifts in an idealized
setting in order to motivate the approach.
PoseDiffusion: Solving Pose Estimation via Diffusion-aided Bundle Adjustment
June 27, 2023
Jianyuan Wang, Christian Rupprecht, David Novotny
Camera pose estimation is a long-standing computer vision problem that to
date often relies on classical methods, such as handcrafted keypoint matching,
RANSAC and bundle adjustment. In this paper, we propose to formulate the
Structure from Motion (SfM) problem inside a probabilistic diffusion framework,
modelling the conditional distribution of camera poses given input images. This
novel view of an old problem has several advantages. (i) The nature of the
diffusion framework mirrors the iterative procedure of bundle adjustment. (ii)
The formulation allows a seamless integration of geometric constraints from
epipolar geometry. (iii) It excels in typically difficult scenarios such as
sparse views with wide baselines. (iv) The method can predict intrinsics and
extrinsics for an arbitrary amount of images. We demonstrate that our method
PoseDiffusion significantly improves over the classic SfM pipelines and the
learned approaches on two real-world datasets. Finally, it is observed that our
method can generalize across datasets without further training. Project page:
https://posediffusion.github.io/
Anomaly Detection in Networks via Score-Based Generative Models
June 27, 2023
Dmitrii Gavrilev, Evgeny Burnaev
Node outlier detection in attributed graphs is a challenging problem for
which there is no method that would work well across different datasets.
Motivated by the state-of-the-art results of score-based models in graph
generative modeling, we propose to incorporate them into the aforementioned
problem. Our method achieves competitive results on small-scale graphs. We
provide an empirical analysis of the Dirichlet energy, and show that generative
models might struggle to accurately reconstruct it.
Learning from Invalid Data: On Constraint Satisfaction in Generative Models
June 27, 2023
Giorgio Giannone, Lyle Regenwetter, Akash Srivastava, Dan Gutfreund, Faez Ahmed
Generative models have demonstrated impressive results in vision, language,
and speech. However, even with massive datasets, they struggle with precision,
generating physically invalid or factually incorrect data. This is particularly
problematic when the generated data must satisfy constraints, for example, to
meet product specifications in engineering design or to adhere to the laws of
physics in a natural scene. To improve precision while preserving diversity and
fidelity, we propose a novel training mechanism that leverages datasets of
constraint-violating data points, which we consider invalid. Our approach
minimizes the divergence between the generative distribution and the valid
prior while maximizing the divergence with the invalid distribution. We
demonstrate how generative models like GANs and DDPMs that we augment to train
with invalid data vastly outperform their standard counterparts which solely
train on valid data points. For example, our training procedure generates up to
98 % fewer invalid samples on 2D densities, improves connectivity and stability
four-fold on a stacking block problem, and improves constraint satisfaction by
15 % on a structural topology optimization benchmark in engineering design. We
also analyze how the quality of the invalid data affects the learning procedure
and the generalization properties of models. Finally, we demonstrate
significant improvements in sample efficiency, showing that a tenfold increase
in valid samples leads to a negligible difference in constraint satisfaction,
while less than 10 % invalid samples lead to a tenfold improvement. Our
proposed mechanism offers a promising solution for improving precision in
generative models while preserving diversity and fidelity, particularly in
domains where constraint satisfaction is critical and data is limited, such as
engineering design, robotics, and medicine.
Equivariant flow matching
June 26, 2023
Leon Klein, Andreas Krämer, Frank Noé
stat.ML, cs.LG, physics.chem-ph, physics.comp-ph
Normalizing flows are a class of deep generative models that are especially
interesting for modeling probability distributions in physics, where the exact
likelihood of flows allows reweighting to known target energy functions and
computing unbiased observables. For instance, Boltzmann generators tackle the
long-standing sampling problem in statistical physics by training flows to
produce equilibrium samples of many-body systems such as small molecules and
proteins. To build effective models for such systems, it is crucial to
incorporate the symmetries of the target energy into the model, which can be
achieved by equivariant continuous normalizing flows (CNFs). However, CNFs can
be computationally expensive to train and generate samples from, which has
hampered their scalability and practical application. In this paper, we
introduce equivariant flow matching, a new training objective for equivariant
CNFs that is based on the recently proposed optimal transport flow matching.
Equivariant flow matching exploits the physical symmetries of the target energy
for efficient, simulation-free training of equivariant CNFs. We demonstrate the
effectiveness of flow matching on rotation and permutation invariant
many-particle systems and a small molecule, alanine dipeptide, where for the
first time we obtain a Boltzmann generator with significant sampling efficiency
without relying on tailored internal coordinate featurization. Our results show
that the equivariant flow matching objective yields flows with shorter
integration paths, improved sampling efficiency, and higher scalability
compared to existing methods.
DiffInfinite: Large Mask-Image Synthesis via Parallel Random Patch Diffusion in Histopathology
June 23, 2023
Marco Aversa, Gabriel Nobis, Miriam Hägele, Kai Standvoss, Mihaela Chirica, Roderick Murray-Smith, Ahmed Alaa, Lukas Ruff, Daniela Ivanova, Wojciech Samek, Frederick Klauschen, Bruno Sanguinetti, Luis Oala
We present DiffInfinite, a hierarchical diffusion model that generates
arbitrarily large histological images while preserving long-range correlation
structural information. Our approach first generates synthetic segmentation
masks, subsequently used as conditions for the high-fidelity generative
diffusion process. The proposed sampling method can be scaled up to any desired
image size while only requiring small patches for fast training. Moreover, it
can be parallelized more efficiently than previous large-content generation
methods while avoiding tiling artifacts. The training leverages classifier-free
guidance to augment a small, sparsely annotated dataset with unlabelled data.
Our method alleviates unique challenges in histopathological imaging practice:
large-scale information, costly manual annotation, and protective data
handling. The biological plausibility of DiffInfinite data is evaluated in a
survey by ten experienced pathologists as well as a downstream classification
and segmentation task. Samples from the model score strongly on anti-copying
metrics which is relevant for the protection of patient data.
Directional diffusion models for graph representation learning
June 22, 2023
Run Yang, Yuling Yang, Fan Zhou, Qiang Sun
In recent years, diffusion models have achieved remarkable success in various
domains of artificial intelligence, such as image synthesis, super-resolution,
and 3D molecule generation. However, the application of diffusion models in
graph learning has received relatively little attention. In this paper, we
address this gap by investigating the use of diffusion models for unsupervised
graph representation learning. We begin by identifying the anisotropic
structures of graphs and a crucial limitation of the vanilla forward diffusion
process in learning anisotropic structures. This process relies on continuously
adding an isotropic Gaussian noise to the data, which may convert the
anisotropic signals to noise too quickly. This rapid conversion hampers the
training of denoising neural networks and impedes the acquisition of
semantically meaningful representations in the reverse process. To address this
challenge, we propose a new class of models called {\it directional diffusion
models}. These models incorporate data-dependent, anisotropic, and directional
noises in the forward diffusion process. To assess the efficacy of our proposed
models, we conduct extensive experiments on 12 publicly available datasets,
focusing on two distinct graph representation learning tasks. The experimental
results demonstrate the superiority of our models over state-of-the-art
baselines, indicating their effectiveness in capturing meaningful graph
representations. Our studies not only provide valuable insights into the
forward process of diffusion models but also highlight the wide-ranging
potential of these models for various graph-related tasks.
DiMSam: Diffusion Models as Samplers for Task and Motion Planning under Partial Observability
June 22, 2023
Xiaolin Fang, Caelan Reed Garrett, Clemens Eppner, Tomás Lozano-Pérez, Leslie Pack Kaelbling, Dieter Fox
cs.RO, cs.AI, cs.CV, cs.LG
Task and Motion Planning (TAMP) approaches are effective at planning
long-horizon autonomous robot manipulation. However, it can be difficult to
apply them to domains where the environment and its dynamics are not fully
known. We propose to overcome these limitations by leveraging deep generative
modeling, specifically diffusion models, to learn constraints and samplers that
capture these difficult-to-engineer aspects of the planning model. These
learned samplers are composed and combined within a TAMP solver in order to
find action parameter values jointly that satisfy the constraints along a plan.
To tractably make predictions for unseen objects in the environment, we define
these samplers on low-dimensional learned latent embeddings of changing object
state. We evaluate our approach in an articulated object manipulation domain
and show how the combination of classical TAMP, generative learning, and latent
embeddings enables long-horizon constraint-based reasoning. We also apply the
learned sampler in the real world. More details are available at
https://sites.google.com/view/dimsam-tamp
Continuous Layout Editing of Single Images with Diffusion Models
June 22, 2023
Zhiyuan Zhang, Zhitong Huang, Jing Liao
Recent advancements in large-scale text-to-image diffusion models have
enabled many applications in image editing. However, none of these methods have
been able to edit the layout of single existing images. To address this gap, we
propose the first framework for layout editing of a single image while
preserving its visual properties, thus allowing for continuous editing on a
single image. Our approach is achieved through two key modules. First, to
preserve the characteristics of multiple objects within an image, we
disentangle the concepts of different objects and embed them into separate
textual tokens using a novel method called masked textual inversion. Next, we
propose a training-free optimization method to perform layout control for a
pre-trained diffusion model, which allows us to regenerate images with learned
concepts and align them with user-specified layouts. As the first framework to
edit the layout of existing images, we demonstrate that our method is effective
and outperforms other baselines that were modified to support this task. Our
code will be freely available for public use upon acceptance.
Towards More Realistic Membership Inference Attacks on Large Diffusion Models
June 22, 2023
Jan Dubiński, Antoni Kowalczuk, Stanisław Pawlak, Przemysław Rokita, Tomasz Trzciński, Paweł Morawiecki
Generative diffusion models, including Stable Diffusion and Midjourney, can
generate visually appealing, diverse, and high-resolution images for various
applications. These models are trained on billions of internet-sourced images,
raising significant concerns about the potential unauthorized use of
copyright-protected images. In this paper, we examine whether it is possible to
determine if a specific image was used in the training set, a problem known in
the cybersecurity community and referred to as a membership inference attack.
Our focus is on Stable Diffusion, and we address the challenge of designing a
fair evaluation framework to answer this membership question. We propose a
methodology to establish a fair evaluation setup and apply it to Stable
Diffusion, enabling potential extensions to other generative models. Utilizing
this evaluation setup, we execute membership attacks (both known and newly
introduced). Our research reveals that previously proposed evaluation setups do
not provide a full understanding of the effectiveness of membership inference
attacks. We conclude that the membership inference attack remains a significant
challenge for large diffusion models (often deployed as black-box systems),
indicating that related privacy and copyright issues will persist in the
foreseeable future.
Wind Noise Reduction with a Diffusion-based Stochastic Regeneration Model
June 22, 2023
Jean-Marie Lemercier, Joachim Thiemann, Raphael Koning, Timo Gerkmann
In this paper we present a method for single-channel wind noise reduction
using our previously proposed diffusion-based stochastic regeneration model
combining predictive and generative modelling. We introduce a non-additive
speech in noise model to account for the non-linear deformation of the membrane
caused by the wind flow and possible clipping. We show that our stochastic
regeneration model outperforms other neural-network-based wind noise reduction
methods as well as purely predictive and generative models, on a dataset using
simulated and real-recorded wind noise. We further show that the proposed
method generalizes well by testing on an unseen dataset with real-recorded wind
noise. Audio samples, data generation scripts and code for the proposed methods
can be found online (https://uhh.de/inf-sp-storm-wind).
DiffWA: Diffusion Models for Watermark Attack
With the rapid development of deep neural networks(DNNs), many robust blind
watermarking algorithms and frameworks have been proposed and achieved good
results. At present, the watermark attack algorithm can not compete with the
watermark addition algorithm. And many watermark attack algorithms only care
about interfering with the normal extraction of the watermark, and the
watermark attack will cause great visual loss to the image. To this end, we
propose DiffWA, a conditional diffusion model with distance guidance for
watermark attack, which can restore the image while removing the embedded
watermark. The core of our method is training an image-to-image conditional
diffusion model on unwatermarked images and guiding the conditional model using
a distance guidance when sampling so that the model will generate unwatermarked
images which is similar to original images. We conducted experiments on
CIFAR-10 using our proposed models. The results shows that the model can remove
the watermark with good effect and make the bit error rate of watermark
extraction higher than 0.4. At the same time, the attacked image will maintain
good visual effect with PSNR more than 31 and SSIM more than 0.97 compared with
the original image.
Semi-Implicit Denoising Diffusion Models (SIDDMs)
June 21, 2023
Yanwu Xu, Mingming Gong, Shaoan Xie, Wei Wei, Matthias Grundmann, Kayhan Batmanghelich, Tingbo Hou
Despite the proliferation of generative models, achieving fast sampling
during inference without compromising sample diversity and quality remains
challenging. Existing models such as Denoising Diffusion Probabilistic Models
(DDPM) deliver high-quality, diverse samples but are slowed by an inherently
high number of iterative steps. The Denoising Diffusion Generative Adversarial
Networks (DDGAN) attempted to circumvent this limitation by integrating a GAN
model for larger jumps in the diffusion process. However, DDGAN encountered
scalability limitations when applied to large datasets. To address these
limitations, we introduce a novel approach that tackles the problem by matching
implicit and explicit factors. More specifically, our approach involves
utilizing an implicit model to match the marginal distributions of noisy data
and the explicit conditional distribution of the forward diffusion. This
combination allows us to effectively match the joint denoising distributions.
Unlike DDPM but similar to DDGAN, we do not enforce a parametric distribution
for the reverse step, enabling us to take large steps during inference. Similar
to the DDPM but unlike DDGAN, we take advantage of the exact form of the
diffusion process. We demonstrate that our proposed method obtains comparable
generative performance to diffusion-based models and vastly superior results to
models with a small number of sampling steps.
Diffusion Posterior Sampling for Informed Single-Channel Dereverberation
June 21, 2023
Jean-Marie Lemercier, Simon Welker, Timo Gerkmann
We present in this paper an informed single-channel dereverberation method
based on conditional generation with diffusion models. With knowledge of the
room impulse response, the anechoic utterance is generated via reverse
diffusion using a measurement consistency criterion coupled with a neural
network that represents the clean speech prior. The proposed approach is
largely more robust to measurement noise compared to a state-of-the-art
informed single-channel dereverberation method, especially for non-stationary
noise. Furthermore, we compare to other blind dereverberation methods using
diffusion models and show superiority of the proposed approach for large
reverberation times. We motivate the presented algorithm by introducing an
extension for blind dereverberation allowing joint estimation of the room
impulse response and anechoic speech. Audio samples and code can be found
online (https://uhh.de/inf-sp-derev-dps).
HumanDiffusion: diffusion model using perceptual gradients
June 21, 2023
Yota Ueda, Shinnosuke Takamichi, Yuki Saito, Norihiro Takamune, Hiroshi Saruwatari
We propose {\it HumanDiffusion,} a diffusion model trained from humans’
perceptual gradients to learn an acceptable range of data for humans (i.e.,
human-acceptable distribution). Conventional HumanGAN aims to model the
human-acceptable distribution wider than the real-data distribution by training
a neural network-based generator with human-based discriminators. However,
HumanGAN training tends to converge in a meaningless distribution due to the
gradient vanishing or mode collapse and requires careful heuristics. In
contrast, our HumanDiffusion learns the human-acceptable distribution through
Langevin dynamics based on gradients of human perceptual evaluations. Our
training iterates a process to diffuse real data to cover a wider
human-acceptable distribution and can avoid the issues in the HumanGAN
training. The evaluation results demonstrate that our HumanDiffusion can
successfully represent the human-acceptable distribution without any heuristics
for the training.
DiffuseIR:Diffusion Models For Isotropic Reconstruction of 3D Microscopic Images
June 21, 2023
Mingjie Pan, Yulu Gan, Fangxu Zhou, Jiaming Liu, Aimin Wang, Shanghang Zhang, Dawei Li
Three-dimensional microscopy is often limited by anisotropic spatial
resolution, resulting in lower axial resolution than lateral resolution.
Current State-of-The-Art (SoTA) isotropic reconstruction methods utilizing deep
neural networks can achieve impressive super-resolution performance in fixed
imaging settings. However, their generality in practical use is limited by
degraded performance caused by artifacts and blurring when facing unseen
anisotropic factors. To address these issues, we propose DiffuseIR, an
unsupervised method for isotropic reconstruction based on diffusion models.
First, we pre-train a diffusion model to learn the structural distribution of
biological tissue from lateral microscopic images, resulting in generating
naturally high-resolution images. Then we use low-axial-resolution microscopy
images to condition the generation process of the diffusion model and generate
high-axial-resolution reconstruction results. Since the diffusion model learns
the universal structural distribution of biological tissues, which is
independent of the axial resolution, DiffuseIR can reconstruct authentic images
with unseen low-axial resolutions into a high-axial resolution without
requiring re-training. The proposed DiffuseIR achieves SoTA performance in
experiments on EM data and can even compete with supervised methods.
HSR-Diff:Hyperspectral Image Super-Resolution via Conditional Diffusion Models
June 21, 2023
Chanyue Wu, Dong Wang, Hanyu Mao, Ying Li
Despite the proven significance of hyperspectral images (HSIs) in performing
various computer vision tasks, its potential is adversely affected by the
low-resolution (LR) property in the spatial domain, resulting from multiple
physical factors. Inspired by recent advancements in deep generative models, we
propose an HSI Super-resolution (SR) approach with Conditional Diffusion Models
(HSR-Diff) that merges a high-resolution (HR) multispectral image (MSI) with
the corresponding LR-HSI. HSR-Diff generates an HR-HSI via repeated refinement,
in which the HR-HSI is initialized with pure Gaussian noise and iteratively
refined. At each iteration, the noise is removed with a Conditional Denoising
Transformer (CDF ormer) that is trained on denoising at different noise levels,
conditioned on the hierarchical feature maps of HR-MSI and LR-HSI. In addition,
a progressive learning strategy is employed to exploit the global information
of full-resolution images. Systematic experiments have been conducted on four
public datasets, demonstrating that HSR-Diff outperforms state-of-the-art
methods.
Ambigram Generation by A Diffusion Model
June 21, 2023
Takahiro Shirakawa, Seiichi Uchida
Ambigrams are graphical letter designs that can be read not only from the
original direction but also from a rotated direction (especially with 180
degrees). Designing ambigrams is difficult even for human experts because
keeping their dual readability from both directions is often difficult. This
paper proposes an ambigram generation model. As its generation module, we use a
diffusion model, which has recently been used to generate high-quality
photographic images. By specifying a pair of letter classes, such as ‘A’ and
‘B’, the proposed model generates various ambigram images which can be read as
‘A’ from the original direction and ‘B’ from a direction rotated 180 degrees.
Quantitative and qualitative analyses of experimental results show that the
proposed model can generate high-quality and diverse ambigrams. In addition, we
define ambigramability, an objective measure of how easy it is to generate
ambigrams for each letter pair. For example, the pair of ‘A’ and ‘V’ shows a
high ambigramability (that is, it is easy to generate their ambigrams), and the
pair of ‘D’ and ‘K’ shows a lower ambigramability. The ambigramability gives
various hints of the ambigram generation not only for computers but also for
human experts. The code can be found at
(https://github.com/univ-esuty/ambifusion).
TauPETGen: Text-Conditional Tau PET Image Synthesis Based on Latent Diffusion Models
June 21, 2023
Se-In Jang, Cristina Lois, Emma Thibault, J. Alex Becker, Yafei Dong, Marc D. Normandin, Julie C. Price, Keith A. Johnson, Georges El Fakhri, Kuang Gong
In this work, we developed a novel text-guided image synthesis technique
which could generate realistic tau PET images from textual descriptions and the
subject’s MR image. The generated tau PET images have the potential to be used
in examining relations between different measures and also increasing the
public availability of tau PET datasets. The method was based on latent
diffusion models. Both textual descriptions and the subject’s MR prior image
were utilized as conditions during image generation. The subject’s MR image can
provide anatomical details, while the text descriptions, such as gender, scan
time, cognitive test scores, and amyloid status, can provide further guidance
regarding where the tau neurofibrillary tangles might be deposited. Preliminary
experimental results based on clinical [18F]MK-6240 datasets demonstrate the
feasibility of the proposed method in generating realistic tau PET images at
different clinical stages.
Reward Shaping via Diffusion Process in Reinforcement Learning
June 20, 2023
Peeyush Kumar
Reinforcement Learning (RL) models have continually evolved to navigate the
exploration - exploitation trade-off in uncertain Markov Decision Processes
(MDPs). In this study, I leverage the principles of stochastic thermodynamics
and system dynamics to explore reward shaping via diffusion processes. This
provides an elegant framework as a way to think about exploration-exploitation
trade-off. This article sheds light on relationships between information
entropy, stochastic system dynamics, and their influences on entropy
production. This exploration allows us to construct a dual-pronged framework
that can be interpreted as either a maximum entropy program for deriving
efficient policies or a modified cost optimization program accounting for
informational costs and benefits. This work presents a novel perspective on the
physical nature of information and its implications for online learning in
MDPs, consequently providing a better understanding of information-oriented
formulations in RL.
Diffusion with Forward Models: Solving Stochastic Inverse Problems Without Direct Supervision
June 20, 2023
Ayush Tewari, Tianwei Yin, George Cazenavette, Semon Rezchikov, Joshua B. Tenenbaum, Frédo Durand, William T. Freeman, Vincent Sitzmann
Denoising diffusion models are a powerful type of generative models used to
capture complex distributions of real-world signals. However, their
applicability is limited to scenarios where training samples are readily
available, which is not always the case in real-world applications. For
example, in inverse graphics, the goal is to generate samples from a
distribution of 3D scenes that align with a given image, but ground-truth 3D
scenes are unavailable and only 2D images are accessible. To address this
limitation, we propose a novel class of denoising diffusion probabilistic
models that learn to sample from distributions of signals that are never
directly observed. Instead, these signals are measured indirectly through a
known differentiable forward model, which produces partial observations of the
unknown signal. Our approach involves integrating the forward model directly
into the denoising process. This integration effectively connects the
generative modeling of observations with the generative modeling of the
underlying signals, allowing for end-to-end training of a conditional
generative model over signals. During inference, our approach enables sampling
from the distribution of underlying signals that are consistent with a given
partial observation. We demonstrate the effectiveness of our method on three
challenging computer vision tasks. For instance, in the context of inverse
graphics, our model enables direct sampling from the distribution of 3D scenes
that align with a single 2D input image.
Masked Diffusion Models are Fast Learners
June 20, 2023
Jiachen Lei, Qinglong Wang, Peng Cheng, Zhongjie Ba, Zhan Qin, Zhibo Wang, Zhenguang Liu, Kui Ren
Diffusion model has emerged as the \emph{de-facto} model for image
generation, yet the heavy training overhead hinders its broader adoption in the
research community. We observe that diffusion models are commonly trained to
learn all fine-grained visual information from scratch. This paradigm may cause
unnecessary training costs hence requiring in-depth investigation. In this
work, we show that it suffices to train a strong diffusion model by first
pre-training the model to learn some primer distribution that loosely
characterizes the unknown real image distribution. Then the pre-trained model
can be fine-tuned for various generation tasks efficiently. In the pre-training
stage, we propose to mask a high proportion (e.g., up to 90\%) of input images
to approximately represent the primer distribution and introduce a masked
denoising score matching objective to train a model to denoise visible areas.
In subsequent fine-tuning stage, we efficiently train diffusion model without
masking. Utilizing the two-stage training framework, we achieves significant
training acceleration and a new FID score record of 6.27 on CelebA-HQ $256
\times 256$ for ViT-based diffusion models. The generalizability of a
pre-trained model further helps building models that perform better than ones
trained from scratch on different downstream datasets. For instance, a
diffusion model pre-trained on VGGFace2 attains a 46\% quality improvement when
fine-tuned on a different dataset that contains only 3000 images. Our code is
available at \url{https://github.com/jiachenlei/maskdm}.
GD-VDM: Generated Depth for better Diffusion-based Video Generation
June 19, 2023
Ariel Lapid, Idan Achituve, Lior Bracha, Ethan Fetaya
The field of generative models has recently witnessed significant progress,
with diffusion models showing remarkable performance in image generation. In
light of this success, there is a growing interest in exploring the application
of diffusion models to other modalities. One such challenge is the generation
of coherent videos of complex scenes, which poses several technical
difficulties, such as capturing temporal dependencies and generating long,
high-resolution videos. This paper proposes GD-VDM, a novel diffusion model for
video generation, demonstrating promising results. GD-VDM is based on a
two-phase generation process involving generating depth videos followed by a
novel diffusion Vid2Vid model that generates a coherent real-world video. We
evaluated GD-VDM on the Cityscapes dataset and found that it generates more
diverse and complex scenes compared to natural baselines, demonstrating the
efficacy of our approach.
Diffusion model based data generation for partial differential equations
June 19, 2023
Rucha Apte, Sheel Nidhan, Rishikesh Ranade, Jay Pathak
In a preliminary attempt to address the problem of data scarcity in
physics-based machine learning, we introduce a novel methodology for data
generation in physics-based simulations. Our motivation is to overcome the
limitations posed by the limited availability of numerical data. To achieve
this, we leverage a diffusion model that allows us to generate synthetic data
samples and test them for two canonical cases: (a) the steady 2-D Poisson
equation, and (b) the forced unsteady 2-D Navier-Stokes (NS)
{vorticity-transport} equation in a confined box. By comparing the generated
data samples against outputs from classical solvers, we assess their accuracy
and examine their adherence to the underlying physics laws. In this way, we
emphasize the importance of not only satisfying visual and statistical
comparisons with solver data but also ensuring the generated data’s conformity
to physics laws, thus enabling their effective utilization in downstream tasks.
Score-based Data Assimilation
June 18, 2023
François Rozet, Gilles Louppe
Data assimilation, in its most comprehensive form, addresses the Bayesian
inverse problem of identifying plausible state trajectories that explain noisy
or incomplete observations of stochastic dynamical systems. Various approaches
have been proposed to solve this problem, including particle-based and
variational methods. However, most algorithms depend on the transition dynamics
for inference, which becomes intractable for long time horizons or for
high-dimensional systems with complex dynamics, such as oceans or atmospheres.
In this work, we introduce score-based data assimilation for trajectory
inference. We learn a score-based generative model of state trajectories based
on the key insight that the score of an arbitrarily long trajectory can be
decomposed into a series of scores over short segments. After training,
inference is carried out using the score model, in a non-autoregressive manner
by generating all states simultaneously. Quite distinctively, we decouple the
observation model from the training procedure and use it only at inference to
guide the generative process, which enables a wide range of zero-shot
observation scenarios. We present theoretical and empirical evidence supporting
the effectiveness of our method.
Point-Cloud Completion with Pretrained Text-to-image Diffusion Models
June 18, 2023
Yoni Kasten, Ohad Rahamim, Gal Chechik
Point-cloud data collected in real-world applications are often incomplete.
Data is typically missing due to objects being observed from partial
viewpoints, which only capture a specific perspective or angle. Additionally,
data can be incomplete due to occlusion and low-resolution sampling. Existing
completion approaches rely on datasets of predefined objects to guide the
completion of noisy and incomplete, point clouds. However, these approaches
perform poorly when tested on Out-Of-Distribution (OOD) objects, that are
poorly represented in the training dataset. Here we leverage recent advances in
text-guided image generation, which lead to major breakthroughs in text-guided
shape generation. We describe an approach called SDS-Complete that uses a
pre-trained text-to-image diffusion model and leverages the text semantics of a
given incomplete point cloud of an object, to obtain a complete surface
representation. SDS-Complete can complete a variety of objects using test-time
optimization without expensive collection of 3D information. We evaluate SDS
Complete on incomplete scanned objects, captured by real-world depth sensors
and LiDAR scanners. We find that it effectively reconstructs objects that are
absent from common datasets, reducing Chamfer loss by 50% on average compared
with current methods. Project page: https://sds-complete.github.io/
GenPose: Generative Category-level Object Pose Estimation via Diffusion Models
June 18, 2023
Jiyao Zhang, Mingdong Wu, Hao Dong
Object pose estimation plays a vital role in embodied AI and computer vision,
enabling intelligent agents to comprehend and interact with their surroundings.
Despite the practicality of category-level pose estimation, current approaches
encounter challenges with partially observed point clouds, known as the
multihypothesis issue. In this study, we propose a novel solution by reframing
categorylevel object pose estimation as conditional generative modeling,
departing from traditional point-to-point regression. Leveraging score-based
diffusion models, we estimate object poses by sampling candidates from the
diffusion model and aggregating them through a two-step process: filtering out
outliers via likelihood estimation and subsequently mean-pooling the remaining
candidates. To avoid the costly integration process when estimating the
likelihood, we introduce an alternative method that trains an energy-based
model from the original score-based model, enabling end-to-end likelihood
estimation. Our approach achieves state-of-the-art performance on the REAL275
dataset, surpassing 50% and 60% on strict 5d2cm and 5d5cm metrics,
respectively. Furthermore, our method demonstrates strong generalizability to
novel categories sharing similar symmetric properties without fine-tuning and
can readily adapt to object pose tracking tasks, yielding comparable results to
the current state-of-the-art baselines.
Image Harmonization with Diffusion Model
June 17, 2023
Jiajie Li, Jian Wang, Chen Wang, Jinjun Xiong
Image composition in image editing involves merging a foreground image with a
background image to create a composite. Inconsistent lighting conditions
between the foreground and background often result in unrealistic composites.
Image harmonization addresses this challenge by adjusting illumination and
color to achieve visually appealing and consistent outputs. In this paper, we
present a novel approach for image harmonization by leveraging diffusion
models. We conduct a comparative analysis of two conditional diffusion models,
namely Classifier-Guidance and Classifier-Free. Our focus is on addressing the
challenge of adjusting illumination and color in foreground images to create
visually appealing outputs that seamlessly blend with the background. Through
this research, we establish a solid groundwork for future investigations in the
realm of diffusion model-based image harmonization.
Text-Driven Foley Sound Generation With Latent Diffusion Model
June 17, 2023
Yi Yuan, Haohe Liu, Xubo Liu, Xiyuan Kang, Peipei Wu, Mark D. Plumbley, Wenwu Wang
cs.SD, cs.AI, cs.LG, eess.AS
Foley sound generation aims to synthesise the background sound for multimedia
content. Previous models usually employ a large development set with labels as
input (e.g., single numbers or one-hot vector). In this work, we propose a
diffusion model based system for Foley sound generation with text conditions.
To alleviate the data scarcity issue, our model is initially pre-trained with
large-scale datasets and fine-tuned to this task via transfer learning using
the contrastive language-audio pertaining (CLAP) technique. We have observed
that the feature embedding extracted by the text encoder can significantly
affect the performance of the generation model. Hence, we introduce a trainable
layer after the encoder to improve the text embedding produced by the encoder.
In addition, we further refine the generated waveform by generating multiple
candidate audio clips simultaneously and selecting the best one, which is
determined in terms of the similarity score between the embedding of the
candidate clips and the embedding of the target text label. Using the proposed
method, our system ranks ${1}^{st}$ among the systems submitted to DCASE
Challenge 2023 Task 7. The results of the ablation studies illustrate that the
proposed techniques significantly improve sound generation performance. The
codes for implementing the proposed system are available online.
Building the Bridge of Schrödinger: A Continuous Entropic Optimal Transport Benchmark
June 16, 2023
Nikita Gushchin, Alexander Kolesov, Petr Mokrov, Polina Karpikova, Andrey Spiridonov, Evgeny Burnaev, Alexander Korotin
Over the last several years, there has been significant progress in
developing neural solvers for the Schr"odinger Bridge (SB) problem and
applying them to generative modelling. This new research field is justifiably
fruitful as it is interconnected with the practically well-performing diffusion
models and theoretically grounded entropic optimal transport (EOT). Still, the
area lacks non-trivial tests allowing a researcher to understand how well the
methods solve SB or its equivalent continuous EOT problem. We fill this gap and
propose a novel way to create pairs of probability distributions for which the
ground truth OT solution is known by the construction. Our methodology is
generic and works for a wide range of OT formulations, in particular, it covers
the EOT which is equivalent to SB (the main interest of our study). This
development allows us to create continuous benchmark distributions with the
known EOT and SB solutions on high-dimensional spaces such as spaces of images.
As an illustration, we use these benchmark pairs to test how well existing
neural EOT/SB solvers actually compute the EOT solution. Our code for
constructing benchmark pairs under different setups is available at:
https://github.com/ngushchin/EntropicOTBenchmark.
Towards Better Certified Segmentation via Diffusion Models
June 16, 2023
Othmane Laousy, Alexandre Araujo, Guillaume Chassagnon, Marie-Pierre Revel, Siddharth Garg, Farshad Khorrami, Maria Vakalopoulou
The robustness of image segmentation has been an important research topic in
the past few years as segmentation models have reached production-level
accuracy. However, like classification models, segmentation models can be
vulnerable to adversarial perturbations, which hinders their use in
critical-decision systems like healthcare or autonomous driving. Recently,
randomized smoothing has been proposed to certify segmentation predictions by
adding Gaussian noise to the input to obtain theoretical guarantees. However,
this method exhibits a trade-off between the amount of added noise and the
level of certification achieved. In this paper, we address the problem of
certifying segmentation prediction using a combination of randomized smoothing
and diffusion models. Our experiments show that combining randomized smoothing
and diffusion models significantly improves certified robustness, with results
indicating a mean improvement of 21 points in accuracy compared to previous
state-of-the-art methods on Pascal-Context and Cityscapes public datasets. Our
method is independent of the selected segmentation model and does not need any
additional specialized training procedure.
Drag-guided diffusion models for vehicle image generation
June 16, 2023
Nikos Arechiga, Frank Permenter, Binyang Song, Chenyang Yuan
Denoising diffusion models trained at web-scale have revolutionized image
generation. The application of these tools to engineering design is an
intriguing possibility, but is currently limited by their inability to parse
and enforce concrete engineering constraints. In this paper, we take a step
towards this goal by proposing physics-based guidance, which enables
optimization of a performance metric (as predicted by a surrogate model) during
the generation process. As a proof-of-concept, we add drag guidance to Stable
Diffusion, which allows this tool to generate images of novel vehicles while
simultaneously minimizing their predicted drag coefficients.
Energy-Based Cross Attention for Bayesian Context Update in Text-to-Image Diffusion Models
June 16, 2023
Geon Yeong Park, Jeongsol Kim, Beomsu Kim, Sang Wan Lee, Jong Chul Ye
cs.CV, cs.AI, cs.CL, cs.LG
Despite the remarkable performance of text-to-image diffusion models in image
generation tasks, recent studies have raised the issue that generated images
sometimes cannot capture the intended semantic contents of the text prompts,
which phenomenon is often called semantic misalignment. To address this, here
we present a novel energy-based model (EBM) framework for adaptive context
control by modeling the posterior of context vectors. Specifically, we first
formulate EBMs of latent image representations and text embeddings in each
cross-attention layer of the denoising autoencoder. Then, we obtain the
gradient of the log posterior of context vectors, which can be updated and
transferred to the subsequent cross-attention layer, thereby implicitly
minimizing a nested hierarchy of energy functions. Our latent EBMs further
allow zero-shot compositional generation as a linear combination of
cross-attention outputs from different contexts. Using extensive experiments,
we demonstrate that the proposed method is highly effective in handling various
image generation tasks, including multi-concept generation, text-guided image
inpainting, and real and synthetic image editing. Code:
https://github.com/EnergyAttention/Energy-Based-CrossAttention.
The Big Data Myth: Using Diffusion Models for Dataset Generation to Train Deep Detection Models
June 16, 2023
Roy Voetman, Maya Aghaei, Klaas Dijkstra
Despite the notable accomplishments of deep object detection models, a major
challenge that persists is the requirement for extensive amounts of training
data. The process of procuring such real-world data is a laborious undertaking,
which has prompted researchers to explore new avenues of research, such as
synthetic data generation techniques. This study presents a framework for the
generation of synthetic datasets by fine-tuning pretrained stable diffusion
models. The synthetic datasets are then manually annotated and employed for
training various object detection models. These detectors are evaluated on a
real-world test set of 331 images and compared against a baseline model that
was trained on real-world images. The results of this study reveal that the
object detection models trained on synthetic data perform similarly to the
baseline model. In the context of apple detection in orchards, the average
precision deviation with the baseline ranges from 0.09 to 0.12. This study
illustrates the potential of synthetic data generation techniques as a viable
alternative to the collection of extensive training data for the training of
deep models.
Evaluating the Robustness of Text-to-image Diffusion Models against Real-world Attacks
June 16, 2023
Hongcheng Gao, Hao Zhang, Yinpeng Dong, Zhijie Deng
cs.CR, cs.AI, cs.CV, cs.LG
Text-to-image (T2I) diffusion models (DMs) have shown promise in generating
high-quality images from textual descriptions. The real-world applications of
these models require particular attention to their safety and fidelity, but
this has not been sufficiently explored. One fundamental question is whether
existing T2I DMs are robust against variations over input texts. To answer it,
this work provides the first robustness evaluation of T2I DMs against
real-world attacks. Unlike prior studies that focus on malicious attacks
involving apocryphal alterations to the input texts, we consider an attack
space spanned by realistic errors (e.g., typo, glyph, phonetic) that humans can
make, to ensure semantic consistency. Given the inherent randomness of the
generation process, we develop novel distribution-based attack objectives to
mislead T2I DMs. We perform attacks in a black-box manner without any knowledge
of the model. Extensive experiments demonstrate the effectiveness of our method
for attacking popular T2I DMs and simultaneously reveal their non-trivial
robustness issues. Moreover, we provide an in-depth analysis of our method to
show that it is not designed to attack the text encoder in T2I DMs solely.
Edit-DiffNeRF: Editing 3D Neural Radiance Fields using 2D Diffusion Model
June 15, 2023
Lu Yu, Wei Xiang, Kang Han
Recent research has demonstrated that the combination of pretrained diffusion
models with neural radiance fields (NeRFs) has emerged as a promising approach
for text-to-3D generation. Simply coupling NeRF with diffusion models will
result in cross-view inconsistency and degradation of stylized view syntheses.
To address this challenge, we propose the Edit-DiffNeRF framework, which is
composed of a frozen diffusion model, a proposed delta module to edit the
latent semantic space of the diffusion model, and a NeRF. Instead of training
the entire diffusion for each scene, our method focuses on editing the latent
semantic space in frozen pretrained diffusion models by the delta module. This
fundamental change to the standard diffusion framework enables us to make
fine-grained modifications to the rendered views and effectively consolidate
these instructions in a 3D scene via NeRF training. As a result, we are able to
produce an edited 3D scene that faithfully aligns to input text instructions.
Furthermore, to ensure semantic consistency across different viewpoints, we
propose a novel multi-view semantic consistency loss that extracts a latent
semantic embedding from the input view as a prior, and aim to reconstruct it in
different views. Our proposed method has been shown to effectively edit
real-world 3D scenes, resulting in 25% improvement in the alignment of the
performed 3D edits with text instructions compared to prior work.
R2-Diff: Denoising by diffusion as a refinement of retrieved motion for image-based motion prediction
June 15, 2023
Takeru Oba, Norimichi Ukita
cs.CV, cs.LG, cs.RO, 68T40
Image-based motion prediction is one of the essential techniques for robot
manipulation. Among the various prediction models, we focus on diffusion models
because they have achieved state-of-the-art performance in various
applications. In image-based motion prediction, diffusion models stochastically
predict contextually appropriate motion by gradually denoising random Gaussian
noise based on the image context. While diffusion models are able to predict
various motions by changing the random noise, they sometimes fail to predict a
contextually appropriate motion based on the image because the random noise is
sampled independently of the image context. To solve this problem, we propose
R2-Diff. In R2-Diff, a motion retrieved from a dataset based on image
similarity is fed into a diffusion model instead of random noise. Then, the
retrieved motion is refined through the denoising process of the diffusion
model. Since the retrieved motion is almost appropriate to the context, it
becomes easier to predict contextually appropriate motion. However, traditional
diffusion models are not optimized to refine the retrieved motion. Therefore,
we propose the method of tuning the hyperparameters based on the distance of
the nearest neighbor motion among the dataset to optimize the diffusion model
for refinement. Furthermore, we propose an image-based retrieval method to
retrieve the nearest neighbor motion in inference. Our proposed retrieval
efficiently computes the similarity based on the image features along the
motion trajectory. We demonstrate that R2-Diff accurately predicts appropriate
motions and achieves high task success rates compared to recent
state-of-the-art models in robot manipulation.
Diff-TTSG: Denoising probabilistic integrated speech and gesture synthesis
June 15, 2023
Shivam Mehta, Siyang Wang, Simon Alexanderson, Jonas Beskow, Éva Székely, Gustav Eje Henter
eess.AS, cs.AI, cs.CV, cs.HC, cs.LG, 68T07 (Primary), 68T42 (Secondary), I.2.7; I.2.6; G.3; H.5.5
With read-aloud speech synthesis achieving high naturalness scores, there is
a growing research interest in synthesising spontaneous speech. However, human
spontaneous face-to-face conversation has both spoken and non-verbal aspects
(here, co-speech gestures). Only recently has research begun to explore the
benefits of jointly synthesising these two modalities in a single system. The
previous state of the art used non-probabilistic methods, which fail to capture
the variability of human speech and motion, and risk producing oversmoothing
artefacts and sub-optimal synthesis quality. We present the first
diffusion-based probabilistic model, called Diff-TTSG, that jointly learns to
synthesise speech and gestures together. Our method can be trained on small
datasets from scratch. Furthermore, we describe a set of careful uni- and
multi-modal subjective tests for evaluating integrated speech and gesture
synthesis systems, and use them to validate our proposed approach. Please see
https://shivammehta25.github.io/Diff-TTSG/ for video examples, data, and code.
Fit Like You Sample: Sample-Efficient Generalized Score Matching from Fast Mixing Markov Chains
June 15, 2023
Yilong Qin, Andrej Risteski
Score matching is an approach to learning probability distributions
parametrized up to a constant of proportionality (e.g. Energy-Based Models).
The idea is to fit the score of the distribution, rather than the likelihood,
thus avoiding the need to evaluate the constant of proportionality. While
there’s a clear algorithmic benefit, the statistical “cost’’ can be steep:
recent work by Koehler et al. 2022 showed that for distributions that have poor
isoperimetric properties (a large Poincar'e or log-Sobolev constant), score
matching is substantially statistically less efficient than maximum likelihood.
However, many natural realistic distributions, e.g. multimodal distributions as
simple as a mixture of two Gaussians in one dimension – have a poor Poincar'e
constant.
In this paper, we show a close connection between the mixing time of a broad
class of Markov processes with generator $\mathcal{L}$ and an appropriately
chosen generalized score matching loss that tries to fit $\frac{\mathcal{O}
p}{p}$. This allows us to adapt techniques to speed up Markov chains to
construct better score-matching losses. In particular, preconditioning'' the
diffusion can be translated to an appropriate
preconditioning’’ of the score
loss. Lifting the chain by adding a temperature like in simulated tempering can
be shown to result in a Gaussian-convolution annealed score matching loss,
similar to Song and Ermon, 2019. Moreover, we show that if the distribution
being learned is a finite mixture of Gaussians in $d$ dimensions with a shared
covariance, the sample complexity of annealed score matching is polynomial in
the ambient dimension, the diameter of the means, and the smallest and largest
eigenvalues of the covariance – obviating the Poincar'e constant-based lower
bounds of the basic score matching loss shown in Koehler et al. 2022.
ArtFusion: Arbitrary Style Transfer using Dual Conditional Latent Diffusion Models
June 15, 2023
Dar-Yen Chen
Arbitrary Style Transfer (AST) aims to transform images by adopting the style
from any selected artwork. Nonetheless, the need to accommodate diverse and
subjective user preferences poses a significant challenge. While some users
wish to preserve distinct content structures, others might favor a more
pronounced stylization. Despite advances in feed-forward AST methods, their
limited customizability hinders their practical application. We propose a new
approach, ArtFusion, which provides a flexible balance between content and
style. In contrast to traditional methods reliant on biased similarity losses,
ArtFusion utilizes our innovative Dual Conditional Latent Diffusion
Probabilistic Models (Dual-cLDM). This approach mitigates repetitive patterns
and enhances subtle artistic aspects like brush strokes and genre-specific
features. Despite the promising results of conditional diffusion probabilistic
models (cDM) in various generative tasks, their introduction to style transfer
is challenging due to the requirement for paired training data. ArtFusion
successfully navigates this issue, offering more practical and controllable
stylization. A key element of our approach involves using a single image for
both content and style during model training, all the while maintaining
effective stylization during inference. ArtFusion outperforms existing
approaches on outstanding controllability and faithful presentation of artistic
details, providing evidence of its superior style transfer capabilities.
Furthermore, the Dual-cLDM utilized in ArtFusion carries the potential for a
variety of complex multi-condition generative tasks, thus greatly broadening
the impact of our research.
Diffusion Models for Zero-Shot Open-Vocabulary Segmentation
June 15, 2023
Laurynas Karazija, Iro Laina, Andrea Vedaldi, Christian Rupprecht
The variety of objects in the real world is nearly unlimited and is thus
impossible to capture using models trained on a fixed set of categories. As a
result, in recent years, open-vocabulary methods have attracted the interest of
the community. This paper proposes a new method for zero-shot open-vocabulary
segmentation. Prior work largely relies on contrastive training using
image-text pairs, leveraging grouping mechanisms to learn image features that
are both aligned with language and well-localised. This however can introduce
ambiguity as the visual appearance of images with similar captions often
varies. Instead, we leverage the generative properties of large-scale
text-to-image diffusion models to sample a set of support images for a given
textual category. This provides a distribution of appearances for a given text
circumventing the ambiguity problem. We further propose a mechanism that
considers the contextual background of the sampled images to better localise
objects and segment the background directly. We show that our method can be
used to ground several existing pre-trained self-supervised feature extractors
in natural language and provide explainable predictions by mapping back to
regions in the support set. Our proposal is training-free, relying on
pre-trained components only, yet, shows strong performance on a range of
open-vocabulary segmentation benchmarks, obtaining a lead of more than 10% on
the Pascal VOC benchmark.
A Score-based Nonlinear Filter for Data Assimilation
June 15, 2023
Feng Bao, Zezhong Zhang, Guannan Zhang
We introduce a score-based generative sampling method for solving the
nonlinear filtering problem with robust accuracy. A major drawback of existing
nonlinear filtering methods, e.g., particle filters, is the low stability. To
overcome this issue, we adopt the diffusion model framework to solve the
nonlinear filtering problem. In stead of storing the information of the
filtering density in finite number of Monte Carlo samples, in the score-based
filter we store the information of the filtering density in the score model.
Then, via the reverse-time diffusion sampler, we can generate unlimited samples
to characterize the filtering density. Moreover, with the powerful expressive
capabilities of deep neural networks, it has been demonstrated that a well
trained score in diffusion model can produce samples from complex target
distributions in very high dimensional spaces. Extensive numerical experiments
show that our score-based filter could potentially address the curse of
dimensionality in very high dimensional problems.
Towards Faster Non-Asymptotic Convergence for Diffusion-Based Generative Models
June 15, 2023
Gen Li, Yuting Wei, Yuxin Chen, Yuejie Chi
stat.ML, cs.IT, cs.LG, math.IT, math.ST, stat.TH
Diffusion models, which convert noise into new data instances by learning to
reverse a Markov diffusion process, have become a cornerstone in contemporary
generative modeling. While their practical power has now been widely
recognized, the theoretical underpinnings remain far from mature. In this work,
we develop a suite of non-asymptotic theory towards understanding the data
generation process of diffusion models in discrete time, assuming access to
$\ell_2$-accurate estimates of the (Stein) score functions. For a popular
deterministic sampler (based on the probability flow ODE), we establish a
convergence rate proportional to $1/T$ (with $T$ the total number of steps),
improving upon past results; for another mainstream stochastic sampler (i.e., a
type of the denoising diffusion probabilistic model), we derive a convergence
rate proportional to $1/\sqrt{T}$, matching the state-of-the-art theory.
Imposing only minimal assumptions on the target data distribution (e.g., no
smoothness assumption is imposed), our results characterize how $\ell_2$ score
estimation errors affect the quality of the data generation processes. In
contrast to prior works, our theory is developed based on an elementary yet
versatile non-asymptotic approach without resorting to toolboxes for SDEs and
ODEs. Further, we design two accelerated variants, improving the convergence to
$1/T^2$ for the ODE-based sampler and $1/T$ for the DDPM-type sampler, which
might be of independent theoretical and empirical interest.
Training Diffusion Classifiers with Denoising Assistance
June 15, 2023
Chandramouli Sastry, Sri Harsha Dumpala, Sageev Oore
Score-matching and diffusion models have emerged as state-of-the-art
generative models for both conditional and unconditional generation.
Classifier-guided diffusion models are created by training a classifier on
samples obtained from the forward-diffusion process (i.e., from data to noise).
In this paper, we propose denoising-assisted (DA) classifiers wherein the
diffusion classifier is trained using both noisy and denoised examples as
simultaneous inputs to the model. We differentiate between denoising-assisted
(DA) classifiers and noisy classifiers, which are diffusion classifiers that
are only trained on noisy examples. Our experiments on Cifar10 and Imagenet
show that DA-classifiers improve over noisy classifiers both quantitatively in
terms of generalization to test data and qualitatively in terms of
perceptually-aligned classifier-gradients and generative modeling metrics.
Finally, we describe a semi-supervised framework for training diffusion
classifiers and our experiments, that also include positive-unlabeled settings,
demonstrate improved generalization of DA-classifiers over noisy classifiers.
DIFFender: Diffusion-Based Adversarial Defense against Patch Attacks in the Physical World
June 15, 2023
Caixin Kang, Yinpeng Dong, Zhengyi Wang, Shouwei Ruan, Yubo Chen, Hang Su, Xingxing Wei
cs.CV, cs.AI, cs.CR, cs.LG
Adversarial attacks, particularly patch attacks, pose significant threats to
the robustness and reliability of deep learning models. Developing reliable
defenses against patch attacks is crucial for real-world applications, yet
current research in this area is unsatisfactory. In this paper, we propose
DIFFender, a novel defense method that leverages a text-guided diffusion model
to defend against adversarial patches. DIFFender includes two main stages:
patch localization and patch restoration. In the localization stage, we find
and exploit an intriguing property of the diffusion model to precisely identify
the locations of adversarial patches. In the restoration stage, we employ the
diffusion model to reconstruct the adversarial regions in the images while
preserving the integrity of the visual content. Thanks to the former finding,
these two stages can be simultaneously guided by a unified diffusion model.
Thus, we can utilize the close interaction between them to improve the whole
defense performance. Moreover, we propose a few-shot prompt-tuning algorithm to
fine-tune the diffusion model, enabling the pre-trained diffusion model to
adapt to the defense task easily. We conduct extensive experiments on image
classification, face recognition, and further in the physical world,
demonstrating that our proposed method exhibits superior robustness under
strong adaptive attacks and generalizes well across various scenarios, diverse
classifiers, and multiple patch attack methods.
Unbalanced Diffusion Schrödinger Bridge
June 15, 2023
Matteo Pariset, Ya-Ping Hsieh, Charlotte Bunne, Andreas Krause, Valentin De Bortoli
Schr"odinger bridges (SBs) provide an elegant framework for modeling the
temporal evolution of populations in physical, chemical, or biological systems.
Such natural processes are commonly subject to changes in population size over
time due to the emergence of new species or birth and death events. However,
existing neural parameterizations of SBs such as diffusion Schr"odinger
bridges (DSBs) are restricted to settings in which the endpoints of the
stochastic process are both probability measures and assume conservation of
mass constraints. To address this limitation, we introduce unbalanced DSBs
which model the temporal evolution of marginals with arbitrary finite mass.
This is achieved by deriving the time reversal of stochastic differential
equations with killing and birth terms. We present two novel algorithmic
schemes that comprise a scalable objective function for training unbalanced
DSBs and provide a theoretical analysis alongside challenging applications on
predicting heterogeneous molecular single-cell responses to various cancer
drugs and simulating the emergence and spread of new viral variants.
Relation-Aware Diffusion Model for Controllable Poster Layout Generation
June 15, 2023
Fengheng Li, An Liu, Wei Feng, Honghe Zhu, Yaoyu Li, Zheng Zhang, Jingjing Lv, Xin Zhu, Junjie Shen, Zhangang Lin, Jingping Shao
Poster layout is a crucial aspect of poster design. Prior methods primarily
focus on the correlation between visual content and graphic elements. However,
a pleasant layout should also consider the relationship between visual and
textual contents and the relationship between elements. In this study, we
introduce a relation-aware diffusion model for poster layout generation that
incorporates these two relationships in the generation process. Firstly, we
devise a visual-textual relation-aware module that aligns the visual and
textual representations across modalities, thereby enhancing the layout’s
efficacy in conveying textual information. Subsequently, we propose a geometry
relation-aware module that learns the geometry relationship between elements by
comprehensively considering contextual information. Additionally, the proposed
method can generate diverse layouts based on user constraints. To advance
research in this field, we have constructed a poster layout dataset named
CGL-Dataset V2. Our proposed method outperforms state-of-the-art methods on
CGL-Dataset V2. The data and code will be available at
https://github.com/liuan0803/RADM.
Annotator Consensus Prediction for Medical Image Segmentation with Diffusion Models
June 15, 2023
Tomer Amit, Shmuel Shichrur, Tal Shaharabany, Lior Wolf
A major challenge in the segmentation of medical images is the large inter-
and intra-observer variability in annotations provided by multiple experts. To
address this challenge, we propose a novel method for multi-expert prediction
using diffusion models. Our method leverages the diffusion-based approach to
incorporate information from multiple annotations and fuse it into a unified
segmentation map that reflects the consensus of multiple experts. We evaluate
the performance of our method on several datasets of medical segmentation
annotated by multiple experts and compare it with state-of-the-art methods. Our
results demonstrate the effectiveness and robustness of the proposed method.
Our code is publicly available at
https://github.com/tomeramit/Annotator-Consensus-Prediction.
When Hyperspectral Image Classification Meets Diffusion Models: An Unsupervised Feature Learning Framework
June 15, 2023
Jingyi Zhou, Jiamu Sheng, Jiayuan Fan, Peng Ye, Tong He, Bin Wang, Tao Chen
Learning effective spectral-spatial features is important for the
hyperspectral image (HSI) classification task, but the majority of existing HSI
classification methods still suffer from modeling complex spectral-spatial
relations and characterizing low-level details and high-level semantics
comprehensively. As a new class of record-breaking generative models, diffusion
models are capable of modeling complex relations for understanding inputs well
as learning both high-level and low-level visual features. Meanwhile, diffusion
models can capture more abundant features by taking advantage of the extra and
unique dimension of timestep t. In view of these, we propose an unsupervised
spectral-spatial feature learning framework based on the diffusion model for
HSI classification for the first time, named Diff-HSI. Specifically, we first
pretrain the diffusion model with unlabeled HSI patches for unsupervised
feature learning, and then exploit intermediate hierarchical features from
different timesteps for classification. For better using the abundant
timestep-wise features, we design a timestep-wise feature bank and a dynamic
feature fusion module to construct timestep-wise features, adaptively learning
informative multi-timestep representations. Finally, an ensemble of linear
classifiers is applied to perform HSI classification. Extensive experiments are
conducted on three public HSI datasets, and our results demonstrate that
Diff-HSI outperforms state-of-the-art supervised and unsupervised methods for
HSI classification.
Linguistic Binding in Diffusion Models: Enhancing Attribute Correspondence through Attention Map Alignment
June 15, 2023
Royi Rassin, Eran Hirsch, Daniel Glickman, Shauli Ravfogel, Yoav Goldberg, Gal Chechik
Text-conditioned image generation models often generate incorrect
associations between entities and their visual attributes. This reflects an
impaired mapping between linguistic binding of entities and modifiers in the
prompt and visual binding of the corresponding elements in the generated image.
As one notable example, a query like “a pink sunflower and a yellow flamingo”
may incorrectly produce an image of a yellow sunflower and a pink flamingo. To
remedy this issue, we propose SynGen, an approach which first syntactically
analyses the prompt to identify entities and their modifiers, and then uses a
novel loss function that encourages the cross-attention maps to agree with the
linguistic binding reflected by the syntax. Specifically, we encourage large
overlap between attention maps of entities and their modifiers, and small
overlap with other entities and modifier words. The loss is optimized during
inference, without retraining or fine-tuning the model. Human evaluation on
three datasets, including one new and challenging set, demonstrate significant
improvements of SynGen compared with current state of the art methods. This
work highlights how making use of sentence structure during inference can
efficiently and substantially improve the faithfulness of text-to-image
generation.
Taming Diffusion Models for Music-driven Conducting Motion Generation
June 15, 2023
Zhuoran Zhao, Jinbin Bai, Delong Chen, Debang Wang, Yubo Pan
eess.AS, cs.AI, cs.MM, cs.SD
Generating the motion of orchestral conductors from a given piece of symphony
music is a challenging task since it requires a model to learn semantic music
features and capture the underlying distribution of real conducting motion.
Prior works have applied Generative Adversarial Networks (GAN) to this task,
but the promising diffusion model, which recently showed its advantages in
terms of both training stability and output quality, has not been exploited in
this context. This paper presents Diffusion-Conductor, a novel DDIM-based
approach for music-driven conducting motion generation, which integrates the
diffusion model to a two-stage learning framework. We further propose a random
masking strategy to improve the feature robustness, and use a pair of geometric
loss functions to impose additional regularizations and increase motion
diversity. We also design several novel metrics, including Frechet Gesture
Distance (FGD) and Beat Consistency Score (BC) for a more comprehensive
evaluation of the generated motion. Experimental results demonstrate the
advantages of our model.
June 14, 2023
Yingheng Wang, Yair Schiff, Aaron Gokaslan, Weishen Pan, Fei Wang, Christopher De Sa, Volodymyr Kuleshov
While diffusion models excel at generating high-quality samples, their latent
variables typically lack semantic meaning and are not suitable for
representation learning. Here, we propose InfoDiffusion, an algorithm that
augments diffusion models with low-dimensional latent variables that capture
high-level factors of variation in the data. InfoDiffusion relies on a learning
objective regularized with the mutual information between observed and hidden
variables, which improves latent space quality and prevents the latents from
being ignored by expressive diffusion-based decoders. Empirically, we find that
InfoDiffusion learns disentangled and human-interpretable latent
representations that are competitive with state-of-the-art generative and
contrastive methods, while retaining the high sample quality of diffusion
models. Our method enables manipulating the attributes of generated images and
has the potential to assist tasks that require exploring a learned latent space
to generate quality samples, e.g., generative design.
Training-free Diffusion Model Adaptation for Variable-Sized Text-to-Image Synthesis
June 14, 2023
Zhiyu Jin, Xuli Shen, Bin Li, Xiangyang Xue
Diffusion models (DMs) have recently gained attention with state-of-the-art
performance in text-to-image synthesis. Abiding by the tradition in deep
learning, DMs are trained and evaluated on the images with fixed sizes.
However, users are demanding for various images with specific sizes and various
aspect ratio. This paper focuses on adapting text-to-image diffusion models to
handle such variety while maintaining visual fidelity. First we observe that,
during the synthesis, lower resolution images suffer from incomplete object
portrayal, while higher resolution images exhibit repetitively disordered
presentation. Next, we establish a statistical relationship indicating that
attention entropy changes with token quantity, suggesting that models aggregate
spatial information in proportion to image resolution. The subsequent
interpretation on our observations is that objects are incompletely depicted
due to limited spatial information for low resolutions, while repetitively
disorganized presentation arises from redundant spatial information for high
resolutions. From this perspective, we propose a scaling factor to alleviate
the change of attention entropy and mitigate the defective pattern observed.
Extensive experimental results validate the efficacy of the proposed scaling
factor, enabling models to achieve better visual effects, image quality, and
text alignment. Notably, these improvements are achieved without additional
training or fine-tuning techniques.
TryOnDiffusion: A Tale of Two UNets
June 14, 2023
Luyang Zhu, Dawei Yang, Tyler Zhu, Fitsum Reda, William Chan, Chitwan Saharia, Mohammad Norouzi, Ira Kemelmacher-Shlizerman
Given two images depicting a person and a garment worn by another person, our
goal is to generate a visualization of how the garment might look on the input
person. A key challenge is to synthesize a photorealistic detail-preserving
visualization of the garment, while warping the garment to accommodate a
significant body pose and shape change across the subjects. Previous methods
either focus on garment detail preservation without effective pose and shape
variation, or allow try-on with the desired shape and pose but lack garment
details. In this paper, we propose a diffusion-based architecture that unifies
two UNets (referred to as Parallel-UNet), which allows us to preserve garment
details and warp the garment for significant pose and body change in a single
network. The key ideas behind Parallel-UNet include: 1) garment is warped
implicitly via a cross attention mechanism, 2) garment warp and person blend
happen as part of a unified process as opposed to a sequence of two separate
tasks. Experimental results indicate that TryOnDiffusion achieves
state-of-the-art performance both qualitatively and quantitatively.
On the Robustness of Latent Diffusion Models
June 14, 2023
Jianping Zhang, Zhuoer Xu, Shiwen Cui, Changhua Meng, Weibin Wu, Michael R. Lyu
Latent diffusion models achieve state-of-the-art performance on a variety of
generative tasks, such as image synthesis and image editing. However, the
robustness of latent diffusion models is not well studied. Previous works only
focus on the adversarial attacks against the encoder or the output image under
white-box settings, regardless of the denoising process. Therefore, in this
paper, we aim to analyze the robustness of latent diffusion models more
thoroughly. We first study the influence of the components inside latent
diffusion models on their white-box robustness. In addition to white-box
scenarios, we evaluate the black-box robustness of latent diffusion models via
transfer attacks, where we consider both prompt-transfer and model-transfer
settings and possible defense mechanisms. However, all these explorations need
a comprehensive benchmark dataset, which is missing in the literature.
Therefore, to facilitate the research of the robustness of latent diffusion
models, we propose two automatic dataset construction pipelines for two kinds
of image editing models and release the whole dataset. Our code and dataset are
available at \url{https://github.com/jpzhang1810/LDM-Robustness}.
Data Augmentation for Seizure Prediction with Generative Diffusion Model
June 14, 2023
Kai Shu, Yuchang Zhao, Le Wu, Aiping Liu, Ruobing Qian, Xun Chen
Objective: Seizure prediction is of great importance to improve the life of
patients. The focal point is to distinguish preictal states from interictal
ones. With the development of machine learning, seizure prediction methods have
achieved significant progress. However, the severe imbalance problem between
preictal and interictal data still poses a great challenge, restricting the
performance of classifiers. Data augmentation is an intuitive way to solve this
problem. Existing data augmentation methods generate samples by overlapping or
recombining data. The distribution of generated samples is limited by original
data, because such transformations cannot fully explore the feature space and
offer new information. As the epileptic EEG representation varies among
seizures, these generated samples cannot provide enough diversity to achieve
high performance on a new seizure. As a consequence, we propose a novel data
augmentation method with diffusion model called DiffEEG. Methods: Diffusion
models are a class of generative models that consist of two processes.
Specifically, in the diffusion process, the model adds noise to the input EEG
sample step by step and converts the noisy sample into output random noise,
exploring the distribution of data by minimizing the loss between the output
and the noise added. In the denoised process, the model samples the synthetic
data by removing the noise gradually, diffusing the data distribution to
outward areas and narrowing the distance between different clusters. Results:
We compared DiffEEG with existing methods, and integrated them into three
representative classifiers. The experiments indicate that DiffEEG could further
improve the performance and shows superiority to existing methods. Conclusion:
This paper proposes a novel and effective method to solve the imbalanced
problem and demonstrates the effectiveness and generality of our method.
GBSD: Generative Bokeh with Stage Diffusion
June 14, 2023
Jieren Deng, Xin Zhou, Hao Tian, Zhihong Pan, Derek Aguiar
The bokeh effect is an artistic technique that blurs out-of-focus areas in a
photograph and has gained interest due to recent developments in text-to-image
synthesis and the ubiquity of smart-phone cameras and photo-sharing apps. Prior
work on rendering bokeh effects have focused on post hoc image manipulation to
produce similar blurring effects in existing photographs using classical
computer graphics or neural rendering techniques, but have either depth
discontinuity artifacts or are restricted to reproducing bokeh effects that are
present in the training data. More recent diffusion based models can synthesize
images with an artistic style, but either require the generation of
high-dimensional masks, expensive fine-tuning, or affect global image
characteristics. In this paper, we present GBSD, the first generative
text-to-image model that synthesizes photorealistic images with a bokeh style.
Motivated by how image synthesis occurs progressively in diffusion models, our
approach combines latent diffusion models with a 2-stage conditioning algorithm
to render bokeh effects on semantically defined objects. Since we can focus the
effect on objects, this semantic bokeh effect is more versatile than classical
rendering techniques. We evaluate GBSD both quantitatively and qualitatively
and demonstrate its ability to be applied in both text-to-image and
image-to-image settings.
Adding 3D Geometry Control to Diffusion Models
June 13, 2023
Wufei Ma, Qihao Liu, Jiahao Wang, Xiaoding Yuan, Angtian Wang, Yi Zhang, Zihao Xiao, Guofeng Zhang, Beijia Lu, Ruxiao Duan, Yongrui Qi, Adam Kortylewski, Yaoyao Liu, Alan Yuille
Diffusion models have emerged as a powerful method of generative modeling
across a range of fields, capable of producing stunning photo-realistic images
from natural language descriptions. However, these models lack explicit control
over the 3D structure in the generated images. Consequently, this hinders our
ability to obtain detailed 3D annotations for the generated images or to craft
instances with specific poses and distances. In this paper, we propose a simple
yet effective method that incorporates 3D geometry control into diffusion
models. Our method exploits ControlNet, which extends diffusion models by using
visual prompts in addition to text prompts. We generate images of the 3D
objects taken from 3D shape repositories (e.g., ShapeNet and Objaverse), render
them from a variety of poses and viewing directions, compute the edge maps of
the rendered images, and use these edge maps as visual prompts to generate
realistic images. With explicit 3D geometry control, we can easily change the
3D structures of the objects in the generated images and obtain ground-truth 3D
annotations automatically. This allows us to improve a wide range of vision
tasks, e.g., classification and 3D pose estimation, in both in-distribution
(ID) and out-of-distribution (OOD) settings. We demonstrate the effectiveness
of our method through extensive experiments on ImageNet-100, ImageNet-R,
PASCAL3D+, ObjectNet3D, and OOD-CV. The results show that our method
significantly outperforms existing methods across multiple benchmarks, e.g.,
3.8 percentage points on ImageNet-100 using DeiT-B and 3.5 percentage points on
PASCAL3D+ & ObjectNet3D using NeMo.
DORSal: Diffusion for Object-centric Representations of Scenes
June 13, 2023
Allan Jabri, Sjoerd van Steenkiste, Emiel Hoogeboom, Mehdi S. M. Sajjadi, Thomas Kipf
Recent progress in 3D scene understanding enables scalable learning of
representations across large datasets of diverse scenes. As a consequence,
generalization to unseen scenes and objects, rendering novel views from just a
single or a handful of input images, and controllable scene generation that
supports editing, is now possible. However, training jointly on a large number
of scenes typically compromises rendering quality when compared to single-scene
optimized models such as NeRFs. In this paper, we leverage recent progress in
diffusion models to equip 3D scene representation learning models with the
ability to render high-fidelity novel views, while retaining benefits such as
object-level scene editing to a large degree. In particular, we propose DORSal,
which adapts a video diffusion architecture for 3D scene generation conditioned
on frozen object-centric slot-based representations of scenes. On both complex
synthetic multi-object scenes and on the real-world large-scale Street View
dataset, we show that DORSal enables scalable neural rendering of 3D scenes
with object-level editing and improves upon existing approaches.
Viewset Diffusion: (0-)Image-Conditioned 3D Generative Models from 2D Data
June 13, 2023
Stanislaw Szymanowicz, Christian Rupprecht, Andrea Vedaldi
We present Viewset Diffusion, a diffusion-based generator that outputs 3D
objects while only using multi-view 2D data for supervision. We note that there
exists a one-to-one mapping between viewsets, i.e., collections of several 2D
views of an object, and 3D models. Hence, we train a diffusion model to
generate viewsets, but design the neural network generator to reconstruct
internally corresponding 3D models, thus generating those too. We fit a
diffusion model to a large number of viewsets for a given category of objects.
The resulting generator can be conditioned on zero, one or more input views.
Conditioned on a single view, it performs 3D reconstruction accounting for the
ambiguity of the task and allowing to sample multiple solutions compatible with
the input. The model performs reconstruction efficiently, in a feed-forward
manner, and is trained using only rendering losses using as few as three views
per viewset. Project page: szymanowiczs.github.io/viewset-diffusion.
Paste, Inpaint and Harmonize via Denoising: Subject-Driven Image Editing with Pre-Trained Diffusion Model
June 13, 2023
Xin Zhang, Jiaxian Guo, Paul Yoo, Yutaka Matsuo, Yusuke Iwasawa
Text-to-image generative models have attracted rising attention for flexible
image editing via user-specified descriptions. However, text descriptions alone
are not enough to elaborate the details of subjects, often compromising the
subjects’ identity or requiring additional per-subject fine-tuning. We
introduce a new framework called \textit{Paste, Inpaint and Harmonize via
Denoising} (PhD), which leverages an exemplar image in addition to text
descriptions to specify user intentions. In the pasting step, an off-the-shelf
segmentation model is employed to identify a user-specified subject within an
exemplar image which is subsequently inserted into a background image to serve
as an initialization capturing both scene context and subject identity in one.
To guarantee the visual coherence of the generated or edited image, we
introduce an inpainting and harmonizing module to guide the pre-trained
diffusion model to seamlessly blend the inserted subject into the scene
naturally. As we keep the pre-trained diffusion model frozen, we preserve its
strong image synthesis ability and text-driven ability, thus achieving
high-quality results and flexible editing with diverse texts. In our
experiments, we apply PhD to both subject-driven image editing tasks and
explore text-driven scene generation given a reference subject. Both
quantitative and qualitative comparisons with baseline methods demonstrate that
our approach achieves state-of-the-art performance in both tasks. More
qualitative results can be found at
\url{https://sites.google.com/view/phd-demo-page}.
I See Dead People: Gray-Box Adversarial Attack on Image-To-Text Models
June 13, 2023
Raz Lapid, Moshe Sipper
Modern image-to-text systems typically adopt the encoder-decoder framework,
which comprises two main components: an image encoder, responsible for
extracting image features, and a transformer-based decoder, used for generating
captions. Taking inspiration from the analysis of neural networks’ robustness
against adversarial perturbations, we propose a novel gray-box algorithm for
creating adversarial examples in image-to-text models. Unlike image
classification tasks that have a finite set of class labels, finding visually
similar adversarial examples in an image-to-text task poses greater challenges
because the captioning system allows for a virtually infinite space of possible
captions. In this paper, we present a gray-box adversarial attack on
image-to-text, both untargeted and targeted. We formulate the process of
discovering adversarial perturbations as an optimization problem that uses only
the image-encoder component, meaning the proposed attack is language-model
agnostic. Through experiments conducted on the ViT-GPT2 model, which is the
most-used image-to-text model in Hugging Face, and the Flickr30k dataset, we
demonstrate that our proposed attack successfully generates visually similar
adversarial examples, both with untargeted and targeted captions. Notably, our
attack operates in a gray-box manner, requiring no knowledge about the decoder
module. We also show that our attacks fool the popular open-source platform
Hugging Face.
UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding
June 13, 2023
Chenpeng Du, Yiwei Guo, Feiyu Shen, Zhijun Liu, Zheng Liang, Xie Chen, Shuai Wang, Hui Zhang, Kai Yu
The utilization of discrete speech tokens, divided into semantic tokens and
acoustic tokens, has been proven superior to traditional acoustic feature
mel-spectrograms in terms of naturalness and robustness for text-to-speech
(TTS) synthesis. Recent popular models, such as VALL-E and SPEAR-TTS, allow
zero-shot speaker adaptation through auto-regressive (AR) continuation of
acoustic tokens extracted from a short speech prompt. However, these AR models
are restricted to generate speech only in a left-to-right direction, making
them unsuitable for speech editing where both preceding and following contexts
are provided. Furthermore, these models rely on acoustic tokens, which have
audio quality limitations imposed by the performance of audio codec models. In
this study, we propose a unified context-aware TTS framework called UniCATS,
which is capable of both speech continuation and editing. UniCATS comprises two
components, an acoustic model CTX-txt2vec and a vocoder CTX-vec2wav.
CTX-txt2vec employs contextual VQ-diffusion to predict semantic tokens from the
input text, enabling it to incorporate the semantic context and maintain
seamless concatenation with the surrounding context. Following that,
CTX-vec2wav utilizes contextual vocoding to convert these semantic tokens into
waveforms, taking into consideration the acoustic context. Our experimental
results demonstrate that CTX-vec2wav outperforms HifiGAN and AudioLM in terms
of speech resynthesis from semantic tokens. Moreover, we show that UniCATS
achieves state-of-the-art performance in both speech continuation and editing.
Deep Ultrasound Denoising Using Diffusion Probabilistic Models
June 12, 2023
Hojat Asgariandehkordi, Sobhan Goudarzi, Adrian Basarab, Hassan Rivaz
Ultrasound images are widespread in medical diagnosis for musculoskeletal,
cardiac, and obstetrical imaging due to the efficiency and non-invasiveness of
the acquisition methodology. However, the acquired images are degraded by
acoustic (e.g. reverberation and clutter) and electronic sources of noise. To
improve the Peak Signal to Noise Ratio (PSNR) of the images, previous denoising
methods often remove the speckles, which could be informative for radiologists
and also for quantitative ultrasound. Herein, a method based on the recent
Denoising Diffusion Probabilistic Models (DDPM) is proposed. It iteratively
enhances the image quality by eliminating the noise while preserving the
speckle texture. It is worth noting that the proposed method is trained in a
completely unsupervised manner, and no annotated data is required. The
experimental blind test results show that our method outperforms the previous
nonlocal means denoising methods in terms of PSNR and Generalized Contrast to
Noise Ratio (GCNR) while preserving speckles.
Controlling Text-to-Image Diffusion by Orthogonal Finetuning
June 12, 2023
Zeju Qiu, Weiyang Liu, Haiwen Feng, Yuxuan Xue, Yao Feng, Zhen Liu, Dan Zhang, Adrian Weller, Bernhard Schölkopf
cs.CV, cs.AI, cs.GR, cs.LG
Large text-to-image diffusion models have impressive capabilities in
generating photorealistic images from text prompts. How to effectively guide or
control these powerful models to perform different downstream tasks becomes an
important open problem. To tackle this challenge, we introduce a principled
finetuning method – Orthogonal Finetuning (OFT), for adapting text-to-image
diffusion models to downstream tasks. Unlike existing methods, OFT can provably
preserve hyperspherical energy which characterizes the pairwise neuron
relationship on the unit hypersphere. We find that this property is crucial for
preserving the semantic generation ability of text-to-image diffusion models.
To improve finetuning stability, we further propose Constrained Orthogonal
Finetuning (COFT) which imposes an additional radius constraint to the
hypersphere. Specifically, we consider two important finetuning text-to-image
tasks: subject-driven generation where the goal is to generate subject-specific
images given a few images of a subject and a text prompt, and controllable
generation where the goal is to enable the model to take in additional control
signals. We empirically show that our OFT framework outperforms existing
methods in generation quality and convergence speed.
Fill-Up: Balancing Long-Tailed Data with Generative Models
June 12, 2023
Joonghyuk Shin, Minguk Kang, Jaesik Park
Modern text-to-image synthesis models have achieved an exceptional level of
photorealism, generating high-quality images from arbitrary text descriptions.
In light of the impressive synthesis ability, several studies have exhibited
promising results in exploiting generated data for image recognition. However,
directly supplementing data-hungry situations in the real-world (e.g. few-shot
or long-tailed scenarios) with existing approaches result in marginal
performance gains, as they suffer to thoroughly reflect the distribution of the
real data. Through extensive experiments, this paper proposes a new image
synthesis pipeline for long-tailed situations using Textual Inversion. The
study demonstrates that generated images from textual-inverted text tokens
effectively aligns with the real domain, significantly enhancing the
recognition ability of a standard ResNet50 backbone. We also show that
real-world data imbalance scenarios can be successfully mitigated by filling up
the imbalanced data with synthetic images. In conjunction with techniques in
the area of long-tailed recognition, our method achieves state-of-the-art
results on standard long-tailed benchmarks when trained from scratch.
Latent Dynamical Implicit Diffusion Processes
June 12, 2023
Mohammad R. Rezaei
Latent dynamical models are commonly used to learn the distribution of a
latent dynamical process that represents a sequence of noisy data samples.
However, producing samples from such models with high fidelity is challenging
due to the complexity and variability of latent and observation dynamics.
Recent advances in diffusion-based generative models, such as DDPM and NCSN,
have shown promising alternatives to state-of-the-art latent generative models,
such as Neural ODEs, RNNs, and Normalizing flow networks, for generating
high-quality sequential samples from a prior distribution. However, their
application in modeling sequential data with latent dynamical models is yet to
be explored. Here, we propose a novel latent variable model named latent
dynamical implicit diffusion processes (LDIDPs), which utilizes implicit
diffusion processes to sample from dynamical latent processes and generate
sequential observation samples accordingly. We tested LDIDPs on synthetic and
simulated neural decoding problems. We demonstrate that LDIDPs can accurately
learn the dynamics over latent dimensions. Furthermore, the implicit sampling
method allows for the computationally efficient generation of high-quality
sequential data samples from the latent and observation spaces.
Fast Diffusion Model
June 12, 2023
Zike Wu, Pan Zhou, Kenji Kawaguchi, Hanwang Zhang
Diffusion models (DMs) have been adopted across diverse fields with its
remarkable abilities in capturing intricate data distributions. In this paper,
we propose a Fast Diffusion Model (FDM) to significantly speed up DMs from a
stochastic optimization perspective for both faster training and sampling. We
first find that the diffusion process of DMs accords with the stochastic
optimization process of stochastic gradient descent (SGD) on a stochastic
time-variant problem. Then, inspired by momentum SGD that uses both gradient
and an extra momentum to achieve faster and more stable convergence than SGD,
we integrate momentum into the diffusion process of DMs. This comes with a
unique challenge of deriving the noise perturbation kernel from the
momentum-based diffusion process. To this end, we frame the process as a Damped
Oscillation system whose critically damped state – the kernel solution –
avoids oscillation and yields a faster convergence speed of the diffusion
process. Empirical results show that our FDM can be applied to several popular
DM frameworks, e.g., VP, VE, and EDM, and reduces their training cost by about
50% with comparable image synthesis performance on CIFAR-10, FFHQ, and AFHQv2
datasets. Moreover, FDM decreases their sampling steps by about 3x to achieve
similar performance under the same samplers. The code is available at
https://github.com/sail-sg/FDM.
VillanDiffusion: A Unified Backdoor Attack Framework for Diffusion Models
June 12, 2023
Sheng-Yen Chou, Pin-Yu Chen, Tsung-Yi Ho
Diffusion Models (DMs) are state-of-the-art generative models that learn a
reversible corruption process from iterative noise addition and denoising. They
are the backbone of many generative AI applications, such as text-to-image
conditional generation. However, recent studies have shown that basic
unconditional DMs (e.g., DDPM and DDIM) are vulnerable to backdoor injection, a
type of output manipulation attack triggered by a maliciously embedded pattern
at model input. This paper presents a unified backdoor attack framework
(VillanDiffusion) to expand the current scope of backdoor analysis for DMs. Our
framework covers mainstream unconditional and conditional DMs (denoising-based
and score-based) and various training-free samplers for holistic evaluations.
Experiments show that our unified framework facilitates the backdoor analysis
of different DM configurations and provides new insights into caption-based
backdoor attacks on DMs. Our code is available on GitHub:
\url{https://github.com/IBM/villandiffusion}
On Kinetic Optimal Probability Paths for Generative Models
June 11, 2023
Neta Shaul, Ricky T. Q. Chen, Maximilian Nickel, Matt Le, Yaron Lipman
Recent successful generative models are trained by fitting a neural network
to an a-priori defined tractable probability density path taking noise to
training examples. In this paper we investigate the space of Gaussian
probability paths, which includes diffusion paths as an instance, and look for
an optimal member in some useful sense. In particular, minimizing the Kinetic
Energy (KE) of a path is known to make particles’ trajectories simple, hence
easier to sample, and empirically improve performance in terms of likelihood of
unseen data and sample generation quality. We investigate Kinetic Optimal (KO)
Gaussian paths and offer the following observations: (i) We show the KE takes a
simplified form on the space of Gaussian paths, where the data is incorporated
only through a single, one dimensional scalar function, called the \emph{data
separation function}. (ii) We characterize the KO solutions with a one
dimensional ODE. (iii) We approximate data-dependent KO paths by approximating
the data separation function and minimizing the KE. (iv) We prove that the data
separation function converges to $1$ in the general case of arbitrary
normalized dataset consisting of $n$ samples in $d$ dimension as
$n/\sqrt{d}\rightarrow 0$. A consequence of this result is that the Conditional
Optimal Transport (Cond-OT) path becomes \emph{kinetic optimal} as
$n/\sqrt{d}\rightarrow 0$. We further support this theory with empirical
experiments on ImageNet.
Language-Guided Traffic Simulation via Scene-Level Diffusion
June 10, 2023
Ziyuan Zhong, Davis Rempe, Yuxiao Chen, Boris Ivanovic, Yulong Cao, Danfei Xu, Marco Pavone, Baishakhi Ray
Realistic and controllable traffic simulation is a core capability that is
necessary to accelerate autonomous vehicle (AV) development. However, current
approaches for controlling learning-based traffic models require significant
domain expertise and are difficult for practitioners to use. To remedy this, we
present CTG++, a scene-level conditional diffusion model that can be guided by
language instructions. Developing this requires tackling two challenges: the
need for a realistic and controllable traffic model backbone, and an effective
method to interface with a traffic model using language. To address these
challenges, we first propose a scene-level diffusion model equipped with a
spatio-temporal transformer backbone, which generates realistic and
controllable traffic. We then harness a large language model (LLM) to convert a
user’s query into a loss function, guiding the diffusion model towards
query-compliant generation. Through comprehensive evaluation, we demonstrate
the effectiveness of our proposed method in generating realistic,
query-compliant traffic simulations.
Boosting GUI Prototyping with Diffusion Models
June 09, 2023
Jialiang Wei, Anne-Lise Courbis, Thomas Lambolais, Binbin Xu, Pierre Louis Bernard, Gérard Dray
GUI (graphical user interface) prototyping is a widely-used technique in
requirements engineering for gathering and refining requirements, reducing
development risks and increasing stakeholder engagement. However, GUI
prototyping can be a time-consuming and costly process. In recent years, deep
learning models such as Stable Diffusion have emerged as a powerful
text-to-image tool capable of generating detailed images based on text prompts.
In this paper, we propose UI-Diffuser, an approach that leverages Stable
Diffusion to generate mobile UIs through simple textual descriptions and UI
components. Preliminary results show that UI-Diffuser provides an efficient and
cost-effective way to generate mobile GUI designs while reducing the need for
extensive prototyping efforts. This approach has the potential to significantly
improve the speed and efficiency of GUI prototyping in requirements
engineering.
Unsupervised Compositional Concepts Discovery with Text-to-Image Generative Models
June 08, 2023
Nan Liu, Yilun Du, Shuang Li, Joshua B. Tenenbaum, Antonio Torralba
Text-to-image generative models have enabled high-resolution image synthesis
across different domains, but require users to specify the content they wish to
generate. In this paper, we consider the inverse problem – given a collection
of different images, can we discover the generative concepts that represent
each image? We present an unsupervised approach to discover generative concepts
from a collection of images, disentangling different art styles in paintings,
objects, and lighting from kitchen scenes, and discovering image classes given
ImageNet images. We show how such generative concepts can accurately represent
the content of images, be recombined and composed to generate new artistic and
hybrid images, and be further used as a representation for downstream
classification tasks.
PriSampler: Mitigating Property Inference of Diffusion Models
June 08, 2023
Hailong Hu, Jun Pang
Diffusion models have been remarkably successful in data synthesis. Such
successes have also driven diffusion models to apply to sensitive data, such as
human face data, but this might bring about severe privacy concerns. In this
work, we systematically present the first privacy study about property
inference attacks against diffusion models, in which adversaries aim to extract
sensitive global properties of the training set from a diffusion model, such as
the proportion of the training data for certain sensitive properties.
Specifically, we consider the most practical attack scenario: adversaries are
only allowed to obtain synthetic data. Under this realistic scenario, we
evaluate the property inference attacks on different types of samplers and
diffusion models. A broad range of evaluations shows that various diffusion
models and their samplers are all vulnerable to property inference attacks.
Furthermore, one case study on off-the-shelf pre-trained diffusion models also
demonstrates the effectiveness of the attack in practice. Finally, we propose a
new model-agnostic plug-in method PriSampler to mitigate the property inference
of diffusion models. PriSampler can be directly applied to well-trained
diffusion models and support both stochastic and deterministic sampling.
Extensive experiments illustrate the effectiveness of our defense and it makes
adversaries infer the proportion of properties as close as random guesses.
PriSampler also shows its significantly superior performance to diffusion
models trained with differential privacy on both model utility and defense
performance.
SyncDiffusion: Coherent Montage via Synchronized Joint Diffusions
June 08, 2023
Yuseung Lee, Kunho Kim, Hyunjin Kim, Minhyuk Sung
The remarkable capabilities of pretrained image diffusion models have been
utilized not only for generating fixed-size images but also for creating
panoramas. However, naive stitching of multiple images often results in visible
seams. Recent techniques have attempted to address this issue by performing
joint diffusions in multiple windows and averaging latent features in
overlapping regions. However, these approaches, which focus on seamless montage
generation, often yield incoherent outputs by blending different scenes within
a single image. To overcome this limitation, we propose SyncDiffusion, a
plug-and-play module that synchronizes multiple diffusions through gradient
descent from a perceptual similarity loss. Specifically, we compute the
gradient of the perceptual loss using the predicted denoised images at each
denoising step, providing meaningful guidance for achieving coherent montages.
Our experimental results demonstrate that our method produces significantly
more coherent outputs compared to previous methods (66.35% vs. 33.65% in our
user study) while still maintaining fidelity (as assessed by GIQA) and
compatibility with the input prompt (as measured by CLIP score). We further
demonstrate the versatility of our method across three plug-and-play
applications: layout-guided image generation, conditional image generation and
360-degree panorama generation. Our project page is at
https://syncdiffusion.github.io.
Multi-Architecture Multi-Expert Diffusion Models
June 08, 2023
Yunsung Lee, Jin-Young Kim, Hyojun Go, Myeongho Jeong, Shinhyeok Oh, Seungtaek Choi
In this paper, we address the performance degradation of efficient diffusion
models by introducing Multi-architecturE Multi-Expert diffusion models (MEME).
We identify the need for tailored operations at different time-steps in
diffusion processes and leverage this insight to create compact yet
high-performing models. MEME assigns distinct architectures to different
time-step intervals, balancing convolution and self-attention operations based
on observed frequency characteristics. We also introduce a soft interval
assignment strategy for comprehensive training. Empirically, MEME operates 3.3
times faster than baselines while improving image generation quality (FID
scores) by 0.62 (FFHQ) and 0.37 (CelebA). Though we validate the effectiveness
of assigning more optimal architecture per time-step, where efficient models
outperform the larger models, we argue that MEME opens a new design choice for
diffusion models that can be easily applied in other scenarios, such as large
multi-expert models.
Instructed Diffuser with Temporal Condition Guidance for Offline Reinforcement Learning
June 08, 2023
Jifeng Hu, Yanchao Sun, Sili Huang, SiYuan Guo, Hechang Chen, Li Shen, Lichao Sun, Yi Chang, Dacheng Tao
Recent works have shown the potential of diffusion models in computer vision
and natural language processing. Apart from the classical supervised learning
fields, diffusion models have also shown strong competitiveness in
reinforcement learning (RL) by formulating decision-making as sequential
generation. However, incorporating temporal information of sequential data and
utilizing it to guide diffusion models to perform better generation is still an
open challenge. In this paper, we take one step forward to investigate
controllable generation with temporal conditions that are refined from temporal
information. We observe the importance of temporal conditions in sequential
generation in sufficient explorative scenarios and provide a comprehensive
discussion and comparison of different temporal conditions. Based on the
observations, we propose an effective temporally-conditional diffusion model
coined Temporally-Composable Diffuser (TCD), which extracts temporal
information from interaction sequences and explicitly guides generation with
temporal conditions. Specifically, we separate the sequences into three parts
according to time expansion and identify historical, immediate, and prospective
conditions accordingly. Each condition preserves non-overlapping temporal
information of sequences, enabling more controllable generation when we jointly
use them to guide the diffuser. Finally, we conduct extensive experiments and
analysis to reveal the favorable applicability of TCD in offline RL tasks,
where our method reaches or matches the best performance compared with prior
SOTA baselines.
Interpreting and Improving Diffusion Models Using the Euclidean Distance Function
June 08, 2023
Frank Permenter, Chenyang Yuan
cs.LG, cs.CV, math.OC, stat.ML
Denoising is intuitively related to projection. Indeed, under the manifold
hypothesis, adding random noise is approximately equivalent to orthogonal
perturbation. Hence, learning to denoise is approximately learning to project.
In this paper, we use this observation to reinterpret denoising diffusion
models as approximate gradient descent applied to the Euclidean distance
function. We then provide straight-forward convergence analysis of the DDIM
sampler under simple assumptions on the projection-error of the denoiser.
Finally, we propose a new sampler based on two simple modifications to DDIM
using insights from our theoretical results. In as few as 5-10 function
evaluations, our sampler achieves state-of-the-art FID scores on pretrained
CIFAR-10 and CelebA models and can generate high quality samples on latent
diffusion models.
WOUAF: Weight Modulation for User Attribution and Fingerprinting in Text-to-Image Diffusion Models
June 07, 2023
Changhoon Kim, Kyle Min, Maitreya Patel, Sheng Cheng, Yezhou Yang
The rapid advancement of generative models, facilitating the creation of
hyper-realistic images from textual descriptions, has concurrently escalated
critical societal concerns such as misinformation. Although providing some
mitigation, traditional fingerprinting mechanisms fall short in attributing
responsibility for the malicious use of synthetic images. This paper introduces
a novel approach to model fingerprinting that assigns responsibility for the
generated images, thereby serving as a potential countermeasure to model
misuse. Our method modifies generative models based on each user’s unique
digital fingerprint, imprinting a unique identifier onto the resultant content
that can be traced back to the user. This approach, incorporating fine-tuning
into Text-to-Image (T2I) tasks using the Stable Diffusion Model, demonstrates
near-perfect attribution accuracy with a minimal impact on output quality.
Through extensive evaluation, we show that our method outperforms baseline
methods with an average improvement of 11\% in handling image post-processes.
Our method presents a promising and novel avenue for accountable model
distribution and responsible use.
ConceptBed: Evaluating Concept Learning Abilities of Text-to-Image Diffusion Models
June 07, 2023
Maitreya Patel, Tejas Gokhale, Chitta Baral, Yezhou Yang
The ability to understand visual concepts and replicate and compose these
concepts from images is a central goal for computer vision. Recent advances in
text-to-image (T2I) models have lead to high definition and realistic image
quality generation by learning from large databases of images and their
descriptions. However, the evaluation of T2I models has focused on photorealism
and limited qualitative measures of visual understanding. To quantify the
ability of T2I models in learning and synthesizing novel visual concepts, we
introduce ConceptBed, a large-scale dataset that consists of 284 unique visual
concepts, 5K unique concept compositions, and 33K composite text prompts. Along
with the dataset, we propose an evaluation metric, Concept Confidence Deviation
(CCD), that uses the confidence of oracle concept classifiers to measure the
alignment between concepts generated by T2I generators and concepts contained
in ground truth images. We evaluate visual concepts that are either objects,
attributes, or styles, and also evaluate four dimensions of compositionality:
counting, attributes, relations, and actions. Our human study shows that CCD is
highly correlated with human understanding of concepts. Our results point to a
trade-off between learning the concepts and preserving the compositionality
which existing approaches struggle to overcome.
On the Design Fundamentals of Diffusion Models: A Survey
June 07, 2023
Ziyi Chang, George Alex Koulieris, Hubert P. H. Shum
Diffusion models are generative models, which gradually add and remove noise
to learn the underlying distribution of training data for data generation. The
components of diffusion models have gained significant attention with many
design choices proposed. Existing reviews have primarily focused on
higher-level solutions, thereby covering less on the design fundamentals of
components. This study seeks to address this gap by providing a comprehensive
and coherent review on component-wise design choices in diffusion models.
Specifically, we organize this review according to their three key components,
namely the forward process, the reverse process, and the sampling procedure.
This allows us to provide a fine-grained perspective of diffusion models,
benefiting future studies in the analysis of individual components, the
applicability of design choices, and the implementation of diffusion models.
Multi-modal Latent Diffusion
June 07, 2023
Mustapha Bounoua, Giulio Franzese, Pietro Michiardi
Multi-modal data-sets are ubiquitous in modern applications, and multi-modal
Variational Autoencoders are a popular family of models that aim to learn a
joint representation of the different modalities. However, existing approaches
suffer from a coherence-quality tradeoff, where models with good generation
quality lack generative coherence across modalities, and vice versa. We discuss
the limitations underlying the unsatisfactory performance of existing methods,
to motivate the need for a different approach. We propose a novel method that
uses a set of independently trained, uni-modal, deterministic autoencoders.
Individual latent variables are concatenated into a common latent space, which
is fed to a masked diffusion model to enable generative modeling. We also
introduce a new multi-time training method to learn the conditional score
network for multi-modal diffusion. Our methodology substantially outperforms
competitors in both generation quality and coherence, as shown through an
extensive experimental campaign.
Improving Diffusion-based Image Translation using Asymmetric Gradient Guidance
June 07, 2023
Gihyun Kwon, Jong Chul Ye
cs.CV, cs.AI, cs.LG, stat.ML
Diffusion models have shown significant progress in image translation tasks
recently. However, due to their stochastic nature, there’s often a trade-off
between style transformation and content preservation. Current strategies aim
to disentangle style and content, preserving the source image’s structure while
successfully transitioning from a source to a target domain under text or
one-shot image conditions. Yet, these methods often require computationally
intense fine-tuning of diffusion models or additional neural networks. To
address these challenges, here we present an approach that guides the reverse
process of diffusion sampling by applying asymmetric gradient guidance. This
results in quicker and more stable image manipulation for both text-guided and
image-guided image translation. Our model’s adaptability allows it to be
implemented with both image- and latent-diffusion models. Experiments show that
our method outperforms various state-of-the-art models in image translation
tasks.
ScoreCL: Augmentation-Adaptive Contrastive Learning via Score-Matching Function
June 07, 2023
JinYoung Kim, Soonwoo Kwon, Hyojun Go, Yunsung Lee, Seungtaek Choi
Self-supervised contrastive learning (CL) has achieved state-of-the-art
performance in representation learning by minimizing the distance between
positive pairs while maximizing that of negative ones. Recently, it has been
verified that the model learns better representation with diversely augmented
positive pairs because they enable the model to be more view-invariant.
However, only a few studies on CL have considered the difference between
augmented views, and have not gone beyond the hand-crafted findings. In this
paper, we first observe that the score-matching function can measure how much
data has changed from the original through augmentation. With the observed
property, every pair in CL can be weighted adaptively by the difference of
score values, resulting in boosting the performance of the existing CL method.
We show the generality of our method, referred to as ScoreCL, by consistently
improving various CL methods, SimCLR, SimSiam, W-MSE, and VICReg, up to 3%p in
k-NN evaluation on CIFAR-10, CIFAR-100, and ImageNet-100. Moreover, we have
conducted exhaustive experiments and ablations, including results on diverse
downstream tasks, comparison with possible baselines, and improvement when used
with other proposed augmentation methods. We hope our exploration will inspire
more research in exploiting the score matching for CL.
A Survey on Generative Diffusion Models for Structured Data
June 07, 2023
Heejoon Koo, To Eun Kim
In recent years, generative diffusion models have achieved a rapid paradigm
shift in deep generative models by showing groundbreaking performance across
various applications. Meanwhile, structured data, encompassing tabular and time
series data, has been received comparatively limited attention from the deep
learning research community, despite its omnipresence and extensive
applications. Thus, there is still a lack of literature and its reviews on
structured data modelling via diffusion models, compared to other data
modalities such as visual and textual data. To address this gap, we present a
comprehensive review of recently proposed diffusion models in the field of
structured data. First, this survey provides a concise overview of the
score-based diffusion model theory, subsequently proceeding to the technical
descriptions of the majority of pioneering works that used structured data in
both data-driven general tasks and domain-specific applications. Thereafter, we
analyse and discuss the limitations and challenges shown in existing works and
suggest potential research directions. We hope this review serves as a catalyst
for the research community, promoting developments in generative diffusion
models for structured data.
Phoenix: A Federated Generative Diffusion Model
June 07, 2023
Fiona Victoria Stanley Jothiraj, Afra Mashhadi
Generative AI has made impressive strides in enabling users to create diverse
and realistic visual content such as images, videos, and audio. However,
training generative models on large centralized datasets can pose challenges in
terms of data privacy, security, and accessibility. Federated learning (FL) is
an approach that uses decentralized techniques to collaboratively train a
shared deep learning model while retaining the training data on individual edge
devices to preserve data privacy. This paper proposes a novel method for
training a Denoising Diffusion Probabilistic Model (DDPM) across multiple data
sources using FL techniques. Diffusion models, a newly emerging generative
model, show promising results in achieving superior quality images than
Generative Adversarial Networks (GANs). Our proposed method Phoenix is an
unconditional diffusion model that leverages strategies to improve the data
diversity of generated samples even when trained on data with statistical
heterogeneity or Non-IID (Non-Independent and Identically Distributed) data. We
demonstrate how our approach outperforms the default diffusion model in an FL
setting. These results indicate that high-quality samples can be generated by
maintaining data diversity, preserving privacy, and reducing communication
between data sources, offering exciting new possibilities in the field of
generative AI.
Conditional Diffusion Models for Weakly Supervised Medical Image Segmentation
June 06, 2023
Xinrong Hu, Yu-Jen Chen, Tsung-Yi Ho, Yiyu Shi
Recent advances in denoising diffusion probabilistic models have shown great
success in image synthesis tasks. While there are already works exploring the
potential of this powerful tool in image semantic segmentation, its application
in weakly supervised semantic segmentation (WSSS) remains relatively
under-explored. Observing that conditional diffusion models (CDM) is capable of
generating images subject to specific distributions, in this work, we utilize
category-aware semantic information underlied in CDM to get the prediction mask
of the target object with only image-level annotations. More specifically, we
locate the desired class by approximating the derivative of the output of CDM
w.r.t the input condition. Our method is different from previous diffusion
model methods with guidance from an external classifier, which accumulates
noises in the background during the reconstruction process. Our method
outperforms state-of-the-art CAM and diffusion model methods on two public
medical image segmentation datasets, which demonstrates that CDM is a promising
tool in WSSS. Also, experiment shows our method is more time-efficient than
existing diffusion model methods, making it practical for wider applications.
Logic Diffusion for Knowledge Graph Reasoning
June 06, 2023
Xiaoying Xie, Biao Gong, Yiliang Lv, Zhen Han, Guoshuai Zhao, Xueming Qian
Most recent works focus on answering first order logical queries to explore
the knowledge graph reasoning via multi-hop logic predictions. However,
existing reasoning models are limited by the circumscribed logical paradigms of
training samples, which leads to a weak generalization of unseen logic. To
address these issues, we propose a plug-in module called Logic Diffusion (LoD)
to discover unseen queries from surroundings and achieves dynamical equilibrium
between different kinds of patterns. The basic idea of LoD is relation
diffusion and sampling sub-logic by random walking as well as a special
training mechanism called gradient adaption. Besides, LoD is accompanied by a
novel loss function to further achieve the robust logical diffusion when facing
noisy data in training or testing sets. Extensive experiments on four public
datasets demonstrate the superiority of mainstream knowledge graph reasoning
models with LoD over state-of-the-art. Moreover, our ablation study proves the
general effectiveness of LoD on the noise-rich knowledge graph.
June 06, 2023
Hefeng Wang, Jiale Cao, Rao Muhammad Anwer, Jin Xie, Fahad Shahbaz Khan, Yanwei Pang
This paper introduces an approach, named DFormer, for universal image
segmentation. The proposed DFormer views universal image segmentation task as a
denoising process using a diffusion model. DFormer first adds various levels of
Gaussian noise to ground-truth masks, and then learns a model to predict
denoising masks from corrupted masks. Specifically, we take deep pixel-level
features along with the noisy masks as inputs to generate mask features and
attention masks, employing diffusion-based decoder to perform mask prediction
gradually. At inference, our DFormer directly predicts the masks and
corresponding categories from a set of randomly-generated masks. Extensive
experiments reveal the merits of our proposed contributions on different image
segmentation tasks: panoptic segmentation, instance segmentation, and semantic
segmentation. Our DFormer outperforms the recent diffusion-based panoptic
segmentation method Pix2Seq-D with a gain of 3.6% on MS COCO val2017 set.
Further, DFormer achieves promising semantic segmentation performance
outperforming the recent diffusion-based method by 2.2% on ADE20K val set. Our
source code and models will be publicly on https://github.com/cp3wan/DFormer
Protecting the Intellectual Property of Diffusion Models by the Watermark Diffusion Process
June 06, 2023
Sen Peng, Yufei Chen, Cong Wang, Xiaohua Jia
Diffusion models have rapidly become a vital part of deep generative
architectures, given today’s increasing demands. Obtaining large,
high-performance diffusion models demands significant resources, highlighting
their importance as intellectual property worth protecting. However, existing
watermarking techniques for ownership verification are insufficient when
applied to diffusion models. Very recent research in watermarking diffusion
models either exposes watermarks during task generation, which harms the
imperceptibility, or is developed for conditional diffusion models that require
prompts to trigger the watermark. This paper introduces WDM, a novel
watermarking solution for diffusion models without imprinting the watermark
during task generation. It involves training a model to concurrently learn a
Watermark Diffusion Process (WDP) for embedding watermarks alongside the
standard diffusion process for task generation. We provide a detailed
theoretical analysis of WDP training and sampling, relating it to a shifted
Gaussian diffusion process via the same reverse noise. Extensive experiments
are conducted to validate the effectiveness and robustness of our approach in
various trigger and watermark data configurations.
DreamSparse: Escaping from Plato’s Cave with 2D Diffusion Model Given Sparse Views
June 06, 2023
Paul Yoo, Jiaxian Guo, Yutaka Matsuo, Shixiang Shane Gu
Synthesizing novel view images from a few views is a challenging but
practical problem. Existing methods often struggle with producing high-quality
results or necessitate per-object optimization in such few-view settings due to
the insufficient information provided. In this work, we explore leveraging the
strong 2D priors in pre-trained diffusion models for synthesizing novel view
images. 2D diffusion models, nevertheless, lack 3D awareness, leading to
distorted image synthesis and compromising the identity. To address these
problems, we propose DreamSparse, a framework that enables the frozen
pre-trained diffusion model to generate geometry and identity-consistent novel
view image. Specifically, DreamSparse incorporates a geometry module designed
to capture 3D features from sparse views as a 3D prior. Subsequently, a spatial
guidance model is introduced to convert these 3D feature maps into spatial
information for the generative process. This information is then used to guide
the pre-trained diffusion model, enabling it to generate geometrically
consistent images without tuning it. Leveraging the strong image priors in the
pre-trained diffusion models, DreamSparse is capable of synthesizing
high-quality novel views for both object and scene-level images and
generalising to open-set images. Experimental results demonstrate that our
framework can effectively synthesize novel view images from sparse views and
outperforms baselines in both trained and open-set category images. More
results can be found on our project page:
https://sites.google.com/view/dreamsparse-webpage.
Brain tumor segmentation using synthetic MR images – A comparison of GANs and diffusion models
June 05, 2023
Muhammad Usman Akbar, Måns Larsson, Anders Eklund
Large annotated datasets are required for training deep learning models, but
in medical imaging data sharing is often complicated due to ethics,
anonymization and data protection legislation. Generative AI models, such as
generative adversarial networks (GANs) and diffusion models, can today produce
very realistic synthetic images, and can potentially facilitate data sharing.
However, in order to share synthetic medical images it must first be
demonstrated that they can be used for training different networks with
acceptable performance. Here, we therefore comprehensively evaluate four GANs
(progressive GAN, StyleGAN 1-3) and a diffusion model for the task of brain
tumor segmentation (using two segmentation networks, U-Net and a Swin
transformer). Our results show that segmentation networks trained on synthetic
images reach Dice scores that are 80% - 90% of Dice scores when training with
real images, but that memorization of the training images can be a problem for
diffusion models if the original dataset is too small. Our conclusion is that
sharing synthetic medical images is a viable option to sharing real images, but
that further work is required. The trained generative models and the generated
synthetic images are shared on AIDA data hub
Faster Training of Diffusion Models and Improved Density Estimation via Parallel Score Matching
June 05, 2023
Etrit Haxholli, Marco Lorenzi
In Diffusion Probabilistic Models (DPMs), the task of modeling the score
evolution via a single time-dependent neural network necessitates extended
training periods and may potentially impede modeling flexibility and capacity.
To counteract these challenges, we propose leveraging the independence of
learning tasks at different time points inherent to DPMs. More specifically, we
partition the learning task by utilizing independent networks, each dedicated
to learning the evolution of scores within a specific time sub-interval.
Further, inspired by residual flows, we extend this strategy to its logical
conclusion by employing separate networks to independently model the score at
each individual time point. As empirically demonstrated on synthetic and image
datasets, our approach not only significantly accelerates the training process
by introducing an additional layer of parallelization atop data
parallelization, but it also enhances density estimation performance when
compared to the conventional training methodology for DPMs.
Enhance Diffusion to Improve Robust Generalization
June 05, 2023
Jianhui Sun, Sanchit Sinha, Aidong Zhang
Deep neural networks are susceptible to human imperceptible adversarial
perturbations. One of the strongest defense mechanisms is \emph{Adversarial
Training} (AT). In this paper, we aim to address two predominant problems in
AT. First, there is still little consensus on how to set hyperparameters with a
performance guarantee for AT research, and customized settings impede a fair
comparison between different model designs in AT research. Second, the robustly
trained neural networks struggle to generalize well and suffer from tremendous
overfitting. This paper focuses on the primary AT framework - Projected
Gradient Descent Adversarial Training (PGD-AT). We approximate the dynamic of
PGD-AT by a continuous-time Stochastic Differential Equation (SDE), and show
that the diffusion term of this SDE determines the robust generalization. An
immediate implication of this theoretical finding is that robust generalization
is positively correlated with the ratio between learning rate and batch size.
We further propose a novel approach, \emph{Diffusion Enhanced Adversarial
Training} (DEAT), to manipulate the diffusion term to improve robust
generalization with virtually no extra computational burden. We theoretically
show that DEAT obtains a tighter generalization bound than PGD-AT. Our
empirical investigation is extensive and firmly attests that DEAT universally
outperforms PGD-AT by a significant margin.
SwinRDM: Integrate SwinRNN with Diffusion Model towards High-Resolution and High-Quality Weather Forecasting
June 05, 2023
Lei Chen, Fei Du, Yuan Hu, Fan Wang, Zhibin Wang
cs.AI, cs.CV, physics.ao-ph
Data-driven medium-range weather forecasting has attracted much attention in
recent years. However, the forecasting accuracy at high resolution is
unsatisfactory currently. Pursuing high-resolution and high-quality weather
forecasting, we develop a data-driven model SwinRDM which integrates an
improved version of SwinRNN with a diffusion model. SwinRDM performs
predictions at 0.25-degree resolution and achieves superior forecasting
accuracy to IFS (Integrated Forecast System), the state-of-the-art operational
NWP model, on representative atmospheric variables including 500 hPa
geopotential (Z500), 850 hPa temperature (T850), 2-m temperature (T2M), and
total precipitation (TP), at lead times of up to 5 days. We propose to leverage
a two-step strategy to achieve high-resolution predictions at 0.25-degree
considering the trade-off between computation memory and forecasting accuracy.
Recurrent predictions for future atmospheric fields are firstly performed at
1.40625-degree resolution, and then a diffusion-based super-resolution model is
leveraged to recover the high spatial resolution and finer-scale atmospheric
details. SwinRDM pushes forward the performance and potential of data-driven
models for a large margin towards operational applications.
Stable Diffusion is Untable
June 05, 2023
Chengbin Du, Yanxi Li, Zhongwei Qiu, Chang Xu
Recently, text-to-image models have been thriving. Despite their powerful
generative capacity, our research has uncovered a lack of robustness in this
generation process. Specifically, the introduction of small perturbations to
the text prompts can result in the blending of primary subjects with other
categories or their complete disappearance in the generated images. In this
paper, we propose Auto-attack on Text-to-image Models (ATM), a gradient-based
approach, to effectively and efficiently generate such perturbations. By
learning a Gumbel Softmax distribution, we can make the discrete process of
word replacement or extension continuous, thus ensuring the differentiability
of the perturbation generation. Once the distribution is learned, ATM can
sample multiple attack samples simultaneously. These attack samples can prevent
the generative model from generating the desired subjects without compromising
image quality. ATM has achieved a 91.1% success rate in short-text attacks and
an 81.2% success rate in long-text attacks. Further empirical analysis revealed
four attack patterns based on: 1) the variability in generation speed, 2) the
similarity of coarse-grained characteristics, 3) the polysemy of words, and 4)
the positioning of words.
Video Diffusion Models with Local-Global Context Guidance
June 05, 2023
Siyuan Yang, Lu Zhang, Yu Liu, Zhizhuo Jiang, You He
Diffusion models have emerged as a powerful paradigm in video synthesis tasks
including prediction, generation, and interpolation. Due to the limitation of
the computational budget, existing methods usually implement conditional
diffusion models with an autoregressive inference pipeline, in which the future
fragment is predicted based on the distribution of adjacent past frames.
However, only the conditions from a few previous frames can’t capture the
global temporal coherence, leading to inconsistent or even outrageous results
in long-term video prediction. In this paper, we propose a Local-Global Context
guided Video Diffusion model (LGC-VD) to capture multi-perception conditions
for producing high-quality videos in both conditional/unconditional settings.
In LGC-VD, the UNet is implemented with stacked residual blocks with
self-attention units, avoiding the undesirable computational cost in 3D Conv.
We construct a local-global context guidance strategy to capture the
multi-perceptual embedding of the past fragment to boost the consistency of
future prediction. Furthermore, we propose a two-stage training strategy to
alleviate the effect of noisy frames for more stable predictions. Our
experiments demonstrate that the proposed method achieves favorable performance
on video prediction, interpolation, and unconditional video generation. We
release code at https://github.com/exisas/LGC-VD.
PLANNER: Generating Diversified Paragraph via Latent Language Diffusion Model
June 05, 2023
Yizhe Zhang, Jiatao Gu, Zhuofeng Wu, Shuangfei Zhai, Josh Susskind, Navdeep Jaitly
Autoregressive models for text sometimes generate repetitive and low-quality
output because errors accumulate during the steps of generation. This issue is
often attributed to exposure bias - the difference between how a model is
trained, and how it is used during inference. Denoising diffusion models
provide an alternative approach in which a model can revisit and revise its
output. However, they can be computationally expensive and prior efforts on
text have led to models that produce less fluent output compared to
autoregressive models, especially for longer text and paragraphs. In this
paper, we propose PLANNER, a model that combines latent semantic diffusion with
autoregressive generation, to generate fluent text while exercising global
control over paragraphs. The model achieves this by combining an autoregressive
“decoding” module with a “planning” module that uses latent diffusion to
generate semantic paragraph embeddings in a coarse-to-fine manner. The proposed
method is evaluated on various conditional generation tasks, and results on
semantic generation, text completion and summarization show its effectiveness
in generating high-quality long-form text in an efficient manner.
Temporal Dynamic Quantization for Diffusion Models
June 04, 2023
Junhyuk So, Jungwon Lee, Daehyun Ahn, Hyungjun Kim, Eunhyeok Park
The diffusion model has gained popularity in vision applications due to its
remarkable generative performance and versatility. However, high storage and
computation demands, resulting from the model size and iterative generation,
hinder its use on mobile devices. Existing quantization techniques struggle to
maintain performance even in 8-bit precision due to the diffusion model’s
unique property of temporal variation in activation. We introduce a novel
quantization method that dynamically adjusts the quantization interval based on
time step information, significantly improving output quality. Unlike
conventional dynamic quantization techniques, our approach has no computational
overhead during inference and is compatible with both post-training
quantization (PTQ) and quantization-aware training (QAT). Our extensive
experiments demonstrate substantial improvements in output quality with the
quantized diffusion model across various datasets.
Training Data Attribution for Diffusion Models
June 03, 2023
Zheng Dai, David K Gifford
Diffusion models have become increasingly popular for synthesizing
high-quality samples based on training datasets. However, given the oftentimes
enormous sizes of the training datasets, it is difficult to assess how training
data impact the samples produced by a trained diffusion model. The difficulty
of relating diffusion model inputs and outputs poses significant challenges to
model explainability and training data attribution. Here we propose a novel
solution that reveals how training data influence the output of diffusion
models through the use of ensembles. In our approach individual models in an
encoded ensemble are trained on carefully engineered splits of the overall
training data to permit the identification of influential training examples.
The resulting model ensembles enable efficient ablation of training data
influence, allowing us to assess the impact of training data on model outputs.
We demonstrate the viability of these ensembles as generative models and the
validity of our approach to assessing influence.
Variational Gaussian Process Diffusion Processes
June 03, 2023
Prakhar Verma, Vincent Adam, Arno Solin
Diffusion processes are a class of stochastic differential equations (SDEs)
providing a rich family of expressive models that arise naturally in dynamic
modelling tasks. Probabilistic inference and learning under generative models
with latent processes endowed with a non-linear diffusion process prior are
intractable problems. We build upon work within variational inference,
approximating the posterior process as a linear diffusion process, and point
out pathologies in the approach. We propose an alternative parameterization of
the Gaussian variational process using a site-based exponential family
description. This allows us to trade a slow inference algorithm with
fixed-point iterations for a fast algorithm for convex optimization akin to
natural gradient descent, which also provides a better objective for learning
model parameters.
June 03, 2023
Salva Rühling Cachay, Bo Zhao, Hailey Joren, Rose Yu
While diffusion models can successfully generate data and make predictions,
they are predominantly designed for static images. We propose an approach for
efficiently training diffusion models for probabilistic spatiotemporal
forecasting, where generating stable and accurate rollout forecasts remains
challenging, Our method, DYffusion, leverages the temporal dynamics in the
data, directly coupling it with the diffusion steps in the model. We train a
stochastic, time-conditioned interpolator and a forecaster network that mimic
the forward and reverse processes of standard diffusion models, respectively.
DYffusion naturally facilitates multi-step and long-range forecasting, allowing
for highly flexible, continuous-time sampling trajectories and the ability to
trade-off performance with accelerated sampling at inference time. In addition,
the dynamics-informed diffusion process in DYffusion imposes a strong inductive
bias and significantly improves computational efficiency compared to
traditional Gaussian noise-based diffusion models. Our approach performs
competitively on probabilistic forecasting of complex dynamics in sea surface
temperatures, Navier-Stokes flows, and spring mesh systems.
Unlearnable Examples for Diffusion Models: Protect Data from Unauthorized Exploitation
June 02, 2023
Zhengyue Zhao, Jinhao Duan, Xing Hu, Kaidi Xu, Chenan Wang, Rui Zhang, Zidong Du, Qi Guo, Yunji Chen
Diffusion models have demonstrated remarkable performance in image generation
tasks, paving the way for powerful AIGC applications. However, these
widely-used generative models can also raise security and privacy concerns,
such as copyright infringement, and sensitive data leakage. To tackle these
issues, we propose a method, Unlearnable Diffusion Perturbation, to safeguard
images from unauthorized exploitation. Our approach involves designing an
algorithm to generate sample-wise perturbation noise for each image to be
protected. This imperceptible protective noise makes the data almost
unlearnable for diffusion models, i.e., diffusion models trained or fine-tuned
on the protected data cannot generate high-quality and diverse images related
to the protected training data. Theoretically, we frame this as a max-min
optimization problem and introduce EUDP, a noise scheduler-based method to
enhance the effectiveness of the protective noise. We evaluate our methods on
both Denoising Diffusion Probabilistic Model and Latent Diffusion Models,
demonstrating that training diffusion models on the protected data lead to a
significant reduction in the quality of the generated images. Especially, the
experimental results on Stable Diffusion demonstrate that our method
effectively safeguards images from being used to train Diffusion Models in
various tasks, such as training specific objects and styles. This achievement
holds significant importance in real-world scenarios, as it contributes to the
protection of privacy and copyright against AI-generated content.
DiffECG: A Generalized Probabilistic Diffusion Model for ECG Signals Synthesis
June 02, 2023
Nour Neifar, Achraf Ben-Hamadou, Afef Mdhaffar, Mohamed Jmaiel
In recent years, deep generative models have gained attention as a promising
data augmentation solution for heart disease detection using deep learning
approaches applied to ECG signals. In this paper, we introduce a novel approach
based on denoising diffusion probabilistic models for ECG synthesis that covers
three scenarios: heartbeat generation, partial signal completion, and full
heartbeat forecasting. Our approach represents the first generalized
conditional approach for ECG synthesis, and our experimental results
demonstrate its effectiveness for various ECG-related tasks. Moreover, we show
that our approach outperforms other state-of-the-art ECG generative models and
can enhance the performance of state-of-the-art classifiers.
Denoising Diffusion Semantic Segmentation with Mask Prior Modeling
June 02, 2023
Zeqiang Lai, Yuchen Duan, Jifeng Dai, Ziheng Li, Ying Fu, Hongsheng Li, Yu Qiao, Wenhai Wang
The evolution of semantic segmentation has long been dominated by learning
more discriminative image representations for classifying each pixel. Despite
the prominent advancements, the priors of segmentation masks themselves, e.g.,
geometric and semantic constraints, are still under-explored. In this paper, we
propose to ameliorate the semantic segmentation quality of existing
discriminative approaches with a mask prior modeled by a recently-developed
denoising diffusion generative model. Beginning with a unified architecture
that adapts diffusion models for mask prior modeling, we focus this work on a
specific instantiation with discrete diffusion and identify a variety of key
design choices for its successful application. Our exploratory analysis
revealed several important findings, including: (1) a simple integration of
diffusion models into semantic segmentation is not sufficient, and a
poorly-designed diffusion process might lead to degradation in segmentation
performance; (2) during the training, the object to which noise is added is
more important than the type of noise; (3) during the inference, the strict
diffusion denoising scheme may not be essential and can be relaxed to a simpler
scheme that even works better. We evaluate the proposed prior modeling with
several off-the-shelf segmentors, and our experimental results on ADE20K and
Cityscapes demonstrate that our approach could achieve competitively
quantitative performance and more appealing visual quality.
GANs Settle Scores!
June 02, 2023
Siddarth Asokan, Nishanth Shetty, Aadithya Srikanth, Chandra Sekhar Seelamantula
Generative adversarial networks (GANs) comprise a generator, trained to learn
the underlying distribution of the desired data, and a discriminator, trained
to distinguish real samples from those output by the generator. A majority of
GAN literature focuses on understanding the optimality of the discriminator
through integral probability metric (IPM) or divergence based analysis. In this
paper, we propose a unified approach to analyzing the generator optimization
through variational approach. In $f$-divergence-minimizing GANs, we show that
the optimal generator is the one that matches the score of its output
distribution with that of the data distribution, while in IPM GANs, we show
that this optimal generator matches score-like functions, involving the
flow-field of the kernel associated with a chosen IPM constraint space.
Further, the IPM-GAN optimization can be seen as one of smoothed
score-matching, where the scores of the data and the generator distributions
are convolved with the kernel associated with the constraint. The proposed
approach serves to unify score-based training and existing GAN flavors,
leveraging results from normalizing flows, while also providing explanations
for empirical phenomena such as the stability of non-saturating GAN losses.
Based on these results, we propose novel alternatives to $f$-GAN and IPM-GAN
training based on score and flow matching, and discriminator-guided Langevin
sampling.
PolyDiffuse: Polygonal Shape Reconstruction via Guided Set Diffusion Models
June 02, 2023
Jiacheng Chen, Ruizhi Deng, Yasutaka Furukawa
This paper presents PolyDiffuse, a novel structured reconstruction algorithm
that transforms visual sensor data into polygonal shapes with Diffusion Models
(DM), an emerging machinery amid exploding generative AI, while formulating
reconstruction as a generation process conditioned on sensor data. The task of
structured reconstruction poses two fundamental challenges to DM: 1) A
structured geometry is a set'' (e.g., a set of polygons for a floorplan
geometry), where a sample of $N$ elements has $N!$ different but equivalent
representations, making the denoising highly ambiguous; and 2) A
reconstruction’’ task has a single solution, where an initial noise needs to
be chosen carefully, while any initial noise works for a generation task. Our
technical contribution is the introduction of a Guided Set Diffusion Model
where 1) the forward diffusion process learns guidance networks to control
noise injection so that one representation of a sample remains distinct from
its other permutation variants, thus resolving denoising ambiguity; and 2) the
reverse denoising process reconstructs polygonal shapes, initialized and
directed by the guidance networks, as a conditional generation process subject
to the sensor data. We have evaluated our approach for reconstructing two types
of polygonal shapes: floorplan as a set of polygons and HD map for autonomous
cars as a set of polylines. Through extensive experiments on standard
benchmarks, we demonstrate that PolyDiffuse significantly advances the current
state of the art and enables broader practical applications.
Quantifying Sample Anonymity in Score-Based Generative Models with Adversarial Fingerprinting
June 02, 2023
Mischa Dombrowski, Bernhard Kainz
Recent advances in score-based generative models have led to a huge spike in
the development of downstream applications using generative models ranging from
data augmentation over image and video generation to anomaly detection. Despite
publicly available trained models, their potential to be used for privacy
preserving data sharing has not been fully explored yet. Training diffusion
models on private data and disseminating the models and weights rather than the
raw dataset paves the way for innovative large-scale data-sharing strategies,
particularly in healthcare, where safeguarding patients’ personal health
information is paramount. However, publishing such models without individual
consent of, e.g., the patients from whom the data was acquired, necessitates
guarantees that identifiable training samples will never be reproduced, thus
protecting personal health data and satisfying the requirements of policymakers
and regulatory bodies. This paper introduces a method for estimating the upper
bound of the probability of reproducing identifiable training images during the
sampling process. This is achieved by designing an adversarial approach that
searches for anatomic fingerprints, such as medical devices or dermal art,
which could potentially be employed to re-identify training images. Our method
harnesses the learned score-based model to estimate the probability of the
entire subspace of the score function that may be utilized for one-to-one
reproduction of training samples. To validate our estimates, we generate
anomalies containing a fingerprint and investigate whether generated samples
from trained generative models can be uniquely mapped to the original training
samples. Overall our results show that privacy-breaching images are reproduced
at sampling time if the models were trained without care.
Privacy Distillation: Reducing Re-identification Risk of Multimodal Diffusion Models
June 02, 2023
Virginia Fernandez, Pedro Sanchez, Walter Hugo Lopez Pinaya, Grzegorz Jacenków, Sotirios A. Tsaftaris, Jorge Cardoso
Knowledge distillation in neural networks refers to compressing a large model
or dataset into a smaller version of itself. We introduce Privacy Distillation,
a framework that allows a text-to-image generative model to teach another model
without exposing it to identifiable data. Here, we are interested in the
privacy issue faced by a data provider who wishes to share their data via a
multimodal generative model. A question that immediately arises is ``How can a
data provider ensure that the generative model is not leaking identifiable
information about a patient?’’. Our solution consists of (1) training a first
diffusion model on real data (2) generating a synthetic dataset using this
model and filtering it to exclude images with a re-identifiability risk (3)
training a second diffusion model on the filtered synthetic data only. We
showcase that datasets sampled from models trained with privacy distillation
can effectively reduce re-identification risk whilst maintaining downstream
performance.
Diffusion Self-Guidance for Controllable Image Generation
June 01, 2023
Dave Epstein, Allan Jabri, Ben Poole, Alexei A. Efros, Aleksander Holynski
Large-scale generative models are capable of producing high-quality images
from detailed text descriptions. However, many aspects of an image are
difficult or impossible to convey through text. We introduce self-guidance, a
method that provides greater control over generated images by guiding the
internal representations of diffusion models. We demonstrate that properties
such as the shape, location, and appearance of objects can be extracted from
these representations and used to steer sampling. Self-guidance works similarly
to classifier guidance, but uses signals present in the pretrained model
itself, requiring no additional models or training. We show how a simple set of
properties can be composed to perform challenging image manipulations, such as
modifying the position or size of objects, merging the appearance of objects in
one image with the layout of another, composing objects from many images into
one, and more. We also show that self-guidance can be used to edit real images.
For results and an interactive demo, see our project page at
https://dave.ml/selfguidance/
SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds
June 01, 2023
Yanyu Li, Huan Wang, Qing Jin, Ju Hu, Pavlo Chemerys, Yun Fu, Yanzhi Wang, Sergey Tulyakov, Jian Ren
Text-to-image diffusion models can create stunning images from natural
language descriptions that rival the work of professional artists and
photographers. However, these models are large, with complex network
architectures and tens of denoising iterations, making them computationally
expensive and slow to run. As a result, high-end GPUs and cloud-based inference
are required to run diffusion models at scale. This is costly and has privacy
implications, especially when user data is sent to a third party. To overcome
these challenges, we present a generic approach that, for the first time,
unlocks running text-to-image diffusion models on mobile devices in less than
$2$ seconds. We achieve so by introducing efficient network architecture and
improving step distillation. Specifically, we propose an efficient UNet by
identifying the redundancy of the original model and reducing the computation
of the image decoder via data distillation. Further, we enhance the step
distillation by exploring training strategies and introducing regularization
from classifier-free guidance. Our extensive experiments on MS-COCO show that
our model with $8$ denoising steps achieves better FID and CLIP scores than
Stable Diffusion v$1.5$ with $50$ steps. Our work democratizes content creation
by bringing powerful text-to-image diffusion models to the hands of users.
June 01, 2023
Felipe Nuti, Tim Franzmeyer, João F. Henriques
Diffusion models have achieved remarkable results in image generation, and
have similarly been used to learn high-performing policies in sequential
decision-making tasks. Decision-making diffusion models can be trained on
lower-quality data, and then be steered with a reward function to generate
near-optimal trajectories. We consider the problem of extracting a reward
function by comparing a decision-making diffusion model that models low-reward
behavior and one that models high-reward behavior; a setting related to inverse
reinforcement learning. We first define the notion of a relative reward
function of two diffusion models and show conditions under which it exists and
is unique. We then devise a practical learning algorithm for extracting it by
aligning the gradients of a reward function – parametrized by a neural network
– to the difference in outputs of both diffusion models. Our method finds
correct reward functions in navigation environments, and we demonstrate that
steering the base model with the learned reward functions results in
significantly increased performance in standard locomotion benchmarks. Finally,
we demonstrate that our approach generalizes beyond sequential decision-making
by learning a reward-like function from two large-scale image generation
diffusion models. The extracted reward function successfully assigns lower
rewards to harmful images.
Intriguing Properties of Text-guided Diffusion Models
June 01, 2023
Qihao Liu, Adam Kortylewski, Yutong Bai, Song Bai, Alan Yuille
Text-guided diffusion models (TDMs) are widely applied but can fail
unexpectedly. Common failures include: (i) natural-looking text prompts
generating images with the wrong content, or (ii) different random samples of
the latent variables that generate vastly different, and even unrelated,
outputs despite being conditioned on the same text prompt. In this work, we aim
to study and understand the failure modes of TDMs in more detail. To achieve
this, we propose SAGE, the first adversarial search method on TDMs that
systematically explores the discrete prompt space and the high-dimensional
latent space, to automatically discover undesirable behaviors and failure cases
in image generation. We use image classifiers as surrogate loss functions
during searching, and employ human inspections to validate the identified
failures. For the first time, our method enables efficient exploration of both
the discrete and intricate human language space and the challenging latent
space, overcoming the gradient vanishing problem. Then, we demonstrate the
effectiveness of SAGE on five widely used generative models and reveal four
typical failure modes: (1) We find a variety of natural text prompts that
generate images failing to capture the semantics of input texts. We further
discuss the underlying causes and potential solutions based on the results. (2)
We find regions in the latent space that lead to distorted images independent
of the text prompt, suggesting that parts of the latent space are not
well-structured. (3) We also find latent samples that result in natural-looking
images unrelated to the text prompt, implying a possible misalignment between
the latent and prompt spaces. (4) By appending a single adversarial token
embedding to any input prompts, we can generate a variety of specified target
objects. Project page: https://sage-diffusion.github.io/
The Hidden Language of Diffusion Models
June 01, 2023
Hila Chefer, Oran Lang, Mor Geva, Volodymyr Polosukhin, Assaf Shocher, Michal Irani, Inbar Mosseri, Lior Wolf
Text-to-image diffusion models have demonstrated an unparalleled ability to
generate high-quality, diverse images from a textual prompt. However, the
internal representations learned by these models remain an enigma. In this
work, we present Conceptor, a novel method to interpret the internal
representation of a textual concept by a diffusion model. This interpretation
is obtained by decomposing the concept into a small set of human-interpretable
textual elements. Applied over the state-of-the-art Stable Diffusion model,
Conceptor reveals non-trivial structures in the representations of concepts.
For example, we find surprising visual connections between concepts, that
transcend their textual semantics. We additionally discover concepts that rely
on mixtures of exemplars, biases, renowned artistic styles, or a simultaneous
fusion of multiple meanings of the concept. Through a large battery of
experiments, we demonstrate Conceptor’s ability to provide meaningful, robust,
and faithful decompositions for a wide variety of abstract, concrete, and
complex textual concepts, while allowing to naturally connect each
decomposition element to its corresponding visual impact on the generated
images. Our code will be available at: https://hila-chefer.github.io/Conceptor/
Differential Diffusion: Giving Each Pixel Its Strength
June 01, 2023
Eran Levin, Ohad Fried
cs.CV, cs.AI, cs.GR, cs.LG, I.3.3
Text-based image editing has advanced significantly in recent years. With the
rise of diffusion models, image editing via textual instructions has become
ubiquitous. Unfortunately, current models lack the ability to customize the
quantity of the change per pixel or per image fragment, resorting to changing
the entire image in an equal amount, or editing a specific region using a
binary mask. In this paper, we suggest a new framework which enables the user
to customize the quantity of change for each image fragment, thereby enhancing
the flexibility and verbosity of modern diffusion models. Our framework does
not require model training or fine-tuning, but instead performs everything at
inference time, making it easily applicable to an existing model. We show both
qualitatively and quantitatively that our method allows better controllability
and can produce results which are unattainable by existing models. Our code is
available at: https://github.com/exx8/differential-diffusion
Inserting Anybody in Diffusion Models via Celeb Basis
June 01, 2023
Ge Yuan, Xiaodong Cun, Yong Zhang, Maomao Li, Chenyang Qi, Xintao Wang, Ying Shan, Huicheng Zheng
Exquisite demand exists for customizing the pretrained large text-to-image
model, $\textit{e.g.}$, Stable Diffusion, to generate innovative concepts, such
as the users themselves. However, the newly-added concept from previous
customization methods often shows weaker combination abilities than the
original ones even given several images during training. We thus propose a new
personalization method that allows for the seamless integration of a unique
individual into the pre-trained diffusion model using just $\textbf{one facial
photograph}$ and only $\textbf{1024 learnable parameters}$ under $\textbf{3
minutes}$. So as we can effortlessly generate stunning images of this person in
any pose or position, interacting with anyone and doing anything imaginable
from text prompts. To achieve this, we first analyze and build a well-defined
celeb basis from the embedding space of the pre-trained large text encoder.
Then, given one facial photo as the target identity, we generate its own
embedding by optimizing the weight of this basis and locking all other
parameters. Empowered by the proposed celeb basis, the new identity in our
customized model showcases a better concept combination ability than previous
personalization methods. Besides, our model can also learn several new
identities at once and interact with each other where the previous
customization model fails to. The code will be released.
Conditioning Diffusion Models via Attributes and Semantic Masks for Face Generation
June 01, 2023
Nico Giambi, Giuseppe Lisanti
Deep generative models have shown impressive results in generating realistic
images of faces. GANs managed to generate high-quality, high-fidelity images
when conditioned on semantic masks, but they still lack the ability to
diversify their output. Diffusion models partially solve this problem and are
able to generate diverse samples given the same condition. In this paper, we
propose a multi-conditioning approach for diffusion models via cross-attention
exploiting both attributes and semantic masks to generate high-quality and
controllable face images. We also studied the impact of applying
perceptual-focused loss weighting into the latent space instead of the pixel
space. Our method extends the previous approaches by introducing conditioning
on more than one set of features, guaranteeing a more fine-grained control over
the generated face images. We evaluate our approach on the CelebA-HQ dataset,
and we show that it can generate realistic and diverse samples while allowing
for fine-grained control over multiple attributes and semantic regions.
Additionally, we perform an ablation study to evaluate the impact of different
conditioning strategies on the quality and diversity of the generated images.
UniDiff: Advancing Vision-Language Models with Generative and Discriminative Learning
June 01, 2023
Xiao Dong, Runhui Huang, Xiaoyong Wei, Zequn Jie, Jianxing Yu, Jian Yin, Xiaodan Liang
Recent advances in vision-language pre-training have enabled machines to
perform better in multimodal object discrimination (e.g., image-text semantic
alignment) and image synthesis (e.g., text-to-image generation). On the other
hand, fine-tuning pre-trained models with discriminative or generative
capabilities such as CLIP and Stable Diffusion on domain-specific datasets has
shown to be effective in various tasks by adapting to specific domains.
However, few studies have explored the possibility of learning both
discriminative and generative capabilities and leveraging their synergistic
effects to create a powerful and personalized multimodal model during
fine-tuning. This paper presents UniDiff, a unified multi-modal model that
integrates image-text contrastive learning (ITC), text-conditioned image
synthesis learning (IS), and reciprocal semantic consistency modeling (RSC).
UniDiff effectively learns aligned semantics and mitigates the issue of
semantic collapse during fine-tuning on small datasets by leveraging RSC on
visual features from CLIP and diffusion models, without altering the
pre-trained model’s basic architecture. UniDiff demonstrates versatility in
both multi-modal understanding and generative tasks. Experimental results on
three datasets (Fashion-man, Fashion-woman, and E-commercial Product) showcase
substantial enhancements in vision-language retrieval and text-to-image
generation, illustrating the advantages of combining discriminative and
generative fine-tuning. The proposed UniDiff model establishes a robust
pipeline for personalized modeling and serves as a benchmark for future
comparisons in the field.
Dissecting Arbitrary-scale Super-resolution Capability from Pre-trained Diffusion Generative Models
June 01, 2023
Ruibin Li, Qihua Zhou, Song Guo, Jie Zhang, Jingcai Guo, Xinyang Jiang, Yifei Shen, Zhenhua Han
Diffusion-based Generative Models (DGMs) have achieved unparalleled
performance in synthesizing high-quality visual content, opening up the
opportunity to improve image super-resolution (SR) tasks. Recent solutions for
these tasks often train architecture-specific DGMs from scratch, or require
iterative fine-tuning and distillation on pre-trained DGMs, both of which take
considerable time and hardware investments. More seriously, since the DGMs are
established with a discrete pre-defined upsampling scale, they cannot well
match the emerging requirements of arbitrary-scale super-resolution (ASSR),
where a unified model adapts to arbitrary upsampling scales, instead of
preparing a series of distinct models for each case. These limitations beg an
intriguing question: can we identify the ASSR capability of existing
pre-trained DGMs without the need for distillation or fine-tuning? In this
paper, we take a step towards resolving this matter by proposing Diff-SR, a
first ASSR attempt based solely on pre-trained DGMs, without additional
training efforts. It is motivated by an exciting finding that a simple
methodology, which first injects a specific amount of noise into the
low-resolution images before invoking a DGM’s backward diffusion process,
outperforms current leading solutions. The key insight is determining a
suitable amount of noise to inject, i.e., small amounts lead to poor low-level
fidelity, while over-large amounts degrade the high-level signature. Through a
finely-grained theoretical analysis, we propose the Perceptual Recoverable
Field (PRF), a metric that achieves the optimal trade-off between these two
factors. Extensive experiments verify the effectiveness, flexibility, and
adaptability of Diff-SR, demonstrating superior performance to state-of-the-art
solutions under diverse ASSR environments.
DiffRoom: Diffusion-based High-Quality 3D Room Reconstruction and Generation
June 01, 2023
Xiaoliang Ju, Zhaoyang Huang, Yijin Li, Guofeng Zhang, Yu Qiao, Hongsheng Li
We present DiffInDScene, a novel framework for tackling the problem of
high-quality 3D indoor scene generation, which is challenging due to the
complexity and diversity of the indoor scene geometry. Although diffusion-based
generative models have previously demonstrated impressive performance in image
generation and object-level 3D generation, they have not yet been applied to
room-level 3D generation due to their computationally intensive costs. In
DiffInDScene, we propose a cascaded 3D diffusion pipeline that is efficient and
possesses strong generative performance for Truncated Signed Distance Function
(TSDF). The whole pipeline is designed to run on a sparse occupancy space in a
coarse-to-fine fashion. Inspired by KinectFusion’s incremental alignment and
fusion of local TSDF volumes, we propose a diffusion-based SDF fusion approach
that iteratively diffuses and fuses local TSDF volumes, facilitating the
generation of an entire room environment. The generated results demonstrate
that our work is capable to achieve high-quality room generation directly in
three-dimensional space, starting from scratch. In addition to the scene
generation, the final part of DiffInDScene can be used as a post-processing
module to refine the 3D reconstruction results from multi-view stereo.
According to the user study, the mesh quality generated by our DiffInDScene can
even outperform the ground truth mesh provided by ScanNet. Please visit our
project page for the latest progress and demonstrations:
https://github.com/AkiraHero/diffindscene.
Image generation with shortest path diffusion
June 01, 2023
Ayan Das, Stathi Fotiadis, Anil Batra, Farhang Nabiei, FengTing Liao, Sattar Vakili, Da-shan Shiu, Alberto Bernacchia
The field of image generation has made significant progress thanks to the
introduction of Diffusion Models, which learn to progressively reverse a given
image corruption. Recently, a few studies introduced alternative ways of
corrupting images in Diffusion Models, with an emphasis on blurring. However,
these studies are purely empirical and it remains unclear what is the optimal
procedure for corrupting an image. In this work, we hypothesize that the
optimal procedure minimizes the length of the path taken when corrupting an
image towards a given final state. We propose the Fisher metric for the path
length, measured in the space of probability distributions. We compute the
shortest path according to this metric, and we show that it corresponds to a
combination of image sharpening, rather than blurring, and noise deblurring.
While the corruption was chosen arbitrarily in previous work, our Shortest Path
Diffusion (SPD) determines uniquely the entire spatiotemporal structure of the
corruption. We show that SPD improves on strong baselines without any
hyperparameter tuning, and outperforms all previous Diffusion Models based on
image blurring. Furthermore, any small deviation from the shortest path leads
to worse performance, suggesting that SPD provides the optimal procedure to
corrupt images. Our work sheds new light on observations made in recent works
and provides a new approach to improve diffusion models on images and other
types of data.
DiffPack: A Torsional Diffusion Model for Autoregressive Protein Side-Chain Packing
June 01, 2023
Yangtian Zhang, Zuobai Zhang, Bozitao Zhong, Sanchit Misra, Jian Tang
Proteins play a critical role in carrying out biological functions, and their
3D structures are essential in determining their functions. Accurately
predicting the conformation of protein side-chains given their backbones is
important for applications in protein structure prediction, design and
protein-protein interactions. Traditional methods are computationally intensive
and have limited accuracy, while existing machine learning methods treat the
problem as a regression task and overlook the restrictions imposed by the
constant covalent bond lengths and angles. In this work, we present DiffPack, a
torsional diffusion model that learns the joint distribution of side-chain
torsional angles, the only degrees of freedom in side-chain packing, by
diffusing and denoising on the torsional space. To avoid issues arising from
simultaneous perturbation of all four torsional angles, we propose
autoregressively generating the four torsional angles from \c{hi}1 to \c{hi}4
and training diffusion models for each torsional angle. We evaluate the method
on several benchmarks for protein side-chain packing and show that our method
achieves improvements of 11.9% and 13.5% in angle accuracy on CASP13 and
CASP14, respectively, with a significantly smaller model size (60x fewer
parameters). Additionally, we show the effectiveness of our method in enhancing
side-chain predictions in the AlphaFold2 model. Code will be available upon the
accept.
Controllable Motion Diffusion Model
June 01, 2023
Yi Shi, Jingbo Wang, Xuekun Jiang, Bo Dai
Generating realistic and controllable motions for virtual characters is a
challenging task in computer animation, and its implications extend to games,
simulations, and virtual reality. Recent studies have drawn inspiration from
the success of diffusion models in image generation, demonstrating the
potential for addressing this task. However, the majority of these studies have
been limited to offline applications that target at sequence-level generation
that generates all steps simultaneously. To enable real-time motion synthesis
with diffusion models in response to time-varying control signals, we propose
the framework of the Controllable Motion Diffusion Model (COMODO). Our
framework begins with an auto-regressive motion diffusion model (A-MDM), which
generates motion sequences step by step. In this way, simply using the standard
DDPM algorithm without any additional complexity, our framework is able to
generate high-fidelity motion sequences over extended periods with different
types of control signals. Then, we propose our reinforcement learning-based
controller and controlling strategies on top of the A-MDM model, so that our
framework can steer the motion synthesis process across multiple tasks,
including target reaching, joystick-based control, goal-oriented control, and
trajectory following. The proposed framework enables the real-time generation
of diverse motions that react adaptively to user commands on-the-fly, thereby
enhancing the overall user experience. Besides, it is compatible with the
inpainting-based editing methods and can predict much more diverse motions
without additional fine-tuning of the basic motion generation models. We
conduct comprehensive experiments to evaluate the effectiveness of our
framework in performing various tasks and compare its performance against
state-of-the-art methods.
Addressing Negative Transfer in Diffusion Models
June 01, 2023
Hyojun Go, JinYoung Kim, Yunsung Lee, Seunghyun Lee, Shinhyeok Oh, Hyeongdon Moon, Seungtaek Choi
Diffusion-based generative models have achieved remarkable success in various
domains. It trains a shared model on denoising tasks that encompass different
noise levels simultaneously, representing a form of multi-task learning (MTL).
However, analyzing and improving diffusion models from an MTL perspective
remains under-explored. In particular, MTL can sometimes lead to the well-known
phenomenon of negative transfer, which results in the performance degradation
of certain tasks due to conflicts between tasks. In this paper, we first aim to
analyze diffusion training from an MTL standpoint, presenting two key
observations: (O1) the task affinity between denoising tasks diminishes as the
gap between noise levels widens, and (O2) negative transfer can arise even in
diffusion training. Building upon these observations, we aim to enhance
diffusion training by mitigating negative transfer. To achieve this, we propose
leveraging existing MTL methods, but the presence of a huge number of denoising
tasks makes this computationally expensive to calculate the necessary per-task
loss or gradient. To address this challenge, we propose clustering the
denoising tasks into small task clusters and applying MTL methods to them.
Specifically, based on (O2), we employ interval clustering to enforce temporal
proximity among denoising tasks within clusters. We show that interval
clustering can be solved using dynamic programming, utilizing signal-to-noise
ratio, timestep, and task affinity for clustering objectives. Through this, our
approach addresses the issue of negative transfer in diffusion models by
allowing for efficient computation of MTL methods. We validate the efficacy of
proposed clustering and its integration with MTL methods through various
experiments, demonstrating 1) improved generation quality and 2) faster
training convergence of diffusion models.
Low-Light Image Enhancement with Wavelet-based Diffusion Models
June 01, 2023
Hai Jiang, Ao Luo, Songchen Han, Haoqiang Fan, Shuaicheng Liu
Diffusion models have achieved promising results in image restoration tasks,
yet suffer from time-consuming, excessive computational resource consumption,
and unstable restoration. To address these issues, we propose a robust and
efficient Diffusion-based Low-Light image enhancement approach, dubbed DiffLL.
Specifically, we present a wavelet-based conditional diffusion model (WCDM)
that leverages the generative power of diffusion models to produce results with
satisfactory perceptual fidelity. Additionally, it also takes advantage of the
strengths of wavelet transformation to greatly accelerate inference and reduce
computational resource usage without sacrificing information. To avoid chaotic
content and diversity, we perform both forward diffusion and denoising in the
training phase of WCDM, enabling the model to achieve stable denoising and
reduce randomness during inference. Moreover, we further design a
high-frequency restoration module (HFRM) that utilizes the vertical and
horizontal details of the image to complement the diagonal information for
better fine-grained restoration. Extensive experiments on publicly available
real-world benchmarks demonstrate that our method outperforms the existing
state-of-the-art methods both quantitatively and visually, and it achieves
remarkable improvements in efficiency compared to previous diffusion-based
methods. In addition, we empirically show that the application for low-light
face detection also reveals the latent practical values of our method. Code is
available at https://github.com/JianghaiSCU/Diffusion-Low-Light.
May 31, 2023
Peyman Gholami, Robert Xiao
cs.CV, cs.AI, cs.CL, cs.LG
Text-to-image generative models have made remarkable advancements in
generating high-quality images. However, generated images often contain
undesirable artifacts or other errors due to model limitations. Existing
techniques to fine-tune generated images are time-consuming (manual editing),
produce poorly-integrated results (inpainting), or result in unexpected changes
across the entire image (variation selection and prompt fine-tuning). In this
work, we present Diffusion Brush, a Latent Diffusion Model-based (LDM) tool to
efficiently fine-tune desired regions within an AI-synthesized image. Our
method introduces new random noise patterns at targeted regions during the
reverse diffusion process, enabling the model to efficiently make changes to
the specified regions while preserving the original context for the rest of the
image. We evaluate our method’s usability and effectiveness through a user
study with artists, comparing our technique against other state-of-the-art
image inpainting techniques and editing software for fine-tuning AI-generated
imagery.
SafeDiffuser: Safe Planning with Diffusion Probabilistic Models
May 31, 2023
Wei Xiao, Tsun-Hsuan Wang, Chuang Gan, Daniela Rus
cs.LG, cs.RO, cs.SY, eess.SY
Diffusion model-based approaches have shown promise in data-driven planning,
but there are no safety guarantees, thus making it hard to be applied for
safety-critical applications. To address these challenges, we propose a new
method, called SafeDiffuser, to ensure diffusion probabilistic models satisfy
specifications by using a class of control barrier functions. The key idea of
our approach is to embed the proposed finite-time diffusion invariance into the
denoising diffusion procedure, which enables trustworthy diffusion data
generation. Moreover, we demonstrate that our finite-time diffusion invariance
method through generative models not only maintains generalization performance
but also creates robustness in safe data generation. We test our method on a
series of safe planning tasks, including maze path generation, legged robot
locomotion, and 3D space manipulation, with results showing the advantages of
robustness and guarantees over vanilla diffusion models.
Understanding and Mitigating Copying in Diffusion Models
May 31, 2023
Gowthami Somepalli, Vasu Singla, Micah Goldblum, Jonas Geiping, Tom Goldstein
Images generated by diffusion models like Stable Diffusion are increasingly
widespread. Recent works and even lawsuits have shown that these models are
prone to replicating their training data, unbeknownst to the user. In this
paper, we first analyze this memorization problem in text-to-image diffusion
models. While it is widely believed that duplicated images in the training set
are responsible for content replication at inference time, we observe that the
text conditioning of the model plays a similarly important role. In fact, we
see in our experiments that data replication often does not happen for
unconditional models, while it is common in the text-conditional case.
Motivated by our findings, we then propose several techniques for reducing data
replication at both training and inference time by randomizing and augmenting
image captions in the training set.
Control4D: Dynamic Portrait Editing by Learning 4D GAN from 2D Diffusion-based Editor
May 31, 2023
Ruizhi Shao, Jingxiang Sun, Cheng Peng, Zerong Zheng, Boyao Zhou, Hongwen Zhang, Yebin Liu
We introduce Control4D, an innovative framework for editing dynamic 4D
portraits using text instructions. Our method addresses the prevalent
challenges in 4D editing, notably the inefficiencies of existing 4D
representations and the inconsistent editing effect caused by diffusion-based
editors. We first propose GaussianPlanes, a novel 4D representation that makes
Gaussian Splatting more structured by applying plane-based decomposition in 3D
space and time. This enhances both efficiency and robustness in 4D editing.
Furthermore, we propose to leverage a 4D generator to learn a more continuous
generation space from inconsistent edited images produced by the
diffusion-based editor, which effectively improves the consistency and quality
of 4D editing. Comprehensive evaluation demonstrates the superiority of
Control4D, including significantly reduced training time, high-quality
rendering, and spatial-temporal consistency in 4D portrait editing. The link to
our project website is https://control4darxiv.github.io.
A Unified Conditional Framework for Diffusion-based Image Restoration
May 31, 2023
Yi Zhang, Xiaoyu Shi, Dasong Li, Xiaogang Wang, Jian Wang, Hongsheng Li
Diffusion Probabilistic Models (DPMs) have recently shown remarkable
performance in image generation tasks, which are capable of generating highly
realistic images. When adopting DPMs for image restoration tasks, the crucial
aspect lies in how to integrate the conditional information to guide the DPMs
to generate accurate and natural output, which has been largely overlooked in
existing works. In this paper, we present a unified conditional framework based
on diffusion models for image restoration. We leverage a lightweight UNet to
predict initial guidance and the diffusion model to learn the residual of the
guidance. By carefully designing the basic module and integration module for
the diffusion model block, we integrate the guidance and other auxiliary
conditional information into every block of the diffusion model to achieve
spatially-adaptive generation conditioning. To handle high-resolution images,
we propose a simple yet effective inter-step patch-splitting strategy to
produce arbitrary-resolution images without grid artifacts. We evaluate our
conditional framework on three challenging tasks: extreme low-light denoising,
deblurring, and JPEG restoration, demonstrating its significant improvements in
perceptual quality and the generalization to restoration tasks.
Protein Design with Guided Discrete Diffusion
May 31, 2023
Nate Gruver, Samuel Stanton, Nathan C. Frey, Tim G. J. Rudner, Isidro Hotzel, Julien Lafrance-Vanasse, Arvind Rajpal, Kyunghyun Cho, Andrew Gordon Wilson
A popular approach to protein design is to combine a generative model with a
discriminative model for conditional sampling. The generative model samples
plausible sequences while the discriminative model guides a search for
sequences with high fitness. Given its broad success in conditional sampling,
classifier-guided diffusion modeling is a promising foundation for protein
design, leading many to develop guided diffusion models for structure with
inverse folding to recover sequences. In this work, we propose diffusioN
Optimized Sampling (NOS), a guidance method for discrete diffusion models that
follows gradients in the hidden states of the denoising network. NOS makes it
possible to perform design directly in sequence space, circumventing
significant limitations of structure-based methods, including scarce data and
challenging inverse design. Moreover, we use NOS to generalize LaMBO, a
Bayesian optimization procedure for sequence design that facilitates multiple
objectives and edit-based constraints. The resulting method, LaMBO-2, enables
discrete diffusions and stronger performance with limited edits through a novel
application of saliency maps. We apply LaMBO-2 to a real-world protein design
task, optimizing antibodies for higher expression yield and binding affinity to
several therapeutic targets under locality and developability constraints,
attaining a 99% expression rate and 40% binding rate in exploratory in vitro
experiments.
A Geometric Perspective on Diffusion Models
May 31, 2023
Defang Chen, Zhenyu Zhou, Jian-Ping Mei, Chunhua Shen, Chun Chen, Can Wang
Recent years have witnessed significant progress in developing effective
training and fast sampling techniques for diffusion models. A remarkable
advancement is the use of stochastic differential equations (SDEs) and their
marginal-preserving ordinary differential equations (ODEs) to describe data
perturbation and generative modeling in a unified framework. In this paper, we
carefully inspect the ODE-based sampling of a popular variance-exploding SDE
and reveal several intriguing structures of its sampling dynamics. We discover
that the data distribution and the noise distribution are smoothly connected
with a quasi-linear sampling trajectory and another implicit denoising
trajectory that even converges faster. Meanwhile, the denoising trajectory
governs the curvature of the corresponding sampling trajectory and its various
finite differences yield all second-order samplers used in practice.
Furthermore, we establish a theoretical relationship between the optimal
ODE-based sampling and the classic mean-shift (mode-seeking) algorithm, with
which we can characterize the asymptotic behavior of diffusion models and
identify the empirical score deviation.
Unsupervised Anomaly Detection in Medical Images Using Masked Diffusion Model
May 31, 2023
Hasan Iqbal, Umar Khalid, Jing Hua, Chen Chen
It can be challenging to identify brain MRI anomalies using supervised
deep-learning techniques due to anatomical heterogeneity and the requirement
for pixel-level labeling. Unsupervised anomaly detection approaches provide an
alternative solution by relying only on sample-level labels of healthy brains
to generate a desired representation to identify abnormalities at the pixel
level. Although, generative models are crucial for generating such anatomically
consistent representations of healthy brains, accurately generating the
intricate anatomy of the human brain remains a challenge. In this study, we
present a method called masked-DDPM (mDPPM), which introduces masking-based
regularization to reframe the generation task of diffusion models.
Specifically, we introduce Masked Image Modeling (MIM) and Masked Frequency
Modeling (MFM) in our self-supervised approach that enables models to learn
visual representations from unlabeled data. To the best of our knowledge, this
is the first attempt to apply MFM in DPPM models for medical applications. We
evaluate our approach on datasets containing tumors and numerous sclerosis
lesions and exhibit the superior performance of our unsupervised method as
compared to the existing fully/weakly supervised baselines. Code is available
at https://github.com/hasan1292/mDDPM.
Direct Diffusion Bridge using Data Consistency for Inverse Problems
May 31, 2023
Hyungjin Chung, Jeongsol Kim, Jong Chul Ye
cs.CV, cs.AI, cs.LG, stat.ML
Diffusion model-based inverse problem solvers have shown impressive
performance, but are limited in speed, mostly as they require reverse diffusion
sampling starting from noise. Several recent works have tried to alleviate this
problem by building a diffusion process, directly bridging the clean and the
corrupted for specific inverse problems. In this paper, we first unify these
existing works under the name Direct Diffusion Bridges (DDB), showing that
while motivated by different theories, the resulting algorithms only differ in
the choice of parameters. Then, we highlight a critical limitation of the
current DDB framework, namely that it does not ensure data consistency. To
address this problem, we propose a modified inference procedure that imposes
data consistency without the need for fine-tuning. We term the resulting method
data Consistent DDB (CDDB), which outperforms its inconsistent counterpart in
terms of both perception and distortion metrics, thereby effectively pushing
the Pareto-frontier toward the optimum. Our proposed method achieves
state-of-the-art results on both evaluation criteria, showcasing its
superiority over existing methods. Code is available at
https://github.com/HJ-harry/CDDB
Spontaneous symmetry breaking in generative diffusion models
May 31, 2023
Gabriel Raya, Luca Ambrogioni
Generative diffusion models have recently emerged as a leading approach for
generating high-dimensional data. In this paper, we show that the dynamics of
these models exhibit a spontaneous symmetry breaking that divides the
generative dynamics into two distinct phases: 1) A linear steady-state dynamics
around a central fixed-point and 2) an attractor dynamics directed towards the
data manifold. These two “phases” are separated by the change in stability of
the central fixed-point, with the resulting window of instability being
responsible for the diversity of the generated samples. Using both theoretical
and empirical evidence, we show that an accurate simulation of the early
dynamics does not significantly contribute to the final generation, since early
fluctuations are reverted to the central fixed point. To leverage this insight,
we propose a Gaussian late initialization scheme, which significantly improves
model performance, achieving up to 3x FID improvements on fast samplers, while
also increasing sample diversity (e.g., racial composition of generated CelebA
images). Our work offers a new way to understand the generative dynamics of
diffusion models that has the potential to bring about higher performance and
less biased fast-samplers.
Mask, Stitch, and Re-Sample: Enhancing Robustness and Generalizability in Anomaly Detection through Automatic Diffusion Models
May 31, 2023
Cosmin I. Bercea, Michael Neumayr, Daniel Rueckert, Julia A. Schnabel
The introduction of diffusion models in anomaly detection has paved the way
for more effective and accurate image reconstruction in pathologies. However,
the current limitations in controlling noise granularity hinder diffusion
models’ ability to generalize across diverse anomaly types and compromise the
restoration of healthy tissues. To overcome these challenges, we propose
AutoDDPM, a novel approach that enhances the robustness of diffusion models.
AutoDDPM utilizes diffusion models to generate initial likelihood maps of
potential anomalies and seamlessly integrates them with the original image.
Through joint noised distribution re-sampling, AutoDDPM achieves harmonization
and in-painting effects. Our study demonstrates the efficacy of AutoDDPM in
replacing anomalous regions while preserving healthy tissues, considerably
surpassing diffusion models’ limitations. It also contributes valuable insights
and analysis on the limitations of current diffusion models, promoting robust
and interpretable anomaly detection in medical imaging - an essential aspect of
building autonomous clinical decision systems with higher interpretability.
DiffLoad: Uncertainty Quantification in Load Forecasting with Diffusion Model
May 31, 2023
Zhixian Wang, Qingsong Wen, Chaoli Zhang, Liang Sun, Yi Wang
Electrical load forecasting plays a crucial role in decision-making for power
systems, including unit commitment and economic dispatch. The integration of
renewable energy sources and the occurrence of external events, such as the
COVID-19 pandemic, have rapidly increased uncertainties in load forecasting.
The uncertainties in load forecasting can be divided into two types: epistemic
uncertainty and aleatoric uncertainty. Separating these types of uncertainties
can help decision-makers better understand where and to what extent the
uncertainty is, thereby enhancing their confidence in the following
decision-making. This paper proposes a diffusion-based Seq2Seq structure to
estimate epistemic uncertainty and employs the robust additive Cauchy
distribution to estimate aleatoric uncertainty. Our method not only ensures the
accuracy of load forecasting but also demonstrates the ability to separate the
two types of uncertainties and be applicable to different levels of loads. The
relevant code can be found at
\url{https://anonymous.4open.science/r/DiffLoad-4714/}.
Improving Handwritten OCR with Training Samples Generated by Glyph Conditional Denoising Diffusion Probabilistic Model
May 31, 2023
Haisong Ding, Bozhi Luan, Dongnan Gui, Kai Chen, Qiang Huo
Constructing a highly accurate handwritten OCR system requires large amounts
of representative training data, which is both time-consuming and expensive to
collect. To mitigate the issue, we propose a denoising diffusion probabilistic
model (DDPM) to generate training samples. This model conditions on a printed
glyph image and creates mappings between printed characters and handwritten
images, thus enabling the generation of photo-realistic handwritten samples
with diverse styles and unseen text contents. However, the text contents in
synthetic images are not always consistent with the glyph conditional images,
leading to unreliable labels of synthetic samples. To address this issue, we
further propose a progressive data filtering strategy to add those samples with
a high confidence of correctness to the training set. Experimental results on
IAM benchmark task show that OCR model trained with augmented DDPM-synthesized
training samples can achieve about 45% relative word error rate reduction
compared with the one trained on real data only.
Label-Retrieval-Augmented Diffusion Models for Learning from Noisy Labels
May 31, 2023
Jian Chen, Ruiyi Zhang, Tong Yu, Rohan Sharma, Zhiqiang Xu, Tong Sun, Changyou Chen
Learning from noisy labels is an important and long-standing problem in
machine learning for real applications. One of the main research lines focuses
on learning a label corrector to purify potential noisy labels. However, these
methods typically rely on strict assumptions and are limited to certain types
of label noise. In this paper, we reformulate the label-noise problem from a
generative-model perspective, $\textit{i.e.}$, labels are generated by
gradually refining an initial random guess. This new perspective immediately
enables existing powerful diffusion models to seamlessly learn the stochastic
generative process. Once the generative uncertainty is modeled, we can perform
classification inference using maximum likelihood estimation of labels. To
mitigate the impact of noisy labels, we propose the
$\textbf{L}$abel-$\textbf{R}$etrieval-$\textbf{A}$ugmented (LRA) diffusion
model, which leverages neighbor consistency to effectively construct
pseudo-clean labels for diffusion training. Our model is flexible and general,
allowing easy incorporation of different types of conditional information,
$\textit{e.g.}$, use of pre-trained models, to further boost model performance.
Extensive experiments are conducted for evaluation. Our model achieves new
state-of-the-art (SOTA) results on all the standard real-world benchmark
datasets. Remarkably, by incorporating conditional information from the
powerful CLIP model, our method can boost the current SOTA accuracy by 10-20
absolute points in many cases.
Fine-grained Text Style Transfer with Diffusion-Based Language Models
May 31, 2023
Yiwei Lyu, Tiange Luo, Jiacheng Shi, Todd C. Hollon, Honglak Lee
Diffusion probabilistic models have shown great success in generating
high-quality images controllably, and researchers have tried to utilize this
controllability into text generation domain. Previous works on diffusion-based
language models have shown that they can be trained without external knowledge
(such as pre-trained weights) and still achieve stable performance and
controllability. In this paper, we trained a diffusion-based model on StylePTB
dataset, the standard benchmark for fine-grained text style transfers. The
tasks in StylePTB requires much more refined control over the output text
compared to tasks evaluated in previous works, and our model was able to
achieve state-of-the-art performance on StylePTB on both individual and
compositional transfers. Moreover, our model, trained on limited data from
StylePTB without external knowledge, outperforms previous works that utilized
pretrained weights, embeddings, and external grammar parsers, and this may
indicate that diffusion-based language models have great potential under
low-resource settings.
May 31, 2023
Shaoyan Pan, Elham Abouei, Jacob Wynne, Tonghe Wang, Richard L. J. Qiu, Yuheng Li, Chih-Wei Chang, Junbo Peng, Justin Roper, Pretesh Patel, David S. Yu, Hui Mao, Xiaofeng Yang
Magnetic resonance imaging (MRI)-based synthetic computed tomography (sCT)
simplifies radiation therapy treatment planning by eliminating the need for CT
simulation and error-prone image registration, ultimately reducing patient
radiation dose and setup uncertainty. We propose an MRI-to-CT transformer-based
denoising diffusion probabilistic model (MC-DDPM) to transform MRI into
high-quality sCT to facilitate radiation treatment planning. MC-DDPM implements
diffusion processes with a shifted-window transformer network to generate sCT
from MRI. The proposed model consists of two processes: a forward process which
adds Gaussian noise to real CT scans, and a reverse process in which a
shifted-window transformer V-net (Swin-Vnet) denoises the noisy CT scans
conditioned on the MRI from the same patient to produce noise-free CT scans.
With an optimally trained Swin-Vnet, the reverse diffusion process was used to
generate sCT scans matching MRI anatomy. We evaluated the proposed method by
generating sCT from MRI on a brain dataset and a prostate dataset. Qualitative
evaluation was performed using the mean absolute error (MAE) of Hounsfield unit
(HU), peak signal to noise ratio (PSNR), multi-scale Structure Similarity index
(MS-SSIM) and normalized cross correlation (NCC) indexes between ground truth
CTs and sCTs. MC-DDPM generated brain sCTs with state-of-the-art quantitative
results with MAE 43.317 HU, PSNR 27.046 dB, SSIM 0.965, and NCC 0.983. For the
prostate dataset, MC-DDPM achieved MAE 59.953 HU, PSNR 26.920 dB, SSIM 0.849,
and NCC 0.948. In conclusion, we have developed and validated a novel approach
for generating CT images from routine MRIs using a transformer-based DDPM. This
model effectively captures the complex relationship between CT and MRI images,
allowing for robust and high-quality synthetic CT (sCT) images to be generated
in minutes.
Ambient Diffusion: Learning Clean Distributions from Corrupted Data
May 30, 2023
Giannis Daras, Kulin Shah, Yuval Dagan, Aravind Gollakota, Alexandros G. Dimakis, Adam Klivans
cs.LG, cs.AI, cs.CV, cs.IT, math.IT
We present the first diffusion-based framework that can learn an unknown
distribution using only highly-corrupted samples. This problem arises in
scientific applications where access to uncorrupted samples is impossible or
expensive to acquire. Another benefit of our approach is the ability to train
generative models that are less likely to memorize individual training samples
since they never observe clean training data. Our main idea is to introduce
additional measurement distortion during the diffusion process and require the
model to predict the original corrupted image from the further corrupted image.
We prove that our method leads to models that learn the conditional expectation
of the full uncorrupted image given this additional measurement corruption.
This holds for any corruption process that satisfies some technical conditions
(and in particular includes inpainting and compressed sensing). We train models
on standard benchmarks (CelebA, CIFAR-10 and AFHQ) and show that we can learn
the distribution even when all the training samples have $90\%$ of their pixels
missing. We also show that we can finetune foundation models on small corrupted
datasets (e.g. MRI scans with block corruptions) and learn the clean
distribution without memorizing the training set.
Calliffusion: Chinese Calligraphy Generation and Style Transfer with Diffusion Modeling
May 30, 2023
Qisheng Liao, Gus Xia, Zhinuo Wang
In this paper, we propose Calliffusion, a system for generating high-quality
Chinese calligraphy using diffusion models. Our model architecture is based on
DDPM (Denoising Diffusion Probabilistic Models), and it is capable of
generating common characters in five different scripts and mimicking the styles
of famous calligraphers. Experiments demonstrate that our model can generate
calligraphy that is difficult to distinguish from real artworks and that our
controls for characters, scripts, and styles are effective. Moreover, we
demonstrate one-shot transfer learning, using LoRA (Low-Rank Adaptation) to
transfer Chinese calligraphy art styles to unseen characters and even
out-of-domain symbols such as English letters and digits.
Unsupervised Statistical Feature-Guided Diffusion Model for Sensor-based Human Activity Recognition
May 30, 2023
Si Zuo, Vitor Fortes Rey, Sungho Suh, Stephan Sigg, Paul Lukowicz
Recognizing human activities from sensor data is a vital task in various
domains, but obtaining diverse and labeled sensor data remains challenging and
costly. In this paper, we propose an unsupervised statistical feature-guided
diffusion model for sensor-based human activity recognition. The proposed
method aims to generate synthetic time-series sensor data without relying on
labeled data, addressing the scarcity and annotation difficulties associated
with real-world sensor data. By conditioning the diffusion model on statistical
information such as mean, standard deviation, Z-score, and skewness, we
generate diverse and representative synthetic sensor data. We conducted
experiments on public human activity recognition datasets and compared the
proposed method to conventional oversampling methods and state-of-the-art
generative adversarial network methods. The experimental results demonstrate
that the proposed method can improve the performance of human activity
recognition and outperform existing techniques.
DiffMatch: Diffusion Model for Dense Matching
May 30, 2023
Jisu Nam, Gyuseong Lee, Sunwoo Kim, Hyeonsu Kim, Hyoungwon Cho, Seyeon Kim, Seungryong Kim
The objective for establishing dense correspondence between paired images
consists of two terms: a data term and a prior term. While conventional
techniques focused on defining hand-designed prior terms, which are difficult
to formulate, recent approaches have focused on learning the data term with
deep neural networks without explicitly modeling the prior, assuming that the
model itself has the capacity to learn an optimal prior from a large-scale
dataset. The performance improvement was obvious, however, they often fail to
address inherent ambiguities of matching, such as textureless regions,
repetitive patterns, and large displacements. To address this, we propose
DiffMatch, a novel conditional diffusion-based framework designed to explicitly
model both the data and prior terms. Unlike previous approaches, this is
accomplished by leveraging a conditional denoising diffusion model. DiffMatch
consists of two main components: conditional denoising diffusion module and
cost injection module. We stabilize the training process and reduce memory
usage with a stage-wise training strategy. Furthermore, to boost performance,
we introduce an inference technique that finds a better path to the accurate
matching field. Our experimental results demonstrate significant performance
improvements of our method over existing approaches, and the ablation studies
validate our design choices along with the effectiveness of each component.
Project page is available at https://ku-cvlab.github.io/DiffMatch/.
Nested Diffusion Processes for Anytime Image Generation
May 30, 2023
Noam Elata, Bahjat Kawar, Tomer Michaeli, Michael Elad
Diffusion models are the current state-of-the-art in image generation,
synthesizing high-quality images by breaking down the generation process into
many fine-grained denoising steps. Despite their good performance, diffusion
models are computationally expensive, requiring many neural function
evaluations (NFEs). In this work, we propose an anytime diffusion-based method
that can generate viable images when stopped at arbitrary times before
completion. Using existing pretrained diffusion models, we show that the
generation scheme can be recomposed as two nested diffusion processes, enabling
fast iterative refinement of a generated image. In experiments on ImageNet and
Stable Diffusion-based text-to-image generation, we show, both qualitatively
and quantitatively, that our method’s intermediate generation quality greatly
exceeds that of the original diffusion model, while the final generation result
remains comparable. We illustrate the applicability of Nested Diffusion in
several settings, including for solving inverse problems, and for rapid
text-based content creation by allowing user intervention throughout the
sampling process.
DiffSketching: Sketch Control Image Synthesis with Diffusion Models
May 30, 2023
Qiang Wang, Di Kong, Fengyin Lin, Yonggang Qi
Creative sketch is a universal way of visual expression, but translating
images from an abstract sketch is very challenging. Traditionally, creating a
deep learning model for sketch-to-image synthesis needs to overcome the
distorted input sketch without visual details, and requires to collect
large-scale sketch-image datasets. We first study this task by using diffusion
models. Our model matches sketches through the cross domain constraints, and
uses a classifier to guide the image synthesis more accurately. Extensive
experiments confirmed that our method can not only be faithful to user’s input
sketches, but also maintain the diversity and imagination of synthetic image
results. Our model can beat GAN-based method in terms of generation quality and
human evaluation, and does not rely on massive sketch-image datasets.
Additionally, we present applications of our method in image editing and
interpolation.
HiFA: High-fidelity Text-to-3D with Advanced Diffusion Guidance
May 30, 2023
Junzhe Zhu, Peiye Zhuang
The advancements in automatic text-to-3D generation have been remarkable.
Most existing methods use pre-trained text-to-image diffusion models to
optimize 3D representations like Neural Radiance Fields (NeRFs) via
latent-space denoising score matching. Yet, these methods often result in
artifacts and inconsistencies across different views due to their suboptimal
optimization approaches and limited understanding of 3D geometry. Moreover, the
inherent constraints of NeRFs in rendering crisp geometry and stable textures
usually lead to a two-stage optimization to attain high-resolution details.
This work proposes holistic sampling and smoothing approaches to achieve
high-quality text-to-3D generation, all in a single-stage optimization. We
compute denoising scores in the text-to-image diffusion model’s latent and
image spaces. Instead of randomly sampling timesteps (also referred to as noise
levels in denoising score matching), we introduce a novel timestep annealing
approach that progressively reduces the sampled timestep throughout
optimization. To generate high-quality renderings in a single-stage
optimization, we propose regularization for the variance of z-coordinates along
NeRF rays. To address texture flickering issues in NeRFs, we introduce a kernel
smoothing technique that refines importance sampling weights coarse-to-fine,
ensuring accurate and thorough sampling in high-density regions. Extensive
experiments demonstrate the superiority of our method over previous approaches,
enabling the generation of highly detailed and view-consistent 3D assets
through a single-stage training process.
Generating Behaviorally Diverse Policies with Latent Diffusion Models
May 30, 2023
Shashank Hegde, Sumeet Batra, K. R. Zentner, Gaurav S. Sukhatme
Recent progress in Quality Diversity Reinforcement Learning (QD-RL) has
enabled learning a collection of behaviorally diverse, high performing
policies. However, these methods typically involve storing thousands of
policies, which results in high space-complexity and poor scaling to additional
behaviors. Condensing the archive into a single model while retaining the
performance and coverage of the original collection of policies has proved
challenging. In this work, we propose using diffusion models to distill the
archive into a single generative model over policy parameters. We show that our
method achieves a compression ratio of 13x while recovering 98% of the original
rewards and 89% of the original coverage. Further, the conditioning mechanism
of diffusion models allows for flexibly selecting and sequencing behaviors,
including using language. Project website:
https://sites.google.com/view/policydiffusion/home
Diffusion-Stego: Training-free Diffusion Generative Steganography via Message Projection
May 30, 2023
Daegyu Kim, Chaehun Shin, Jooyoung Choi, Dahuin Jung, Sungroh Yoon
Generative steganography is the process of hiding secret messages in
generated images instead of cover images. Existing studies on generative
steganography use GAN or Flow models to obtain high hiding message capacity and
anti-detection ability over cover images. However, they create relatively
unrealistic stego images because of the inherent limitations of generative
models. We propose Diffusion-Stego, a generative steganography approach based
on diffusion models which outperform other generative models in image
generation. Diffusion-Stego projects secret messages into latent noise of
diffusion models and generates stego images with an iterative denoising
process. Since the naive hiding of secret messages into noise boosts visual
degradation and decreases extracted message accuracy, we introduce message
projection, which hides messages into noise space while addressing these
issues. We suggest three options for message projection to adjust the trade-off
between extracted message accuracy, anti-detection ability, and image quality.
Diffusion-Stego is a training-free approach, so we can apply it to pre-trained
diffusion models which generate high-quality images, or even large-scale
text-to-image models, such as Stable diffusion. Diffusion-Stego achieved a high
capacity of messages (3.0 bpp of binary messages with 98% accuracy, and 6.0 bpp
with 90% accuracy) as well as high quality (with a FID score of 2.77 for 1.0
bpp on the FFHQ 64$\times$64 dataset) that makes it challenging to distinguish
from real images in the PNG format.
LayerDiffusion: Layered Controlled Image Editing with Diffusion Models
May 30, 2023
Pengzhi Li, QInxuan Huang, Yikang Ding, Zhiheng Li
Text-guided image editing has recently experienced rapid development.
However, simultaneously performing multiple editing actions on a single image,
such as background replacement and specific subject attribute changes, while
maintaining consistency between the subject and the background remains
challenging. In this paper, we propose LayerDiffusion, a semantic-based layered
controlled image editing method. Our method enables non-rigid editing and
attribute modification of specific subjects while preserving their unique
characteristics and seamlessly integrating them into new backgrounds. We
leverage a large-scale text-to-image model and employ a layered controlled
optimization strategy combined with layered diffusion training. During the
diffusion process, an iterative guidance strategy is used to generate a final
image that aligns with the textual description. Experimental results
demonstrate the effectiveness of our method in generating highly coherent
images that closely align with the given textual description. The edited images
maintain a high similarity to the features of the input image and surpass the
performance of current leading image editing methods. LayerDiffusion opens up
new possibilities for controllable image editing.
On Diffusion Modeling for Anomaly Detection
May 29, 2023
Victor Livernoche, Vineet Jain, Yashar Hezaveh, Siamak Ravanbakhsh
Known for their impressive performance in generative modeling, diffusion
models are attractive candidates for density-based anomaly detection. This
paper investigates different variations of diffusion modeling for unsupervised
and semi-supervised anomaly detection. In particular, we find that Denoising
Diffusion Probability Models (DDPM) are performant on anomaly detection
benchmarks yet computationally expensive. By simplifying DDPM in application to
anomaly detection, we are naturally led to an alternative approach called
Diffusion Time Estimation (DTE). DTE estimates the distribution over diffusion
time for a given input and uses the mode or mean of this distribution as the
anomaly score. We derive an analytical form for this density and leverage a
deep neural network to improve inference efficiency. Through empirical
evaluations on the ADBench benchmark, we demonstrate that all diffusion-based
anomaly detection methods perform competitively for both semi-supervised and
unsupervised settings. Notably, DTE achieves orders of magnitude faster
inference time than DDPM, while outperforming it on this benchmark. These
results establish diffusion-based anomaly detection as a scalable alternative
to traditional methods and recent deep-learning techniques for standard
unsupervised and semi-supervised anomaly detection settings.
RAPHAEL: Text-to-Image Generation via Large Mixture of Diffusion Paths
May 29, 2023
Zeyue Xue, Guanglu Song, Qiushan Guo, Boxiao Liu, Zhuofan Zong, Yu Liu, Ping Luo
Text-to-image generation has recently witnessed remarkable achievements. We
introduce a text-conditional image diffusion model, termed RAPHAEL, to generate
highly artistic images, which accurately portray the text prompts, encompassing
multiple nouns, adjectives, and verbs. This is achieved by stacking tens of
mixture-of-experts (MoEs) layers, i.e., space-MoE and time-MoE layers, enabling
billions of diffusion paths (routes) from the network input to the output. Each
path intuitively functions as a “painter” for depicting a particular textual
concept onto a specified image region at a diffusion timestep. Comprehensive
experiments reveal that RAPHAEL outperforms recent cutting-edge models, such as
Stable Diffusion, ERNIE-ViLG 2.0, DeepFloyd, and DALL-E 2, in terms of both
image quality and aesthetic appeal. Firstly, RAPHAEL exhibits superior
performance in switching images across diverse styles, such as Japanese comics,
realism, cyberpunk, and ink illustration. Secondly, a single model with three
billion parameters, trained on 1,000 A100 GPUs for two months, achieves a
state-of-the-art zero-shot FID score of 6.61 on the COCO dataset. Furthermore,
RAPHAEL significantly surpasses its counterparts in human evaluation on the
ViLG-300 benchmark. We believe that RAPHAEL holds the potential to propel the
frontiers of image generation research in both academia and industry, paving
the way for future breakthroughs in this rapidly evolving field. More details
can be found on a webpage: https://raphael-painter.github.io/.
Reconstructing the Mind’s Eye: fMRI-to-Image with Contrastive Learning and Diffusion Priors
May 29, 2023
Paul S. Scotti, Atmadeep Banerjee, Jimmie Goode, Stepan Shabalin, Alex Nguyen, Ethan Cohen, Aidan J. Dempster, Nathalie Verlinde, Elad Yundler, David Weisberg, Kenneth A. Norman, Tanishq Mathew Abraham
We present MindEye, a novel fMRI-to-image approach to retrieve and
reconstruct viewed images from brain activity. Our model comprises two parallel
submodules that are specialized for retrieval (using contrastive learning) and
reconstruction (using a diffusion prior). MindEye can map fMRI brain activity
to any high dimensional multimodal latent space, like CLIP image space,
enabling image reconstruction using generative models that accept embeddings
from this latent space. We comprehensively compare our approach with other
existing methods, using both qualitative side-by-side comparisons and
quantitative evaluations, and show that MindEye achieves state-of-the-art
performance in both reconstruction and retrieval tasks. In particular, MindEye
can retrieve the exact original image even among highly similar candidates
indicating that its brain embeddings retain fine-grained image-specific
information. This allows us to accurately retrieve images even from large-scale
databases like LAION-5B. We demonstrate through ablations that MindEye’s
performance improvements over previous methods result from specialized
submodules for retrieval and reconstruction, improved training techniques, and
training models with orders of magnitude more parameters. Furthermore, we show
that MindEye can better preserve low-level image features in the
reconstructions by using img2img, with outputs from a separate autoencoder. All
code is available on GitHub.
CamoDiffusion: Camouflaged Object Detection via Conditional Diffusion Models
May 29, 2023
Zhongxi Chen, Ke Sun, Xianming Lin, Rongrong Ji
Camouflaged Object Detection (COD) is a challenging task in computer vision
due to the high similarity between camouflaged objects and their surroundings.
Existing COD methods primarily employ semantic segmentation, which suffers from
overconfident incorrect predictions. In this paper, we propose a new paradigm
that treats COD as a conditional mask-generation task leveraging diffusion
models. Our method, dubbed CamoDiffusion, employs the denoising process of
diffusion models to iteratively reduce the noise of the mask. Due to the
stochastic sampling process of diffusion, our model is capable of sampling
multiple possible predictions from the mask distribution, avoiding the problem
of overconfident point estimation. Moreover, we develop specialized learning
strategies that include an innovative ensemble approach for generating robust
predictions and tailored forward diffusion methods for efficient training,
specifically for the COD task. Extensive experiments on three COD datasets
attest the superior performance of our model compared to existing
state-of-the-art methods, particularly on the most challenging COD10K dataset,
where our approach achieves 0.019 in terms of MAE.
Diff-Instruct: A Universal Approach for Transferring Knowledge From Pre-trained Diffusion Models
May 29, 2023
Weijian Luo, Tianyang Hu, Shifeng Zhang, Jiacheng Sun, Zhenguo Li, Zhihua Zhang
Due to the ease of training, ability to scale, and high sample quality,
diffusion models (DMs) have become the preferred option for generative
modeling, with numerous pre-trained models available for a wide variety of
datasets. Containing intricate information about data distributions,
pre-trained DMs are valuable assets for downstream applications. In this work,
we consider learning from pre-trained DMs and transferring their knowledge to
other generative models in a data-free fashion. Specifically, we propose a
general framework called Diff-Instruct to instruct the training of arbitrary
generative models as long as the generated samples are differentiable with
respect to the model parameters. Our proposed Diff-Instruct is built on a
rigorous mathematical foundation where the instruction process directly
corresponds to minimizing a novel divergence we call Integral Kullback-Leibler
(IKL) divergence. IKL is tailored for DMs by calculating the integral of the KL
divergence along a diffusion process, which we show to be more robust in
comparing distributions with misaligned supports. We also reveal non-trivial
connections of our method to existing works such as DreamFusion, and generative
adversarial training. To demonstrate the effectiveness and universality of
Diff-Instruct, we consider two scenarios: distilling pre-trained diffusion
models and refining existing GAN models. The experiments on distilling
pre-trained diffusion models show that Diff-Instruct results in
state-of-the-art single-step diffusion-based models. The experiments on
refining GAN models show that the Diff-Instruct can consistently improve the
pre-trained generators of GAN models across various settings.
Conditional Diffusion Models for Semantic 3D Medical Image Synthesis
May 29, 2023
Zolnamar Dorjsembe, Hsing-Kuo Pao, Sodtavilan Odonchimed, Furen Xiao
The demand for artificial intelligence (AI) in healthcare is rapidly
increasing. However, significant challenges arise from data scarcity and
privacy concerns, particularly in medical imaging. While existing generative
models have achieved success in image synthesis and image-to-image translation
tasks, there remains a gap in the generation of 3D semantic medical images. To
address this gap, we introduce Med-DDPM, a diffusion model specifically
designed for semantic 3D medical image synthesis, effectively tackling data
scarcity and privacy issues. The novelty of Med-DDPM lies in its incorporation
of semantic conditioning, enabling precise control during the image generation
process. Our model outperforms Generative Adversarial Networks (GANs) in terms
of stability and performance, generating diverse and anatomically coherent
images with high visual fidelity. Comparative analysis against state-of-the-art
augmentation techniques demonstrates that Med-DDPM produces comparable results,
highlighting its potential as a data augmentation tool for enhancing model
accuracy. In conclusion, Med-DDPM pioneers 3D semantic medical image synthesis
by delivering high-quality and anatomically coherent images. Furthermore, the
integration of semantic conditioning with Med-DDPM holds promise for image
anonymization in the field of biomedical imaging, showcasing the capabilities
of the model in addressing challenges related to data scarcity and privacy
concerns. Our code and model weights are publicly accessible on our GitHub
repository at https://github.com/mobaidoctor/med-ddpm/, facilitating
reproducibility.
Generating Driving Scenes with Diffusion
May 29, 2023
Ethan Pronovost, Kai Wang, Nick Roy
In this paper we describe a learned method of traffic scene generation
designed to simulate the output of the perception system of a self-driving car.
In our “Scene Diffusion” system, inspired by latent diffusion, we use a novel
combination of diffusion and object detection to directly create realistic and
physically plausible arrangements of discrete bounding boxes for agents. We
show that our scene generation model is able to adapt to different regions in
the US, producing scenarios that capture the intricacies of each region.
Cognitively Inspired Cross-Modal Data Generation Using Diffusion Models
May 28, 2023
Zizhao Hu, Mohammad Rostami
Most existing cross-modal generative methods based on diffusion models use
guidance to provide control over the latent space to enable conditional
generation across different modalities. Such methods focus on providing
guidance through separately-trained models, each for one modality. As a result,
these methods suffer from cross-modal information loss and are limited to
unidirectional conditional generation. Inspired by how humans synchronously
acquire multi-modal information and learn the correlation between modalities,
we explore a multi-modal diffusion model training and sampling scheme that uses
channel-wise image conditioning to learn cross-modality correlation during the
training phase to better mimic the learning process in the brain. Our empirical
results demonstrate that our approach can achieve data generation conditioned
on all correlated modalities.
Conditional score-based diffusion models for Bayesian inference in infinite dimensions
May 28, 2023
Lorenzo Baldassari, Ali Siahkoohi, Josselin Garnier, Knut Solna, Maarten V. de Hoop
stat.ML, cs.LG, math.AP, math.PR, 62F15, 65N21, 68Q32, 60Hxx, 60Jxx
Since their initial introduction, score-based diffusion models (SDMs) have
been successfully applied to solve a variety of linear inverse problems in
finite-dimensional vector spaces due to their ability to efficiently
approximate the posterior distribution. However, using SDMs for inverse
problems in infinite-dimensional function spaces has only been addressed
recently, primarily through methods that learn the unconditional score. While
this approach is advantageous for some inverse problems, it is mostly heuristic
and involves numerous computationally costly forward operator evaluations
during posterior sampling. To address these limitations, we propose a
theoretically grounded method for sampling from the posterior of
infinite-dimensional Bayesian linear inverse problems based on amortized
conditional SDMs. In particular, we prove that one of the most successful
approaches for estimating the conditional score in finite dimensions - the
conditional denoising estimator - can also be applied in infinite dimensions. A
significant part of our analysis is dedicated to demonstrating that extending
infinite-dimensional SDMs to the conditional setting requires careful
consideration, as the conditional score typically blows up for small times,
contrarily to the unconditional score. We conclude by presenting stylized and
large-scale numerical examples that validate our approach, offer additional
insights, and demonstrate that our method enables large-scale,
discretization-invariant Bayesian inference.
Creating Personalized Synthetic Voices from Post-Glossectomy Speech with Guided Diffusion Models
May 27, 2023
Yusheng Tian, Guangyan Zhang, Tan Lee
This paper is about developing personalized speech synthesis systems with
recordings of mildly impaired speech. In particular, we consider consonant and
vowel alterations resulted from partial glossectomy, the surgical removal of
part of the tongue. The aim is to restore articulation in the synthesized
speech and maximally preserve the target speaker’s individuality. We propose to
tackle the problem with guided diffusion models. Specifically, a
diffusion-based speech synthesis model is trained on original recordings, to
capture and preserve the target speaker’s original articulation style. When
using the model for inference, a separately trained phone classifier will guide
the synthesis process towards proper articulation. Objective and subjective
evaluation results show that the proposed method substantially improves
articulation in the synthesized speech over original recordings, and preserves
more of the target speaker’s individuality than a voice conversion baseline.
MADiff: Offline Multi-agent Learning with Diffusion Models
May 27, 2023
Zhengbang Zhu, Minghuan Liu, Liyuan Mao, Bingyi Kang, Minkai Xu, Yong Yu, Stefano Ermon, Weinan Zhang
Diffusion model (DM), as a powerful generative model, recently achieved huge
success in various scenarios including offline reinforcement learning, where
the policy learns to conduct planning by generating trajectory in the online
evaluation. However, despite the effectiveness shown for single-agent learning,
it remains unclear how DMs can operate in multi-agent problems, where agents
can hardly complete teamwork without good coordination by independently
modeling each agent’s trajectories. In this paper, we propose MADiff, a novel
generative multi-agent learning framework to tackle this problem. MADiff is
realized with an attention-based diffusion model to model the complex
coordination among behaviors of multiple diffusion agents. To the best of our
knowledge, MADiff is the first diffusion-based multi-agent offline RL
framework, which behaves as both a decentralized policy and a centralized
controller. During decentralized executions, MADiff simultaneously performs
teammate modeling, and the centralized controller can also be applied in
multi-agent trajectory predictions. Our experiments show the superior
performance of MADiff compared to baseline algorithms in a wide range of
multi-agent learning tasks, which emphasizes the effectiveness of MADiff in
modeling complex multi-agent interactions. Our code is available at
https://github.com/zbzhu99/madiff.
Contrast, Attend and Diffuse to Decode High-Resolution Images from Brain Activities
May 26, 2023
Jingyuan Sun, Mingxiao Li, Zijiao Chen, Yunhao Zhang, Shaonan Wang, Marie-Francine Moens
Decoding visual stimuli from neural responses recorded by functional Magnetic
Resonance Imaging (fMRI) presents an intriguing intersection between cognitive
neuroscience and machine learning, promising advancements in understanding
human visual perception and building non-invasive brain-machine interfaces.
However, the task is challenging due to the noisy nature of fMRI signals and
the intricate pattern of brain visual representations. To mitigate these
challenges, we introduce a two-phase fMRI representation learning framework.
The first phase pre-trains an fMRI feature learner with a proposed
Double-contrastive Mask Auto-encoder to learn denoised representations. The
second phase tunes the feature learner to attend to neural activation patterns
most informative for visual reconstruction with guidance from an image
auto-encoder. The optimized fMRI feature learner then conditions a latent
diffusion model to reconstruct image stimuli from brain activities.
Experimental results demonstrate our model’s superiority in generating
high-resolution and semantically accurate images, substantially exceeding
previous state-of-the-art methods by 39.34% in the 50-way-top-1 semantic
classification accuracy. Our research invites further exploration of the
decoding task’s potential and contributes to the development of non-invasive
brain-machine interfaces.
Functional Flow Matching
May 26, 2023
Gavin Kerrigan, Giosue Migliorini, Padhraic Smyth
We propose Functional Flow Matching (FFM), a function-space generative model
that generalizes the recently-introduced Flow Matching model to operate in
infinite-dimensional spaces. Our approach works by first defining a path of
probability measures that interpolates between a fixed Gaussian measure and the
data distribution, followed by learning a vector field on the underlying space
of functions that generates this path of measures. Our method does not rely on
likelihoods or simulations, making it well-suited to the function space
setting. We provide both a theoretical framework for building such models and
an empirical evaluation of our techniques. We demonstrate through experiments
on several real-world benchmarks that our proposed FFM method outperforms
several recently proposed function-space generative models.
Flow Matching for Scalable Simulation-Based Inference
May 26, 2023
Maximilian Dax, Jonas Wildberger, Simon Buchholz, Stephen R. Green, Jakob H. Macke, Bernhard Schölkopf
Neural posterior estimation methods based on discrete normalizing flows have
become established tools for simulation-based inference (SBI), but scaling them
to high-dimensional problems can be challenging. Building on recent advances in
generative modeling, we here present flow matching posterior estimation (FMPE),
a technique for SBI using continuous normalizing flows. Like diffusion models,
and in contrast to discrete flows, flow matching allows for unconstrained
architectures, providing enhanced flexibility for complex data modalities. Flow
matching, therefore, enables exact density evaluation, fast training, and
seamless scalability to large architectures–making it ideal for SBI. We show
that FMPE achieves competitive performance on an established SBI benchmark, and
then demonstrate its improved scalability on a challenging scientific problem:
for gravitational-wave inference, FMPE outperforms methods based on comparable
discrete flows, reducing training time by 30% with substantially improved
accuracy. Our work underscores the potential of FMPE to enhance performance in
challenging inference scenarios, thereby paving the way for more advanced
applications to scientific problems.
High-Fidelity Image Compression with Score-based Generative Models
May 26, 2023
Emiel Hoogeboom, Eirikur Agustsson, Fabian Mentzer, Luca Versari, George Toderici, Lucas Theis
eess.IV, cs.CV, cs.LG, stat.ML
Despite the tremendous success of diffusion generative models in
text-to-image generation, replicating this success in the domain of image
compression has proven difficult. In this paper, we demonstrate that diffusion
can significantly improve perceptual quality at a given bit-rate, outperforming
state-of-the-art approaches PO-ELIC and HiFiC as measured by FID score. This is
achieved using a simple but theoretically motivated two-stage approach
combining an autoencoder targeting MSE followed by a further score-based
decoder. However, as we will show, implementation details matter and the
optimal design decisions can differ greatly from typical text-to-image models.
An Efficient Membership Inference Attack for the Diffusion Model by Proximal Initialization
May 26, 2023
Fei Kong, Jinhao Duan, RuiPeng Ma, Hengtao Shen, Xiaofeng Zhu, Xiaoshuang Shi, Kaidi Xu
cs.SD, cs.AI, cs.LG, eess.AS
Recently, diffusion models have achieved remarkable success in generating
tasks, including image and audio generation. However, like other generative
models, diffusion models are prone to privacy issues. In this paper, we propose
an efficient query-based membership inference attack (MIA), namely Proximal
Initialization Attack (PIA), which utilizes groundtruth trajectory obtained by
$\epsilon$ initialized in $t=0$ and predicted point to infer memberships.
Experimental results indicate that the proposed method can achieve competitive
performance with only two queries on both discrete-time and continuous-time
diffusion models. Moreover, previous works on the privacy of diffusion models
have focused on vision tasks without considering audio tasks. Therefore, we
also explore the robustness of diffusion models to MIA in the text-to-speech
(TTS) task, which is an audio generation task. To the best of our knowledge,
this work is the first to study the robustness of diffusion models to MIA in
the TTS task. Experimental results indicate that models with mel-spectrogram
(image-like) output are vulnerable to MIA, while models with audio output are
relatively robust to MIA. {Code is available at
\url{https://github.com/kong13661/PIA}}.
Accelerating Diffusion Models for Inverse Problems through Shortcut Sampling
May 26, 2023
Gongye Liu, Haoze Sun, Jiayi Li, Fei Yin, Yujiu Yang
Recently, diffusion models have demonstrated a remarkable ability to solve
inverse problems in an unsupervised manner. Existing methods mainly focus on
modifying the posterior sampling process while neglecting the potential of the
forward process. In this work, we propose Shortcut Sampling for Diffusion
(SSD), a novel pipeline for solving inverse problems. Instead of initiating
from random noise, the key concept of SSD is to find the “Embryo”, a
transitional state that bridges the measurement image y and the restored image
x. By utilizing the “shortcut” path of “input-Embryo-output”, SSD can achieve
precise and fast restoration. To obtain the Embryo in the forward process, We
propose Distortion Adaptive Inversion (DA Inversion). Moreover, we apply back
projection and attention injection as additional consistency constraints during
the generation process. Experimentally, we demonstrate the effectiveness of SSD
on several representative tasks, including super-resolution, deblurring, and
colorization. Compared to state-of-the-art zero-shot methods, our method
achieves competitive results with only 30 NFEs. Moreover, SSD with 100 NFEs can
outperform state-of-the-art zero-shot methods in certain tasks.
Error Bounds for Flow Matching Methods
May 26, 2023
Joe Benton, George Deligiannidis, Arnaud Doucet
Score-based generative models are a popular class of generative modelling
techniques relying on stochastic differential equations (SDE). From their
inception, it was realized that it was also possible to perform generation
using ordinary differential equations (ODE) rather than SDE. This led to the
introduction of the probability flow ODE approach and denoising diffusion
implicit models. Flow matching methods have recently further extended these
ODE-based approaches and approximate a flow between two arbitrary probability
distributions. Previous work derived bounds on the approximation error of
diffusion models under the stochastic sampling regime, given assumptions on the
$L^2$ loss. We present error bounds for the flow matching procedure using fully
deterministic sampling, assuming an $L^2$ bound on the approximation error and
a certain regularity condition on the data distributions.
Diverse and Expressive Speech Prosody Prediction with Denoising Diffusion Probabilistic Model
May 26, 2023
Xiang Li, Songxiang Liu, Max W. Y. Lam, Zhiyong Wu, Chao Weng, Helen Meng
Expressive human speech generally abounds with rich and flexible speech
prosody variations. The speech prosody predictors in existing expressive speech
synthesis methods mostly produce deterministic predictions, which are learned
by directly minimizing the norm of prosody prediction error. Its unimodal
nature leads to a mismatch with ground truth distribution and harms the model’s
ability in making diverse predictions. Thus, we propose a novel prosody
predictor based on the denoising diffusion probabilistic model to take
advantage of its high-quality generative modeling and training stability.
Experiment results confirm that the proposed prosody predictor outperforms the
deterministic baseline on both the expressiveness and diversity of prediction
results with even fewer network parameters.
Tree-Based Diffusion Schrödinger Bridge with Applications to Wasserstein Barycenters
May 26, 2023
Maxence Noble, Valentin De Bortoli, Arnaud Doucet, Alain Durmus
Multi-marginal Optimal Transport (mOT), a generalization of OT, aims at
minimizing the integral of a cost function with respect to a distribution with
some prescribed marginals. In this paper, we consider an entropic version of
mOT with a tree-structured quadratic cost, i.e., a function that can be written
as a sum of pairwise cost functions between the nodes of a tree. To address
this problem, we develop Tree-based Diffusion Schr"odinger Bridge (TreeDSB),
an extension of the Diffusion Schr"odinger Bridge (DSB) algorithm. TreeDSB
corresponds to a dynamic and continuous state-space counterpart of the
multimarginal Sinkhorn algorithm. A notable use case of our methodology is to
compute Wasserstein barycenters which can be recast as the solution of a mOT
problem on a star-shaped tree. We demonstrate that our methodology can be
applied in high-dimensional settings such as image interpolation and Bayesian
fusion.
Score-based Diffusion Models for Bayesian Image Reconstruction
May 25, 2023
Michael T. McCann, Hyungjin Chung, Jong Chul Ye, Marc L. Klasky
This paper explores the use of score-based diffusion models for Bayesian
image reconstruction. Diffusion models are an efficient tool for generative
modeling. Diffusion models can also be used for solving image reconstruction
problems. We present a simple and flexible algorithm for training a diffusion
model and using it for maximum a posteriori reconstruction, minimum mean square
error reconstruction, and posterior sampling. We present experiments on both a
linear and a nonlinear reconstruction problem that highlight the strengths and
limitations of the approach.
Anomaly Detection in Satellite Videos using Diffusion Models
May 25, 2023
Akash Awasthi, Son Ly, Jaer Nizam, Samira Zare, Videet Mehta, Safwan Ahmed, Keshav Shah, Ramakrishna Nemani, Saurabh Prasad, Hien Van Nguyen
The definition of anomaly detection is the identification of an unexpected
event. Real-time detection of extreme events such as wildfires, cyclones, or
floods using satellite data has become crucial for disaster management.
Although several earth-observing satellites provide information about
disasters, satellites in the geostationary orbit provide data at intervals as
frequent as every minute, effectively creating a video from space. There are
many techniques that have been proposed to identify anomalies in surveillance
videos; however, the available datasets do not have dynamic behavior, so we
discuss an anomaly framework that can work on very high-frequency datasets to
find very fast-moving anomalies. In this work, we present a diffusion model
which does not need any motion component to capture the fast-moving anomalies
and outperforms the other baseline methods.
Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models
May 25, 2023
Shihao Zhao, Dongdong Chen, Yen-Chun Chen, Jianmin Bao, Shaozhe Hao, Lu Yuan, Kwan-Yee K. Wong
Text-to-Image diffusion models have made tremendous progress over the past
two years, enabling the generation of highly realistic images based on
open-domain text descriptions. However, despite their success, text
descriptions often struggle to adequately convey detailed controls, even when
composed of long and complex texts. Moreover, recent studies have also shown
that these models face challenges in understanding such complex texts and
generating the corresponding images. Therefore, there is a growing need to
enable more control modes beyond text description. In this paper, we introduce
Uni-ControlNet, a unified framework that allows for the simultaneous
utilization of different local controls (e.g., edge maps, depth map,
segmentation masks) and global controls (e.g., CLIP image embeddings) in a
flexible and composable manner within one single model. Unlike existing
methods, Uni-ControlNet only requires the fine-tuning of two additional
adapters upon frozen pre-trained text-to-image diffusion models, eliminating
the huge cost of training from scratch. Moreover, thanks to some dedicated
adapter designs, Uni-ControlNet only necessitates a constant number (i.e., 2)
of adapters, regardless of the number of local or global controls used. This
not only reduces the fine-tuning costs and model size, making it more suitable
for real-world deployment, but also facilitate composability of different
conditions. Through both quantitative and qualitative comparisons,
Uni-ControlNet demonstrates its superiority over existing methods in terms of
controllability, generation quality and composability. Code is available at
\url{https://github.com/ShihaoZhaoZSH/Uni-ControlNet}.
Parallel Sampling of Diffusion Models
May 25, 2023
Andy Shih, Suneel Belkhale, Stefano Ermon, Dorsa Sadigh, Nima Anari
Diffusion models are powerful generative models but suffer from slow
sampling, often taking 1000 sequential denoising steps for one sample. As a
result, considerable efforts have been directed toward reducing the number of
denoising steps, but these methods hurt sample quality. Instead of reducing the
number of denoising steps (trading quality for speed), in this paper we explore
an orthogonal approach: can we run the denoising steps in parallel (trading
compute for speed)? In spite of the sequential nature of the denoising steps,
we show that surprisingly it is possible to parallelize sampling via Picard
iterations, by guessing the solution of future denoising steps and iteratively
refining until convergence. With this insight, we present ParaDiGMS, a novel
method to accelerate the sampling of pretrained diffusion models by denoising
multiple steps in parallel. ParaDiGMS is the first diffusion sampling method
that enables trading compute for speed and is even compatible with existing
fast sampling techniques such as DDIM and DPMSolver. Using ParaDiGMS, we
improve sampling speed by 2-4x across a range of robotics and image generation
models, giving state-of-the-art sampling speeds of 0.2s on 100-step
DiffusionPolicy and 14.6s on 1000-step StableDiffusion-v2 with no measurable
degradation of task reward, FID score, or CLIP score.
UDPM: Upsampling Diffusion Probabilistic Models
May 25, 2023
Shady Abu-Hussein, Raja Giryes
In recent years, Denoising Diffusion Probabilistic Models (DDPM) have caught
significant attention. By composing a Markovian process that starts in the data
domain and then gradually adds noise until reaching pure white noise, they
achieve superior performance in learning data distributions. Yet, these models
require a large number of diffusion steps to produce aesthetically pleasing
samples, which is inefficient. In addition, unlike common generative
adversarial networks, the latent space of diffusion models is not
interpretable. In this work, we propose to generalize the denoising diffusion
process into an Upsampling Diffusion Probabilistic Model (UDPM), in which we
reduce the latent variable dimension in addition to the traditional noise level
addition. As a result, we are able to sample images of size $256\times 256$
with only 7 diffusion steps, which is less than two orders of magnitude
compared to standard DDPMs. We formally develop the Markovian diffusion
processes of the UDPM, and demonstrate its generation capabilities on the
popular FFHQ, LSUN horses, ImageNet, and AFHQv2 datasets. Another favorable
property of UDPM is that it is very easy to interpolate its latent space, which
is not the case with standard diffusion models. Our code is available online
\url{https://github.com/shadyabh/UDPM}
Trans-Dimensional Generative Modeling via Jump Diffusion Models
May 25, 2023
Andrew Campbell, William Harvey, Christian Weilbach, Valentin De Bortoli, Tom Rainforth, Arnaud Doucet
We propose a new class of generative models that naturally handle data of
varying dimensionality by jointly modeling the state and dimension of each
datapoint. The generative process is formulated as a jump diffusion process
that makes jumps between different dimensional spaces. We first define a
dimension destroying forward noising process, before deriving the dimension
creating time-reversed generative process along with a novel evidence lower
bound training objective for learning to approximate it. Simulating our learned
approximation to the time-reversed generative process then provides an
effective way of sampling data of varying dimensionality by jointly generating
state values and dimensions. We demonstrate our approach on molecular and video
datasets of varying dimensionality, reporting better compatibility with
test-time diffusion guidance imputation tasks and improved interpolation
capabilities versus fixed dimensional models that generate state values and
dimensions separately.
Prompt-Free Diffusion: Taking “Text” out of Text-to-Image Diffusion Models
May 25, 2023
Xingqian Xu, Jiayi Guo, Zhangyang Wang, Gao Huang, Irfan Essa, Humphrey Shi
Text-to-image (T2I) research has grown explosively in the past year, owing to
the large-scale pre-trained diffusion models and many emerging personalization
and editing approaches. Yet, one pain point persists: the text prompt
engineering, and searching high-quality text prompts for customized results is
more art than science. Moreover, as commonly argued: “an image is worth a
thousand words” - the attempt to describe a desired image with texts often ends
up being ambiguous and cannot comprehensively cover delicate visual details,
hence necessitating more additional controls from the visual domain. In this
paper, we take a bold step forward: taking “Text” out of a pre-trained T2I
diffusion model, to reduce the burdensome prompt engineering efforts for users.
Our proposed framework, Prompt-Free Diffusion, relies on only visual inputs to
generate new images: it takes a reference image as “context”, an optional image
structural conditioning, and an initial noise, with absolutely no text prompt.
The core architecture behind the scene is Semantic Context Encoder (SeeCoder),
substituting the commonly used CLIP-based or LLM-based text encoder. The
reusability of SeeCoder also makes it a convenient drop-in component: one can
also pre-train a SeeCoder in one T2I model and reuse it for another. Through
extensive experiments, Prompt-Free Diffusion is experimentally found to (i)
outperform prior exemplar-based image synthesis approaches; (ii) perform on par
with state-of-the-art T2I models using prompts following the best practice; and
(iii) be naturally extensible to other downstream applications such as anime
figure generation and virtual try-on, with promising quality. Our code and
models are open-sourced at https://github.com/SHI-Labs/Prompt-Free-Diffusion.
ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation with Variational Score Distillation
May 25, 2023
Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, Jun Zhu
Score distillation sampling (SDS) has shown great promise in text-to-3D
generation by distilling pretrained large-scale text-to-image diffusion models,
but suffers from over-saturation, over-smoothing, and low-diversity problems.
In this work, we propose to model the 3D parameter as a random variable instead
of a constant as in SDS and present variational score distillation (VSD), a
principled particle-based variational framework to explain and address the
aforementioned issues in text-to-3D generation. We show that SDS is a special
case of VSD and leads to poor samples with both small and large CFG weights. In
comparison, VSD works well with various CFG weights as ancestral sampling from
diffusion models and simultaneously improves the diversity and sample quality
with a common CFG weight (i.e., $7.5$). We further present various improvements
in the design space for text-to-3D such as distillation time schedule and
density initialization, which are orthogonal to the distillation algorithm yet
not well explored. Our overall approach, dubbed ProlificDreamer, can generate
high rendering resolution (i.e., $512\times512$) and high-fidelity NeRF with
rich structure and complex effects (e.g., smoke and drops). Further,
initialized from NeRF, meshes fine-tuned by VSD are meticulously detailed and
photo-realistic. Project page and codes:
https://ml.cs.tsinghua.edu.cn/prolificdreamer/
Unifying GANs and Score-Based Diffusion as Generative Particle Models
May 25, 2023
Jean-Yves Franceschi, Mike Gartrell, Ludovic Dos Santos, Thibaut Issenhuth, Emmanuel de Bézenac, Mickaël Chen, Alain Rakotomamonjy
cs.LG, cs.CV, cs.NE, stat.ML
Particle-based deep generative models, such as gradient flows and score-based
diffusion models, have recently gained traction thanks to their striking
performance. Their principle of displacing particle distributions using
differential equations is conventionally seen as opposed to the previously
widespread generative adversarial networks (GANs), which involve training a
pushforward generator network. In this paper we challenge this interpretation,
and propose a novel framework that unifies particle and adversarial generative
models by framing generator training as a generalization of particle models.
This suggests that a generator is an optional addition to any such generative
model. Consequently, integrating a generator into a score-based diffusion model
and training a GAN without a generator naturally emerge from our framework. We
empirically test the viability of these original models as proofs of concepts
of potential applications of our framework.
DiffusionShield: A Watermark for Copyright Protection against Generative Diffusion Models
May 25, 2023
Yingqian Cui, Jie Ren, Han Xu, Pengfei He, Hui Liu, Lichao Sun, Yue Xing, Jiliang Tang
Recently, Generative Diffusion Models (GDMs) have showcased their remarkable
capabilities in learning and generating images. A large community of GDMs has
naturally emerged, further promoting the diversified applications of GDMs in
various fields. However, this unrestricted proliferation has raised serious
concerns about copyright protection. For example, artists including painters
and photographers are becoming increasingly concerned that GDMs could
effortlessly replicate their unique creative works without authorization. In
response to these challenges, we introduce a novel watermarking scheme,
DiffusionShield, tailored for GDMs. DiffusionShield protects images from
copyright infringement by GDMs through encoding the ownership information into
an imperceptible watermark and injecting it into the images. Its watermark can
be easily learned by GDMs and will be reproduced in their generated images. By
detecting the watermark from generated images, copyright infringement can be
exposed with evidence. Benefiting from the uniformity of the watermarks and the
joint optimization method, DiffusionShield ensures low distortion of the
original image, high watermark detection performance, and the ability to embed
lengthy messages. We conduct rigorous and comprehensive experiments to show the
effectiveness of DiffusionShield in defending against infringement by GDMs and
its superiority over traditional watermarking methods.
DiffCLIP: Leveraging Stable Diffusion for Language Grounded 3D Classification
May 25, 2023
Sitian Shen, Zilin Zhu, Linqian Fan, Harry Zhang, Xinxiao Wu
Large pre-trained models have had a significant impact on computer vision by
enabling multi-modal learning, where the CLIP model has achieved impressive
results in image classification, object detection, and semantic segmentation.
However, the model’s performance on 3D point cloud processing tasks is limited
due to the domain gap between depth maps from 3D projection and training images
of CLIP. This paper proposes DiffCLIP, a new pre-training framework that
incorporates stable diffusion with ControlNet to minimize the domain gap in the
visual branch. Additionally, a style-prompt generation module is introduced for
few-shot tasks in the textual branch. Extensive experiments on the ModelNet10,
ModelNet40, and ScanObjectNN datasets show that DiffCLIP has strong abilities
for 3D understanding. By using stable diffusion and style-prompt generation,
DiffCLIP achieves an accuracy of 43.2\% for zero-shot classification on OBJ_BG
of ScanObjectNN, which is state-of-the-art performance, and an accuracy of
80.6\% for zero-shot classification on ModelNet10, which is comparable to
state-of-the-art performance.
A Diffusion Probabilistic Prior for Low-Dose CT Image Denoising
May 25, 2023
Xuan Liu, Yaoqin Xie, Jun Cheng, Songhui Diao, Shan Tan, Xiaokun Liang
Denoising low-dose computed tomography (CT) images is a critical task in
medical image computing. Supervised deep learning-based approaches have made
significant advancements in this area in recent years. However, these methods
typically require pairs of low-dose and normal-dose CT images for training,
which are challenging to obtain in clinical settings. Existing unsupervised
deep learning-based methods often require training with a large number of
low-dose CT images or rely on specially designed data acquisition processes to
obtain training data. To address these limitations, we propose a novel
unsupervised method that only utilizes normal-dose CT images during training,
enabling zero-shot denoising of low-dose CT images. Our method leverages the
diffusion model, a powerful generative model. We begin by training a cascaded
unconditional diffusion model capable of generating high-quality normal-dose CT
images from low-resolution to high-resolution. The cascaded architecture makes
the training of high-resolution diffusion models more feasible. Subsequently,
we introduce low-dose CT images into the reverse process of the diffusion model
as likelihood, combined with the priors provided by the diffusion model and
iteratively solve multiple maximum a posteriori (MAP) problems to achieve
denoising. Additionally, we propose methods to adaptively adjust the
coefficients that balance the likelihood and prior in MAP estimations, allowing
for adaptation to different noise levels in low-dose CT images. We test our
method on low-dose CT datasets of different regions with varying dose levels.
The results demonstrate that our method outperforms the state-of-the-art
unsupervised method and surpasses several supervised deep learning-based
methods. Codes are available in https://github.com/DeepXuan/Dn-Dp.
On Architectural Compression of Text-to-Image Diffusion Models
May 25, 2023
Bo-Kyeong Kim, Hyoung-Kyu Song, Thibault Castells, Shinkook Choi
Text-to-image (T2I) generation with Stable Diffusion models (SDMs) involves
high computing demands due to billion-scale parameters. To enhance efficiency,
recent studies have reduced sampling steps and applied network quantization
while retaining the original architectures. The lack of architectural reduction
attempts may stem from worries over expensive retraining for such massive
models. In this work, we uncover the surprising potential of block pruning and
feature distillation for low-cost general-purpose T2I. By removing several
residual and attention blocks from the U-Net of SDMs, we achieve 30%~50%
reduction in model size, MACs, and latency. We show that distillation
retraining is effective even under limited resources: using only 13 A100 days
and a tiny dataset, our compact models can imitate the original SDMs (v1.4 and
v2.1-base with over 6,000 A100 days). Benefiting from the transferred
knowledge, our BK-SDMs deliver competitive results on zero-shot MS-COCO against
larger multi-billion parameter models. We further demonstrate the applicability
of our lightweight backbones in personalized generation and image-to-image
translation. Deployment of our models on edge devices attains 4-second
inference. We hope this work can help build small yet powerful diffusion models
with feasible training budgets. Code and models can be found at:
https://github.com/Nota-NetsPresso/BK-SDM
Custom-Edit: Text-Guided Image Editing with Customized Diffusion Models
May 25, 2023
Jooyoung Choi, Yunjey Choi, Yunji Kim, Junho Kim, Sungroh Yoon
Text-to-image diffusion models can generate diverse, high-fidelity images
based on user-provided text prompts. Recent research has extended these models
to support text-guided image editing. While text guidance is an intuitive
editing interface for users, it often fails to ensure the precise concept
conveyed by users. To address this issue, we propose Custom-Edit, in which we
(i) customize a diffusion model with a few reference images and then (ii)
perform text-guided editing. Our key discovery is that customizing only
language-relevant parameters with augmented prompts improves reference
similarity significantly while maintaining source similarity. Moreover, we
provide our recipe for each customization and editing process. We compare
popular customization methods and validate our findings on two editing methods
using various datasets.
Differentially Private Latent Diffusion Models
May 25, 2023
Saiyue Lyu, Margarita Vinaroz, Michael F. Liu, Mijung Park
Diffusion models (DMs) are widely used for generating high-quality
high-dimensional images in a non-differentially private manner. To address this
challenge, recent papers suggest pre-training DMs with public data, then
fine-tuning them with private data using DP-SGD for a relatively short period.
In this paper, we further improve the current state of DMs with DP by adopting
the Latent Diffusion Models (LDMs). LDMs are equipped with powerful pre-trained
autoencoders that map the high-dimensional pixels into lower-dimensional latent
representations, in which DMs are trained, yielding a more efficient and fast
training of DMs. In our algorithm, DP-LDMs, rather than fine-tuning the entire
DMs, we fine-tune only the attention modules of LDMs at varying layers with
privacy-sensitive data, reducing the number of trainable parameters by roughly
90% and achieving a better accuracy, compared to fine-tuning the entire DMs.
The smaller parameter space to fine-tune with DP-SGD helps our algorithm to
achieve new state-of-the-art results in several public-private benchmark data
pairs.Our approach also allows us to generate more realistic, high-dimensional
images (256x256) and those conditioned on text prompts with differential
privacy, which have not been attempted before us, to the best of our knowledge.
Our approach provides a promising direction for training more powerful, yet
training-efficient differentially private DMs, producing high-quality
high-dimensional DP images.
Score-Based Multimodal Autoencoders
May 25, 2023
Daniel Wesego, Amirmohammad Rooshenas
Multimodal Variational Autoencoders (VAEs) represent a promising group of
generative models that facilitate the construction of a tractable posterior
within the latent space, given multiple modalities. Daunhawer et al. (2022)
demonstrate that as the number of modalities increases, the generative quality
of each modality declines. In this study, we explore an alternative approach to
enhance the generative performance of multimodal VAEs by jointly modeling the
latent space of unimodal VAEs using score-based models (SBMs). The role of the
SBM is to enforce multimodal coherence by learning the correlation among the
latent variables. Consequently, our model combines the superior generative
quality of unimodal VAEs with coherent integration across different modalities.
Zero-shot Generation of Training Data with Denoising Diffusion Probabilistic Model for Handwritten Chinese Character Recognition
May 25, 2023
Dongnan Gui, Kai Chen, Haisong Ding, Qiang Huo
There are more than 80,000 character categories in Chinese while most of them
are rarely used. To build a high performance handwritten Chinese character
recognition (HCCR) system supporting the full character set with a traditional
approach, many training samples need be collected for each character category,
which is both time-consuming and expensive. In this paper, we propose a novel
approach to transforming Chinese character glyph images generated from font
libraries to handwritten ones with a denoising diffusion probabilistic model
(DDPM). Training from handwritten samples of a small character set, the DDPM is
capable of mapping printed strokes to handwritten ones, which makes it possible
to generate photo-realistic and diverse style handwritten samples of unseen
character categories. Combining DDPM-synthesized samples of unseen categories
with real samples of other categories, we can build an HCCR system to support
the full character set. Experimental results on CASIA-HWDB dataset with 3,755
character categories show that the HCCR systems trained with synthetic samples
perform similarly with the one trained with real samples in terms of
recognition accuracy. The proposed method has the potential to address HCCR
with a larger vocabulary.
Manifold Diffusion Fields
May 24, 2023
Ahmed A. Elhag, Joshua M. Susskind, Miguel Angel Bautista
We present Manifold Diffusion Fields (MDF), an approach to learn generative
models of continuous functions defined over Riemannian manifolds. Leveraging
insights from spectral geometry analysis, we define an intrinsic coordinate
system on the manifold via the eigen-functions of the Laplace-Beltrami
Operator. MDF represents functions using an explicit parametrization formed by
a set of multiple input-output pairs. Our approach allows to sample continuous
functions on manifolds and is invariant with respect to rigid and isometric
transformations of the manifold. Empirical results on several datasets and
manifolds show that MDF can capture distributions of such functions with better
diversity and fidelity than previous approaches.
Solving Diffusion ODEs with Optimal Boundary Conditions for Better Image Super-Resolution
May 24, 2023
Yiyang Ma, Huan Yang, Wenhan Yang, Jianlong Fu, Jiaying Liu
Diffusion models, as a kind of powerful generative model, have given
impressive results on image super-resolution (SR) tasks. However, due to the
randomness introduced in the reverse process of diffusion models, the
performances of diffusion-based SR models are fluctuating at every time of
sampling, especially for samplers with few resampled steps. This inherent
randomness of diffusion models results in ineffectiveness and instability,
making it challenging for users to guarantee the quality of SR results.
However, our work takes this randomness as an opportunity: fully analyzing and
leveraging it leads to the construction of an effective plug-and-play sampling
method that owns the potential to benefit a series of diffusion-based SR
methods. More in detail, we propose to steadily sample high-quality SR images
from pre-trained diffusion-based SR models by solving diffusion ordinary
differential equations (diffusion ODEs) with optimal boundary conditions (BCs)
and analyze the characteristics between the choices of BCs and their
corresponding SR results. Our analysis shows the route to obtain an
approximately optimal BC via an efficient exploration in the whole space. The
quality of SR results sampled by the proposed method with fewer steps
outperforms the quality of results sampled by current methods with randomness
from the same pre-trained diffusion-based SR model, which means that our
sampling method “boosts” current diffusion-based SR models without any
additional training.
Training Energy-Based Normalizing Flow with Score-Matching Objectives
May 24, 2023
Chen-Hao Chao, Wei-Fang Sun, Yen-Chang Hsu, Zsolt Kira, Chun-Yi Lee
In this paper, we establish a connection between the parameterization of
flow-based and energy-based generative models, and present a new flow-based
modeling approach called energy-based normalizing flow (EBFlow). We demonstrate
that by optimizing EBFlow with score-matching objectives, the computation of
Jacobian determinants for linear transformations can be entirely bypassed. This
feature enables the use of arbitrary linear layers in the construction of
flow-based models without increasing the computational time complexity of each
training iteration from $O(D^2L)$ to $O(D^3L)$ for an $L$-layered model that
accepts $D$-dimensional inputs. This makes the training of EBFlow more
efficient than the commonly-adopted maximum likelihood training method. In
addition to the reduction in runtime, we enhance the training stability and
empirical performance of EBFlow through a number of techniques developed based
on our analysis of the score-matching methods. The experimental results
demonstrate that our approach achieves a significant speedup compared to
maximum likelihood estimation while outperforming prior methods with a
noticeable margin in terms of negative log-likelihood (NLL).
Diffusion-Based Audio Inpainting
May 24, 2023
Eloi Moliner, Vesa Välimäki
Audio inpainting aims to reconstruct missing segments in corrupted
recordings. Most of existing methods produce plausible reconstructions when the
gap lengths are short, but struggle to reconstruct gaps larger than about 100
ms. This paper explores recent advancements in deep learning and, particularly,
diffusion models, for the task of audio inpainting. The proposed method uses an
unconditionally trained generative model, which can be conditioned in a
zero-shot fashion for audio inpainting, and is able to regenerate gaps of any
size. An improved deep neural network architecture based on the constant-Q
transform, which allows the model to exploit pitch-equivariant symmetries in
audio, is also presented. The performance of the proposed algorithm is
evaluated through objective and subjective metrics for the task of
reconstructing short to mid-sized gaps, up to 300 ms. The results of a formal
listening test show that the proposed method delivers comparable performance
against the compared baselines for short gaps, such as 50 ms, while retaining a
good audio quality and outperforming the baselines for wider gaps that are up
to 300 ms long. The method presented in this paper can be applied to restoring
sound recordings that suffer from severe local disturbances or dropouts, which
must be reconstructed.
Robust Classification via a Single Diffusion Model
May 24, 2023
Huanran Chen, Yinpeng Dong, Zhengyi Wang, Xiao Yang, Chengqi Duan, Hang Su, Jun Zhu
Recently, diffusion models have been successfully applied to improving
adversarial robustness of image classifiers by purifying the adversarial noises
or generating realistic data for adversarial training. However, the
diffusion-based purification can be evaded by stronger adaptive attacks while
adversarial training does not perform well under unseen threats, exhibiting
inevitable limitations of these methods. To better harness the expressive power
of diffusion models, in this paper we propose Robust Diffusion Classifier
(RDC), a generative classifier that is constructed from a pre-trained diffusion
model to be adversarially robust. Our method first maximizes the data
likelihood of a given input and then predicts the class probabilities of the
optimized input using the conditional likelihood of the diffusion model through
Bayes’ theorem. Since our method does not require training on particular
adversarial attacks, we demonstrate that it is more generalizable to defend
against multiple unseen threats. In particular, RDC achieves $73.24\%$ robust
accuracy against $\ell_\infty$ norm-bounded perturbations with
$\epsilon_\infty=8/255$ on CIFAR-10, surpassing the previous state-of-the-art
adversarial training models by $+2.34\%$. The findings highlight the potential
of generative classifiers by employing diffusion models for adversarial
robustness compared with the commonly studied discriminative classifiers.
DiffBlender: Scalable and Composable Multimodal Text-to-Image Diffusion Models
May 24, 2023
Sungnyun Kim, Junsoo Lee, Kibeom Hong, Daesik Kim, Namhyuk Ahn
In this study, we aim to extend the capabilities of diffusion-based
text-to-image (T2I) generation models by incorporating diverse modalities
beyond textual description, such as sketch, box, color palette, and style
embedding, within a single model. We thus design a multimodal T2I diffusion
model, coined as DiffBlender, by separating the channels of conditions into
three types, i.e., image forms, spatial tokens, and non-spatial tokens. The
unique architecture of DiffBlender facilitates adding new input modalities,
pioneering a scalable framework for conditional image generation. Notably, we
achieve this without altering the parameters of the existing generative model,
Stable Diffusion, only with updating partial components. Our study establishes
new benchmarks in multimodal generation through quantitative and qualitative
comparisons with existing conditional generation methods. We demonstrate that
DiffBlender faithfully blends all the provided information and showcase its
various applications in the detailed image synthesis.
Deceptive-NeRF: Enhancing NeRF Reconstruction using Pseudo-Observations from Diffusion Models
May 24, 2023
Xinhang Liu, Jiaben Chen, Shiu-hong Kao, Yu-Wing Tai, Chi-Keung Tang
We introduce Deceptive-NeRF, a novel methodology for few-shot NeRF
reconstruction, which leverages diffusion models to synthesize plausible
pseudo-observations to improve the reconstruction. This approach unfolds
through three key steps: 1) reconstructing a coarse NeRF from sparse input
data; 2) utilizing the coarse NeRF to render images and subsequently generating
pseudo-observations based on them; 3) training a refined NeRF model utilizing
input images augmented with pseudo-observations. We develop a deceptive
diffusion model that adeptly transitions RGB images and depth maps from coarse
NeRFs into photo-realistic pseudo-observations, all while preserving scene
semantics for reconstruction. Furthermore, we propose a progressive strategy
for training the Deceptive-NeRF, using the current NeRF renderings to create
pseudo-observations that enhance the next iteration’s NeRF. Extensive
experiments demonstrate that our approach is capable of synthesizing
photo-realistic novel views, even for highly complex scenes with very sparse
inputs. Codes will be released.
Unpaired Image-to-Image Translation via Neural Schrödinger Bridge
May 24, 2023
Beomsu Kim, Gihyun Kwon, Kwanyoung Kim, Jong Chul Ye
cs.CV, cs.AI, cs.LG, stat.ML
Diffusion models are a powerful class of generative models which simulate
stochastic differential equations (SDEs) to generate data from noise. Although
diffusion models have achieved remarkable progress in recent years, they have
limitations in the unpaired image-to-image translation tasks due to the
Gaussian prior assumption. Schr"odinger Bridge (SB), which learns an SDE to
translate between two arbitrary distributions, have risen as an attractive
solution to this problem. However, none of SB models so far have been
successful at unpaired translation between high-resolution images. In this
work, we propose the Unpaired Neural Schr"odinger Bridge (UNSB), which
expresses SB problem as a sequence of adversarial learning problems. This
allows us to incorporate advanced discriminators and regularization to learn a
SB between unpaired data. We demonstrate that UNSB is scalable and successfully
solves various unpaired image-to-image translation tasks. Code:
\url{https://github.com/cyclomon/UNSB}
DuDGAN: Improving Class-Conditional GANs via Dual-Diffusion
May 24, 2023
Taesun Yeom, Minhyeok Lee
Class-conditional image generation using generative adversarial networks
(GANs) has been investigated through various techniques; however, it continues
to face challenges such as mode collapse, training instability, and low-quality
output in cases of datasets with high intra-class variation. Furthermore, most
GANs often converge in larger iterations, resulting in poor iteration efficacy
in training procedures. While Diffusion-GAN has shown potential in generating
realistic samples, it has a critical limitation in generating class-conditional
samples. To overcome these limitations, we propose a novel approach for
class-conditional image generation using GANs called DuDGAN, which incorporates
a dual diffusion-based noise injection process. Our method consists of three
unique networks: a discriminator, a generator, and a classifier. During the
training process, Gaussian-mixture noises are injected into the two noise-aware
networks, the discriminator and the classifier, in distinct ways. This noisy
data helps to prevent overfitting by gradually introducing more challenging
tasks, leading to improved model performance. As a result, our method
outperforms state-of-the-art conditional GAN models for image generation in
terms of performance. We evaluated our method using the AFHQ, Food-101, and
CIFAR-10 datasets and observed superior results across metrics such as FID,
KID, Precision, and Recall score compared with comparison models, highlighting
the effectiveness of our approach.
May 24, 2023
Jaemoo Choi, Jaewoong Choi, Myungjoo Kang
Optimal Transport (OT) problem investigates a transport map that bridges two
distributions while minimizing a given cost function. In this regard, OT
between tractable prior distribution and data has been utilized for generative
modeling tasks. However, OT-based methods are susceptible to outliers and face
optimization challenges during training. In this paper, we propose a novel
generative model based on the semi-dual formulation of Unbalanced Optimal
Transport (UOT). Unlike OT, UOT relaxes the hard constraint on distribution
matching. This approach provides better robustness against outliers, stability
during training, and faster convergence. We validate these properties
empirically through experiments. Moreover, we study the theoretical upper-bound
of divergence between distributions in UOT. Our model outperforms existing
OT-based generative models, achieving FID scores of 2.97 on CIFAR-10 and 5.80
on CelebA-HQ-256. The code is available at
\url{https://github.com/Jae-Moo/UOTM}.
On the Generalization of Diffusion Model
May 24, 2023
Mingyang Yi, Jiacheng Sun, Zhenguo Li
The diffusion probabilistic generative models are widely used to generate
high-quality data. Though they can synthetic data that does not exist in the
training set, the rationale behind such generalization is still unexplored. In
this paper, we formally define the generalization of the generative model,
which is measured by the mutual information between the generated data and the
training set. The definition originates from the intuition that the model which
generates data with less correlation to the training set exhibits better
generalization ability. Meanwhile, we show that for the empirical optimal
diffusion model, the data generated by a deterministic sampler are all highly
related to the training set, thus poor generalization. This result contradicts
the observation of the trained diffusion model’s (approximating empirical
optima) extrapolation ability (generating unseen data). To understand this
contradiction, we empirically verify the difference between the sufficiently
trained diffusion model and the empirical optima. We found, though obtained
through sufficient training, there still exists a slight difference between
them, which is critical to making the diffusion model generalizable. Moreover,
we propose another training objective whose empirical optimal solution has no
potential generalization problem. We empirically show that the proposed
training objective returns a similar model to the original one, which further
verifies the generalization ability of the trained diffusion model.
Optimal Linear Subspace Search: Learning to Construct Fast and High-Quality Schedulers for Diffusion Models
May 24, 2023
Zhongjie Duan, Chengyu Wang, Cen Chen, Jun Huang, Weining Qian
In recent years, diffusion models have become the most popular and powerful
methods in the field of image synthesis, even rivaling human artists in
artistic creativity. However, the key issue currently limiting the application
of diffusion models is its extremely slow generation process. Although several
methods were proposed to speed up the generation process, there still exists a
trade-off between efficiency and quality. In this paper, we first provide a
detailed theoretical and empirical analysis of the generation process of the
diffusion models based on schedulers. We transform the designing problem of
schedulers into the determination of several parameters, and further transform
the accelerated generation process into an expansion process of the linear
subspace. Based on these analyses, we consequently propose a novel method
called Optimal Linear Subspace Search (OLSS), which accelerates the generation
process by searching for the optimal approximation process of the complete
generation process in the linear subspaces spanned by latent variables. OLSS is
able to generate high-quality images with a very small number of steps. To
demonstrate the effectiveness of our method, we conduct extensive comparative
experiments on open-source diffusion models. Experimental results show that
with a given number of steps, OLSS can significantly improve the quality of
generated images. Using an NVIDIA A100 GPU, we make it possible to generate a
high-quality image by Stable Diffusion within only one second without other
optimization techniques.
T1: Scaling Diffusion Probabilistic Fields to High-Resolution on Unified Visual Modalities
May 24, 2023
Kangfu Mei, Mo Zhou, Vishal M. Patel
Diffusion Probabilistic Field (DPF) models the distribution of continuous
functions defined over metric spaces. While DPF shows great potential for
unifying data generation of various modalities including images, videos, and 3D
geometry, it does not scale to a higher data resolution. This can be attributed
to the ``scaling property’’, where it is difficult for the model to capture
local structures through uniform sampling. To this end, we propose a new model
comprising of a view-wise sampling algorithm to focus on local structure
learning, and incorporating additional guidance, e.g., text description, to
complement the global geometry. The model can be scaled to generate
high-resolution data while unifying multiple modalities. Experimental results
on data generation in various modalities demonstrate the effectiveness of our
model, as well as its potential as a foundation framework for scalable
modality-unified visual content generation.
Diffusion Hyperfeatures: Searching Through Time and Space for Semantic Correspondence
May 23, 2023
Grace Luo, Lisa Dunlap, Dong Huk Park, Aleksander Holynski, Trevor Darrell
Diffusion models have been shown to be capable of generating high-quality
images, suggesting that they could contain meaningful internal representations.
Unfortunately, the feature maps that encode a diffusion model’s internal
information are spread not only over layers of the network, but also over
diffusion timesteps, making it challenging to extract useful descriptors. We
propose Diffusion Hyperfeatures, a framework for consolidating multi-scale and
multi-timestep feature maps into per-pixel feature descriptors that can be used
for downstream tasks. These descriptors can be extracted for both synthetic and
real images using the generation and inversion processes. We evaluate the
utility of our Diffusion Hyperfeatures on the task of semantic keypoint
correspondence: our method achieves superior performance on the SPair-71k real
image benchmark. We also demonstrate that our method is flexible and
transferable: our feature aggregation network trained on the inversion features
of real image pairs can be used on the generation features of synthetic image
pairs with unseen objects and compositions. Our code is available at
\url{https://diffusion-hyperfeatures.github.io}.
SEEDS: Exponential SDE Solvers for Fast High-Quality Sampling from Diffusion Models
May 23, 2023
Martin Gonzalez, Nelson Fernandez, Thuy Tran, Elies Gherbi, Hatem Hajri, Nader Masmoudi
cs.LG, cs.CV, cs.NA, math.NA, I.2.6
A potent class of generative models known as Diffusion Probabilistic Models
(DPMs) has become prominent. A forward diffusion process adds gradually noise
to data, while a model learns to gradually denoise. Sampling from pre-trained
DPMs is obtained by solving differential equations (DE) defined by the learnt
model, a process which has shown to be prohibitively slow. Numerous efforts on
speeding-up this process have consisted on crafting powerful ODE solvers.
Despite being quick, such solvers do not usually reach the optimal quality
achieved by available slow SDE solvers. Our goal is to propose SDE solvers that
reach optimal quality without requiring several hundreds or thousands of NFEs
to achieve that goal. We propose Stochastic Explicit Exponential
Derivative-free Solvers (SEEDS), improving and generalizing Exponential
Integrator approaches to the stochastic case on several frameworks. After
carefully analyzing the formulation of exact solutions of diffusion SDEs, we
craft SEEDS to analytically compute the linear part of such solutions. Inspired
by the Exponential Time-Differencing method, SEEDS use a novel treatment of the
stochastic components of solutions, enabling the analytical computation of
their variance, and contains high-order terms allowing to reach optimal quality
sampling $\sim3$-$5\times$ faster than previous SDE methods. We validate our
approach on several image generation benchmarks, showing that SEEDS outperform
or are competitive with previous SDE solvers. Contrary to the latter, SEEDS are
derivative and training free, and we fully prove strong convergence guarantees
for them.
Improved Convergence of Score-Based Diffusion Models via Prediction-Correction
May 23, 2023
Francesco Pedrotti, Jan Maas, Marco Mondelli
cs.LG, math.ST, stat.ML, stat.TH
Score-based generative models (SGMs) are powerful tools to sample from
complex data distributions. Their underlying idea is to (i) run a forward
process for time $T_1$ by adding noise to the data, (ii) estimate its score
function, and (iii) use such estimate to run a reverse process. As the reverse
process is initialized with the stationary distribution of the forward one, the
existing analysis paradigm requires $T_1\to\infty$. This is however
problematic: from a theoretical viewpoint, for a given precision of the score
approximation, the convergence guarantee fails as $T_1$ diverges; from a
practical viewpoint, a large $T_1$ increases computational costs and leads to
error propagation. This paper addresses the issue by considering a version of
the popular predictor-corrector scheme: after running the forward process, we
first estimate the final distribution via an inexact Langevin dynamics and then
revert the process. Our key technical contribution is to provide convergence
guarantees which require to run the forward process only for a fixed finite
time $T_1$. Our bounds exhibit a mild logarithmic dependence on the input
dimension and the subgaussian norm of the target distribution, have minimal
assumptions on the data, and require only to control the $L^2$ loss on the
score approximation, which is the quantity minimized in practice.
Realistic Noise Synthesis with Diffusion Models
May 23, 2023
Qi Wu, Mingyan Han, Ting Jiang, Haoqiang Fan, Bing Zeng, Shuaicheng Liu
Deep image denoising models often rely on large amount of training data for
the high quality performance. However, it is challenging to obtain sufficient
amount of data under real-world scenarios for the supervised training. As such,
synthesizing realistic noise becomes an important solution. However, existing
techniques have limitations in modeling complex noise distributions, resulting
in residual noise and edge artifacts in denoising methods relying on synthetic
data. To overcome these challenges, we propose a novel method that synthesizes
realistic noise using diffusion models, namely Realistic Noise Synthesize
Diffusor (RNSD). In particular, the proposed time-aware controlling module can
simulate various environmental conditions under given camera settings. RNSD can
incorporate guided multiscale content, such that more realistic noise with
spatial correlations can be generated at multiple frequencies. In addition, we
construct an inversion mechanism to predict the unknown camera setting, which
enables the extension of RNSD to datasets without setting information.
Extensive experiments demonstrate that our RNSD method significantly
outperforms the existing methods not only in the synthesized noise under
multiple realism metrics, but also in the single image denoising performances.
Compositional Text-to-Image Synthesis with Attention Map Control of Diffusion Models
May 23, 2023
Ruichen Wang, Zekang Chen, Chen Chen, Jian Ma, Haonan Lu, Xiaodong Lin
Recent text-to-image (T2I) diffusion models show outstanding performance in
generating high-quality images conditioned on textual prompts. However, they
fail to semantically align the generated images with the prompts due to their
limited compositional capabilities, leading to attribute leakage, entity
leakage, and missing entities. In this paper, we propose a novel attention mask
control strategy based on predicted object boxes to address these issues. In
particular, we first train a BoxNet to predict a box for each entity that
possesses the attribute specified in the prompt. Then, depending on the
predicted boxes, a unique mask control is applied to the cross- and
self-attention maps. Our approach produces a more semantically accurate
synthesis by constraining the attention regions of each token in the prompt to
the image. In addition, the proposed method is straightforward and effective
and can be readily integrated into existing cross-attention-based T2I
generators. We compare our approach to competing methods and demonstrate that
it can faithfully convey the semantics of the original text to the generated
content and achieve high availability as a ready-to-use plugin. Please refer to
https://github.com/OPPOMente-Lab/attention-mask-control.
Unsafe Diffusion: On the Generation of Unsafe Images and Hateful Memes From Text-To-Image Models
May 23, 2023
Yiting Qu, Xinyue Shen, Xinlei He, Michael Backes, Savvas Zannettou, Yang Zhang
cs.CV, cs.CR, cs.CY, cs.LG, cs.SI
State-of-the-art Text-to-Image models like Stable Diffusion and DALLE$\cdot$2
are revolutionizing how people generate visual content. At the same time,
society has serious concerns about how adversaries can exploit such models to
generate unsafe images. In this work, we focus on demystifying the generation
of unsafe images and hateful memes from Text-to-Image models. We first
construct a typology of unsafe images consisting of five categories (sexually
explicit, violent, disturbing, hateful, and political). Then, we assess the
proportion of unsafe images generated by four advanced Text-to-Image models
using four prompt datasets. We find that these models can generate a
substantial percentage of unsafe images; across four models and four prompt
datasets, 14.56% of all generated images are unsafe. When comparing the four
models, we find different risk levels, with Stable Diffusion being the most
prone to generating unsafe content (18.92% of all generated images are unsafe).
Given Stable Diffusion’s tendency to generate more unsafe content, we evaluate
its potential to generate hateful meme variants if exploited by an adversary to
attack a specific individual or community. We employ three image editing
methods, DreamBooth, Textual Inversion, and SDEdit, which are supported by
Stable Diffusion. Our evaluation result shows that 24% of the generated images
using DreamBooth are hateful meme variants that present the features of the
original hateful meme and the target individual/community; these generated
images are comparable to hateful meme variants collected from the real world.
Overall, our results demonstrate that the danger of large-scale generation of
unsafe images is imminent. We discuss several mitigating measures, such as
curating training data, regulating prompts, and implementing safety filters,
and encourage better safeguard tools to be developed to prevent unsafe
generation.
Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models
May 23, 2023
Weifeng Chen, Yatai Ji, Jie Wu, Hefeng Wu, Pan Xie, Jiashi Li, Xin Xia, Xuefeng Xiao, Liang Lin
cs.CV, cs.AI, cs.LG, cs.MM
Recent advancements in diffusion models have unlocked unprecedented abilities
in visual creation. However, current text-to-video generation models struggle
with the trade-off among movement range, action coherence and object
consistency. To mitigate this issue, we present a controllable text-to-video
(T2V) diffusion model, called Control-A-Video, capable of maintaining
consistency while customizable video synthesis. Based on a pre-trained
conditional text-to-image (T2I) diffusion model, our model aims to generate
videos conditioned on a sequence of control signals, such as edge or depth
maps. For the purpose of improving object consistency, Control-A-Video
integrates motion priors and content priors into video generation. We propose
two motion-adaptive noise initialization strategies, which are based on pixel
residual and optical flow, to introduce motion priors from input videos,
producing more coherent videos. Moreover, a first-frame conditioned controller
is proposed to generate videos from content priors of the first frame, which
facilitates the semantic alignment with text and allows longer video generation
in an auto-regressive manner. With the proposed architecture and strategies,
our model achieves resource-efficient convergence and generate consistent and
coherent videos with fine-grained control. Extensive experiments demonstrate
its success in various video generative tasks such as video editing and video
style transfer, outperforming previous methods in terms of consistency and
quality.
WaveDM: Wavelet-Based Diffusion Models for Image Restoration
May 23, 2023
Yi Huang, Jiancheng Huang, Jianzhuang Liu, Yu Dong, Jiaxi Lv, Shifeng Chen
Latest diffusion-based methods for many image restoration tasks outperform
traditional models, but they encounter the long-time inference problem. To
tackle it, this paper proposes a Wavelet-Based Diffusion Model (WaveDM) with an
Efficient Conditional Sampling (ECS) strategy. WaveDM learns the distribution
of clean images in the wavelet domain conditioned on the wavelet spectrum of
degraded images after wavelet transform, which is more time-saving in each step
of sampling than modeling in the spatial domain. In addition, ECS follows the
same procedure as the deterministic implicit sampling in the initial sampling
period and then stops to predict clean images directly, which reduces the
number of total sampling steps to around 5. Evaluations on four benchmark
datasets including image raindrop removal, defocus deblurring, demoir'eing,
and denoising demonstrate that WaveDM achieves state-of-the-art performance
with the efficiency that is comparable to traditional one-pass methods and over
100 times faster than existing image restoration methods using vanilla
diffusion models.
DiffHand: End-to-End Hand Mesh Reconstruction via Diffusion Models
May 23, 2023
Lijun Li, Li'an Zhuo, Bang Zhang, Liefeng Bo, Chen Chen
Hand mesh reconstruction from the monocular image is a challenging task due
to its depth ambiguity and severe occlusion, there remains a non-unique mapping
between the monocular image and hand mesh. To address this, we develop
DiffHand, the first diffusion-based framework that approaches hand mesh
reconstruction as a denoising diffusion process. Our one-stage pipeline
utilizes noise to model the uncertainty distribution of the intermediate hand
mesh in a forward process. We reformulate the denoising diffusion process to
gradually refine noisy hand mesh and then select mesh with the highest
probability of being correct based on the image itself, rather than relying on
2D joints extracted beforehand. To better model the connectivity of hand
vertices, we design a novel network module called the cross-modality decoder.
Extensive experiments on the popular benchmarks demonstrate that our method
outperforms the state-of-the-art hand mesh reconstruction approaches by
achieving 5.8mm PA-MPJPE on the Freihand test set, 4.98mm PA-MPJPE on the
DexYCB test set.
Training Diffusion Models with Reinforcement Learning
May 22, 2023
Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, Sergey Levine
Diffusion models are a class of flexible generative models trained with an
approximation to the log-likelihood objective. However, most use cases of
diffusion models are not concerned with likelihoods, but instead with
downstream objectives such as human-perceived image quality or drug
effectiveness. In this paper, we investigate reinforcement learning methods for
directly optimizing diffusion models for such objectives. We describe how
posing denoising as a multi-step decision-making problem enables a class of
policy gradient algorithms, which we refer to as denoising diffusion policy
optimization (DDPO), that are more effective than alternative reward-weighted
likelihood approaches. Empirically, DDPO is able to adapt text-to-image
diffusion models to objectives that are difficult to express via prompting,
such as image compressibility, and those derived from human feedback, such as
aesthetic quality. Finally, we show that DDPO can improve prompt-image
alignment using feedback from a vision-language model without the need for
additional data collection or human annotation. The project’s website can be
found at http://rl-diffusion.github.io .
GSURE-Based Diffusion Model Training with Corrupted Data
May 22, 2023
Bahjat Kawar, Noam Elata, Tomer Michaeli, Michael Elad
Diffusion models have demonstrated impressive results in both data generation
and downstream tasks such as inverse problems, text-based editing,
classification, and more. However, training such models usually requires large
amounts of clean signals which are often difficult or impossible to obtain. In
this work, we propose a novel training technique for generative diffusion
models based only on corrupted data. We introduce a loss function based on the
Generalized Stein’s Unbiased Risk Estimator (GSURE), and prove that under some
conditions, it is equivalent to the training objective used in fully supervised
diffusion models. We demonstrate our technique on face images as well as
Magnetic Resonance Imaging (MRI), where the use of undersampled data
significantly alleviates data collection costs. Our approach achieves
generative performance comparable to its fully supervised counterpart without
training on any clean signals. In addition, we deploy the resulting diffusion
model in various downstream tasks beyond the degradation present in the
training set, showcasing promising results.
AudioToken: Adaptation of Text-Conditioned Diffusion Models for Audio-to-Image Generation
May 22, 2023
Guy Yariv, Itai Gat, Lior Wolf, Yossi Adi, Idan Schwartz
cs.SD, cs.CV, cs.LG, eess.AS
In recent years, image generation has shown a great leap in performance,
where diffusion models play a central role. Although generating high-quality
images, such models are mainly conditioned on textual descriptions. This begs
the question: “how can we adopt such models to be conditioned on other
modalities?”. In this paper, we propose a novel method utilizing latent
diffusion models trained for text-to-image-generation to generate images
conditioned on audio recordings. Using a pre-trained audio encoding model, the
proposed method encodes audio into a new token, which can be considered as an
adaptation layer between the audio and text representations. Such a modeling
paradigm requires a small number of trainable parameters, making the proposed
approach appealing for lightweight optimization. Results suggest the proposed
method is superior to the evaluated baseline methods, considering objective and
subjective metrics. Code and samples are available at:
https://pages.cs.huji.ac.il/adiyoss-lab/AudioToken.
Hierarchical Integration Diffusion Model for Realistic Image Deblurring
May 22, 2023
Zheng Chen, Yulun Zhang, Ding Liu, Bin Xia, Jinjin Gu, Linghe Kong, Xin Yuan
Diffusion models (DMs) have recently been introduced in image deblurring and
exhibited promising performance, particularly in terms of details
reconstruction. However, the diffusion model requires a large number of
inference iterations to recover the clean image from pure Gaussian noise, which
consumes massive computational resources. Moreover, the distribution
synthesized by the diffusion model is often misaligned with the target results,
leading to restrictions in distortion-based metrics. To address the above
issues, we propose the Hierarchical Integration Diffusion Model (HI-Diff), for
realistic image deblurring. Specifically, we perform the DM in a highly
compacted latent space to generate the prior feature for the deblurring
process. The deblurring process is implemented by a regression-based method to
obtain better distortion accuracy. Meanwhile, the highly compact latent space
ensures the efficiency of the DM. Furthermore, we design the hierarchical
integration module to fuse the prior into the regression-based model from
multiple scales, enabling better generalization in complex blurry scenarios.
Comprehensive experiments on synthetic and real-world blur datasets demonstrate
that our HI-Diff outperforms state-of-the-art methods. Code and trained models
are available at https://github.com/zhengchen1999/HI-Diff.
Is Synthetic Data From Diffusion Models Ready for Knowledge Distillation?
May 22, 2023
Zheng Li, Yuxuan Li, Penghai Zhao, Renjie Song, Xiang Li, Jian Yang
Diffusion models have recently achieved astonishing performance in generating
high-fidelity photo-realistic images. Given their huge success, it is still
unclear whether synthetic images are applicable for knowledge distillation when
real images are unavailable. In this paper, we extensively study whether and
how synthetic images produced from state-of-the-art diffusion models can be
used for knowledge distillation without access to real images, and obtain three
key conclusions: (1) synthetic data from diffusion models can easily lead to
state-of-the-art performance among existing synthesis-based distillation
methods, (2) low-fidelity synthetic images are better teaching materials, and
(3) relatively weak classifiers are better teachers. Code is available at
https://github.com/zhengli97/DM-KD.
ViT-TTS: Visual Text-to-Speech with Scalable Diffusion Transformer
May 22, 2023
Huadai Liu, Rongjie Huang, Xuan Lin, Wenqiang Xu, Maozong Zheng, Hong Chen, Jinzheng He, Zhou Zhao
Text-to-speech(TTS) has undergone remarkable improvements in performance,
particularly with the advent of Denoising Diffusion Probabilistic Models
(DDPMs). However, the perceived quality of audio depends not solely on its
content, pitch, rhythm, and energy, but also on the physical environment. In
this work, we propose ViT-TTS, the first visual TTS model with scalable
diffusion transformers. ViT-TTS complement the phoneme sequence with the visual
information to generate high-perceived audio, opening up new avenues for
practical applications of AR and VR to allow a more immersive and realistic
audio experience. To mitigate the data scarcity in learning visual acoustic
information, we 1) introduce a self-supervised learning framework to enhance
both the visual-text encoder and denoiser decoder; 2) leverage the diffusion
transformer scalable in terms of parameters and capacity to learn visual scene
information. Experimental results demonstrate that ViT-TTS achieves new
state-of-the-art results, outperforming cascaded systems and other baselines
regardless of the visibility of the scene. With low-resource data (1h, 2h, 5h),
ViT-TTS achieves comparative results with rich-resource
baselines.~\footnote{Audio samples are available at
\url{https://ViT-TTS.github.io/.}}
Towards Globally Consistent Stochastic Human Motion Prediction via Motion Diffusion
May 21, 2023
Jiarui Sun, Girish Chowdhary
Stochastic Human Motion Prediction (HMP) aims to predict multiple possible
upcoming pose sequences based on past human motion trajectories. Although
previous approaches have shown impressive performance, they face several
issues, including complex training processes and a tendency to generate
predictions that are often inconsistent with the provided history, and
sometimes even becoming entirely unreasonable. To overcome these issues, we
propose DiffMotion, an end-to-end diffusion-based stochastic HMP framework.
DiffMotion’s motion predictor is composed of two modules, including (1) a
Transformer-based network for initial motion reconstruction from corrupted
motion, and (2) a Graph Convolutional Network (GCN) to refine the generated
motion considering past observations. Our method, facilitated by this novel
Transformer-GCN module design and a proposed variance scheduler, excels in
predicting accurate, realistic, and consistent motions, while maintaining an
appropriate level of diversity. Our results on benchmark datasets show that
DiffMotion significantly outperforms previous methods in terms of both accuracy
and fidelity, while demonstrating superior robustness.
DiffUCD:Unsupervised Hyperspectral Image Change Detection with Semantic Correlation Diffusion Model
May 21, 2023
Xiangrong Zhang, Shunli Tian, Guanchun Wang, Huiyu Zhou, Licheng Jiao
Hyperspectral image change detection (HSI-CD) has emerged as a crucial
research area in remote sensing due to its ability to detect subtle changes on
the earth’s surface. Recently, diffusional denoising probabilistic models
(DDPM) have demonstrated remarkable performance in the generative domain. Apart
from their image generation capability, the denoising process in diffusion
models can comprehensively account for the semantic correlation of
spectral-spatial features in HSI, resulting in the retrieval of semantically
relevant features in the original image. In this work, we extend the diffusion
model’s application to the HSI-CD field and propose a novel unsupervised HSI-CD
with semantic correlation diffusion model (DiffUCD). Specifically, the semantic
correlation diffusion model (SCDM) leverages abundant unlabeled samples and
fully accounts for the semantic correlation of spectral-spatial features, which
mitigates pseudo change between multi-temporal images arising from inconsistent
imaging conditions. Besides, objects with the same semantic concept at the same
spatial location may exhibit inconsistent spectral signatures at different
times, resulting in pseudo change. To address this problem, we propose a
cross-temporal contrastive learning (CTCL) mechanism that aligns the spectral
feature representations of unchanged samples. By doing so, the spectral
difference invariant features caused by environmental changes can be obtained.
Experiments conducted on three publicly available datasets demonstrate that the
proposed method outperforms the other state-of-the-art unsupervised methods in
terms of Overall Accuracy (OA), Kappa Coefficient (KC), and F1 scores,
achieving improvements of approximately 3.95%, 8.13%, and 4.45%, respectively.
Notably, our method can achieve comparable results to those fully supervised
methods requiring numerous annotated samples.
Dual-Diffusion: Dual Conditional Denoising Diffusion Probabilistic Models for Blind Super-Resolution Reconstruction in RSIs
May 20, 2023
Mengze Xu, Jie Ma, Yuanyuan Zhu
Previous super-resolution reconstruction (SR) works are always designed on
the assumption that the degradation operation is fixed, such as bicubic
downsampling. However, as for remote sensing images, some unexpected factors
can cause the blurred visual performance, like weather factors, orbit altitude,
etc. Blind SR methods are proposed to deal with various degradations. There are
two main challenges of blind SR in RSIs: 1) the accu-rate estimation of
degradation kernels; 2) the realistic image generation in the ill-posed
problem. To rise to the challenge, we propose a novel blind SR framework based
on dual conditional denoising diffusion probabilistic models (DDSR). In our
work, we introduce conditional denoising diffusion probabilistic models (DDPM)
from two aspects: kernel estimation progress and re-construction progress,
named as the dual-diffusion. As for kernel estimation progress, conditioned on
low-resolution (LR) images, a new DDPM-based kernel predictor is constructed by
studying the invertible mapping between the kernel distribution and the latent
distribution. As for reconstruction progress, regarding the predicted
degradation kernels and LR images as conditional information, we construct a
DDPM-based reconstructor to learning the mapping from the LR images to HR
images. Com-prehensive experiments show the priority of our proposal com-pared
with SOTA blind SR methods. Source Code is available at
https://github.com/Lincoln20030413/DDSR
Any-to-Any Generation via Composable Diffusion
May 19, 2023
Zineng Tang, Ziyi Yang, Chenguang Zhu, Michael Zeng, Mohit Bansal
cs.CV, cs.CL, cs.LG, cs.SD, eess.AS
We present Composable Diffusion (CoDi), a novel generative model capable of
generating any combination of output modalities, such as language, image,
video, or audio, from any combination of input modalities. Unlike existing
generative AI systems, CoDi can generate multiple modalities in parallel and
its input is not limited to a subset of modalities like text or image. Despite
the absence of training datasets for many combinations of modalities, we
propose to align modalities in both the input and output space. This allows
CoDi to freely condition on any input combination and generate any group of
modalities, even if they are not present in the training data. CoDi employs a
novel composable generation strategy which involves building a shared
multimodal space by bridging alignment in the diffusion process, enabling the
synchronized generation of intertwined modalities, such as temporally aligned
video and audio. Highly customizable and flexible, CoDi achieves strong
joint-modality generation quality, and outperforms or is on par with the
unimodal state-of-the-art for single-modality synthesis. The project page with
demonstrations and code is at https://codi-gen.github.io
The probability flow ODE is provably fast
May 19, 2023
Sitan Chen, Sinho Chewi, Holden Lee, Yuanzhi Li, Jianfeng Lu, Adil Salim
cs.LG, math.ST, stat.ML, stat.TH
We provide the first polynomial-time convergence guarantees for the
probability flow ODE implementation (together with a corrector step) of
score-based generative modeling. Our analysis is carried out in the wake of
recent results obtaining such guarantees for the SDE-based implementation
(i.e., denoising diffusion probabilistic modeling or DDPM), but requires the
development of novel techniques for studying deterministic dynamics without
contractivity. Through the use of a specially chosen corrector step based on
the underdamped Langevin diffusion, we obtain better dimension dependence than
prior works on DDPM ($O(\sqrt{d})$ vs. $O(d)$, assuming smoothness of the data
distribution), highlighting potential advantages of the ODE framework.
Efficient Cross-Lingual Transfer for Chinese Stable Diffusion with Images as Pivots
May 19, 2023
Jinyi Hu, Xu Han, Xiaoyuan Yi, Yutong Chen, Wenhao Li, Zhiyuan Liu, Maosong Sun
Diffusion models have made impressive progress in text-to-image synthesis.
However, training such large-scale models (e.g. Stable Diffusion), from scratch
requires high computational costs and massive high-quality text-image pairs,
which becomes unaffordable in other languages. To handle this challenge, we
propose IAP, a simple but effective method to transfer English Stable Diffusion
into Chinese. IAP optimizes only a separate Chinese text encoder with all other
parameters fixed to align Chinese semantics space to the English one in CLIP.
To achieve this, we innovatively treat images as pivots and minimize the
distance of attentive features produced from cross-attention between images and
each language respectively. In this way, IAP establishes connections of
Chinese, English and visual semantics in CLIP’s embedding space efficiently,
advancing the quality of the generated image with direct Chinese prompts.
Experimental results show that our method outperforms several strong Chinese
diffusion models with only 5%~10% training data.
A Preliminary Study on Augmenting Speech Emotion Recognition using a Diffusion Model
May 19, 2023
Ibrahim Malik, Siddique Latif, Raja Jurdak, Björn Schuller
In this paper, we propose to utilise diffusion models for data augmentation
in speech emotion recognition (SER). In particular, we present an effective
approach to utilise improved denoising diffusion probabilistic models (IDDPM)
to generate synthetic emotional data. We condition the IDDPM with the textual
embedding from bidirectional encoder representations from transformers (BERT)
to generate high-quality synthetic emotional samples in different speakers’
voices\footnote{synthetic samples URL:
\url{https://emulationai.com/research/diffusion-ser.}}. We implement a series
of experiments and show that better quality synthetic data helps improve SER
performance. We compare results with generative adversarial networks (GANs) and
show that the proposed model generates better-quality synthetic samples that
can considerably improve the performance of SER when augmented with synthetic
data.
SlotDiffusion: Object-Centric Generative Modeling with Diffusion Models
May 18, 2023
Ziyi Wu, Jingyu Hu, Wuyue Lu, Igor Gilitschenski, Animesh Garg
Object-centric learning aims to represent visual data with a set of object
entities (a.k.a. slots), providing structured representations that enable
systematic generalization. Leveraging advanced architectures like Transformers,
recent approaches have made significant progress in unsupervised object
discovery. In addition, slot-based representations hold great potential for
generative modeling, such as controllable image generation and object
manipulation in image editing. However, current slot-based methods often
produce blurry images and distorted objects, exhibiting poor generative
modeling capabilities. In this paper, we focus on improving slot-to-image
decoding, a crucial aspect for high-quality visual generation. We introduce
SlotDiffusion – an object-centric Latent Diffusion Model (LDM) designed for
both image and video data. Thanks to the powerful modeling capacity of LDMs,
SlotDiffusion surpasses previous slot models in unsupervised object
segmentation and visual generation across six datasets. Furthermore, our
learned object features can be utilized by existing object-centric dynamics
models, improving video prediction quality and downstream temporal reasoning
tasks. Finally, we demonstrate the scalability of SlotDiffusion to
unconstrained real-world datasets such as PASCAL VOC and COCO, when integrated
with self-supervised pre-trained image encoders.
UniControl: A Unified Diffusion Model for Controllable Visual Generation In the Wild
May 18, 2023
Can Qin, Shu Zhang, Ning Yu, Yihao Feng, Xinyi Yang, Yingbo Zhou, Huan Wang, Juan Carlos Niebles, Caiming Xiong, Silvio Savarese, Stefano Ermon, Yun Fu, Ran Xu
Achieving machine autonomy and human control often represent divergent
objectives in the design of interactive AI systems. Visual generative
foundation models such as Stable Diffusion show promise in navigating these
goals, especially when prompted with arbitrary languages. However, they often
fall short in generating images with spatial, structural, or geometric
controls. The integration of such controls, which can accommodate various
visual conditions in a single unified model, remains an unaddressed challenge.
In response, we introduce UniControl, a new generative foundation model that
consolidates a wide array of controllable condition-to-image (C2I) tasks within
a singular framework, while still allowing for arbitrary language prompts.
UniControl enables pixel-level-precise image generation, where visual
conditions primarily influence the generated structures and language prompts
guide the style and context. To equip UniControl with the capacity to handle
diverse visual conditions, we augment pretrained text-to-image diffusion models
and introduce a task-aware HyperNet to modulate the diffusion models, enabling
the adaptation to different C2I tasks simultaneously. Trained on nine unique
C2I tasks, UniControl demonstrates impressive zero-shot generation abilities
with unseen visual conditions. Experimental results show that UniControl often
surpasses the performance of single-task-controlled methods of comparable model
sizes. This control versatility positions UniControl as a significant
advancement in the realm of controllable visual generation.
Blackout Diffusion: Generative Diffusion Models in Discrete-State Spaces
May 18, 2023
Javier E Santos, Zachary R. Fox, Nicholas Lubbers, Yen Ting Lin
Typical generative diffusion models rely on a Gaussian diffusion process for
training the backward transformations, which can then be used to generate
samples from Gaussian noise. However, real world data often takes place in
discrete-state spaces, including many scientific applications. Here, we develop
a theoretical formulation for arbitrary discrete-state Markov processes in the
forward diffusion process using exact (as opposed to variational) analysis. We
relate the theory to the existing continuous-state Gaussian diffusion as well
as other approaches to discrete diffusion, and identify the corresponding
reverse-time stochastic process and score function in the continuous-time
setting, and the reverse-time mapping in the discrete-time setting. As an
example of this framework, we introduce ``Blackout Diffusion’’, which learns to
produce samples from an empty image instead of from noise. Numerical
experiments on the CIFAR-10, Binarized MNIST, and CelebA datasets confirm the
feasibility of our approach. Generalizing from specific (Gaussian) forward
processes to discrete-state processes without a variational approximation sheds
light on how to interpret diffusion models, which we discuss.
Unsupervised Pansharpening via Low-rank Diffusion Model
May 18, 2023
Xiangyu Rui, Xiangyong Cao, Li Pang, Zeyu Zhu, Zongsheng Yue, Deyu Meng
Hyperspectral pansharpening is a process of merging a high-resolution
panchromatic (PAN) image and a low-resolution hyperspectral (LRHS) image to
create a single high-resolution hyperspectral (HRHS) image. Existing
Bayesian-based HS pansharpening methods require designing handcraft image prior
to characterize the image features, and deep learning-based HS pansharpening
methods usually require a large number of paired training data and suffer from
poor generalization ability. To address these issues, in this work, we propose
a low-rank diffusion model for hyperspectral pansharpening by simultaneously
leveraging the power of the pre-trained deep diffusion model and better
generalization ability of Bayesian methods. Specifically, we assume that the
HRHS image can be recovered from the product of two low-rank tensors, i.e., the
base tensor and the coefficient matrix. The base tensor lies on the image field
and has a low spectral dimension. Thus, we can conveniently utilize a
pre-trained remote sensing diffusion model to capture its image structures.
Additionally, we derive a simple yet quite effective way to pre-estimate the
coefficient matrix from the observed LRHS image, which preserves the spectral
information of the HRHS. Experimental results demonstrate that the proposed
method performs better than some popular traditional approaches and gains
better generalization ability than some DL-based methods. The code is released
in https://github.com/xyrui/PLRDiff.
Structural Pruning for Diffusion Models
May 18, 2023
Gongfan Fang, Xinyin Ma, Xinchao Wang
Generative modeling has recently undergone remarkable advancements, primarily
propelled by the transformative implications of Diffusion Probabilistic Models
(DPMs). The impressive capability of these models, however, often entails
significant computational overhead during both training and inference. To
tackle this challenge, we present Diff-Pruning, an efficient compression method
tailored for learning lightweight diffusion models from pre-existing ones,
without the need for extensive re-training. The essence of Diff-Pruning is
encapsulated in a Taylor expansion over pruned timesteps, a process that
disregards non-contributory diffusion steps and ensembles informative gradients
to identify important weights. Our empirical assessment, undertaken across
several datasets highlights two primary benefits of our proposed method: 1)
Efficiency: it enables approximately a 50\% reduction in FLOPs at a mere 10\%
to 20\% of the original training expenditure; 2) Consistency: the pruned
diffusion models inherently preserve generative behavior congruent with their
pre-trained models. Code is available at
\url{https://github.com/VainF/Diff-Pruning}.
Diffusion-Based Mel-Spectrogram Enhancement for Personalized Speech Synthesis with Found Data
May 18, 2023
Yusheng Tian, Wei Liu, Tan Lee
Creating synthetic voices with found data is challenging, as real-world
recordings often contain various types of audio degradation. One way to address
this problem is to pre-enhance the speech with an enhancement model and then
use the enhanced data for text-to-speech (TTS) model training. This paper
investigates the use of conditional diffusion models for generalized speech
enhancement, which aims at addressing multiple types of audio degradation
simultaneously. The enhancement is performed on the log Mel-spectrogram domain
to align with the TTS training objective. Text information is introduced as an
additional condition to improve the model robustness. Experiments on real-world
recordings demonstrate that the synthetic voice built on data enhanced by the
proposed model produces higher-quality synthetic speech, compared to those
trained on data enhanced by strong baselines. Code and pre-trained parameters
of the proposed enhancement model are available at
\url{https://github.com/dmse4tts/DMSE4TTS}
TextDiffuser: Diffusion Models as Text Painters
May 18, 2023
Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, Furu Wei
Diffusion models have gained increasing attention for their impressive
generation abilities but currently struggle with rendering accurate and
coherent text. To address this issue, we introduce TextDiffuser, focusing on
generating images with visually appealing text that is coherent with
backgrounds. TextDiffuser consists of two stages: first, a Transformer model
generates the layout of keywords extracted from text prompts, and then
diffusion models generate images conditioned on the text prompt and the
generated layout. Additionally, we contribute the first large-scale text images
dataset with OCR annotations, MARIO-10M, containing 10 million image-text pairs
with text recognition, detection, and character-level segmentation annotations.
We further collect the MARIO-Eval benchmark to serve as a comprehensive tool
for evaluating text rendering quality. Through experiments and user studies, we
show that TextDiffuser is flexible and controllable to create high-quality text
images using text prompts alone or together with text template images, and
conduct text inpainting to reconstruct incomplete images with text. The code,
model, and dataset will be available at \url{https://aka.ms/textdiffuser}.
LDM3D: Latent Diffusion Model for 3D
May 18, 2023
Gabriela Ben Melech Stan, Diana Wofk, Scottie Fox, Alex Redden, Will Saxton, Jean Yu, Estelle Aflalo, Shao-Yen Tseng, Fabio Nonato, Matthias Muller, Vasudev Lal
This research paper proposes a Latent Diffusion Model for 3D (LDM3D) that
generates both image and depth map data from a given text prompt, allowing
users to generate RGBD images from text prompts. The LDM3D model is fine-tuned
on a dataset of tuples containing an RGB image, depth map and caption, and
validated through extensive experiments. We also develop an application called
DepthFusion, which uses the generated RGB images and depth maps to create
immersive and interactive 360-degree-view experiences using TouchDesigner. This
technology has the potential to transform a wide range of industries, from
entertainment and gaming to architecture and design. Overall, this paper
presents a significant contribution to the field of generative AI and computer
vision, and showcases the potential of LDM3D and DepthFusion to revolutionize
content creation and digital experiences. A short video summarizing the
approach can be found at https://t.ly/tdi2.
DiffUTE: Universal Text Editing Diffusion Model
May 18, 2023
Haoxing Chen, Zhuoer Xu, Zhangxuan Gu, Jun Lan, Xing Zheng, Yaohui Li, Changhua Meng, Huijia Zhu, Weiqiang Wang
Diffusion model based language-guided image editing has achieved great
success recently. However, existing state-of-the-art diffusion models struggle
with rendering correct text and text style during generation. To tackle this
problem, we propose a universal self-supervised text editing diffusion model
(DiffUTE), which aims to replace or modify words in the source image with
another one while maintaining its realistic appearance. Specifically, we build
our model on a diffusion model and carefully modify the network structure to
enable the model for drawing multilingual characters with the help of glyph and
position information. Moreover, we design a self-supervised learning framework
to leverage large amounts of web data to improve the representation ability of
the model. Experimental results show that our method achieves an impressive
performance and enables controllable editing on in-the-wild images with high
fidelity. Our code will be avaliable in
\url{https://github.com/chenhaoxing/DiffUTE}.
Democratized Diffusion Language Model
May 18, 2023
Nikita Balagansky, Daniil Gavrilov
Despite the potential benefits of Diffusion Models for NLP applications,
publicly available implementations, trained models, or reproducible training
procedures currently need to be publicly available. We present the Democratized
Diffusion Language Model (DDLM), based on the Continuous Diffusion for
Categorical Data (CDCD) framework, to address these challenges. We propose a
simplified training procedure for DDLM using the C4 dataset and perform an
in-depth analysis of the trained model’s behavior. Furthermore, we introduce a
novel early-exiting strategy for faster sampling with models trained with score
interpolation. Since no previous works aimed at solving downstream tasks with
pre-trained Diffusion LM (e.g., classification tasks), we experimented with
GLUE Benchmark to study the ability of DDLM to transfer knowledge. With this
paper, we propose available training and evaluation pipelines to other
researchers and pre-trained DDLM models, which could be used in future research
with Diffusion LMs.
Diffusion-Based Speech Enhancement with Joint Generative and Predictive Decoders
May 18, 2023
Hao Shi, Kazuki Shimada, Masato Hirano, Takashi Shibuya, Yuichiro Koyama, Zhi Zhong, Shusuke Takahashi, Tatsuya Kawahara, Yuki Mitsufuji
Diffusion-based speech enhancement (SE) has been investigated recently, but
its decoding is very time-consuming. One solution is to initialize the decoding
process with the enhanced feature estimated by a predictive SE system. However,
this two-stage method ignores the complementarity between predictive and
diffusion SE. In this paper, we propose a unified system that integrates these
two SE modules. The system encodes both generative and predictive information,
and then applies both generative and predictive decoders, whose outputs are
fused. Specifically, the two SE modules are fused in the first and final
diffusion steps: the first step fusion initializes the diffusion process with
the predictive SE for improving the convergence, and the final step fusion
combines the two complementary SE outputs to improve the SE performance.
Experiments on the Voice-Bank dataset show that the diffusion score estimation
can benefit from the predictive information and speed up the decoding.
Discriminative Diffusion Models as Few-shot Vision and Language Learners
May 18, 2023
Xuehai He, Weixi Feng, Tsu-Jui Fu, Varun Jampani, Arjun Akula, Pradyumna Narayana, Sugato Basu, William Yang Wang, Xin Eric Wang
Diffusion models, such as Stable Diffusion, have shown incredible performance
on text-to-image generation. Since text-to-image generation often requires
models to generate visual concepts with fine-grained details and attributes
specified in text prompts, can we leverage the powerful representations learned
by pre-trained diffusion models for discriminative tasks such as image-text
matching? To answer this question, we propose a novel approach, Discriminative
Stable Diffusion (DSD), which turns pre-trained text-to-image diffusion models
into few-shot discriminative learners. Our approach mainly uses the
cross-attention score of a Stable Diffusion model to capture the mutual
influence between visual and textual information and fine-tune the model via
efficient attention-based prompt learning to perform image-text matching. By
comparing DSD with state-of-the-art methods on several benchmark datasets, we
demonstrate the potential of using pre-trained diffusion models for
discriminative tasks with superior results on few-shot image-text matching.
Sampling, Diffusions, and Stochastic Localization
May 18, 2023
Andrea Montanari
Diffusions are a successful technique to sample from high-dimensional
distributions can be either explicitly given or learnt from a collection of
samples. They implement a diffusion process whose endpoint is a sample from the
target distribution and whose drift is typically represented as a neural
network. Stochastic localization is a successful technique to prove mixing of
Markov Chains and other functional inequalities in high dimension. An
algorithmic version of stochastic localization was introduced in [EAMS2022], to
obtain an algorithm that samples from certain statistical mechanics models.
This notes have three objectives: (i) Generalize the construction [EAMS2022]
to other stochastic localization processes; (ii) Clarify the connection between
diffusions and stochastic localization. In particular we show that standard
denoising diffusions are stochastic localizations but other examples that are
naturally suggested by the proposed viewpoint; (iii) Describe some insights
that follow from this viewpoint.
Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models
May 17, 2023
Songwei Ge, Seungjun Nah, Guilin Liu, Tyler Poon, Andrew Tao, Bryan Catanzaro, David Jacobs, Jia-Bin Huang, Ming-Yu Liu, Yogesh Balaji
Despite tremendous progress in generating high-quality images using diffusion
models, synthesizing a sequence of animated frames that are both photorealistic
and temporally coherent is still in its infancy. While off-the-shelf
billion-scale datasets for image generation are available, collecting similar
video data of the same scale is still challenging. Also, training a video
diffusion model is computationally much more expensive than its image
counterpart. In this work, we explore finetuning a pretrained image diffusion
model with video data as a practical solution for the video synthesis task. We
find that naively extending the image noise prior to video noise prior in video
diffusion leads to sub-optimal performance. Our carefully designed video noise
prior leads to substantially better performance. Extensive experimental
validation shows that our model, Preserve Your Own Correlation (PYoCo), attains
SOTA zero-shot text-to-video results on the UCF-101 and MSR-VTT benchmarks. It
also achieves SOTA video generation quality on the small-scale UCF-101
benchmark with a $10\times$ smaller model using significantly less computation
than the prior art.
Raising the Bar for Certified Adversarial Robustness with Diffusion Models
May 17, 2023
Thomas Altstidl, David Dobre, Björn Eskofier, Gauthier Gidel, Leo Schwinn
Certified defenses against adversarial attacks offer formal guarantees on the
robustness of a model, making them more reliable than empirical methods such as
adversarial training, whose effectiveness is often later reduced by unseen
attacks. Still, the limited certified robustness that is currently achievable
has been a bottleneck for their practical adoption. Gowal et al. and Wang et
al. have shown that generating additional training data using state-of-the-art
diffusion models can considerably improve the robustness of adversarial
training. In this work, we demonstrate that a similar approach can
substantially improve deterministic certified defenses. In addition, we provide
a list of recommendations to scale the robustness of certified training
approaches. One of our main insights is that the generalization gap, i.e., the
difference between the training and test accuracy of the original model, is a
good predictor of the magnitude of the robustness improvement when using
additional generated data. Our approach achieves state-of-the-art deterministic
robustness certificates on CIFAR-10 for the $\ell_2$ ($\epsilon = 36/255$) and
$\ell_\infty$ ($\epsilon = 8/255$) threat models, outperforming the previous
best results by $+3.95\%$ and $+1.39\%$, respectively. Furthermore, we report
similar improvements for CIFAR-100.
Selective Guidance: Are All the Denoising Steps of Guided Diffusion Important?
May 16, 2023
Pareesa Ameneh Golnari, Zhewei Yao, Yuxiong He
This study examines the impact of optimizing the Stable Diffusion (SD) guided
inference pipeline. We propose optimizing certain denoising steps by limiting
the noise computation to conditional noise and eliminating unconditional noise
computation, thereby reducing the complexity of the target iterations by 50%.
Additionally, we demonstrate that later iterations of the SD are less sensitive
to optimization, making them ideal candidates for applying the suggested
optimization. Our experiments show that optimizing the last 20% of the
denoising loop iterations results in an 8.2% reduction in inference time with
almost no perceivable changes to the human eye. Furthermore, we found that by
extending the optimization to 50% of the last iterations, we can reduce
inference time by approximately 20.3%, while still generating visually pleasing
images.
A score-based operator Newton method for measure transport
May 16, 2023
Nisha Chandramoorthy, Florian Schaefer, Youssef Marzouk
math.ST, cs.LG, cs.NA, math.NA, stat.TH
We propose a new approach for sampling and Bayesian computation that uses the
score of the target distribution to construct a transport from a given
reference distribution to the target. Our approach is an infinite-dimensional
Newton method, involving an elliptic PDE, for finding a zero of a
``score-residual’’ operator. We use classical elliptic PDE theory to prove
convergence to a valid transport map. Our Newton iterates can be computed by
exploiting fast solvers for elliptic PDEs, resulting in new algorithms for
Bayesian inference and other sampling tasks. We identify elementary settings
where score-operator Newton transport achieves fast convergence while avoiding
mode collapse.
Expressiveness Remarks for Denoising Diffusion Models and Samplers
May 16, 2023
Francisco Vargas, Teodora Reu, Anna Kerekes
Denoising diffusion models are a class of generative models which have
recently achieved state-of-the-art results across many domains. Gradual noise
is added to the data using a diffusion process, which transforms the data
distribution into a Gaussian. Samples from the generative model are then
obtained by simulating an approximation of the time reversal of this diffusion
initialized by Gaussian samples. Recent research has explored adapting
diffusion models for sampling and inference tasks. In this paper, we leverage
known connections to stochastic control akin to the F"ollmer drift to extend
established neural network approximation results for the F"ollmer drift to
denoising diffusion models and samplers.
AR-Diffusion: Auto-Regressive Diffusion Model for Text Generation
May 16, 2023
Tong Wu, Zhihao Fan, Xiao Liu, Yeyun Gong, Yelong Shen, Jian Jiao, Hai-Tao Zheng, Juntao Li, Zhongyu Wei, Jian Guo, Nan Duan, Weizhu Chen
Diffusion models have gained significant attention in the realm of image
generation due to their exceptional performance. Their success has been
recently expanded to text generation via generating all tokens within a
sequence concurrently. However, natural language exhibits a far more pronounced
sequential dependency in comparison to images, and the majority of existing
language models are trained with a left-to-right auto-regressive approach. To
account for the inherent sequential characteristic of natural language, we
introduce Auto-Regressive Diffusion (AR-Diffusion). AR-Diffusion ensures that
the generation of tokens on the right depends on the generated ones on the
left, a mechanism achieved through employing a dynamic number of denoising
steps that vary based on token position. This results in tokens on the left
undergoing fewer denoising steps than those on the right, thereby enabling them
to generate earlier and subsequently influence the generation of tokens on the
right. In a series of experiments on various text generation tasks, including
text summarization, machine translation, and common sense generation,
AR-Diffusion clearly demonstrated its superiority over existing diffusion
language models and that it can be $100\times\sim600\times$ faster when
achieving comparable results. Our code is available at
https://github.com/microsoft/ProphetNet/tree/master/AR-diffusion.
Discrete Diffusion Probabilistic Models for Symbolic Music Generation
May 16, 2023
Matthias Plasser, Silvan Peter, Gerhard Widmer
Denoising Diffusion Probabilistic Models (DDPMs) have made great strides in
generating high-quality samples in both discrete and continuous domains.
However, Discrete DDPMs (D3PMs) have yet to be applied to the domain of
Symbolic Music. This work presents the direct generation of Polyphonic Symbolic
Music using D3PMs. Our model exhibits state-of-the-art sample quality,
according to current quantitative evaluation metrics, and allows for flexible
infilling at the note level. We further show, that our models are accessible to
post-hoc classifier guidance, widening the scope of possible applications.
However, we also cast a critical view on quantitative evaluation of music
sample quality via statistical metrics, and present a simple algorithm that can
confound our metrics with completely spurious, non-musical samples.
A Conditional Denoising Diffusion Probabilistic Model for Radio Interferometric Image Reconstruction
May 16, 2023
Ruoqi Wang, Zhuoyang Chen, Qiong Luo, Feng Wang
astro-ph.IM, astro-ph.GA, cs.CV, cs.LG, eess.IV
In radio astronomy, signals from radio telescopes are transformed into images
of observed celestial objects, or sources. However, these images, called dirty
images, contain real sources as well as artifacts due to signal sparsity and
other factors. Therefore, radio interferometric image reconstruction is
performed on dirty images, aiming to produce clean images in which artifacts
are reduced and real sources are recovered. So far, existing methods have
limited success on recovering faint sources, preserving detailed structures,
and eliminating artifacts. In this paper, we present VIC-DDPM, a Visibility and
Image Conditioned Denoising Diffusion Probabilistic Model. Our main idea is to
use both the original visibility data in the spectral domain and dirty images
in the spatial domain to guide the image generation process with DDPM. This
way, we can leverage DDPM to generate fine details and eliminate noise, while
utilizing visibility data to separate signals from noise and retaining spatial
information in dirty images. We have conducted experiments in comparison with
both traditional methods and recent deep learning based approaches. Our results
show that our method significantly improves the resulting images by reducing
artifacts, preserving fine details, and recovering dim sources. This
advancement further facilitates radio astronomical data analysis tasks on
celestial phenomena.
Denoising Diffusion Models for Plug-and-Play Image Restoration
May 15, 2023
Yuanzhi Zhu, Kai Zhang, Jingyun Liang, Jiezhang Cao, Bihan Wen, Radu Timofte, Luc Van Gool
Plug-and-play Image Restoration (IR) has been widely recognized as a flexible
and interpretable method for solving various inverse problems by utilizing any
off-the-shelf denoiser as the implicit image prior. However, most existing
methods focus on discriminative Gaussian denoisers. Although diffusion models
have shown impressive performance for high-quality image synthesis, their
potential to serve as a generative denoiser prior to the plug-and-play IR
methods remains to be further explored. While several other attempts have been
made to adopt diffusion models for image restoration, they either fail to
achieve satisfactory results or typically require an unacceptable number of
Neural Function Evaluations (NFEs) during inference. This paper proposes
DiffPIR, which integrates the traditional plug-and-play method into the
diffusion sampling framework. Compared to plug-and-play IR methods that rely on
discriminative Gaussian denoisers, DiffPIR is expected to inherit the
generative ability of diffusion models. Experimental results on three
representative IR tasks, including super-resolution, image deblurring, and
inpainting, demonstrate that DiffPIR achieves state-of-the-art performance on
both the FFHQ and ImageNet datasets in terms of reconstruction faithfulness and
perceptual quality with no more than 100 NFEs. The source code is available at
{\url{https://github.com/yuanzhi-zhu/DiffPIR}}
Laughing Matters: Introducing Laughing-Face Generation using Diffusion Models
May 15, 2023
Antoni Bigata Casademunt, Rodrigo Mira, Nikita Drobyshev, Konstantinos Vougioukas, Stavros Petridis, Maja Pantic
Speech-driven animation has gained significant traction in recent years, with
current methods achieving near-photorealistic results. However, the field
remains underexplored regarding non-verbal communication despite evidence
demonstrating its importance in human interaction. In particular, generating
laughter sequences presents a unique challenge due to the intricacy and nuances
of this behaviour. This paper aims to bridge this gap by proposing a novel
model capable of generating realistic laughter sequences, given a still
portrait and an audio clip containing laughter. We highlight the failure cases
of traditional facial animation methods and leverage recent advances in
diffusion models to produce convincing laughter videos. We train our model on a
diverse set of laughter datasets and introduce an evaluation metric
specifically designed for laughter. When compared with previous speech-driven
approaches, our model achieves state-of-the-art performance across all metrics,
even when these are re-trained for laughter generation. Our code and project
are publicly available
May 15, 2023
Ryan Webster
Recently, Carlini et al. demonstrated the widely used model Stable Diffusion
can regurgitate real training samples, which is troublesome from a copyright
perspective. In this work, we provide an efficient extraction attack on par
with the recent attack, with several order of magnitudes less network
evaluations. In the process, we expose a new phenomena, which we dub template
verbatims, wherein a diffusion model will regurgitate a training sample largely
in tact. Template verbatims are harder to detect as they require retrieval and
masking to correctly label. Furthermore, they are still generated by newer
systems, even those which de-duplicate their training set, and we give insight
into why they still appear during generation. We extract training images from
several state of the art systems, including Stable Diffusion 2.0, Deep Image
Floyd, and finally Midjourney v4. We release code to verify our extraction
attack, perform the attack, as well as all extracted prompts at
\url{https://github.com/ryanwebster90/onestep-extraction}.
Common Diffusion Noise Schedules and Sample Steps are Flawed
May 15, 2023
Shanchuan Lin, Bingchen Liu, Jiashi Li, Xiao Yang
We discover that common diffusion noise schedules do not enforce the last
timestep to have zero signal-to-noise ratio (SNR), and some implementations of
diffusion samplers do not start from the last timestep. Such designs are flawed
and do not reflect the fact that the model is given pure Gaussian noise at
inference, creating a discrepancy between training and inference. We show that
the flawed design causes real problems in existing implementations. In Stable
Diffusion, it severely limits the model to only generate images with medium
brightness and prevents it from generating very bright and dark samples. We
propose a few simple fixes: (1) rescale the noise schedule to enforce zero
terminal SNR; (2) train the model with v prediction; (3) change the sampler to
always start from the last timestep; (4) rescale classifier-free guidance to
prevent over-exposure. These simple changes ensure the diffusion process is
congruent between training and inference and allow the model to generate
samples more faithful to the original data distribution.
TESS: Text-to-Text Self-Conditioned Simplex Diffusion
May 15, 2023
Rabeeh Karimi Mahabadi, Jaesung Tae, Hamish Ivison, James Henderson, Iz Beltagy, Matthew E. Peters, Arman Cohan
Diffusion models have emerged as a powerful paradigm for generation,
obtaining strong performance in various domains with continuous-valued inputs.
Despite the promises of fully non-autoregressive text generation, applying
diffusion models to natural language remains challenging due to its discrete
nature. In this work, we propose Text-to-text Self-conditioned Simplex
Diffusion (TESS), a text diffusion model that is fully non-autoregressive,
employs a new form of self-conditioning, and applies the diffusion process on
the logit simplex space rather than the typical learned embedding space.
Through extensive experiments on natural language understanding and generation
tasks including summarization, text simplification, paraphrase generation, and
question generation, we demonstrate that TESS outperforms state-of-the-art
non-autoregressive models and is competitive with pretrained autoregressive
sequence-to-sequence models.
May 14, 2023
Wentao Hu, Xiurong Jiang, Jiarun Liu, Yuqi Yang, Hui Tian
In the field of few-shot learning (FSL), extensive research has focused on
improving network structures and training strategies. However, the role of data
processing modules has not been fully explored. Therefore, in this paper, we
propose Meta-DM, a generalized data processing module for FSL problems based on
diffusion models. Meta-DM is a simple yet effective module that can be easily
integrated with existing FSL methods, leading to significant performance
improvements in both supervised and unsupervised settings. We provide a
theoretical analysis of Meta-DM and evaluate its performance on several
algorithms. Our experiments show that combining Meta-DM with certain methods
achieves state-of-the-art results.
May 14, 2023
Raza Imam, Muhammad Huzaifa, Mohammed El-Amine Azz
Privacy and confidentiality of medical data are of utmost importance in
healthcare settings. ViTs, the SOTA vision model, rely on large amounts of
patient data for training, which raises concerns about data security and the
potential for unauthorized access. Adversaries may exploit vulnerabilities in
ViTs to extract sensitive patient information and compromising patient privacy.
This work address these vulnerabilities to ensure the trustworthiness and
reliability of ViTs in medical applications. In this work, we introduced a
defensive diffusion technique as an adversarial purifier to eliminate
adversarial noise introduced by attackers in the original image. By utilizing
the denoising capabilities of the diffusion model, we employ a reverse
diffusion process to effectively eliminate the adversarial noise from the
attack sample, resulting in a cleaner image that is then fed into the ViT
blocks. Our findings demonstrate the effectiveness of the diffusion model in
eliminating attack-agnostic adversarial noise from images. Additionally, we
propose combining knowledge distillation with our framework to obtain a
lightweight student model that is both computationally efficient and robust
against gray box attacks. Comparison of our method with a SOTA baseline method,
SEViT, shows that our work is able to outperform the baseline. Extensive
experiments conducted on a publicly available Tuberculosis X-ray dataset
validate the computational efficiency and improved robustness achieved by our
proposed architecture.
Beware of diffusion models for synthesizing medical images – A comparison with GANs in terms of memorizing brain tumor images
May 12, 2023
Muhammad Usman Akbar, Wuhao Wang, Anders Eklund
Diffusion models were initially developed for text-to-image generation and
are now being utilized to generate high-quality synthetic images. Preceded by
GANs, diffusion models have shown impressive results using various evaluation
metrics. However, commonly used metrics such as FID and IS are not suitable for
determining whether diffusion models are simply reproducing the training
images. Here we train StyleGAN and diffusion models, using BRATS20, BRATS21 and
a chest x-ray pneumonia dataset, to synthesize brain MRI and chest x-ray
images, and measure the correlation between the synthe4c images and all
training images. Our results show that diffusion models are more likely to
memorize the training images, compared to StyleGAN, especially for small
datasets and when using 2D slices from 3D volumes. Researchers should be
careful when using diffusion models for medical imaging, if the final goal is
to share the synthe4c images
Provably Convergent Schrödinger Bridge with Applications to Probabilistic Time Series Imputation
May 12, 2023
Yu Chen, Wei Deng, Shikai Fang, Fengpei Li, Nicole Tianjiao Yang, Yikai Zhang, Kashif Rasul, Shandian Zhe, Anderson Schneider, Yuriy Nevmyvaka
The Schr"odinger bridge problem (SBP) is gaining increasing attention in
generative modeling and showing promising potential even in comparison with the
score-based generative models (SGMs). SBP can be interpreted as an
entropy-regularized optimal transport problem, which conducts projections onto
every other marginal alternatingly. However, in practice, only approximated
projections are accessible and their convergence is not well understood. To
fill this gap, we present a first convergence analysis of the Schr"odinger
bridge algorithm based on approximated projections. As for its practical
applications, we apply SBP to probabilistic time series imputation by
generating missing values conditioned on observed data. We show that optimizing
the transport cost improves the performance and the proposed algorithm achieves
the state-of-the-art result in healthcare and environmental data while
exhibiting the advantage of exploring both temporal and feature patterns in
probabilistic time series imputation.
Exploiting Diffusion Prior for Real-World Image Super-Resolution
May 11, 2023
Jianyi Wang, Zongsheng Yue, Shangchen Zhou, Kelvin C. K. Chan, Chen Change Loy
We present a novel approach to leverage prior knowledge encapsulated in
pre-trained text-to-image diffusion models for blind super-resolution (SR).
Specifically, by employing our time-aware encoder, we can achieve promising
restoration results without altering the pre-trained synthesis model, thereby
preserving the generative prior and minimizing training cost. To remedy the
loss of fidelity caused by the inherent stochasticity of diffusion models, we
employ a controllable feature wrapping module that allows users to balance
quality and fidelity by simply adjusting a scalar value during the inference
process. Moreover, we develop a progressive aggregation sampling strategy to
overcome the fixed-size constraints of pre-trained diffusion models, enabling
adaptation to resolutions of any size. A comprehensive evaluation of our method
using both synthetic and real-world benchmarks demonstrates its superiority
over current state-of-the-art approaches. Code and models are available at
https://github.com/IceClear/StableSR.
MolDiff: Addressing the Atom-Bond Inconsistency Problem in 3D Molecule Diffusion Generation
May 11, 2023
Xingang Peng, Jiaqi Guan, Qiang Liu, Jianzhu Ma
q-bio.BM, cs.LG, q-bio.QM
Deep generative models have recently achieved superior performance in 3D
molecule generation. Most of them first generate atoms and then add chemical
bonds based on the generated atoms in a post-processing manner. However, there
might be no corresponding bond solution for the temporally generated atoms as
their locations are generated without considering potential bonds. We define
this problem as the atom-bond inconsistency problem and claim it is the main
reason for current approaches to generating unrealistic 3D molecules. To
overcome this problem, we propose a new diffusion model called MolDiff which
can generate atoms and bonds simultaneously while still maintaining their
consistency by explicitly modeling the dependence between their relationships.
We evaluated the generation ability of our proposed model and the quality of
the generated molecules using criteria related to both geometry and chemical
properties. The empirical studies showed that our model outperforms previous
approaches, achieving a three-fold improvement in success rate and generating
molecules with significantly better quality.
Analyzing Bias in Diffusion-based Face Generation Models
May 10, 2023
Malsha V. Perera, Vishal M. Patel
Diffusion models are becoming increasingly popular in synthetic data
generation and image editing applications. However, these models can amplify
existing biases and propagate them to downstream applications. Therefore, it is
crucial to understand the sources of bias in their outputs. In this paper, we
investigate the presence of bias in diffusion-based face generation models with
respect to attributes such as gender, race, and age. Moreover, we examine how
dataset size affects the attribute composition and perceptual quality of both
diffusion and Generative Adversarial Network (GAN) based face generation models
across various attribute classes. Our findings suggest that diffusion models
tend to worsen distribution bias in the training data for various attributes,
which is heavily influenced by the size of the dataset. Conversely, GAN models
trained on balanced datasets with a larger number of samples show less bias
across different attributes.
Relightify: Relightable 3D Faces from a Single Image via Diffusion Models
May 10, 2023
Foivos Paraperas Papantoniou, Alexandros Lattas, Stylianos Moschoglou, Stefanos Zafeiriou
Following the remarkable success of diffusion models on image generation,
recent works have also demonstrated their impressive ability to address a
number of inverse problems in an unsupervised way, by properly constraining the
sampling process based on a conditioning input. Motivated by this, in this
paper, we present the first approach to use diffusion models as a prior for
highly accurate 3D facial BRDF reconstruction from a single image. We start by
leveraging a high-quality UV dataset of facial reflectance (diffuse and
specular albedo and normals), which we render under varying illumination
settings to simulate natural RGB textures and, then, train an unconditional
diffusion model on concatenated pairs of rendered textures and reflectance
components. At test time, we fit a 3D morphable model to the given image and
unwrap the face in a partial UV texture. By sampling from the diffusion model,
while retaining the observed texture part intact, the model inpaints not only
the self-occluded areas but also the unknown reflectance components, in a
single sequence of denoising steps. In contrast to existing methods, we
directly acquire the observed texture from the input image, thus, resulting in
more faithful and consistent reflectance estimation. Through a series of
qualitative and quantitative comparisons, we demonstrate superior performance
in both texture completion as well as reflectance reconstruction tasks.
Diffusion-based Signal Refiner for Speech Enhancement
May 10, 2023
Masato Hirano, Kazuki Shimada, Yuichiro Koyama, Shusuke Takahashi, Yuki Mitsufuji
We have developed a diffusion-based speech refiner that improves the
reference-free perceptual quality of the audio predicted by preceding
single-channel speech separation models. Although modern deep neural
network-based speech separation models have show high performance in
reference-based metrics, they often produce perceptually unnatural artifacts.
The recent advancements made to diffusion models motivated us to tackle this
problem by restoring the degraded parts of initial separations with a
generative approach. Utilizing the denoising diffusion restoration model (DDRM)
as a basis, we propose a shared DDRM-based refiner that generates samples
conditioned on the global information of preceding outputs from arbitrary
speech separation models. We experimentally show that our refiner can provide a
clearer harmonic structure of speech and improves the reference-free metric of
perceptual quality for arbitrary preceding model architectures. Furthermore, we
tune the variance of the measurement noise based on preceding outputs, which
results in higher scores in both reference-free and reference-based metrics.
The separation quality can also be further improved by blending the
discriminative and generative outputs.
Controllable Light Diffusion for Portraits
May 08, 2023
David Futschik, Kelvin Ritland, James Vecore, Sean Fanello, Sergio Orts-Escolano, Brian Curless, Daniel Sýkora, Rohit Pandey
We introduce light diffusion, a novel method to improve lighting in
portraits, softening harsh shadows and specular highlights while preserving
overall scene illumination. Inspired by professional photographers’ diffusers
and scrims, our method softens lighting given only a single portrait photo.
Previous portrait relighting approaches focus on changing the entire lighting
environment, removing shadows (ignoring strong specular highlights), or
removing shading entirely. In contrast, we propose a learning based method that
allows us to control the amount of light diffusion and apply it on in-the-wild
portraits. Additionally, we design a method to synthetically generate plausible
external shadows with sub-surface scattering effects while conforming to the
shape of the subject’s face. Finally, we show how our approach can increase the
robustness of higher level vision applications, such as albedo estimation,
geometry estimation and semantic segmentation.
ReGeneration Learning of Diffusion Models with Rich Prompts for Zero-Shot Image Translation
May 08, 2023
Yupei Lin, Sen Zhang, Xiaojun Yang, Xiao Wang, Yukai Shi
Large-scale text-to-image models have demonstrated amazing ability to
synthesize diverse and high-fidelity images. However, these models are often
violated by several limitations. Firstly, they require the user to provide
precise and contextually relevant descriptions for the desired image
modifications. Secondly, current models can impose significant changes to the
original image content during the editing process. In this paper, we explore
ReGeneration learning in an image-to-image Diffusion model (ReDiffuser), that
preserves the content of the original image without human prompting and the
requisite editing direction is automatically discovered within the text
embedding space. To ensure consistent preservation of the shape during image
editing, we propose cross-attention guidance based on regeneration learning.
This novel approach allows for enhanced expression of the target domain
features while preserving the original shape of the image. In addition, we
introduce a cooperative update strategy, which allows for efficient
preservation of the original shape of an image, thereby improving the quality
and consistency of shape preservation throughout the editing process. Our
proposed method leverages an existing pre-trained text-image diffusion model
without any additional training. Extensive experiments show that the proposed
method outperforms existing work in both real and synthetic image editing.
Real-World Denoising via Diffusion Model
May 08, 2023
Cheng Yang, Lijing Liang, Zhixun Su
Real-world image denoising is an extremely important image processing
problem, which aims to recover clean images from noisy images captured in
natural environments. In recent years, diffusion models have achieved very
promising results in the field of image generation, outperforming previous
generation models. However, it has not been widely used in the field of image
denoising because it is difficult to control the appropriate position of the
added noise. Inspired by diffusion models, this paper proposes a novel general
denoising diffusion model that can be used for real-world image denoising. We
introduce a diffusion process with linear interpolation, and the intermediate
noisy image is interpolated from the original clean image and the corresponding
real-world noisy image, so that this diffusion model can handle the level of
added noise. In particular, we also introduce two sampling algorithms for this
diffusion model. The first one is a simple sampling procedure defined according
to the diffusion process, and the second one targets the problem of the first
one and makes a number of improvements. Our experimental results show that our
proposed method with a simple CNNs Unet achieves comparable results compared to
the Transformer architecture. Both quantitative and qualitative evaluations on
real-world denoising benchmarks show that the proposed general diffusion model
performs almost as well as against the state-of-the-art methods.
A Variational Perspective on Solving Inverse Problems with Diffusion Models
May 07, 2023
Morteza Mardani, Jiaming Song, Jan Kautz, Arash Vahdat
cs.LG, cs.CV, cs.NA, math.NA, stat.ML
Diffusion models have emerged as a key pillar of foundation models in visual
domains. One of their critical applications is to universally solve different
downstream inverse tasks via a single diffusion prior without re-training for
each task. Most inverse tasks can be formulated as inferring a posterior
distribution over data (e.g., a full image) given a measurement (e.g., a masked
image). This is however challenging in diffusion models since the nonlinear and
iterative nature of the diffusion process renders the posterior intractable. To
cope with this challenge, we propose a variational approach that by design
seeks to approximate the true posterior distribution. We show that our approach
naturally leads to regularization by denoising diffusion process (RED-Diff)
where denoisers at different timesteps concurrently impose different structural
constraints over the image. To gauge the contribution of denoisers from
different timesteps, we propose a weighting mechanism based on
signal-to-noise-ratio (SNR). Our approach provides a new variational
perspective for solving inverse problems with diffusion models, allowing us to
formulate sampling as stochastic optimization, where one can simply apply
off-the-shelf solvers with lightweight iterates. Our experiments for image
restoration tasks such as inpainting and superresolution demonstrate the
strengths of our method compared with state-of-the-art sampling-based diffusion
models.
A Latent Diffusion Model for Protein Structure Generation
May 06, 2023
Cong Fu, Keqiang Yan, Limei Wang, Wing Yee Au, Michael McThrow, Tao Komikado, Koji Maruhashi, Kanji Uchino, Xiaoning Qian, Shuiwang Ji
Proteins are complex biomolecules that perform a variety of crucial functions
within living organisms. Designing and generating novel proteins can pave the
way for many future synthetic biology applications, including drug discovery.
However, it remains a challenging computational task due to the large modeling
space of protein structures. In this study, we propose a latent diffusion model
that can reduce the complexity of protein modeling while flexibly capturing the
distribution of natural protein structures in a condensed latent space.
Specifically, we propose an equivariant protein autoencoder that embeds
proteins into a latent space and then uses an equivariant diffusion model to
learn the distribution of the latent protein representations. Experimental
results demonstrate that our method can effectively generate novel protein
backbone structures with high designability and efficiency. The code will be
made publicly available at
https://github.com/divelab/AIRS/tree/main/OpenProt/LatentDiff
Efficient and Degree-Guided Graph Generation via Discrete Diffusion Modeling
May 06, 2023
Xiaohui Chen, Jiaxing He, Xu Han, Li-Ping Liu
Diffusion-based generative graph models have been proven effective in
generating high-quality small graphs. However, they need to be more scalable
for generating large graphs containing thousands of nodes desiring graph
statistics. In this work, we propose EDGE, a new diffusion-based generative
graph model that addresses generative tasks with large graphs. To improve
computation efficiency, we encourage graph sparsity by using a discrete
diffusion process that randomly removes edges at each time step and finally
obtains an empty graph. EDGE only focuses on a portion of nodes in the graph at
each denoising step. It makes much fewer edge predictions than previous
diffusion-based models. Moreover, EDGE admits explicitly modeling the node
degrees of the graphs, further improving the model performance. The empirical
study shows that EDGE is much more efficient than competing methods and can
generate large graphs with thousands of nodes. It also outperforms baseline
models in generation quality: graphs generated by our approach have more
similar graph statistics to those of the training graphs.
Improved Techniques for Maximum Likelihood Estimation for Diffusion ODEs
May 06, 2023
Kaiwen Zheng, Cheng Lu, Jianfei Chen, Jun Zhu
Diffusion models have exhibited excellent performance in various domains. The
probability flow ordinary differential equation (ODE) of diffusion models
(i.e., diffusion ODEs) is a particular case of continuous normalizing flows
(CNFs), which enables deterministic inference and exact likelihood evaluation.
However, the likelihood estimation results by diffusion ODEs are still far from
those of the state-of-the-art likelihood-based generative models. In this work,
we propose several improved techniques for maximum likelihood estimation for
diffusion ODEs, including both training and evaluation perspectives. For
training, we propose velocity parameterization and explore variance reduction
techniques for faster convergence. We also derive an error-bounded high-order
flow matching objective for finetuning, which improves the ODE likelihood and
smooths its trajectory. For evaluation, we propose a novel training-free
truncated-normal dequantization to fill the training-evaluation gap commonly
existing in diffusion ODEs. Building upon these techniques, we achieve
state-of-the-art likelihood estimation results on image datasets (2.56 on
CIFAR-10, 3.43/3.69 on ImageNet-32) without variational dequantization or data
augmentation. Code is available at \url{https://github.com/thu-ml/i-DODE}.
DocDiff: Document Enhancement via Residual Diffusion Models
May 06, 2023
Zongyuan Yang, Baolin Liu, Yongping Xiong, Lan Yi, Guibin Wu, Xiaojun Tang, Ziqi Liu, Junjie Zhou, Xing Zhang
Removing degradation from document images not only improves their visual
quality and readability, but also enhances the performance of numerous
automated document analysis and recognition tasks. However, existing
regression-based methods optimized for pixel-level distortion reduction tend to
suffer from significant loss of high-frequency information, leading to
distorted and blurred text edges. To compensate for this major deficiency, we
propose DocDiff, the first diffusion-based framework specifically designed for
diverse challenging document enhancement problems, including document
deblurring, denoising, and removal of watermarks and seals. DocDiff consists of
two modules: the Coarse Predictor (CP), which is responsible for recovering the
primary low-frequency content, and the High-Frequency Residual Refinement (HRR)
module, which adopts the diffusion models to predict the residual
(high-frequency information, including text edges), between the ground-truth
and the CP-predicted image. DocDiff is a compact and computationally efficient
model that benefits from a well-designed network architecture, an optimized
training loss objective, and a deterministic sampling process with short time
steps. Extensive experiments demonstrate that DocDiff achieves state-of-the-art
(SOTA) performance on multiple benchmark datasets, and can significantly
enhance the readability and recognizability of degraded document images.
Furthermore, our proposed HRR module in pre-trained DocDiff is plug-and-play
and ready-to-use, with only 4.17M parameters. It greatly sharpens the text
edges generated by SOTA deblurring methods without additional joint training.
Available codes: https://github.com/Royalvice/DocDiff
Conditional Diffusion Feature Refinement for Continuous Sign Language Recognition
May 05, 2023
Leming Guo, Wanli Xue, Qing Guo, Yuxi Zhou, Tiantian Yuan, Shengyong Chen
In this work, we are dedicated to leveraging the denoising diffusion models’
success and formulating feature refinement as the autoencoder-formed diffusion
process, which is a mask-and-predict scheme. The state-of-the-art CSLR
framework consists of a spatial module, a visual module, a sequence module, and
a sequence learning function. However, this framework has faced sequence module
overfitting caused by the objective function and small-scale available
benchmarks, resulting in insufficient model training. To overcome the
overfitting problem, some CSLR studies enforce the sequence module to learn
more visual temporal information or be guided by more informative supervision
to refine its representations. In this work, we propose a novel
autoencoder-formed conditional diffusion feature refinement~(ACDR) to refine
the sequence representations to equip desired properties by learning the
encoding-decoding optimization process in an end-to-end way. Specifically, for
the ACDR, a noising Encoder is proposed to progressively add noise equipped
with semantic conditions to the sequence representations. And a denoising
Decoder is proposed to progressively denoise the noisy sequence representations
with semantic conditions. Therefore, the sequence representations can be imbued
with the semantics of provided semantic conditions. Further, a semantic
constraint is employed to prevent the denoised sequence representations from
semantic corruption. Extensive experiments are conducted to validate the
effectiveness of our ACDR, benefiting state-of-the-art methods and achieving a
notable gain on three benchmarks.
Diffusion Explainer: Visual Explanation for Text-to-image Stable Diffusion
May 04, 2023
Seongmin Lee, Benjamin Hoover, Hendrik Strobelt, Zijie J. Wang, ShengYun Peng, Austin Wright, Kevin Li, Haekyu Park, Haoyang Yang, Duen Horng Chau
cs.CL, cs.AI, cs.HC, cs.LG
Diffusion-based generative models’ impressive ability to create convincing
images has captured global attention. However, their complex internal
structures and operations often make them difficult for non-experts to
understand. We present Diffusion Explainer, the first interactive visualization
tool that explains how Stable Diffusion transforms text prompts into images.
Diffusion Explainer tightly integrates a visual overview of Stable Diffusion’s
complex components with detailed explanations of their underlying operations,
enabling users to fluidly transition between multiple levels of abstraction
through animations and interactive elements. By comparing the evolutions of
image representations guided by two related text prompts over refinement
timesteps, users can discover the impact of prompts on image generation.
Diffusion Explainer runs locally in users’ web browsers without the need for
installation or specialized hardware, broadening the public’s education access
to modern AI techniques. Our open-sourced tool is available at:
https://poloclub.github.io/diffusion-explainer/. A video demo is available at
https://youtu.be/Zg4gxdIWDds.
Multimodal-driven Talking Face Generation, Face Swapping, Diffusion Model
May 04, 2023
Chao Xu, Shaoting Zhu, Junwei Zhu, Tianxin Huang, Jiangning Zhang, Ying Tai, Yong Liu
Multimodal-driven talking face generation refers to animating a portrait with
the given pose, expression, and gaze transferred from the driving image and
video, or estimated from the text and audio. However, existing methods ignore
the potential of text modal, and their generators mainly follow the
source-oriented feature rearrange paradigm coupled with unstable GAN
frameworks. In this work, we first represent the emotion in the text prompt,
which could inherit rich semantics from the CLIP, allowing flexible and
generalized emotion control. We further reorganize these tasks as the
target-oriented texture transfer and adopt the Diffusion Models. More
specifically, given a textured face as the source and the rendered face
projected from the desired 3DMM coefficients as the target, our proposed
Texture-Geometry-aware Diffusion Model decomposes the complex transfer problem
into multi-conditional denoising process, where a Texture Attention-based
module accurately models the correspondences between appearance and geometry
cues contained in source and target conditions, and incorporate extra implicit
information for high-fidelity talking face generation. Additionally, TGDM can
be gracefully tailored for face swapping. We derive a novel paradigm free of
unstable seesaw-style optimization, resulting in simple, stable, and effective
training and inference schemes. Extensive experiments demonstrate the
superiority of our method.
May 04, 2023
Shang Chai, Liansheng Zhuang, Fengying Yan
Automatic layout generation that can synthesize high-quality layouts is an
important tool for graphic design in many applications. Though existing methods
based on generative models such as Generative Adversarial Networks (GANs) and
Variational Auto-Encoders (VAEs) have progressed, they still leave much room
for improving the quality and diversity of the results. Inspired by the recent
success of diffusion models in generating high-quality images, this paper
explores their potential for conditional layout generation and proposes
Transformer-based Layout Diffusion Model (LayoutDM) by instantiating the
conditional denoising diffusion probabilistic model (DDPM) with a purely
transformer-based architecture. Instead of using convolutional neural networks,
a transformer-based conditional Layout Denoiser is proposed to learn the
reverse diffusion process to generate samples from noised layout data.
Benefitting from both transformer and DDPM, our LayoutDM is of desired
properties such as high-quality generation, strong sample diversity, faithful
distribution coverage, and stationary training in comparison to GANs and VAEs.
Quantitative and qualitative experimental results show that our method
outperforms state-of-the-art generative models in terms of quality and
diversity.
Solving Inverse Problems with Score-Based Generative Priors learned from Noisy Data
May 02, 2023
Asad Aali, Marius Arvinte, Sidharth Kumar, Jonathan I. Tamir
We present SURE-Score: an approach for learning score-based generative models
using training samples corrupted by additive Gaussian noise. When a large
training set of clean samples is available, solving inverse problems via
score-based (diffusion) generative models trained on the underlying
fully-sampled data distribution has recently been shown to outperform
end-to-end supervised deep learning. In practice, such a large collection of
training data may be prohibitively expensive to acquire in the first place. In
this work, we present an approach for approximately learning a score-based
generative model of the clean distribution, from noisy training data. We
formulate and justify a novel loss function that leverages Stein’s unbiased
risk estimate to jointly denoise the data and learn the score function via
denoising score matching, while using only the noisy samples. We demonstrate
the generality of SURE-Score by learning priors and applying posterior sampling
to ill-posed inverse problems in two practical applications from different
domains: compressive wireless multiple-input multiple-output channel estimation
and accelerated 2D multi-coil magnetic resonance imaging reconstruction, where
we demonstrate competitive reconstruction performance when learning at
signal-to-noise ratio values of 0 and 10 dB, respectively.
In-Context Learning Unlocked for Diffusion Models
May 01, 2023
Zhendong Wang, Yifan Jiang, Yadong Lu, Yelong Shen, Pengcheng He, Weizhu Chen, Zhangyang Wang, Mingyuan Zhou
We present Prompt Diffusion, a framework for enabling in-context learning in
diffusion-based generative models. Given a pair of task-specific example
images, such as depth from/to image and scribble from/to image, and a text
guidance, our model automatically understands the underlying task and performs
the same task on a new query image following the text guidance. To achieve
this, we propose a vision-language prompt that can model a wide range of
vision-language tasks and a diffusion model that takes it as input. The
diffusion model is trained jointly over six different tasks using these
prompts. The resulting Prompt Diffusion model is the first diffusion-based
vision-language foundation model capable of in-context learning. It
demonstrates high-quality in-context generation on the trained tasks and
generalizes effectively to new, unseen vision tasks with their respective
prompts. Our model also shows compelling text-guided image editing results. Our
framework aims to facilitate research into in-context learning for computer
vision. We share our code and pre-trained models at
https://github.com/Zhendong-Wang/Prompt-Diffusion.
Class-Balancing Diffusion Models
April 30, 2023
Yiming Qin, Huangjie Zheng, Jiangchao Yao, Mingyuan Zhou, Ya Zhang
Diffusion-based models have shown the merits of generating high-quality
visual data while preserving better diversity in recent studies. However, such
observation is only justified with curated data distribution, where the data
samples are nicely pre-processed to be uniformly distributed in terms of their
labels. In practice, a long-tailed data distribution appears more common and
how diffusion models perform on such class-imbalanced data remains unknown. In
this work, we first investigate this problem and observe significant
degradation in both diversity and fidelity when the diffusion model is trained
on datasets with class-imbalanced distributions. Especially in tail classes,
the generations largely lose diversity and we observe severe mode-collapse
issues. To tackle this problem, we set from the hypothesis that the data
distribution is not class-balanced, and propose Class-Balancing Diffusion
Models (CBDM) that are trained with a distribution adjustment regularizer as a
solution. Experiments show that images generated by CBDM exhibit higher
diversity and quality in both quantitative and qualitative ways. Our method
benchmarked the generation results on CIFAR100/CIFAR100LT dataset and shows
outstanding performance on the downstream recognition task.
Cycle-guided Denoising Diffusion Probability Model for 3D Cross-modality MRI Synthesis
April 28, 2023
Shaoyan Pan, Chih-Wei Chang, Junbo Peng, Jiahan Zhang, Richard L. J. Qiu, Tonghe Wang, Justin Roper, Tian Liu, Hui Mao, Xiaofeng Yang
This study aims to develop a novel Cycle-guided Denoising Diffusion
Probability Model (CG-DDPM) for cross-modality MRI synthesis. The CG-DDPM
deploys two DDPMs that condition each other to generate synthetic images from
two different MRI pulse sequences. The two DDPMs exchange random latent noise
in the reverse processes, which helps to regularize both DDPMs and generate
matching images in two modalities. This improves image-to-image translation
ac-curacy. We evaluated the CG-DDPM quantitatively using mean absolute error
(MAE), multi-scale structural similarity index measure (MSSIM), and peak
sig-nal-to-noise ratio (PSNR), as well as the network synthesis consistency, on
the BraTS2020 dataset. Our proposed method showed high accuracy and reliable
consistency for MRI synthesis. In addition, we compared the CG-DDPM with
several other state-of-the-art networks and demonstrated statistically
significant improvements in the image quality of synthetic MRIs. The proposed
method enhances the capability of current multimodal MRI synthesis approaches,
which could contribute to more accurate diagnosis and better treatment planning
for patients by synthesizing additional MRI modalities.
Motion-Conditioned Diffusion Model for Controllable Video Synthesis
April 27, 2023
Tsai-Shien Chen, Chieh Hubert Lin, Hung-Yu Tseng, Tsung-Yi Lin, Ming-Hsuan Yang
Recent advancements in diffusion models have greatly improved the quality and
diversity of synthesized content. To harness the expressive power of diffusion
models, researchers have explored various controllable mechanisms that allow
users to intuitively guide the content synthesis process. Although the latest
efforts have primarily focused on video synthesis, there has been a lack of
effective methods for controlling and describing desired content and motion. In
response to this gap, we introduce MCDiff, a conditional diffusion model that
generates a video from a starting image frame and a set of strokes, which allow
users to specify the intended content and dynamics for synthesis. To tackle the
ambiguity of sparse motion inputs and achieve better synthesis quality, MCDiff
first utilizes a flow completion model to predict the dense video motion based
on the semantic understanding of the video frame and the sparse motion control.
Then, the diffusion model synthesizes high-quality future frames to form the
output video. We qualitatively and quantitatively show that MCDiff achieves the
state-the-of-art visual quality in stroke-guided controllable video synthesis.
Additional experiments on MPII Human Pose further exhibit the capability of our
model on diverse content and motion synthesis.
DiffuseExpand: Expanding dataset for 2D medical image segmentation using diffusion models
April 26, 2023
Shitong Shao, Xiaohan Yuan, Zhen Huang, Ziming Qiu, Shuai Wang, Kevin Zhou
Dataset expansion can effectively alleviate the problem of data scarcity for
medical image segmentation, due to privacy concerns and labeling difficulties.
However, existing expansion algorithms still face great challenges due to their
inability of guaranteeing the diversity of synthesized images with paired
segmentation masks. In recent years, Diffusion Probabilistic Models (DPMs) have
shown powerful image synthesis performance, even better than Generative
Adversarial Networks. Based on this insight, we propose an approach called
DiffuseExpand for expanding datasets for 2D medical image segmentation using
DPM, which first samples a variety of masks from Gaussian noise to ensure the
diversity, and then synthesizes images to ensure the alignment of images and
masks. After that, DiffuseExpand chooses high-quality samples to further
enhance the effectiveness of data expansion. Our comparison and ablation
experiments on COVID-19 and CGMH Pelvis datasets demonstrate the effectiveness
of DiffuseExpand. Our code is released at
https://github.com/shaoshitong/DiffuseExpand.
Score-based Generative Modeling Through Backward Stochastic Differential Equations: Inversion and Generation
April 26, 2023
Zihao Wang
The proposed BSDE-based diffusion model represents a novel approach to
diffusion modeling, which extends the application of stochastic differential
equations (SDEs) in machine learning. Unlike traditional SDE-based diffusion
models, our model can determine the initial conditions necessary to reach a
desired terminal distribution by adapting an existing score function. We
demonstrate the theoretical guarantees of the model, the benefits of using
Lipschitz networks for score matching, and its potential applications in
various areas such as diffusion inversion, conditional diffusion, and
uncertainty quantification. Our work represents a contribution to the field of
score-based generative learning and offers a promising direction for solving
real-world problems.
Single-View Height Estimation with Conditional Diffusion Probabilistic Models
April 26, 2023
Isaac Corley, Peyman Najafirad
Digital Surface Models (DSM) offer a wealth of height information for
understanding the Earth’s surface as well as monitoring the existence or change
in natural and man-made structures. Classical height estimation requires
multi-view geospatial imagery or LiDAR point clouds which can be expensive to
acquire. Single-view height estimation using neural network based models shows
promise however it can struggle with reconstructing high resolution features.
The latest advancements in diffusion models for high resolution image synthesis
and editing have yet to be utilized for remote sensing imagery, particularly
height estimation. Our approach involves training a generative diffusion model
to learn the joint distribution of optical and DSM images across both domains
as a Markov chain. This is accomplished by minimizing a denoising score
matching objective while being conditioned on the source image to generate
realistic high resolution 3D surfaces. In this paper we experiment with
conditional denoising diffusion probabilistic models (DDPM) for height
estimation from a single remotely sensed image and show promising results on
the Vaihingen benchmark dataset.
Latent diffusion models for generative precipitation nowcasting with accurate uncertainty quantification
April 25, 2023
Jussi Leinonen, Ulrich Hamann, Daniele Nerini, Urs Germann, Gabriele Franch
physics.ao-ph, cs.LG, eess.IV, I.2.10; J.2
Diffusion models have been widely adopted in image generation, producing
higher-quality and more diverse samples than generative adversarial networks
(GANs). We introduce a latent diffusion model (LDM) for precipitation
nowcasting - short-term forecasting based on the latest observational data. The
LDM is more stable and requires less computation to train than GANs, albeit
with more computationally expensive generation. We benchmark it against the
GAN-based Deep Generative Models of Rainfall (DGMR) and a statistical model,
PySTEPS. The LDM produces more accurate precipitation predictions, while the
comparisons are more mixed when predicting whether the precipitation exceeds
predefined thresholds. The clearest advantage of the LDM is that it generates
more diverse predictions than DGMR or PySTEPS. Rank distribution tests indicate
that the distribution of samples from the LDM accurately reflects the
uncertainty of the predictions. Thus, LDMs are promising for any applications
where uncertainty quantification is important, such as weather and climate.
April 25, 2023
Zezhou Zhang, Chuanchuan Yang, Yifeng Qin, Hao Feng, Jiqiang Feng, Hongbin Li
Conventional meta-atom designs rely heavily on researchers’ prior knowledge
and trial-and-error searches using full-wave simulations, resulting in
time-consuming and inefficient processes. Inverse design methods based on
optimization algorithms, such as evolutionary algorithms, and topological
optimizations, have been introduced to design metamaterials. However, none of
these algorithms are general enough to fulfill multi-objective tasks. Recently,
deep learning methods represented by Generative Adversarial Networks (GANs)
have been applied to inverse design of metamaterials, which can directly
generate high-degree-of-freedom meta-atoms based on S-parameter requirements.
However, the adversarial training process of GANs makes the network unstable
and results in high modeling costs. This paper proposes a novel metamaterial
inverse design method based on the diffusion probability theory. By learning
the Markov process that transforms the original structure into a Gaussian
distribution, the proposed method can gradually remove the noise starting from
the Gaussian distribution and generate new high-degree-of-freedom meta-atoms
that meet S-parameter conditions, which avoids the model instability introduced
by the adversarial training process of GANs and ensures more accurate and
high-quality generation results. Experiments have proven that our method is
superior to representative methods of GANs in terms of model convergence speed,
generation accuracy, and quality.
Patch Diffusion: Faster and More Data-Efficient Training of Diffusion Models
April 25, 2023
Zhendong Wang, Yifan Jiang, Huangjie Zheng, Peihao Wang, Pengcheng He, Zhangyang Wang, Weizhu Chen, Mingyuan Zhou
Diffusion models are powerful, but they require a lot of time and data to
train. We propose Patch Diffusion, a generic patch-wise training framework, to
significantly reduce the training time costs while improving data efficiency,
which thus helps democratize diffusion model training to broader users. At the
core of our innovations is a new conditional score function at the patch level,
where the patch location in the original image is included as additional
coordinate channels, while the patch size is randomized and diversified
throughout training to encode the cross-region dependency at multiple scales.
Sampling with our method is as easy as in the original diffusion model. Through
Patch Diffusion, we could achieve $\mathbf{\ge 2\times}$ faster training, while
maintaining comparable or better generation quality. Patch Diffusion meanwhile
improves the performance of diffusion models trained on relatively small
datasets, $e.g.$, as few as 5,000 images to train from scratch. We achieve
outstanding FID scores in line with state-of-the-art benchmarks: 1.77 on
CelebA-64$\times$64, 1.93 on AFHQv2-Wild-64$\times$64, and 2.72 on
ImageNet-256$\times$256. We share our code and pre-trained models at
https://github.com/Zhendong-Wang/Patch-Diffusion.
RenderDiffusion: Text Generation as Image Generation
April 25, 2023
Junyi Li, Wayne Xin Zhao, Jian-Yun Nie, Ji-Rong Wen
Diffusion models have become a new generative paradigm for text generation.
Considering the discrete categorical nature of text, in this paper, we propose
GlyphDiffusion, a novel diffusion approach for text generation via text-guided
image generation. Our key idea is to render the target text as a glyph image
containing visual language content. In this way, conditional text generation
can be cast as a glyph image generation task, and it is then natural to apply
continuous diffusion models to discrete texts. Specially, we utilize a cascaded
architecture (ie a base and a super-resolution diffusion model) to generate
high-fidelity glyph images, conditioned on the input text. Furthermore, we
design a text grounding module to transform and refine the visual language
content from generated glyph images into the final texts. In experiments over
four conditional text generation tasks and two classes of metrics (ie quality
and diversity), GlyphDiffusion can achieve comparable or even better results
than several baselines, including pretrained language models. Our model also
makes significant improvements compared to the recent diffusion model.
Variational Diffusion Auto-encoder: Deep Latent Variable Model with Unconditional Diffusion Prior
April 24, 2023
Georgios Batzolis, Jan Stanczuk, Carola-Bibiane Schönlieb
As a widely recognized approach to deep generative modeling, Variational
Auto-Encoders (VAEs) still face challenges with the quality of generated
images, often presenting noticeable blurriness. This issue stems from the
unrealistic assumption that approximates the conditional data distribution,
$p(\textbf{x} | \textbf{z})$, as an isotropic Gaussian. In this paper, we
propose a novel solution to address these issues. We illustrate how one can
extract a latent space from a pre-existing diffusion model by optimizing an
encoder to maximize the marginal data log-likelihood. Furthermore, we
demonstrate that a decoder can be analytically derived post encoder-training,
employing the Bayes rule for scores. This leads to a VAE-esque deep latent
variable model, which discards the need for Gaussian assumptions on
$p(\textbf{x} | \textbf{z})$ or the training of a separate decoder network. Our
method, which capitalizes on the strengths of pre-trained diffusion models and
equips them with latent spaces, results in a significant enhancement to the
performance of VAEs.
Hierarchical Diffusion Autoencoders and Disentangled Image Manipulation
April 24, 2023
Zeyu Lu, Chengyue Wu, Xinyuan Chen, Yaohui Wang, Lei Bai, Yu Qiao, Xihui Liu
Diffusion models have attained impressive visual quality for image synthesis.
However, how to interpret and manipulate the latent space of diffusion models
has not been extensively explored. Prior work diffusion autoencoders encode the
semantic representations into a semantic latent code, which fails to reflect
the rich information of details and the intrinsic feature hierarchy. To
mitigate those limitations, we propose Hierarchical Diffusion Autoencoders
(HDAE) that exploit the fine-grained-to-abstract and lowlevel-to-high-level
feature hierarchy for the latent space of diffusion models. The hierarchical
latent space of HDAE inherently encodes different abstract levels of semantics
and provides more comprehensive semantic representations. In addition, we
propose a truncated-feature-based approach for disentangled image manipulation.
We demonstrate the effectiveness of our proposed approach with extensive
experiments and applications on image reconstruction, style mixing,
controllable interpolation, detail-preserving and disentangled image
manipulation, and multi-modal semantic image synthesis.
Score-Based Diffusion Models as Principled Priors for Inverse Imaging
April 23, 2023
Berthy T. Feng, Jamie Smith, Michael Rubinstein, Huiwen Chang, Katherine L. Bouman, William T. Freeman
Priors are essential for reconstructing images from noisy and/or incomplete
measurements. The choice of the prior determines both the quality and
uncertainty of recovered images. We propose turning score-based diffusion
models into principled image priors (“score-based priors”) for analyzing a
posterior of images given measurements. Previously, probabilistic priors were
limited to handcrafted regularizers and simple distributions. In this work, we
empirically validate the theoretically-proven probability function of a
score-based diffusion model. We show how to sample from resulting posteriors by
using this probability function for variational inference. Our results,
including experiments on denoising, deblurring, and interferometric imaging,
suggest that score-based priors enable principled inference with a
sophisticated, data-driven image prior.
DiffVoice: Text-to-Speech with Latent Diffusion
April 23, 2023
Zhijun Liu, Yiwei Guo, Kai Yu
eess.AS, cs.AI, cs.HC, cs.LG, cs.SD
In this work, we present DiffVoice, a novel text-to-speech model based on
latent diffusion. We propose to first encode speech signals into a phoneme-rate
latent representation with a variational autoencoder enhanced by adversarial
training, and then jointly model the duration and the latent representation
with a diffusion model. Subjective evaluations on LJSpeech and LibriTTS
datasets demonstrate that our method beats the best publicly available systems
in naturalness. By adopting recent generative inverse problem solving
algorithms for diffusion models, DiffVoice achieves the state-of-the-art
performance in text-based speech editing, and zero-shot adaptation.
On Accelerating Diffusion-Based Sampling Process via Improved Integration Approximation
April 22, 2023
Guoqiang Zhang, Niwa Kenta, W. Bastiaan Kleijn
A popular approach to sample a diffusion-based generative model is to solve
an ordinary differential equation (ODE). In existing samplers, the coefficients
of the ODE solvers are pre-determined by the ODE formulation, the reverse
discrete timesteps, and the employed ODE methods. In this paper, we consider
accelerating several popular ODE-based sampling processes (including EDM, DDIM,
and DPM-Solver) by optimizing certain coefficients via improved integration
approximation (IIA). We propose to minimize, for each time step, a mean squared
error (MSE) function with respect to the selected coefficients. The MSE is
constructed by applying the original ODE solver for a set of fine-grained
timesteps, which in principle provides a more accurate integration
approximation in predicting the next diffusion state. The proposed IIA
technique does not require any change of a pre-trained model, and only
introduces a very small computational overhead for solving a number of
quadratic optimization problems. Extensive experiments show that considerably
better FID scores can be achieved by using IIA-EDM, IIA-DDIM, and
IIA-DPM-Solver than the original counterparts when the neural function
evaluation (NFE) is small (i.e., less than 25).
Lookahead Diffusion Probabilistic Models for Refining Mean Estimation
April 22, 2023
Guoqiang Zhang, Niwa Kenta, W. Bastiaan Kleijn
We propose lookahead diffusion probabilistic models (LA-DPMs) to exploit the
correlation in the outputs of the deep neural networks (DNNs) over subsequent
timesteps in diffusion probabilistic models (DPMs) to refine the mean
estimation of the conditional Gaussian distributions in the backward process. A
typical DPM first obtains an estimate of the original data sample
$\boldsymbol{x}$ by feeding the most recent state $\boldsymbol{z}i$ and index
$i$ into the DNN model and then computes the mean vector of the conditional
Gaussian distribution for $\boldsymbol{z}{i-1}$. We propose to calculate a
more accurate estimate for $\boldsymbol{x}$ by performing extrapolation on the
two estimates of $\boldsymbol{x}$ that are obtained by feeding
$(\boldsymbol{z}{i+1},i+1)$ and $(\boldsymbol{z}{i},i)$ into the DNN model.
The extrapolation can be easily integrated into the backward process of
existing DPMs by introducing an additional connection over two consecutive
timesteps, and fine-tuning is not required. Extensive experiments showed that
plugging in the additional connection into DDPM, DDIM, DEIS, S-PNDM, and
high-order DPM-Solvers leads to a significant performance gain in terms of FID
score.
Speed Is All You Need: On-Device Acceleration of Large Diffusion Models via GPU-Aware Optimizations
April 21, 2023
Yu-Hui Chen, Raman Sarokin, Juhyun Lee, Jiuqiang Tang, Chuo-Ling Chang, Andrei Kulik, Matthias Grundmann
The rapid development and application of foundation models have
revolutionized the field of artificial intelligence. Large diffusion models
have gained significant attention for their ability to generate photorealistic
images and support various tasks. On-device deployment of these models provides
benefits such as lower server costs, offline functionality, and improved user
privacy. However, common large diffusion models have over 1 billion parameters
and pose challenges due to restricted computational and memory resources on
devices. We present a series of implementation optimizations for large
diffusion models that achieve the fastest reported inference latency to-date
(under 12 seconds for Stable Diffusion 1.4 without int8 quantization on Samsung
S23 Ultra for a 512x512 image with 20 iterations) on GPU-equipped mobile
devices. These enhancements broaden the applicability of generative AI and
improve the overall user experience across a wide range of devices.
BoDiffusion: Diffusing Sparse Observations for Full-Body Human Motion Synthesis
April 21, 2023
Angela Castillo, Maria Escobar, Guillaume Jeanneret, Albert Pumarola, Pablo Arbeláez, Ali Thabet, Artsiom Sanakoyeu
Mixed reality applications require tracking the user’s full-body motion to
enable an immersive experience. However, typical head-mounted devices can only
track head and hand movements, leading to a limited reconstruction of full-body
motion due to variability in lower body configurations. We propose BoDiffusion
– a generative diffusion model for motion synthesis to tackle this
under-constrained reconstruction problem. We present a time and space
conditioning scheme that allows BoDiffusion to leverage sparse tracking inputs
while generating smooth and realistic full-body motion sequences. To the best
of our knowledge, this is the first approach that uses the reverse diffusion
process to model full-body tracking as a conditional sequence generation task.
We conduct experiments on the large-scale motion-capture dataset AMASS and show
that our approach outperforms the state-of-the-art approaches by a significant
margin in terms of full-body motion realism and joint reconstruction error.
Improved Diffusion-based Image Colorization via Piggybacked Models
April 21, 2023
Hanyuan Liu, Jinbo Xing, Minshan Xie, Chengze Li, Tien-Tsin Wong
Image colorization has been attracting the research interests of the
community for decades. However, existing methods still struggle to provide
satisfactory colorized results given grayscale images due to a lack of
human-like global understanding of colors. Recently, large-scale Text-to-Image
(T2I) models have been exploited to transfer the semantic information from the
text prompts to the image domain, where text provides a global control for
semantic objects in the image. In this work, we introduce a colorization model
piggybacking on the existing powerful T2I diffusion model. Our key idea is to
exploit the color prior knowledge in the pre-trained T2I diffusion model for
realistic and diverse colorization. A diffusion guider is designed to
incorporate the pre-trained weights of the latent diffusion model to output a
latent color prior that conforms to the visual semantics of the grayscale
input. A lightness-aware VQVAE will then generate the colorized result with
pixel-perfect alignment to the given grayscale image. Our model can also
achieve conditional colorization with additional inputs (e.g. user hints and
texts). Extensive experiments show that our method achieves state-of-the-art
performance in terms of perceptual quality.
SILVR: Guided Diffusion for Molecule Generation
April 21, 2023
Nicholas T. Runcie, Antonia S. J. S. Mey
Computationally generating novel synthetically accessible compounds with high
affinity and low toxicity is a great challenge in drug design. Machine-learning
models beyond conventional pharmacophoric methods have shown promise in
generating novel small molecule compounds, but require significant tuning for a
specific protein target. Here, we introduce a method called selective iterative
latent variable refinement (SILVR) for conditioning an existing diffusion-based
equivariant generative model without retraining. The model allows the
generation of new molecules that fit into a binding site of a protein based on
fragment hits. We use the SARS-CoV-2 Main protease fragments from Diamond
X-Chem that form part of the COVID Moonshot project as a reference dataset for
conditioning the molecule generation. The SILVR rate controls the extent of
conditioning and we show that moderate SILVR rates make it possible to generate
new molecules of similar shape to the original fragments, meaning that the new
molecules fit the binding site without knowledge of the protein. We can also
merge up to 3 fragments into a new molecule without affecting the quality of
molecules generated by the underlying generative model. Our method is
generalizable to any protein target with known fragments and any
diffusion-based model for molecule generation.
Persistently Trained, Diffusion-assisted Energy-based Models
April 21, 2023
Xinwei Zhang, Zhiqiang Tan, Zhijian Ou
Maximum likelihood (ML) learning for energy-based models (EBMs) is
challenging, partly due to non-convergence of Markov chain Monte Carlo.Several
variations of ML learning have been proposed, but existing methods all fail to
achieve both post-training image generation and proper density estimation. We
propose to introduce diffusion data and learn a joint EBM, called diffusion
assisted-EBMs, through persistent training (i.e., using persistent contrastive
divergence) with an enhanced sampling algorithm to properly sample from
complex, multimodal distributions. We present results from a 2D illustrative
experiment and image experiments and demonstrate that, for the first time for
image data, persistently trained EBMs can {\it simultaneously} achieve long-run
stability, post-training image generation, and superior out-of-distribution
detection.
Long-Term Photometric Consistent Novel View Synthesis with Diffusion Models
April 21, 2023
Jason J. Yu, Fereshteh Forghani, Konstantinos G. Derpanis, Marcus A. Brubaker
Novel view synthesis from a single input image is a challenging task, where
the goal is to generate a new view of a scene from a desired camera pose that
may be separated by a large motion. The highly uncertain nature of this
synthesis task due to unobserved elements within the scene (i.e. occlusion) and
outside the field-of-view makes the use of generative models appealing to
capture the variety of possible outputs. In this paper, we propose a novel
generative model capable of producing a sequence of photorealistic images
consistent with a specified camera trajectory, and a single starting image. Our
approach is centred on an autoregressive conditional diffusion-based model
capable of interpolating visible scene elements, and extrapolating unobserved
regions in a view, in a geometrically consistent manner. Conditioning is
limited to an image capturing a single camera view and the (relative) pose of
the new camera view. To measure the consistency over a sequence of generated
views, we introduce a new metric, the thresholded symmetric epipolar distance
(TSED), to measure the number of consistent frame pairs in a sequence. While
previous methods have been shown to produce high quality images and consistent
semantics across pairs of views, we show empirically with our metric that they
are often inconsistent with the desired camera poses. In contrast, we
demonstrate that our method produces both photorealistic and view-consistent
imagery.
Collaborative Diffusion for Multi-Modal Face Generation and Editing
April 20, 2023
Ziqi Huang, Kelvin C. K. Chan, Yuming Jiang, Ziwei Liu
Diffusion models arise as a powerful generative tool recently. Despite the
great progress, existing diffusion models mainly focus on uni-modal control,
i.e., the diffusion process is driven by only one modality of condition. To
further unleash the users’ creativity, it is desirable for the model to be
controllable by multiple modalities simultaneously, e.g., generating and
editing faces by describing the age (text-driven) while drawing the face shape
(mask-driven). In this work, we present Collaborative Diffusion, where
pre-trained uni-modal diffusion models collaborate to achieve multi-modal face
generation and editing without re-training. Our key insight is that diffusion
models driven by different modalities are inherently complementary regarding
the latent denoising steps, where bilateral connections can be established
upon. Specifically, we propose dynamic diffuser, a meta-network that adaptively
hallucinates multi-modal denoising steps by predicting the spatial-temporal
influence functions for each pre-trained uni-modal model. Collaborative
Diffusion not only collaborates generation capabilities from uni-modal
diffusion models, but also integrates multiple uni-modal manipulations to
perform multi-modal editing. Extensive qualitative and quantitative experiments
demonstrate the superiority of our framework in both image quality and
condition consistency.
A data augmentation perspective on diffusion models and retrieval
April 20, 2023
Max F. Burg, Florian Wenzel, Dominik Zietlow, Max Horn, Osama Makansi, Francesco Locatello, Chris Russell
Many approaches have been proposed to use diffusion models to augment
training datasets for downstream tasks, such as classification. However,
diffusion models are themselves trained on large datasets, often with noisy
annotations, and it remains an open question to which extent these models
contribute to downstream classification performance. In particular, it remains
unclear if they generalize enough to improve over directly using the additional
data of their pre-training process for augmentation. We systematically evaluate
a range of existing methods to generate images from diffusion models and study
new extensions to assess their benefit for data augmentation. Personalizing
diffusion models towards the target data outperforms simpler prompting
strategies. However, using the pre-training data of the diffusion model alone,
via a simple nearest-neighbor retrieval procedure, leads to even stronger
downstream performance. Our study explores the potential of diffusion models in
generating new training data, and surprisingly finds that these sophisticated
models are not yet able to beat a simple and strong image retrieval baseline on
simple downstream vision tasks.
NeuralField-LDM: Scene Generation with Hierarchical Latent Diffusion Models
April 19, 2023
Seung Wook Kim, Bradley Brown, Kangxue Yin, Karsten Kreis, Katja Schwarz, Daiqing Li, Robin Rombach, Antonio Torralba, Sanja Fidler
Automatically generating high-quality real world 3D scenes is of enormous
interest for applications such as virtual reality and robotics simulation.
Towards this goal, we introduce NeuralField-LDM, a generative model capable of
synthesizing complex 3D environments. We leverage Latent Diffusion Models that
have been successfully utilized for efficient high-quality 2D content creation.
We first train a scene auto-encoder to express a set of image and pose pairs as
a neural field, represented as density and feature voxel grids that can be
projected to produce novel views of the scene. To further compress this
representation, we train a latent-autoencoder that maps the voxel grids to a
set of latent representations. A hierarchical diffusion model is then fit to
the latents to complete the scene generation pipeline. We achieve a substantial
improvement over existing state-of-the-art scene generation models.
Additionally, we show how NeuralField-LDM can be used for a variety of 3D
content creation applications, including conditional scene generation, scene
inpainting and scene style manipulation.
DiFaReli : Diffusion Face Relighting
April 19, 2023
Puntawat Ponglertnapakorn, Nontawat Tritrong, Supasorn Suwajanakorn
We present a novel approach to single-view face relighting in the wild.
Handling non-diffuse effects, such as global illumination or cast shadows, has
long been a challenge in face relighting. Prior work often assumes Lambertian
surfaces, simplified lighting models or involves estimating 3D shape, albedo,
or a shadow map. This estimation, however, is error-prone and requires many
training examples with lighting ground truth to generalize well. Our work
bypasses the need for accurate estimation of intrinsic components and can be
trained solely on 2D images without any light stage data, multi-view images, or
lighting ground truth. Our key idea is to leverage a conditional diffusion
implicit model (DDIM) for decoding a disentangled light encoding along with
other encodings related to 3D shape and facial identity inferred from
off-the-shelf estimators. We also propose a novel conditioning technique that
eases the modeling of the complex interaction between light and geometry by
using a rendered shading reference to spatially modulate the DDIM. We achieve
state-of-the-art performance on standard benchmark Multi-PIE and can
photorealistically relight in-the-wild images. Please visit our page:
https://diffusion-face-relighting.github.io
Denoising Diffusion Medical Models
April 19, 2023
Pham Ngoc Huy, Tran Minh Quan
In this study, we introduce a generative model that can synthesize a large
number of radiographical image/label pairs, and thus is asymptotically
favorable to downstream activities such as segmentation in bio-medical image
analysis. Denoising Diffusion Medical Model (DDMM), the proposed technique, can
create realistic X-ray images and associated segmentations on a small number of
annotated datasets as well as other massive unlabeled datasets with no
supervision. Radiograph/segmentation pairs are generated jointly by the DDMM
sampling process in probabilistic mode. As a result, a vanilla UNet that uses
this data augmentation for segmentation task outperforms other similarly
data-centric approaches.
UPGPT: Universal Diffusion Model for Person Image Generation, Editing and Pose Transfer
April 18, 2023
Soon Yau Cheong, Armin Mustafa, Andrew Gilbert
Text-to-image models (T2I) such as StableDiffusion have been used to generate
high quality images of people. However, due to the random nature of the
generation process, the person has a different appearance e.g. pose, face, and
clothing, despite using the same text prompt. The appearance inconsistency
makes T2I unsuitable for pose transfer. We address this by proposing a
multimodal diffusion model that accepts text, pose, and visual prompting. Our
model is the first unified method to perform all person image tasks -
generation, pose transfer, and mask-less edit. We also pioneer using small
dimensional 3D body model parameters directly to demonstrate new capability -
simultaneous pose and camera view interpolation while maintaining the person’s
appearance.
Two-stage Denoising Diffusion Model for Source Localization in Graph Inverse Problems
April 18, 2023
Bosong Huang, Weihao Yu, Ruzhong Xie, Jing Xiao, Jin Huang
Source localization is the inverse problem of graph information dissemination
and has broad practical applications.
However, the inherent intricacy and uncertainty in information dissemination
pose significant challenges, and the ill-posed nature of the source
localization problem further exacerbates these challenges. Recently, deep
generative models, particularly diffusion models inspired by classical
non-equilibrium thermodynamics, have made significant progress. While diffusion
models have proven to be powerful in solving inverse problems and producing
high-quality reconstructions, applying them directly to the source localization
is infeasible for two reasons. Firstly, it is impossible to calculate the
posterior disseminated results on a large-scale network for iterative denoising
sampling, which would incur enormous computational costs. Secondly, in the
existing methods for this field, the training data itself are ill-posed
(many-to-one); thus simply transferring the diffusion model would only lead to
local optima.
To address these challenges, we propose a two-stage optimization framework,
the source localization denoising diffusion model (SL-Diff). In the coarse
stage, we devise the source proximity degrees as the supervised signals to
generate coarse-grained source predictions. This aims to efficiently initialize
the next stage, significantly reducing its convergence time and calibrating the
convergence process. Furthermore, the introduction of cascade temporal
information in this training method transforms the many-to-one mapping
relationship into a one-to-one relationship, perfectly addressing the ill-posed
problem. In the fine stage, we design a diffusion model for the graph inverse
problem that can quantify the uncertainty in the dissemination. The proposed
SL-Diff yields excellent prediction results within a reasonable sampling time
at extensive experiments.
Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models
April 18, 2023
Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, Karsten Kreis
Latent Diffusion Models (LDMs) enable high-quality image synthesis while
avoiding excessive compute demands by training a diffusion model in a
compressed lower-dimensional latent space. Here, we apply the LDM paradigm to
high-resolution video generation, a particularly resource-intensive task. We
first pre-train an LDM on images only; then, we turn the image generator into a
video generator by introducing a temporal dimension to the latent space
diffusion model and fine-tuning on encoded image sequences, i.e., videos.
Similarly, we temporally align diffusion model upsamplers, turning them into
temporally consistent video super resolution models. We focus on two relevant
real-world applications: Simulation of in-the-wild driving data and creative
content creation with text-to-video modeling. In particular, we validate our
Video LDM on real driving videos of resolution 512 x 1024, achieving
state-of-the-art performance. Furthermore, our approach can easily leverage
off-the-shelf pre-trained image LDMs, as we only need to train a temporal
alignment model in that case. Doing so, we turn the publicly available,
state-of-the-art text-to-image LDM Stable Diffusion into an efficient and
expressive text-to-video model with resolution up to 1280 x 2048. We show that
the temporal layers trained in this way generalize to different fine-tuned
text-to-image LDMs. Utilizing this property, we show the first results for
personalized text-to-video generation, opening exciting directions for future
content creation. Project page:
https://research.nvidia.com/labs/toronto-ai/VideoLDM/
Synthetic Data from Diffusion Models Improves ImageNet Classification
April 17, 2023
Shekoofeh Azizi, Simon Kornblith, Chitwan Saharia, Mohammad Norouzi, David J. Fleet
cs.CV, cs.AI, cs.CL, cs.LG
Deep generative models are becoming increasingly powerful, now generating
diverse high fidelity photo-realistic samples given text prompts. Have they
reached the point where models of natural images can be used for generative
data augmentation, helping to improve challenging discriminative tasks? We show
that large-scale text-to image diffusion models can be fine-tuned to produce
class conditional models with SOTA FID (1.76 at 256x256 resolution) and
Inception Score (239 at 256x256). The model also yields a new SOTA in
Classification Accuracy Scores (64.96 for 256x256 generative samples, improving
to 69.24 for 1024x1024 samples). Augmenting the ImageNet training set with
samples from the resulting models yields significant improvements in ImageNet
classification accuracy over strong ResNet and Vision Transformer baselines.
Refusion: Enabling Large-Size Realistic Image Restoration with Latent-Space Diffusion Models
April 17, 2023
Ziwei Luo, Fredrik K. Gustafsson, Zheng Zhao, Jens Sjölund, Thomas B. Schön
This work aims to improve the applicability of diffusion models in realistic
image restoration. Specifically, we enhance the diffusion model in several
aspects such as network architecture, noise level, denoising steps, training
image size, and optimizer/scheduler. We show that tuning these hyperparameters
allows us to achieve better performance on both distortion and perceptual
scores. We also propose a U-Net based latent diffusion model which performs
diffusion in a low-resolution latent space while preserving high-resolution
information from the original input for the decoding process. Compared to the
previous latent-diffusion model which trains a VAE-GAN to compress the image,
our proposed U-Net compression strategy is significantly more stable and can
recover highly accurate images without relying on adversarial optimization.
Importantly, these modifications allow us to apply diffusion models to various
image restoration tasks, including real-world shadow removal, HR
non-homogeneous dehazing, stereo super-resolution, and bokeh effect
transformation. By simply replacing the datasets and slightly changing the
noise network, our model, named Refusion, is able to deal with large-size
images (e.g., 6000 x 4000 x 3 in HR dehazing) and produces good results on all
the above restoration problems. Our Refusion achieves the best perceptual
performance in the NTIRE 2023 Image Shadow Removal Challenge and wins 2nd place
overall.
Towards Controllable Diffusion Models via Reward-Guided Exploration
April 14, 2023
Hengtong Zhang, Tingyang Xu
By formulating data samples’ formation as a Markov denoising process,
diffusion models achieve state-of-the-art performances in a collection of
tasks. Recently, many variants of diffusion models have been proposed to enable
controlled sample generation. Most of these existing methods either formulate
the controlling information as an input (i.e.,: conditional representation) for
the noise approximator, or introduce a pre-trained classifier in the test-phase
to guide the Langevin dynamic towards the conditional goal. However, the former
line of methods only work when the controlling information can be formulated as
conditional representations, while the latter requires the pre-trained guidance
classifier to be differentiable. In this paper, we propose a novel framework
named RGDM (Reward-Guided Diffusion Model) that guides the training-phase of
diffusion models via reinforcement learning (RL). The proposed training
framework bridges the objective of weighted log-likelihood and maximum entropy
RL, which enables calculating policy gradients via samples from a pay-off
distribution proportional to exponential scaled rewards, rather than from
policies themselves. Such a framework alleviates the high gradient variances
and enables diffusion models to explore for highly rewarded samples in the
reverse process. Experiments on 3D shape and molecule generation tasks show
significant improvements over existing conditional diffusion models.
Delta Denoising Score
April 14, 2023
Amir Hertz, Kfir Aberman, Daniel Cohen-Or
We introduce Delta Denoising Score (DDS), a novel scoring function for
text-based image editing that guides minimal modifications of an input image
towards the content described in a target prompt. DDS leverages the rich
generative prior of text-to-image diffusion models and can be used as a loss
term in an optimization problem to steer an image towards a desired direction
dictated by a text. DDS utilizes the Score Distillation Sampling (SDS)
mechanism for the purpose of image editing. We show that using only SDS often
produces non-detailed and blurry outputs due to noisy gradients. To address
this issue, DDS uses a prompt that matches the input image to identify and
remove undesired erroneous directions of SDS. Our key premise is that SDS
should be zero when calculated on pairs of matched prompts and images, meaning
that if the score is non-zero, its gradients can be attributed to the erroneous
component of SDS. Our analysis demonstrates the competence of DDS for text
based image-to-image translation. We further show that DDS can be used to train
an effective zero-shot image translation model. Experimental results indicate
that DDS outperforms existing methods in terms of stability and quality,
highlighting its potential for real-world applications in text-based image
editing.
Memory Efficient Diffusion Probabilistic Models via Patch-based Generation
April 14, 2023
Shinei Arakawa, Hideki Tsunashima, Daichi Horita, Keitaro Tanaka, Shigeo Morishima
Diffusion probabilistic models have been successful in generating
high-quality and diverse images. However, traditional models, whose input and
output are high-resolution images, suffer from excessive memory requirements,
making them less practical for edge devices. Previous approaches for generative
adversarial networks proposed a patch-based method that uses positional
encoding and global content information. Nevertheless, designing a patch-based
approach for diffusion probabilistic models is non-trivial. In this paper, we
resent a diffusion probabilistic model that generates images on a
patch-by-patch basis. We propose two conditioning methods for a patch-based
generation. First, we propose position-wise conditioning using one-hot
representation to ensure patches are in proper positions. Second, we propose
Global Content Conditioning (GCC) to ensure patches have coherent content when
concatenated together. We evaluate our model qualitatively and quantitatively
on CelebA and LSUN bedroom datasets and demonstrate a moderate trade-off
between maximum memory consumption and generated image quality. Specifically,
when an entire image is divided into 2 x 2 patches, our proposed approach can
reduce the maximum memory consumption by half while maintaining comparable
image quality.
Soundini: Sound-Guided Diffusion for Natural Video Editing
April 13, 2023
Seung Hyun Lee, Sieun Kim, Innfarn Yoo, Feng Yang, Donghyeon Cho, Youngseo Kim, Huiwen Chang, Jinkyu Kim, Sangpil Kim
We propose a method for adding sound-guided visual effects to specific
regions of videos with a zero-shot setting. Animating the appearance of the
visual effect is challenging because each frame of the edited video should have
visual changes while maintaining temporal consistency. Moreover, existing video
editing solutions focus on temporal consistency across frames, ignoring the
visual style variations over time, e.g., thunderstorm, wave, fire crackling. To
overcome this limitation, we utilize temporal sound features for the dynamic
style. Specifically, we guide denoising diffusion probabilistic models with an
audio latent representation in the audio-visual latent space. To the best of
our knowledge, our work is the first to explore sound-guided natural video
editing from various sound sources with sound-specialized properties, such as
intensity, timbre, and volume. Additionally, we design optical flow-based
guidance to generate temporally consistent video frames, capturing the
pixel-wise relationship between adjacent frames. Experimental results show that
our method outperforms existing video editing techniques, producing more
realistic visual effects that reflect the properties of sound. Please visit our
page: https://kuai-lab.github.io/soundini-gallery/.
Single-Stage Diffusion NeRF: A Unified Approach to 3D Generation and Reconstruction
April 13, 2023
Hansheng Chen, Jiatao Gu, Anpei Chen, Wei Tian, Zhuowen Tu, Lingjie Liu, Hao Su
3D-aware image synthesis encompasses a variety of tasks, such as scene
generation and novel view synthesis from images. Despite numerous task-specific
methods, developing a comprehensive model remains challenging. In this paper,
we present SSDNeRF, a unified approach that employs an expressive diffusion
model to learn a generalizable prior of neural radiance fields (NeRF) from
multi-view images of diverse objects. Previous studies have used two-stage
approaches that rely on pretrained NeRFs as real data to train diffusion
models. In contrast, we propose a new single-stage training paradigm with an
end-to-end objective that jointly optimizes a NeRF auto-decoder and a latent
diffusion model, enabling simultaneous 3D reconstruction and prior learning,
even from sparsely available views. At test time, we can directly sample the
diffusion prior for unconditional generation, or combine it with arbitrary
observations of unseen objects for NeRF reconstruction. SSDNeRF demonstrates
robust results comparable to or better than leading task-specific methods in
unconditional generation and single/sparse-view 3D reconstruction.
DiffusionRig: Learning Personalized Priors for Facial Appearance Editing
April 13, 2023
Zheng Ding, Xuaner Zhang, Zhihao Xia, Lars Jebe, Zhuowen Tu, Xiuming Zhang
We address the problem of learning person-specific facial priors from a small
number (e.g., 20) of portrait photos of the same person. This enables us to
edit this specific person’s facial appearance, such as expression and lighting,
while preserving their identity and high-frequency facial details. Key to our
approach, which we dub DiffusionRig, is a diffusion model conditioned on, or
“rigged by,” crude 3D face models estimated from single in-the-wild images by
an off-the-shelf estimator. On a high level, DiffusionRig learns to map
simplistic renderings of 3D face models to realistic photos of a given person.
Specifically, DiffusionRig is trained in two stages: It first learns generic
facial priors from a large-scale face dataset and then person-specific priors
from a small portrait photo collection of the person of interest. By learning
the CGI-to-photo mapping with such personalized priors, DiffusionRig can “rig”
the lighting, facial expression, head pose, etc. of a portrait photo,
conditioned only on coarse 3D models while preserving this person’s identity
and other high-frequency characteristics. Qualitative and quantitative
experiments show that DiffusionRig outperforms existing approaches in both
identity preservation and photorealism. Please see the project website:
https://diffusionrig.github.io for the supplemental material, video, code, and
data.
Learning Controllable 3D Diffusion Models from Single-view Images
April 13, 2023
Jiatao Gu, Qingzhe Gao, Shuangfei Zhai, Baoquan Chen, Lingjie Liu, Josh Susskind
Diffusion models have recently become the de-facto approach for generative
modeling in the 2D domain. However, extending diffusion models to 3D is
challenging due to the difficulties in acquiring 3D ground truth data for
training. On the other hand, 3D GANs that integrate implicit 3D representations
into GANs have shown remarkable 3D-aware generation when trained only on
single-view image datasets. However, 3D GANs do not provide straightforward
ways to precisely control image synthesis. To address these challenges, We
present Control3Diff, a 3D diffusion model that combines the strengths of
diffusion models and 3D GANs for versatile, controllable 3D-aware image
synthesis for single-view datasets. Control3Diff explicitly models the
underlying latent distribution (optionally conditioned on external inputs),
thus enabling direct control during the diffusion process. Moreover, our
approach is general and applicable to any type of controlling input, allowing
us to train it with the same diffusion objective without any auxiliary
supervision. We validate the efficacy of Control3Diff on standard image
generation benchmarks, including FFHQ, AFHQ, and ShapeNet, using various
conditioning inputs such as images, sketches, and text prompts. Please see the
project website (\url{https://jiataogu.me/control3diff}) for video comparisons.
DiffFit: Unlocking Transferability of Large Diffusion Models via Simple Parameter-Efficient Fine-Tuning
April 13, 2023
Enze Xie, Lewei Yao, Han Shi, Zhili Liu, Daquan Zhou, Zhaoqiang Liu, Jiawei Li, Zhenguo Li
Diffusion models have proven to be highly effective in generating
high-quality images. However, adapting large pre-trained diffusion models to
new domains remains an open challenge, which is critical for real-world
applications. This paper proposes DiffFit, a parameter-efficient strategy to
fine-tune large pre-trained diffusion models that enable fast adaptation to new
domains. DiffFit is embarrassingly simple that only fine-tunes the bias term
and newly-added scaling factors in specific layers, yet resulting in
significant training speed-up and reduced model storage costs. Compared with
full fine-tuning, DiffFit achieves 2$\times$ training speed-up and only needs
to store approximately 0.12\% of the total model parameters. Intuitive
theoretical analysis has been provided to justify the efficacy of scaling
factors on fast adaptation. On 8 downstream datasets, DiffFit achieves superior
or competitive performances compared to the full fine-tuning while being more
efficient. Remarkably, we show that DiffFit can adapt a pre-trained
low-resolution generative model to a high-resolution one by adding minimal
cost. Among diffusion-based methods, DiffFit sets a new state-of-the-art FID of
3.02 on ImageNet 512$\times$512 benchmark by fine-tuning only 25 epochs from a
public pre-trained ImageNet 256$\times$256 checkpoint while being 30$\times$
more training efficient than the closest competitor.
An Edit Friendly DDPM Noise Space: Inversion and Manipulations
April 12, 2023
Inbar Huberman-Spiegelglas, Vladimir Kulikov, Tomer Michaeli
Denoising diffusion probabilistic models (DDPMs) employ a sequence of white
Gaussian noise samples to generate an image. In analogy with GANs, those noise
maps could be considered as the latent code associated with the generated
image. However, this native noise space does not possess a convenient
structure, and is thus challenging to work with in editing tasks. Here, we
propose an alternative latent noise space for DDPM that enables a wide range of
editing operations via simple means, and present an inversion method for
extracting these edit-friendly noise maps for any given image (real or
synthetically generated). As opposed to the native DDPM noise space, the
edit-friendly noise maps do not have a standard normal distribution and are not
statistically independent across timesteps. However, they allow perfect
reconstruction of any desired image, and simple transformations on them
translate into meaningful manipulations of the output image (e.g., shifting,
color edits). Moreover, in text-conditional models, fixing those noise maps
while changing the text prompt, modifies semantics while retaining structure.
We illustrate how this property enables text-based editing of real images via
the diverse DDPM sampling scheme (in contrast to the popular non-diverse DDIM
inversion). We also show how it can be used within existing diffusion-based
editing methods to improve their quality and diversity.
E(3)xSO(3)-Equivariant Networks for Spherical Deconvolution in Diffusion MRI
April 12, 2023
Axel Elaldi, Guido Gerig, Neel Dey
We present Roto-Translation Equivariant Spherical Deconvolution (RT-ESD), an
$E(3)\times SO(3)$ equivariant framework for sparse deconvolution of volumes
where each voxel contains a spherical signal. Such 6D data naturally arises in
diffusion MRI (dMRI), a medical imaging modality widely used to measure
microstructure and structural connectivity. As each dMRI voxel is typically a
mixture of various overlapping structures, there is a need for blind
deconvolution to recover crossing anatomical structures such as white matter
tracts. Existing dMRI work takes either an iterative or deep learning approach
to sparse spherical deconvolution, yet it typically does not account for
relationships between neighboring measurements. This work constructs
equivariant deep learning layers which respect to symmetries of spatial
rotations, reflections, and translations, alongside the symmetries of voxelwise
spherical rotations. As a result, RT-ESD improves on previous work across
several tasks including fiber recovery on the DiSCo dataset,
deconvolution-derived partial volume estimation on real-world \textit{in vivo}
human brain dMRI, and improved downstream reconstruction of fiber tractograms
on the Tractometer dataset. Our implementation is available at
https://github.com/AxelElaldi/e3so3_conv
Continual Diffusion: Continual Customization of Text-to-Image Diffusion with C-LoRA
April 12, 2023
James Seale Smith, Yen-Chang Hsu, Lingyu Zhang, Ting Hua, Zsolt Kira, Yilin Shen, Hongxia Jin
Recent works demonstrate a remarkable ability to customize text-to-image
diffusion models while only providing a few example images. What happens if you
try to customize such models using multiple, fine-grained concepts in a
sequential (i.e., continual) manner? In our work, we show that recent
state-of-the-art customization of text-to-image models suffer from catastrophic
forgetting when new concepts arrive sequentially. Specifically, when adding a
new concept, the ability to generate high quality images of past, similar
concepts degrade. To circumvent this forgetting, we propose a new method,
C-LoRA, composed of a continually self-regularized low-rank adaptation in cross
attention layers of the popular Stable Diffusion model. Furthermore, we use
customization prompts which do not include the word of the customized object
(i.e., “person” for a human face dataset) and are initialized as completely
random embeddings. Importantly, our method induces only marginal additional
parameter costs and requires no storage of user data for replay. We show that
C-LoRA not only outperforms several baselines for our proposed setting of
text-to-image continual customization, which we refer to as Continual
Diffusion, but that we achieve a new state-of-the-art in the well-established
rehearsal-free continual learning setting for image classification. The high
achieving performance of C-LoRA in two separate domains positions it as a
compelling solution for a wide range of applications, and we believe it has
significant potential for practical impact.
DreamPose: Fashion Image-to-Video Synthesis via Stable Diffusion
April 12, 2023
Johanna Karras, Aleksander Holynski, Ting-Chun Wang, Ira Kemelmacher-Shlizerman
We present DreamPose, a diffusion-based method for generating animated
fashion videos from still images. Given an image and a sequence of human body
poses, our method synthesizes a video containing both human and fabric motion.
To achieve this, we transform a pretrained text-to-image model (Stable
Diffusion) into a pose-and-image guided video synthesis model, using a novel
fine-tuning strategy, a set of architectural changes to support the added
conditioning signals, and techniques to encourage temporal consistency. We
fine-tune on a collection of fashion videos from the UBC Fashion dataset. We
evaluate our method on a variety of clothing styles and poses, and demonstrate
that our method produces state-of-the-art results on fashion video
animation.Video results are available on our project page.
SpectralDiff: Hyperspectral Image Classification with Spectral-Spatial Diffusion Models
April 12, 2023
Ning Chen, Jun Yue, Leyuan Fang, Shaobo Xia
Hyperspectral Image (HSI) classification is an important issue in remote
sensing field with extensive applications in earth science. In recent years, a
large number of deep learning-based HSI classification methods have been
proposed. However, existing methods have limited ability to handle
high-dimensional, highly redundant, and complex data, making it challenging to
capture the spectral-spatial distributions of data and relationships between
samples. To address this issue, we propose a generative framework for HSI
classification with diffusion models (SpectralDiff) that effectively mines the
distribution information of high-dimensional and highly redundant data by
iteratively denoising and explicitly constructing the data generation process,
thus better reflecting the relationships between samples. The framework
consists of a spectral-spatial diffusion module, and an attention-based
classification module. The spectral-spatial diffusion module adopts forward and
reverse spectral-spatial diffusion processes to achieve adaptive construction
of sample relationships without requiring prior knowledge of graphical
structure or neighborhood information. It captures spectral-spatial
distribution and contextual information of objects in HSI and mines
unsupervised spectral-spatial diffusion features within the reverse diffusion
process. Finally, these features are fed into the attention-based
classification module for per-pixel classification. The diffusion features can
facilitate cross-sample perception via reconstruction distribution, leading to
improved classification performance. Experiments on three public HSI datasets
demonstrate that the proposed method can achieve better performance than
state-of-the-art methods. For the sake of reproducibility, the source code of
SpectralDiff will be publicly available at
https://github.com/chenning0115/SpectralDiff.
Diffusion models with location-scale noise
April 12, 2023
Alexia Jolicoeur-Martineau, Kilian Fatras, Ke Li, Tal Kachman
cs.LG, cs.AI, cs.NA, math.NA
Diffusion Models (DMs) are powerful generative models that add Gaussian noise
to the data and learn to remove it. We wanted to determine which noise
distribution (Gaussian or non-Gaussian) led to better generated data in DMs.
Since DMs do not work by design with non-Gaussian noise, we built a framework
that allows reversing a diffusion process with non-Gaussian location-scale
noise. We use that framework to show that the Gaussian distribution performs
the best over a wide range of other distributions (Laplace, Uniform, t,
Generalized-Gaussian).
InterGen: Diffusion-based Multi-human Motion Generation under Complex Interactions
April 12, 2023
Han Liang, Wenqian Zhang, Wenxuan Li, Jingyi Yu, Lan Xu
We have recently seen tremendous progress in diffusion advances for
generating realistic human motions. Yet, they largely disregard the multi-human
interactions. In this paper, we present InterGen, an effective diffusion-based
approach that incorporates human-to-human interactions into the motion
diffusion process, which enables layman users to customize high-quality
two-person interaction motions, with only text guidance. We first contribute a
multimodal dataset, named InterHuman. It consists of about 107M frames for
diverse two-person interactions, with accurate skeletal motions and 23,337
natural language descriptions. For the algorithm side, we carefully tailor the
motion diffusion model to our two-person interaction setting. To handle the
symmetry of human identities during interactions, we propose two cooperative
transformer-based denoisers that explicitly share weights, with a mutual
attention mechanism to further connect the two denoising processes. Then, we
propose a novel representation for motion input in our interaction diffusion
model, which explicitly formulates the global relations between the two
performers in the world frame. We further introduce two novel regularization
terms to encode spatial relations, equipped with a corresponding damping scheme
during the training of our interaction diffusion model. Extensive experiments
validate the effectiveness and generalizability of InterGen. Notably, it can
generate more diverse and compelling two-person motions than previous methods
and enables various downstream applications for human interactions.
CamDiff: Camouflage Image Augmentation via Diffusion Model
April 11, 2023
Xue-Jing Luo, Shuo Wang, Zongwei Wu, Christos Sakaridis, Yun Cheng, Deng-Ping Fan, Luc Van Gool
The burgeoning field of camouflaged object detection (COD) seeks to identify
objects that blend into their surroundings. Despite the impressive performance
of recent models, we have identified a limitation in their robustness, where
existing methods may misclassify salient objects as camouflaged ones, despite
these two characteristics being contradictory. This limitation may stem from
lacking multi-pattern training images, leading to less saliency robustness. To
address this issue, we introduce CamDiff, a novel approach inspired by
AI-Generated Content (AIGC) that overcomes the scarcity of multi-pattern
training images. Specifically, we leverage the latent diffusion model to
synthesize salient objects in camouflaged scenes, while using the zero-shot
image classification ability of the Contrastive Language-Image Pre-training
(CLIP) model to prevent synthesis failures and ensure the synthesized object
aligns with the input prompt. Consequently, the synthesized image retains its
original camouflage label while incorporating salient objects, yielding
camouflage samples with richer characteristics. The results of user studies
show that the salient objects in the scenes synthesized by our framework
attract the user’s attention more; thus, such samples pose a greater challenge
to the existing COD models. Our approach enables flexible editing and efficient
large-scale dataset generation at a low cost. It significantly enhances COD
baselines’ training and testing phases, emphasizing robustness across diverse
domains. Our newly-generated datasets and source code are available at
https://github.com/drlxj/CamDiff.
Diffusion Models for Constrained Domains
April 11, 2023
Nic Fishman, Leo Klarner, Valentin De Bortoli, Emile Mathieu, Michael Hutchinson
Denoising diffusion models are a recent class of generative models which
achieve state-of-the-art results in many domains such as unconditional image
generation and text-to-speech tasks. They consist of a noising process
destroying the data and a backward stage defined as the time-reversal of the
noising diffusion. Building on their success, diffusion models have recently
been extended to the Riemannian manifold setting. Yet, these Riemannian
diffusion models require geodesics to be defined for all times. While this
setting encompasses many important applications, it does not include manifolds
defined via a set of inequality constraints, which are ubiquitous in many
scientific domains such as robotics and protein design. In this work, we
introduce two methods to bridge this gap. First, we design a noising process
based on the logarithmic barrier metric induced by the inequality constraints.
Second, we introduce a noising process based on the reflected Brownian motion.
As existing diffusion model techniques cannot be applied in this setting, we
derive new tools to define such models in our framework. We empirically
demonstrate the applicability of our methods to a number of synthetic and
real-world tasks, including the constrained conformational modelling of protein
backbones and robotic arms.
Mask-conditioned latent diffusion for generating gastrointestinal polyp images
April 11, 2023
Roman Macháček, Leila Mozaffari, Zahra Sepasdar, Sravanthi Parasa, Pål Halvorsen, Michael A. Riegler, Vajira Thambawita
In order to take advantage of AI solutions in endoscopy diagnostics, we must
overcome the issue of limited annotations. These limitations are caused by the
high privacy concerns in the medical field and the requirement of getting aid
from experts for the time-consuming and costly medical data annotation process.
In computer vision, image synthesis has made a significant contribution in
recent years as a result of the progress of generative adversarial networks
(GANs) and diffusion probabilistic models (DPM). Novel DPMs have outperformed
GANs in text, image, and video generation tasks. Therefore, this study proposes
a conditional DPM framework to generate synthetic GI polyp images conditioned
on given generated segmentation masks. Our experimental results show that our
system can generate an unlimited number of high-fidelity synthetic polyp images
with the corresponding ground truth masks of polyps. To test the usefulness of
the generated data, we trained binary image segmentation models to study the
effect of using synthetic data. Results show that the best micro-imagewise IOU
of 0.7751 was achieved from DeepLabv3+ when the training data consists of both
real data and synthetic data. However, the results reflect that achieving good
segmentation performance with synthetic data heavily depends on model
architectures.
Generative modeling for time series via Schr{ö}dinger bridge
April 11, 2023
Mohamed Hamdouche, Pierre Henry-Labordere, Huyên Pham
math.OC, math.PR, q-fin.CP, stat.ML
We propose a novel generative model for time series based on Schr{"o}dinger
bridge (SB) approach. This consists in the entropic interpolation via optimal
transport between a reference probability measure on path space and a target
measure consistent with the joint data distribution of the time series. The
solution is characterized by a stochastic differential equation on finite
horizon with a path-dependent drift function, hence respecting the temporal
dynamics of the time series distribution. We can estimate the drift function
from data samples either by kernel regression methods or with LSTM neural
networks, and the simulation of the SB diffusion yields new synthetic data
samples of the time series. The performance of our generative model is
evaluated through a series of numerical experiments. First, we test with a toy
autoregressive model, a GARCH Model, and the example of fractional Brownian
motion, and measure the accuracy of our algorithm with marginal and temporal
dependencies metrics. Next, we use our SB generated synthetic samples for the
application to deep hedging on real-data sets. Finally, we illustrate the SB
approach for generating sequence of images.
Binary Latent Diffusion
April 10, 2023
Ze Wang, Jiang Wang, Zicheng Liu, Qiang Qiu
In this paper, we show that a binary latent space can be explored for compact
yet expressive image representations. We model the bi-directional mappings
between an image and the corresponding latent binary representation by training
an auto-encoder with a Bernoulli encoding distribution. On the one hand, the
binary latent space provides a compact discrete image representation of which
the distribution can be modeled more efficiently than pixels or continuous
latent representations. On the other hand, we now represent each image patch as
a binary vector instead of an index of a learned cookbook as in discrete image
representations with vector quantization. In this way, we obtain binary latent
representations that allow for better image quality and high-resolution image
representations without any multi-stage hierarchy in the latent space. In this
binary latent space, images can now be generated effectively using a binary
latent diffusion model tailored specifically for modeling the prior over the
binary image representations. We present both conditional and unconditional
image generation experiments with multiple datasets, and show that the proposed
method performs comparably to state-of-the-art methods while dramatically
improving the sampling efficiency to as few as 16 steps without using any
test-time acceleration. The proposed framework can also be seamlessly scaled to
$1024 \times 1024$ high-resolution image generation without resorting to latent
hierarchy or multi-stage refinements.
Ambiguous Medical Image Segmentation using Diffusion Models
April 10, 2023
Aimon Rahman, Jeya Maria Jose Valanarasu, Ilker Hacihaliloglu, Vishal M Patel
Collective insights from a group of experts have always proven to outperform
an individual’s best diagnostic for clinical tasks. For the task of medical
image segmentation, existing research on AI-based alternatives focuses more on
developing models that can imitate the best individual rather than harnessing
the power of expert groups. In this paper, we introduce a single diffusion
model-based approach that produces multiple plausible outputs by learning a
distribution over group insights. Our proposed model generates a distribution
of segmentation masks by leveraging the inherent stochastic sampling process of
diffusion using only minimal additional learning. We demonstrate on three
different medical image modalities- CT, ultrasound, and MRI that our model is
capable of producing several possible variants while capturing the frequencies
of their occurrences. Comprehensive results show that our proposed approach
outperforms existing state-of-the-art ambiguous segmentation networks in terms
of accuracy while preserving naturally occurring variation. We also propose a
new metric to evaluate the diversity as well as the accuracy of segmentation
predictions that aligns with the interest of clinical practice of collective
insights.
Reflected Diffusion Models
April 10, 2023
Aaron Lou, Stefano Ermon
Score-based diffusion models learn to reverse a stochastic differential
equation that maps data to noise. However, for complex tasks, numerical error
can compound and result in highly unnatural samples. Previous work mitigates
this drift with thresholding, which projects to the natural data domain (such
as pixel space for images) after each diffusion step, but this leads to a
mismatch between the training and generative processes. To incorporate data
constraints in a principled manner, we present Reflected Diffusion Models,
which instead reverse a reflected stochastic differential equation evolving on
the support of the data. Our approach learns the perturbed score function
through a generalized score matching loss and extends key components of
standard diffusion models including diffusion guidance, likelihood-based
training, and ODE sampling. We also bridge the theoretical gap with
thresholding: such schemes are just discretizations of reflected SDEs. On
standard image benchmarks, our method is competitive with or surpasses the
state of the art without architectural modifications and, for classifier-free
guidance, our approach enables fast exact sampling with ODEs and produces more
faithful samples under high guidance weight.
DDRF: Denoising Diffusion Model for Remote Sensing Image Fusion
April 10, 2023
ZiHan Cao, ShiQi Cao, Xiao Wu, JunMing Hou, Ran Ran, Liang-Jian Deng
Denosing diffusion model, as a generative model, has received a lot of
attention in the field of image generation recently, thanks to its powerful
generation capability. However, diffusion models have not yet received
sufficient research in the field of image fusion. In this article, we introduce
diffusion model to the image fusion field, treating the image fusion task as
image-to-image translation and designing two different conditional injection
modulation modules (i.e., style transfer modulation and wavelet modulation) to
inject coarse-grained style information and fine-grained high-frequency and
low-frequency information into the diffusion UNet, thereby generating fused
images. In addition, we also discussed the residual learning and the selection
of training objectives of the diffusion model in the image fusion task.
Extensive experimental results based on quantitative and qualitative
assessments compared with benchmarks demonstrates state-of-the-art results and
good generalization performance in image fusion tasks. Finally, it is hoped
that our method can inspire other works and gain insight into this field to
better apply the diffusion model to image fusion tasks. Code shall be released
for better reproducibility.
BerDiff: Conditional Bernoulli Diffusion Model for Medical Image Segmentation
April 10, 2023
Tao Chen, Chenhui Wang, Hongming Shan
Medical image segmentation is a challenging task with inherent ambiguity and
high uncertainty, attributed to factors such as unclear tumor boundaries and
multiple plausible annotations. The accuracy and diversity of segmentation
masks are both crucial for providing valuable references to radiologists in
clinical practice. While existing diffusion models have shown strong capacities
in various visual generation tasks, it is still challenging to deal with
discrete masks in segmentation. To achieve accurate and diverse medical image
segmentation masks, we propose a novel conditional Bernoulli Diffusion model
for medical image segmentation (BerDiff). Instead of using the Gaussian noise,
we first propose to use the Bernoulli noise as the diffusion kernel to enhance
the capacity of the diffusion model for binary segmentation tasks, resulting in
more accurate segmentation masks. Second, by leveraging the stochastic nature
of the diffusion model, our BerDiff randomly samples the initial Bernoulli
noise and intermediate latent variables multiple times to produce a range of
diverse segmentation masks, which can highlight salient regions of interest
that can serve as valuable references for radiologists. In addition, our
BerDiff can efficiently sample sub-sequences from the overall trajectory of the
reverse diffusion, thereby speeding up the segmentation process. Extensive
experimental results on two medical image segmentation datasets with different
modalities demonstrate that our BerDiff outperforms other recently published
state-of-the-art methods. Our results suggest diffusion models could serve as a
strong backbone for medical image segmentation.
Zero-shot CT Field-of-view Completion with Unconditional Generative Diffusion Prior
April 07, 2023
Kaiwen Xu, Aravind R. Krishnan, Thomas Z. Li, Yuankai Huo, Kim L. Sandler, Fabien Maldonado, Bennett A. Landman
Anatomically consistent field-of-view (FOV) completion to recover truncated
body sections has important applications in quantitative analyses of computed
tomography (CT) with limited FOV. Existing solution based on conditional
generative models relies on the fidelity of synthetic truncation patterns at
training phase, which poses limitations for the generalizability of the method
to potential unknown types of truncation. In this study, we evaluate a
zero-shot method based on a pretrained unconditional generative diffusion
prior, where truncation pattern with arbitrary forms can be specified at
inference phase. In evaluation on simulated chest CT slices with synthetic FOV
truncation, the method is capable of recovering anatomically consistent body
sections and subcutaneous adipose tissue measurement error caused by FOV
truncation. However, the correction accuracy is inferior to the conditionally
trained counterpart.
ChiroDiff: Modelling chirographic data with Diffusion Models
April 07, 2023
Ayan Das, Yongxin Yang, Timothy Hospedales, Tao Xiang, Yi-Zhe Song
Generative modelling over continuous-time geometric constructs, a.k.a such as
handwriting, sketches, drawings etc., have been accomplished through
autoregressive distributions. Such strictly-ordered discrete factorization
however falls short of capturing key properties of chirographic data – it
fails to build holistic understanding of the temporal concept due to one-way
visibility (causality). Consequently, temporal data has been modelled as
discrete token sequences of fixed sampling rate instead of capturing the true
underlying concept. In this paper, we introduce a powerful model-class namely
“Denoising Diffusion Probabilistic Models” or DDPMs for chirographic data that
specifically addresses these flaws. Our model named “ChiroDiff”, being
non-autoregressive, learns to capture holistic concepts and therefore remains
resilient to higher temporal sampling rate up to a good extent. Moreover, we
show that many important downstream utilities (e.g. conditional sampling,
creative mixing) can be flexibly implemented using ChiroDiff. We further show
some unique use-cases like stochastic vectorization, de-noising/healing,
abstraction are also possible with this model-class. We perform quantitative
and qualitative evaluation of our framework on relevant datasets and found it
to be better or on par with competing approaches.
Towards Coherent Image Inpainting Using Denoising Diffusion Implicit Models
April 06, 2023
Guanhua Zhang, Jiabao Ji, Yang Zhang, Mo Yu, Tommi Jaakkola, Shiyu Chang
Image inpainting refers to the task of generating a complete, natural image
based on a partially revealed reference image. Recently, many research
interests have been focused on addressing this problem using fixed diffusion
models. These approaches typically directly replace the revealed region of the
intermediate or final generated images with that of the reference image or its
variants. However, since the unrevealed regions are not directly modified to
match the context, it results in incoherence between revealed and unrevealed
regions. To address the incoherence problem, a small number of methods
introduce a rigorous Bayesian framework, but they tend to introduce mismatches
between the generated and the reference images due to the approximation errors
in computing the posterior distributions. In this paper, we propose COPAINT,
which can coherently inpaint the whole image without introducing mismatches.
COPAINT also uses the Bayesian framework to jointly modify both revealed and
unrevealed regions, but approximates the posterior distribution in a way that
allows the errors to gradually drop to zero throughout the denoising steps,
thus strongly penalizing any mismatches with the reference image. Our
experiments verify that COPAINT can outperform the existing diffusion-based
methods under both objective and subjective metrics. The codes are available at
https://github.com/UCSB-NLP-Chang/CoPaint/.
Diffusion Models as Masked Autoencoders
April 06, 2023
Chen Wei, Karttikeya Mangalam, Po-Yao Huang, Yanghao Li, Haoqi Fan, Hu Xu, Huiyu Wang, Cihang Xie, Alan Yuille, Christoph Feichtenhofer
There has been a longstanding belief that generation can facilitate a true
understanding of visual data. In line with this, we revisit generatively
pre-training visual representations in light of recent interest in denoising
diffusion models. While directly pre-training with diffusion models does not
produce strong representations, we condition diffusion models on masked input
and formulate diffusion models as masked autoencoders (DiffMAE). Our approach
is capable of (i) serving as a strong initialization for downstream recognition
tasks, (ii) conducting high-quality image inpainting, and (iii) being
effortlessly extended to video where it produces state-of-the-art
classification accuracy. We further perform a comprehensive study on the pros
and cons of design choices and build connections between diffusion models and
masked autoencoders.
Anomaly Detection via Gumbel Noise Score Matching
April 06, 2023
Ahsan Mahmood, Junier Oliva, Martin Styner
We propose Gumbel Noise Score Matching (GNSM), a novel unsupervised method to
detect anomalies in categorical data. GNSM accomplishes this by estimating the
scores, i.e. the gradients of log likelihoods w.r.t.~inputs, of continuously
relaxed categorical distributions. We test our method on a suite of anomaly
detection tabular datasets. GNSM achieves a consistently high performance
across all experiments. We further demonstrate the flexibility of GNSM by
applying it to image data where the model is tasked to detect poor segmentation
predictions. Images ranked anomalous by GNSM show clear segmentation failures,
with the outputs of GNSM strongly correlating with segmentation metrics
computed on ground-truth. We outline the score matching training objective
utilized by GNSM and provide an open-source implementation of our work.
DITTO-NeRF: Diffusion-based Iterative Text To Omni-directional 3D Model
April 06, 2023
Hoigi Seo, Hayeon Kim, Gwanghyun Kim, Se Young Chun
The increasing demand for high-quality 3D content creation has motivated the
development of automated methods for creating 3D object models from a single
image and/or from a text prompt. However, the reconstructed 3D objects using
state-of-the-art image-to-3D methods still exhibit low correspondence to the
given image and low multi-view consistency. Recent state-of-the-art text-to-3D
methods are also limited, yielding 3D samples with low diversity per prompt
with long synthesis time. To address these challenges, we propose DITTO-NeRF, a
novel pipeline to generate a high-quality 3D NeRF model from a text prompt or a
single image. Our DITTO-NeRF consists of constructing high-quality partial 3D
object for limited in-boundary (IB) angles using the given or text-generated 2D
image from the frontal view and then iteratively reconstructing the remaining
3D NeRF using inpainting latent diffusion model. We propose progressive 3D
object reconstruction schemes in terms of scales (low to high resolution),
angles (IB angles initially to outer-boundary (OB) later), and masks (object to
background boundary) in our DITTO-NeRF so that high-quality information on IB
can be propagated into OB. Our DITTO-NeRF outperforms state-of-the-art methods
in terms of fidelity and diversity qualitatively and quantitatively with much
faster training times than prior arts on image/text-to-3D such as DreamFusion,
and NeuralLift-360.
Zero-shot Medical Image Translation via Frequency-Guided Diffusion Models
April 05, 2023
Yunxiang Li, Hua-Chieh Shao, Xiao Liang, Liyuan Chen, Ruiqi Li, Steve Jiang, Jing Wang, You Zhang
Recently, the diffusion model has emerged as a superior generative model that
can produce high quality and realistic images. However, for medical image
translation, the existing diffusion models are deficient in accurately
retaining structural information since the structure details of source domain
images are lost during the forward diffusion process and cannot be fully
recovered through learned reverse diffusion, while the integrity of anatomical
structures is extremely important in medical images. For instance, errors in
image translation may distort, shift, or even remove structures and tumors,
leading to incorrect diagnosis and inadequate treatments. Training and
conditioning diffusion models using paired source and target images with
matching anatomy can help. However, such paired data are very difficult and
costly to obtain, and may also reduce the robustness of the developed model to
out-of-distribution testing data. We propose a frequency-guided diffusion model
(FGDM) that employs frequency-domain filters to guide the diffusion model for
structure-preserving image translation. Based on its design, FGDM allows
zero-shot learning, as it can be trained solely on the data from the target
domain, and used directly for source-to-target domain translation without any
exposure to the source-domain data during training. We evaluated it on three
cone-beam CT (CBCT)-to-CT translation tasks for different anatomical sites, and
a cross-institutional MR imaging translation task. FGDM outperformed the
state-of-the-art methods (GAN-based, VAE-based, and diffusion-based) in metrics
of Frechet Inception Distance (FID), Peak Signal-to-Noise Ratio (PSNR), and
Structural Similarity Index Measure (SSIM), showing its significant advantages
in zero-shot medical image translation.
GenPhys: From Physical Processes to Generative Models
April 05, 2023
Ziming Liu, Di Luo, Yilun Xu, Tommi Jaakkola, Max Tegmark
cs.LG, cs.AI, physics.comp-ph, physics.data-an, quant-ph
Since diffusion models (DM) and the more recent Poisson flow generative
models (PFGM) are inspired by physical processes, it is reasonable to ask: Can
physical processes offer additional new generative models? We show that the
answer is yes. We introduce a general family, Generative Models from Physical
Processes (GenPhys), where we translate partial differential equations (PDEs)
describing physical processes to generative models. We show that generative
models can be constructed from s-generative PDEs (s for smooth). GenPhys
subsume the two existing generative models (DM and PFGM) and even give rise to
new families of generative models, e.g., “Yukawa Generative Models” inspired
from weak interactions. On the other hand, some physical processes by default
do not belong to the GenPhys family, e.g., the wave equation and the
Schr"{o}dinger equation, but could be made into the GenPhys family with some
modifications. Our goal with GenPhys is to explore and expand the design space
of generative models.
Generative Novel View Synthesis with 3D-Aware Diffusion Models
April 05, 2023
Eric R. Chan, Koki Nagano, Matthew A. Chan, Alexander W. Bergman, Jeong Joon Park, Axel Levy, Miika Aittala, Shalini De Mello, Tero Karras, Gordon Wetzstein
We present a diffusion-based model for 3D-aware generative novel view
synthesis from as few as a single input image. Our model samples from the
distribution of possible renderings consistent with the input and, even in the
presence of ambiguity, is capable of rendering diverse and plausible novel
views. To achieve this, our method makes use of existing 2D diffusion backbones
but, crucially, incorporates geometry priors in the form of a 3D feature
volume. This latent feature field captures the distribution over possible scene
representations and improves our method’s ability to generate view-consistent
novel renderings. In addition to generating novel views, our method has the
ability to autoregressively synthesize 3D-consistent sequences. We demonstrate
state-of-the-art results on synthetic renderings and room-scale scenes; we also
show compelling results for challenging, real-world objects.
EigenFold: Generative Protein Structure Prediction with Diffusion Models
April 05, 2023
Bowen Jing, Ezra Erives, Peter Pao-Huang, Gabriele Corso, Bonnie Berger, Tommi Jaakkola
q-bio.BM, cs.LG, physics.bio-ph
Protein structure prediction has reached revolutionary levels of accuracy on
single structures, yet distributional modeling paradigms are needed to capture
the conformational ensembles and flexibility that underlie biological function.
Towards this goal, we develop EigenFold, a diffusion generative modeling
framework for sampling a distribution of structures from a given protein
sequence. We define a diffusion process that models the structure as a system
of harmonic oscillators and which naturally induces a cascading-resolution
generative process along the eigenmodes of the system. On recent CAMEO targets,
EigenFold achieves a median TMScore of 0.84, while providing a more
comprehensive picture of model uncertainty via the ensemble of sampled
structures relative to existing methods. We then assess EigenFold’s ability to
model and predict conformational heterogeneity for fold-switching proteins and
ligand-induced conformational change. Code is available at
https://github.com/bjing2016/EigenFold.
A Diffusion-based Method for Multi-turn Compositional Image Generation
Multi-turn compositional image generation (M-CIG) is a challenging task that
aims to iteratively manipulate a reference image given a modification text.
While most of the existing methods for M-CIG are based on generative
adversarial networks (GANs), recent advances in image generation have
demonstrated the superiority of diffusion models over GANs. In this paper, we
propose a diffusion-based method for M-CIG named conditional denoising
diffusion with image compositional matching (CDD-ICM). We leverage CLIP as the
backbone of image and text encoders, and incorporate a gated fusion mechanism,
originally proposed for question answering, to compositionally fuse the
reference image and the modification text at each turn of M-CIG. We introduce a
conditioning scheme to generate the target image based on the fusion results.
To prioritize the semantic quality of the generated target image, we learn an
auxiliary image compositional match (ICM) objective, along with the conditional
denoising diffusion (CDD) objective in a multi-task learning framework.
Additionally, we also perform ICM guidance and classifier-free guidance to
improve performance. Experimental results show that CDD-ICM achieves
state-of-the-art results on two benchmark datasets for M-CIG, i.e., CoDraw and
i-CLEVR.
Trace and Pace: Controllable Pedestrian Animation via Guided Trajectory Diffusion
April 04, 2023
Davis Rempe, Zhengyi Luo, Xue Bin Peng, Ye Yuan, Kris Kitani, Karsten Kreis, Sanja Fidler, Or Litany
We introduce a method for generating realistic pedestrian trajectories and
full-body animations that can be controlled to meet user-defined goals. We draw
on recent advances in guided diffusion modeling to achieve test-time
controllability of trajectories, which is normally only associated with
rule-based systems. Our guided diffusion model allows users to constrain
trajectories through target waypoints, speed, and specified social groups while
accounting for the surrounding environment context. This trajectory diffusion
model is integrated with a novel physics-based humanoid controller to form a
closed-loop, full-body pedestrian animation system capable of placing large
crowds in a simulated environment with varying terrains. We further propose
utilizing the value function learned during RL training of the animation
controller to guide diffusion to produce trajectories better suited for
particular scenarios such as collision avoidance and traversing uneven terrain.
Video results are available on the project page at
https://nv-tlabs.github.io/trace-pace .
Denoising Diffusion Probabilistic Models to Predict the Density of Molecular Clouds
April 04, 2023
Duo Xu, Jonathan C. Tan, Chia-Jung Hsu, Ye Zhu
astro-ph.GA, astro-ph.IM, cs.LG
We introduce the state-of-the-art deep learning Denoising Diffusion
Probabilistic Model (DDPM) as a method to infer the volume or number density of
giant molecular clouds (GMCs) from projected mass surface density maps. We
adopt magnetohydrodynamic simulations with different global magnetic field
strengths and large-scale dynamics, i.e., noncolliding and colliding GMCs. We
train a diffusion model on both mass surface density maps and their
corresponding mass-weighted number density maps from different viewing angles
for all the simulations. We compare the diffusion model performance with a more
traditional empirical two-component and three-component power-law fitting
method and with a more traditional neural network machine learning approach
(CASI-2D). We conclude that the diffusion model achieves an order of magnitude
improvement on the accuracy of predicting number density compared to that by
other methods. We apply the diffusion method to some example astronomical
column density maps of Taurus and the Infrared Dark Clouds (IRDCs) G28.37+0.07
and G35.39-0.33 to produce maps of their mean volume densities.
A Survey on Graph Diffusion Models: Generative AI in Science for Molecule, Protein and Material
April 04, 2023
Mengchun Zhang, Maryam Qamar, Taegoo Kang, Yuna Jung, Chenshuang Zhang, Sung-Ho Bae, Chaoning Zhang
Diffusion models have become a new SOTA generative modeling method in various
fields, for which there are multiple survey works that provide an overall
survey. With the number of articles on diffusion models increasing
exponentially in the past few years, there is an increasing need for surveys of
diffusion models on specific fields. In this work, we are committed to
conducting a survey on the graph diffusion models. Even though our focus is to
cover the progress of diffusion models in graphs, we first briefly summarize
how other generative modeling methods are used for graphs. After that, we
introduce the mechanism of diffusion models in various forms, which facilitates
the discussion on the graph diffusion models. The applications of graph
diffusion models mainly fall into the category of AI-generated content (AIGC)
in science, for which we mainly focus on how graph diffusion models are
utilized for generating molecules and proteins but also cover other cases,
including materials design. Moreover, we discuss the issue of evaluating
diffusion models in the graph domain and the existing challenges.
ReMoDiffuse: Retrieval-Augmented Motion Diffusion Model
April 03, 2023
Mingyuan Zhang, Xinying Guo, Liang Pan, Zhongang Cai, Fangzhou Hong, Huirong Li, Lei Yang, Ziwei Liu
3D human motion generation is crucial for creative industry. Recent advances
rely on generative models with domain knowledge for text-driven motion
generation, leading to substantial progress in capturing common motions.
However, the performance on more diverse motions remains unsatisfactory. In
this work, we propose ReMoDiffuse, a diffusion-model-based motion generation
framework that integrates a retrieval mechanism to refine the denoising
process. ReMoDiffuse enhances the generalizability and diversity of text-driven
motion generation with three key designs: 1) Hybrid Retrieval finds appropriate
references from the database in terms of both semantic and kinematic
similarities. 2) Semantic-Modulated Transformer selectively absorbs retrieval
knowledge, adapting to the difference between retrieved samples and the target
motion sequence. 3) Condition Mixture better utilizes the retrieval database
during inference, overcoming the scale sensitivity in classifier-free guidance.
Extensive experiments demonstrate that ReMoDiffuse outperforms state-of-the-art
methods by balancing both text-motion consistency and motion quality,
especially for more diverse motion generation.
Diffusion Bridge Mixture Transports, Schrödinger Bridge Problems and Generative Modeling
April 03, 2023
Stefano Peluchetti
The dynamic Schr"odinger bridge problem seeks a stochastic process that
defines a transport between two target probability measures, while optimally
satisfying the criteria of being closest, in terms of Kullback-Leibler
divergence, to a reference process. We propose a novel sampling-based iterative
algorithm, the iterated diffusion bridge mixture (IDBM) procedure, aimed at
solving the dynamic Schr"odinger bridge problem. The IDBM procedure exhibits
the attractive property of realizing a valid transport between the target
probability measures at each iteration. We perform an initial theoretical
investigation of the IDBM procedure, establishing its convergence properties.
The theoretical findings are complemented by numerical experiments illustrating
the competitive performance of the IDBM procedure. Recent advancements in
generative modeling employ the time-reversal of a diffusion process to define a
generative process that approximately transports a simple distribution to the
data distribution. As an alternative, we propose utilizing the first iteration
of the IDBM procedure as an approximation-free method for realizing this
transport. This approach offers greater flexibility in selecting the generative
process dynamics and exhibits accelerated training and superior sample quality
over larger discretization intervals. In terms of implementation, the necessary
modifications are minimally intrusive, being limited to the training loss
definition.
DreamAvatar: Text-and-Shape Guided 3D Human Avatar Generation via Diffusion Models
April 03, 2023
Yukang Cao, Yan-Pei Cao, Kai Han, Ying Shan, Kwan-Yee K. Wong
We present DreamAvatar, a text-and-shape guided framework for generating
high-quality 3D human avatars with controllable poses. While encouraging
results have been reported by recent methods on text-guided 3D common object
generation, generating high-quality human avatars remains an open challenge due
to the complexity of the human body’s shape, pose, and appearance. We propose
DreamAvatar to tackle this challenge, which utilizes a trainable NeRF for
predicting density and color for 3D points and pretrained text-to-image
diffusion models for providing 2D self-supervision. Specifically, we leverage
the SMPL model to provide shape and pose guidance for the generation. We
introduce a dual-observation-space design that involves the joint optimization
of a canonical space and a posed space that are related by a learnable
deformation field. This facilitates the generation of more complete textures
and geometry faithful to the target pose. We also jointly optimize the losses
computed from the full body and from the zoomed-in 3D head to alleviate the
common multi-face ‘‘Janus’’ problem and improve facial details in the generated
avatars. Extensive evaluations demonstrate that DreamAvatar significantly
outperforms existing methods, establishing a new state-of-the-art for
text-and-shape guided 3D human avatar generation.
Textile Pattern Generation Using Diffusion Models
April 02, 2023
Halil Faruk Karagoz, Gulcin Baykal, Irem Arikan Eksi, Gozde Unal
The problem of text-guided image generation is a complex task in Computer
Vision, with various applications, including creating visually appealing
artwork and realistic product images. One popular solution widely used for this
task is the diffusion model, a generative model that generates images through
an iterative process. Although diffusion models have demonstrated promising
results for various image generation tasks, they may only sometimes produce
satisfactory results when applied to more specific domains, such as the
generation of textile patterns based on text guidance. This study presents a
fine-tuned diffusion model specifically trained for textile pattern generation
by text guidance to address this issue. The study involves the collection of
various textile pattern images and their captioning with the help of another AI
model. The fine-tuned diffusion model is trained with this newly created
dataset, and its results are compared with the baseline models visually and
numerically. The results demonstrate that the proposed fine-tuned diffusion
model outperforms the baseline models in terms of pattern quality and
efficiency in textile pattern generation by text guidance. This study presents
a promising solution to the problem of text-guided textile pattern generation
and has the potential to simplify the design process within the textile
industry.
Inf-Diff: Infinite Resolution Diffusion with Subsampled Mollified States
March 31, 2023
Sam Bond-Taylor, Chris G. Willcocks
We introduce $\infty$-Diff, a generative diffusion model which directly
operates on infinite resolution data. By randomly sampling subsets of
coordinates during training and learning to denoise the content at those
coordinates, a continuous function is learned that allows sampling at arbitrary
resolutions. In contrast to other recent infinite resolution generative models,
our approach operates directly on the raw data, not requiring latent vector
compression for context, using hypernetworks, nor relying on discrete
components. As such, our approach achieves significantly higher sample quality,
as evidenced by lower FID scores, as well as being able to effectively scale to
higher resolutions than the training data while retaining detail.
A Closer Look at Parameter-Efficient Tuning in Diffusion Models
March 31, 2023
Chendong Xiang, Fan Bao, Chongxuan Li, Hang Su, Jun Zhu
Large-scale diffusion models like Stable Diffusion are powerful and find
various real-world applications while customizing such models by fine-tuning is
both memory and time inefficient. Motivated by the recent progress in natural
language processing, we investigate parameter-efficient tuning in large
diffusion models by inserting small learnable modules (termed adapters). In
particular, we decompose the design space of adapters into orthogonal factors
– the input position, the output position as well as the function form, and
perform Analysis of Variance (ANOVA), a classical statistical approach for
analyzing the correlation between discrete (design options) and continuous
variables (evaluation metrics). Our analysis suggests that the input position
of adapters is the critical factor influencing the performance of downstream
tasks. Then, we carefully study the choice of the input position, and we find
that putting the input position after the cross-attention block can lead to the
best performance, validated by additional visualization analyses. Finally, we
provide a recipe for parameter-efficient tuning in diffusion models, which is
comparable if not superior to the fully fine-tuned baseline (e.g., DreamBooth)
with only 0.75 \% extra parameters, across various customized tasks.
Diffusion Action Segmentation
March 31, 2023
Daochang Liu, Qiyue Li, AnhDung Dinh, Tingting Jiang, Mubarak Shah, Chang Xu
Temporal action segmentation is crucial for understanding long-form videos.
Previous works on this task commonly adopt an iterative refinement paradigm by
using multi-stage models. We propose a novel framework via denoising diffusion
models, which nonetheless shares the same inherent spirit of such iterative
refinement. In this framework, action predictions are iteratively generated
from random noise with input video features as conditions. To enhance the
modeling of three striking characteristics of human actions, including the
position prior, the boundary ambiguity, and the relational dependency, we
devise a unified masking strategy for the conditioning inputs in our framework.
Extensive experiments on three benchmark datasets, i.e., GTEA, 50Salads, and
Breakfast, are performed and the proposed method achieves superior or
comparable results to state-of-the-art methods, showing the effectiveness of a
generative approach for action segmentation.
Reference-based Image Composition with Sketch via Structure-aware Diffusion Model
March 31, 2023
Kangyeol Kim, Sunghyun Park, Junsoo Lee, Jaegul Choo
Recent remarkable improvements in large-scale text-to-image generative models
have shown promising results in generating high-fidelity images. To further
enhance editability and enable fine-grained generation, we introduce a
multi-input-conditioned image composition model that incorporates a sketch as a
novel modal, alongside a reference image. Thanks to the edge-level
controllability using sketches, our method enables a user to edit or complete
an image sub-part with a desired structure (i.e., sketch) and content (i.e.,
reference image). Our framework fine-tunes a pre-trained diffusion model to
complete missing regions using the reference image while maintaining sketch
guidance. Albeit simple, this leads to wide opportunities to fulfill user needs
for obtaining the in-demand images. Through extensive experiments, we
demonstrate that our proposed method offers unique use cases for image
manipulation, enabling user-driven modifications of arbitrary scenes.
Token Merging for Fast Stable Diffusion
March 30, 2023
Daniel Bolya, Judy Hoffman
The landscape of image generation has been forever changed by open vocabulary
diffusion models. However, at their core these models use transformers, which
makes generation slow. Better implementations to increase the throughput of
these transformers have emerged, but they still evaluate the entire model. In
this paper, we instead speed up diffusion models by exploiting natural
redundancy in generated images by merging redundant tokens. After making some
diffusion-specific improvements to Token Merging (ToMe), our ToMe for Stable
Diffusion can reduce the number of tokens in an existing Stable Diffusion model
by up to 60% while still producing high quality images without any extra
training. In the process, we speed up image generation by up to 2x and reduce
memory consumption by up to 5.6x. Furthermore, this speed-up stacks with
efficient implementations such as xFormers, minimally impacting quality while
being up to 5.4x faster for large images. Code is available at
https://github.com/dbolya/tomesd.
Forget-Me-Not: Learning to Forget in Text-to-Image Diffusion Models
March 30, 2023
Eric Zhang, Kai Wang, Xingqian Xu, Zhangyang Wang, Humphrey Shi
The unlearning problem of deep learning models, once primarily an academic
concern, has become a prevalent issue in the industry. The significant advances
in text-to-image generation techniques have prompted global discussions on
privacy, copyright, and safety, as numerous unauthorized personal IDs, content,
artistic creations, and potentially harmful materials have been learned by
these models and later utilized to generate and distribute uncontrolled
content. To address this challenge, we propose \textbf{Forget-Me-Not}, an
efficient and low-cost solution designed to safely remove specified IDs,
objects, or styles from a well-configured text-to-image model in as little as
30 seconds, without impairing its ability to generate other content. Alongside
our method, we introduce the \textbf{Memorization Score (M-Score)} and
\textbf{ConceptBench} to measure the models’ capacity to generate general
concepts, grouped into three primary categories: ID, object, and style. Using
M-Score and ConceptBench, we demonstrate that Forget-Me-Not can effectively
eliminate targeted concepts while maintaining the model’s performance on other
concepts. Furthermore, Forget-Me-Not offers two practical extensions: a)
removal of potentially harmful or NSFW content, and b) enhancement of model
accuracy, inclusion and diversity through \textbf{concept correction and
disentanglement}. It can also be adapted as a lightweight model patch for
Stable Diffusion, allowing for concept manipulation and convenient
distribution. To encourage future research in this critical area and promote
the development of safe and inclusive generative models, we will open-source
our code and ConceptBench at
\href{https://github.com/SHI-Labs/Forget-Me-Not}{https://github.com/SHI-Labs/Forget-Me-Not}.
DDP: Diffusion Model for Dense Visual Prediction
March 30, 2023
Yuanfeng Ji, Zhe Chen, Enze Xie, Lanqing Hong, Xihui Liu, Zhaoqiang Liu, Tong Lu, Zhenguo Li, Ping Luo
We propose a simple, efficient, yet powerful framework for dense visual
predictions based on the conditional diffusion pipeline. Our approach follows a
“noise-to-map” generative paradigm for prediction by progressively removing
noise from a random Gaussian distribution, guided by the image. The method,
called DDP, efficiently extends the denoising diffusion process into the modern
perception pipeline. Without task-specific design and architecture
customization, DDP is easy to generalize to most dense prediction tasks, e.g.,
semantic segmentation and depth estimation. In addition, DDP shows attractive
properties such as dynamic inference and uncertainty awareness, in contrast to
previous single-step discriminative methods. We show top results on three
representative tasks with six diverse benchmarks, without tricks, DDP achieves
state-of-the-art or competitive performance on each task compared to the
specialist counterparts. For example, semantic segmentation (83.9 mIoU on
Cityscapes), BEV map segmentation (70.6 mIoU on nuScenes), and depth estimation
(0.05 REL on KITTI). We hope that our approach will serve as a solid baseline
and facilitate future research
PAIR-Diffusion: Object-Level Image Editing with Structure-and-Appearance Paired Diffusion Models
March 30, 2023
Vidit Goel, Elia Peruzzo, Yifan Jiang, Dejia Xu, Xingqian Xu, Nicu Sebe, Trevor Darrell, Zhangyang Wang, Humphrey Shi
Generative image editing has recently witnessed extremely fast-paced growth.
Some works use high-level conditioning such as text, while others use low-level
conditioning. Nevertheless, most of them lack fine-grained control over the
properties of the different objects present in the image, i.e.\,object-level
image editing. In this work, we tackle the task by perceiving the images as an
amalgamation of various objects and aim to control the properties of each
object in a fine-grained manner. Out of these properties, we identify structure
and appearance as the most intuitive to understand and useful for editing
purposes. We propose \textbf{PAIR} Diffusion, a generic framework that can
enable a diffusion model to control the structure and appearance properties of
each object in the image. We show that having control over the properties of
each object in an image leads to comprehensive editing capabilities. Our
framework allows for various object-level editing operations on real images
such as reference image-based appearance editing, free-form shape editing,
adding objects, and variations. Thanks to our design, we do not require any
inversion step. Additionally, we propose multimodal classifier-free guidance
which enables editing images using both reference images and text when using
our approach with foundational diffusion models. We validate the above claims
by extensively evaluating our framework on both unconditional and foundational
diffusion models. Please refer to
https://vidit98.github.io/publication/conference-paper/pair_diff.html for code
and model release.
LayoutDiffusion: Controllable Diffusion Model for Layout-to-image Generation
March 30, 2023
Guangcong Zheng, Xianpan Zhou, Xuewei Li, Zhongang Qi, Ying Shan, Xi Li
Recently, diffusion models have achieved great success in image synthesis.
However, when it comes to the layout-to-image generation where an image often
has a complex scene of multiple objects, how to make strong control over both
the global layout map and each detailed object remains a challenging task. In
this paper, we propose a diffusion model named LayoutDiffusion that can obtain
higher generation quality and greater controllability than the previous works.
To overcome the difficult multimodal fusion of image and layout, we propose to
construct a structural image patch with region information and transform the
patched image into a special layout to fuse with the normal layout in a unified
form. Moreover, Layout Fusion Module (LFM) and Object-aware Cross Attention
(OaCA) are proposed to model the relationship among multiple objects and
designed to be object-aware and position-sensitive, allowing for precisely
controlling the spatial related information. Extensive experiments show that
our LayoutDiffusion outperforms the previous SOTA methods on FID, CAS by
relatively 46.35%, 26.70% on COCO-stuff and 44.29%, 41.82% on VG. Code is
available at https://github.com/ZGCTroy/LayoutDiffusion.
Discriminative Class Tokens for Text-to-Image Diffusion Models
March 30, 2023
Idan Schwartz, Vésteinn Snæbjarnarson, Hila Chefer, Ryan Cotterell, Serge Belongie, Lior Wolf, Sagie Benaim
Recent advances in text-to-image diffusion models have enabled the generation
of diverse and high-quality images. While impressive, the images often fall
short of depicting subtle details and are susceptible to errors due to
ambiguity in the input text. One way of alleviating these issues is to train
diffusion models on class-labeled datasets. This approach has two
disadvantages: (i) supervised datasets are generally small compared to
large-scale scraped text-image datasets on which text-to-image models are
trained, affecting the quality and diversity of the generated images, or (ii)
the input is a hard-coded label, as opposed to free-form text, limiting the
control over the generated images.
In this work, we propose a non-invasive fine-tuning technique that
capitalizes on the expressive potential of free-form text while achieving high
accuracy through discriminative signals from a pretrained classifier. This is
done by iteratively modifying the embedding of an added input token of a
text-to-image diffusion model, by steering generated images toward a given
target class according to a classifier. Our method is fast compared to prior
fine-tuning methods and does not require a collection of in-class images or
retraining of a noise-tolerant classifier. We evaluate our method extensively,
showing that the generated images are: (i) more accurate and of higher quality
than standard diffusion models, (ii) can be used to augment training data in a
low-resource setting, and (iii) reveal information about the data used to train
the guiding classifier. The code is available at
\url{https://github.com/idansc/discriminative_class_tokens}.
DiffCollage: Parallel Generation of Large Content with Diffusion Models
March 30, 2023
Qinsheng Zhang, Jiaming Song, Xun Huang, Yongxin Chen, Ming-Yu Liu
We present DiffCollage, a compositional diffusion model that can generate
large content by leveraging diffusion models trained on generating pieces of
the large content. Our approach is based on a factor graph representation where
each factor node represents a portion of the content and a variable node
represents their overlap. This representation allows us to aggregate
intermediate outputs from diffusion models defined on individual nodes to
generate content of arbitrary size and shape in parallel without resorting to
an autoregressive generation procedure. We apply DiffCollage to various tasks,
including infinite image generation, panorama image generation, and
long-duration text-guided motion generation. Extensive experimental results
with a comparison to strong autoregressive baselines verify the effectiveness
of our approach.
HyperDiffusion: Generating Implicit Neural Fields with Weight-Space Diffusion
March 29, 2023
Ziya Erkoç, Fangchang Ma, Qi Shan, Matthias Nießner, Angela Dai
Implicit neural fields, typically encoded by a multilayer perceptron (MLP)
that maps from coordinates (e.g., xyz) to signals (e.g., signed distances),
have shown remarkable promise as a high-fidelity and compact representation.
However, the lack of a regular and explicit grid structure also makes it
challenging to apply generative modeling directly on implicit neural fields in
order to synthesize new data. To this end, we propose HyperDiffusion, a novel
approach for unconditional generative modeling of implicit neural fields.
HyperDiffusion operates directly on MLP weights and generates new neural
implicit fields encoded by synthesized MLP parameters. Specifically, a
collection of MLPs is first optimized to faithfully represent individual data
samples. Subsequently, a diffusion process is trained in this MLP weight space
to model the underlying distribution of neural implicit fields. HyperDiffusion
enables diffusion modeling over a implicit, compact, and yet high-fidelity
representation of complex signals across 3D shapes and 4D mesh animations
within one single unified framework.
Physics-Driven Diffusion Models for Impact Sound Synthesis from Videos
March 29, 2023
Kun Su, Kaizhi Qian, Eli Shlizerman, Antonio Torralba, Chuang Gan
cs.CV, cs.LG, cs.SD, eess.AS
Modeling sounds emitted from physical object interactions is critical for
immersive perceptual experiences in real and virtual worlds. Traditional
methods of impact sound synthesis use physics simulation to obtain a set of
physics parameters that could represent and synthesize the sound. However, they
require fine details of both the object geometries and impact locations, which
are rarely available in the real world and can not be applied to synthesize
impact sounds from common videos. On the other hand, existing video-driven deep
learning-based approaches could only capture the weak correspondence between
visual content and impact sounds since they lack of physics knowledge. In this
work, we propose a physics-driven diffusion model that can synthesize
high-fidelity impact sound for a silent video clip. In addition to the video
content, we propose to use additional physics priors to guide the impact sound
synthesis procedure. The physics priors include both physics parameters that
are directly estimated from noisy real-world impact sound examples without
sophisticated setup and learned residual parameters that interpret the sound
environment via neural networks. We further implement a novel diffusion model
with specific training and inference strategies to combine physics priors and
visual information for impact sound synthesis. Experimental results show that
our model outperforms several existing systems in generating realistic impact
sounds. More importantly, the physics-based representations are fully
interpretable and transparent, thus enabling us to perform sound editing
flexibly.
Diffusion Schrödinger Bridge Matching
March 29, 2023
Yuyang Shi, Valentin De Bortoli, Andrew Campbell, Arnaud Doucet
Solving transport problems, i.e. finding a map transporting one given
distribution to another, has numerous applications in machine learning. Novel
mass transport methods motivated by generative modeling have recently been
proposed, e.g. Denoising Diffusion Models (DDMs) and Flow Matching Models
(FMMs) implement such a transport through a Stochastic Differential Equation
(SDE) or an Ordinary Differential Equation (ODE). However, while it is
desirable in many applications to approximate the deterministic dynamic Optimal
Transport (OT) map which admits attractive properties, DDMs and FMMs are not
guaranteed to provide transports close to the OT map. In contrast,
Schr"odinger bridges (SBs) compute stochastic dynamic mappings which recover
entropy-regularized versions of OT. Unfortunately, existing numerical methods
approximating SBs either scale poorly with dimension or accumulate errors
across iterations. In this work, we introduce Iterative Markovian Fitting
(IMF), a new methodology for solving SB problems, and Diffusion Schr"odinger
Bridge Matching (DSBM), a novel numerical algorithm for computing IMF iterates.
DSBM significantly improves over previous SB numerics and recovers as
special/limiting cases various recent transport methods. We demonstrate the
performance of DSBM on a variety of problems.
4D Facial Expression Diffusion Model
March 29, 2023
Kaifeng Zou, Sylvain Faisan, Boyang Yu, Sébastien Valette, Hyewon Seo
Facial expression generation is one of the most challenging and long-sought
aspects of character animation, with many interesting applications. The
challenging task, traditionally having relied heavily on digital craftspersons,
remains yet to be explored. In this paper, we introduce a generative framework
for generating 3D facial expression sequences (i.e. 4D faces) that can be
conditioned on different inputs to animate an arbitrary 3D face mesh. It is
composed of two tasks: (1) Learning the generative model that is trained over a
set of 3D landmark sequences, and (2) Generating 3D mesh sequences of an input
facial mesh driven by the generated landmark sequences. The generative model is
based on a Denoising Diffusion Probabilistic Model (DDPM), which has achieved
remarkable success in generative tasks of other domains. While it can be
trained unconditionally, its reverse process can still be conditioned by
various condition signals. This allows us to efficiently develop several
downstream tasks involving various conditional generation, by using expression
labels, text, partial sequences, or simply a facial geometry. To obtain the
full mesh deformation, we then develop a landmark-guided encoder-decoder to
apply the geometrical deformation embedded in landmarks on a given facial mesh.
Experiments show that our model has learned to generate realistic, quality
expressions solely from the dataset of relatively small size, improving over
the state-of-the-art methods. Videos and qualitative comparisons with other
methods can be found at https://github.com/ZOUKaifeng/4DFM. Code and models
will be made available upon acceptance.
WordStylist: Styled Verbatim Handwritten Text Generation with Latent Diffusion Models
March 29, 2023
Konstantina Nikolaidou, George Retsinas, Vincent Christlein, Mathias Seuret, Giorgos Sfikas, Elisa Barney Smith, Hamam Mokayed, Marcus Liwicki
Text-to-Image synthesis is the task of generating an image according to a
specific text description. Generative Adversarial Networks have been considered
the standard method for image synthesis virtually since their introduction.
Denoising Diffusion Probabilistic Models are recently setting a new baseline,
with remarkable results in Text-to-Image synthesis, among other fields. Aside
its usefulness per se, it can also be particularly relevant as a tool for data
augmentation to aid training models for other document image processing tasks.
In this work, we present a latent diffusion-based method for styled
text-to-text-content-image generation on word-level. Our proposed method is
able to generate realistic word image samples from different writer styles, by
using class index styles and text content prompts without the need of
adversarial training, writer recognition, or text recognition. We gauge system
performance with the Fr'echet Inception Distance, writer recognition accuracy,
and writer retrieval. We show that the proposed model produces samples that are
aesthetically pleasing, help boosting text recognition performance, and get
similar writer retrieval score as real data. Code is available at:
https://github.com/koninik/WordStylist.
Your Diffusion Model is Secretly a Zero-Shot Classifier
March 28, 2023
Alexander C. Li, Mihir Prabhudesai, Shivam Duggal, Ellis Brown, Deepak Pathak
cs.LG, cs.AI, cs.CV, cs.NE, cs.RO
The recent wave of large-scale text-to-image diffusion models has
dramatically increased our text-based image generation abilities. These models
can generate realistic images for a staggering variety of prompts and exhibit
impressive compositional generalization abilities. Almost all use cases thus
far have solely focused on sampling; however, diffusion models can also provide
conditional density estimates, which are useful for tasks beyond image
generation. In this paper, we show that the density estimates from large-scale
text-to-image diffusion models like Stable Diffusion can be leveraged to
perform zero-shot classification without any additional training. Our
generative approach to classification, which we call Diffusion Classifier,
attains strong results on a variety of benchmarks and outperforms alternative
methods of extracting knowledge from diffusion models. Although a gap remains
between generative and discriminative approaches on zero-shot recognition
tasks, our diffusion-based approach has significantly stronger multimodal
compositional reasoning ability than competing discriminative approaches.
Finally, we use Diffusion Classifier to extract standard classifiers from
class-conditional diffusion models trained on ImageNet. Our models achieve
strong classification performance using only weak augmentations and exhibit
qualitatively better “effective robustness” to distribution shift. Overall, our
results are a step toward using generative over discriminative models for
downstream tasks. Results and visualizations at
https://diffusion-classifier.github.io/
Visual Chain-of-Thought Diffusion Models
March 28, 2023
William Harvey, Frank Wood
Recent progress with conditional image diffusion models has been stunning,
and this holds true whether we are speaking about models conditioned on a text
description, a scene layout, or a sketch. Unconditional image diffusion models
are also improving but lag behind, as do diffusion models which are conditioned
on lower-dimensional features like class labels. We propose to close the gap
between conditional and unconditional models using a two-stage sampling
procedure. In the first stage we sample an embedding describing the semantic
content of the image. In the second stage we sample the image conditioned on
this embedding and then discard the embedding. Doing so lets us leverage the
power of conditional diffusion models on the unconditional generation task,
which we show improves FID by 25-50% compared to standard unconditional
generation.
DDMM-Synth: A Denoising Diffusion Model for Cross-modal Medical Image Synthesis with Sparse-view Measurement Embedding
March 28, 2023
Xiaoyue Li, Kai Shang, Gaoang Wang, Mark D. Butala
eess.IV, cs.CV, physics.med-ph
Reducing the radiation dose in computed tomography (CT) is important to
mitigate radiation-induced risks. One option is to employ a well-trained model
to compensate for incomplete information and map sparse-view measurements to
the CT reconstruction. However, reconstruction from sparsely sampled
measurements is insufficient to uniquely characterize an object in CT, and a
learned prior model may be inadequate for unencountered cases. Medical modal
translation from magnetic resonance imaging (MRI) to CT is an alternative but
may introduce incorrect information into the synthesized CT images in addition
to the fact that there exists no explicit transformation describing their
relationship. To address these issues, we propose a novel framework called the
denoising diffusion model for medical image synthesis (DDMM-Synth) to close the
performance gaps described above. This framework combines an MRI-guided
diffusion model with a new CT measurement embedding reverse sampling scheme.
Specifically, the null-space content of the one-step denoising result is
refined by the MRI-guided data distribution prior, and its range-space
component derived from an explicit operator matrix and the sparse-view CT
measurements is directly integrated into the inference stage. DDMM-Synth can
adjust the projection number of CT a posteriori for a particular clinical
application and its modified version can even improve the results significantly
for noisy cases. Our results show that DDMM-Synth outperforms other
state-of-the-art supervised-learning-based baselines under fair experimental
conditions.
DiffULD: Diffusive Universal Lesion Detection
March 28, 2023
Peiang Zhao, Han Li, Ruiyang Jin, S. Kevin Zhou
Universal Lesion Detection (ULD) in computed tomography (CT) plays an
essential role in computer-aided diagnosis. Promising ULD results have been
reported by anchor-based detection designs, but they have inherent drawbacks
due to the use of anchors: i) Insufficient training targets and ii)
Difficulties in anchor design. Diffusion probability models (DPM) have
demonstrated outstanding capabilities in many vision tasks. Many DPM-based
approaches achieve great success in natural image object detection without
using anchors. But they are still ineffective for ULD due to the insufficient
training targets. In this paper, we propose a novel ULD method, DiffULD, which
utilizes DPM for lesion detection. To tackle the negative effect triggered by
insufficient targets, we introduce a novel center-aligned bounding box padding
strategy that provides additional high-quality training targets yet avoids
significant performance deterioration. DiffULD is inherently advanced in
locating lesions with diverse sizes and shapes since it can predict with
arbitrary boxes. Experiments on the benchmark dataset DeepLesion show the
superiority of DiffULD when compared to state-of-the-art ULD approaches.
Anti-DreamBooth: Protecting users from personalized text-to-image synthesis
March 27, 2023
Thanh Van Le, Hao Phung, Thuan Hoang Nguyen, Quan Dao, Ngoc Tran, Anh Tran
Text-to-image diffusion models are nothing but a revolution, allowing anyone,
even without design skills, to create realistic images from simple text inputs.
With powerful personalization tools like DreamBooth, they can generate images
of a specific person just by learning from his/her few reference images.
However, when misused, such a powerful and convenient tool can produce fake
news or disturbing content targeting any individual victim, posing a severe
negative social impact. In this paper, we explore a defense system called
Anti-DreamBooth against such malicious use of DreamBooth. The system aims to
add subtle noise perturbation to each user’s image before publishing in order
to disrupt the generation quality of any DreamBooth model trained on these
perturbed images. We investigate a wide range of algorithms for perturbation
optimization and extensively evaluate them on two facial datasets over various
text-to-image model versions. Despite the complicated formulation of DreamBooth
and Diffusion-based text-to-image models, our methods effectively defend users
from the malicious use of those models. Their effectiveness withstands even
adverse conditions, such as model or prompt/term mismatching between training
and testing. Our code will be available at
https://github.com/VinAIResearch/Anti-DreamBooth.git.
Debiasing Scores and Prompts of 2D Diffusion for Robust Text-to-3D Generation
March 27, 2023
Susung Hong, Donghoon Ahn, Seungryong Kim
cs.CV, cs.CL, cs.GR, cs.LG
Existing score-distilling text-to-3D generation techniques, despite their
considerable promise, often encounter the view inconsistency problem. One of
the most notable issues is the Janus problem, where the most canonical view of
an object (\textit{e.g}., face or head) appears in other views. In this work,
we explore existing frameworks for score-distilling text-to-3D generation and
identify the main causes of the view inconsistency problem – the embedded bias
of 2D diffusion models. Based on these findings, we propose two approaches to
debias the score-distillation frameworks for view-consistent text-to-3D
generation. Our first approach, called score debiasing, involves cutting off
the score estimated by 2D diffusion models and gradually increasing the
truncation value throughout the optimization process. Our second approach,
called prompt debiasing, identifies conflicting words between user prompts and
view prompts using a language model, and adjusts the discrepancy between view
prompts and the viewing direction of an object. Our experimental results show
that our methods improve the realism of the generated 3D objects by
significantly reducing artifacts and achieve a good trade-off between
faithfulness to the 2D diffusion models and 3D consistency with little
overhead. Our project page is available
at~\url{https://susunghong.github.io/Debiased-Score-Distillation-Sampling/}.
Training-free Style Transfer Emerges from h-space in Diffusion models
March 27, 2023
Jaeseok Jeong, Mingi Kwon, Youngjung Uh
Diffusion models (DMs) synthesize high-quality images in various domains.
However, controlling their generative process is still hazy because the
intermediate variables in the process are not rigorously studied. Recently, the
bottleneck feature of the U-Net, namely $h$-space, is found to convey the
semantics of the resulting image. It enables StyleCLIP-like latent editing
within DMs. In this paper, we explore further usage of $h$-space beyond
attribute editing, and introduce a method to inject the content of one image
into another image by combining their features in the generative processes.
Briefly, given the original generative process of the other image, 1) we
gradually blend the bottleneck feature of the content with proper
normalization, and 2) we calibrate the skip connections to match the injected
content. Unlike custom-diffusion approaches, our method does not require
time-consuming optimization or fine-tuning. Instead, our method manipulates
intermediate features within a feed-forward generative process. Furthermore,
our method does not require supervision from external networks. The code is
available at https://curryjung.github.io/InjectFusion/
Exploring Continual Learning of Diffusion Models
March 27, 2023
Michał Zając, Kamil Deja, Anna Kuzina, Jakub M. Tomczak, Tomasz Trzciński, Florian Shkurti, Piotr Miłoś
cs.LG, cs.AI, cs.CV, stat.ML
Diffusion models have achieved remarkable success in generating high-quality
images thanks to their novel training procedures applied to unprecedented
amounts of data. However, training a diffusion model from scratch is
computationally expensive. This highlights the need to investigate the
possibility of training these models iteratively, reusing computation while the
data distribution changes. In this study, we take the first step in this
direction and evaluate the continual learning (CL) properties of diffusion
models. We begin by benchmarking the most common CL methods applied to
Denoising Diffusion Probabilistic Models (DDPMs), where we note the strong
performance of the experience replay with the reduced rehearsal coefficient.
Furthermore, we provide insights into the dynamics of forgetting, which exhibit
diverse behavior across diffusion timesteps. We also uncover certain pitfalls
of using the bits-per-dimension metric for evaluating CL.
Text-to-Image Diffusion Models are Zero-Shot Classifiers
March 27, 2023
Kevin Clark, Priyank Jaini
The excellent generative capabilities of text-to-image diffusion models
suggest they learn informative representations of image-text data. However,
what knowledge their representations capture is not fully understood, and they
have not been thoroughly explored on downstream tasks. We investigate diffusion
models by proposing a method for evaluating them as zero-shot classifiers. The
key idea is using a diffusion model’s ability to denoise a noised image given a
text description of a label as a proxy for that label’s likelihood. We apply
our method to Stable Diffusion and Imagen, using it to probe fine-grained
aspects of the models’ knowledge and comparing them with CLIP’s zero-shot
abilities. They perform competitively with CLIP on a wide range of zero-shot
image classification datasets. Additionally, they achieve state-of-the-art
results on shape/texture bias tests and can successfully perform attribute
binding while CLIP cannot. Although generative pre-training is prevalent in
NLP, visual foundation models often use other methods such as contrastive
learning. Based on our findings, we argue that generative pre-training should
be explored as a compelling alternative for vision-language tasks.
Diffusion Denoised Smoothing for Certified and Adversarial Robust Out-Of-Distribution Detection
March 27, 2023
Nicola Franco, Daniel Korth, Jeanette Miriam Lorenz, Karsten Roscher, Stephan Guennemann
As the use of machine learning continues to expand, the importance of
ensuring its safety cannot be overstated. A key concern in this regard is the
ability to identify whether a given sample is from the training distribution,
or is an “Out-Of-Distribution” (OOD) sample. In addition, adversaries can
manipulate OOD samples in ways that lead a classifier to make a confident
prediction. In this study, we present a novel approach for certifying the
robustness of OOD detection within a $\ell_2$-norm around the input, regardless
of network architecture and without the need for specific components or
additional training. Further, we improve current techniques for detecting
adversarial attacks on OOD samples, while providing high levels of certified
and adversarial robustness on in-distribution samples. The average of all OOD
detection metrics on CIFAR10/100 shows an increase of $\sim 13 \% / 5\%$
relative to previous approaches.
Seer: Language Instructed Video Prediction with Latent Diffusion Models
March 27, 2023
Xianfan Gu, Chuan Wen, Jiaming Song, Yang Gao
Imagining the future trajectory is the key for robots to make sound planning
and successfully reach their goals. Therefore, text-conditioned video
prediction (TVP) is an essential task to facilitate general robot policy
learning, i.e., predicting future video frames with a given language
instruction and reference frames. It is a highly challenging task to ground
task-level goals specified by instructions and high-fidelity frames together,
requiring large-scale data and computation. To tackle this task and empower
robots with the ability to foresee the future, we propose a sample and
computation-efficient model, named \textbf{Seer}, by inflating the pretrained
text-to-image (T2I) stable diffusion models along the temporal axis. We inflate
the denoising U-Net and language conditioning model with two novel techniques,
Autoregressive Spatial-Temporal Attention and Frame Sequential Text Decomposer,
to propagate the rich prior knowledge in the pretrained T2I models across the
frames. With the well-designed architecture, Seer makes it possible to generate
high-fidelity, coherent, and instruction-aligned video frames by fine-tuning a
few layers on a small amount of data. The experimental results on Something
Something V2 (SSv2) and Bridgedata datasets demonstrate our superior video
prediction performance with around 210-hour training on 4 RTX 3090 GPUs:
decreasing the FVD of the current SOTA model from 290 to 200 on SSv2 and
achieving at least 70\% preference in the human evaluation.
DiffTAD: Temporal Action Detection with Proposal Denoising Diffusion
March 27, 2023
Sauradip Nag, Xiatian Zhu, Jiankang Deng, Yi-Zhe Song, Tao Xiang
cs.CV, cs.AI, cs.LG, cs.MM
We propose a new formulation of temporal action detection (TAD) with
denoising diffusion, DiffTAD in short. Taking as input random temporal
proposals, it can yield action proposals accurately given an untrimmed long
video. This presents a generative modeling perspective, against previous
discriminative learning manners. This capability is achieved by first diffusing
the ground-truth proposals to random ones (i.e., the forward/noising process)
and then learning to reverse the noising process (i.e., the backward/denoising
process). Concretely, we establish the denoising process in the Transformer
decoder (e.g., DETR) by introducing a temporal location query design with
faster convergence in training. We further propose a cross-step selective
conditioning algorithm for inference acceleration. Extensive evaluations on
ActivityNet and THUMOS show that our DiffTAD achieves top performance compared
to previous art alternatives. The code will be made available at
https://github.com/sauradip/DiffusionTAD.
Conditional Score-Based Reconstructions for Multi-contrast MRI
March 26, 2023
Brett Levac, Ajil Jalal, Kannan Ramchandran, Jonathan I. Tamir
Magnetic resonance imaging (MRI) exam protocols consist of multiple
contrast-weighted images of the same anatomy to emphasize different tissue
properties. Due to the long acquisition times required to collect fully sampled
k-space measurements, it is common to only collect a fraction of k-space for
each scan and subsequently solve independent inverse problems for each image
contrast. Recently, there has been a push to further accelerate MRI exams using
data-driven priors, and generative models in particular, to regularize the
ill-posed inverse problem of image reconstruction. These methods have shown
promising improvements over classical methods. However, many of the approaches
neglect the additional information present in a clinical MRI exam like the
multi-contrast nature of the data and treat each scan as an independent
reconstruction. In this work we show that by learning a joint Bayesian prior
over multi-contrast data with a score-based generative model we are able to
leverage the underlying structure between random variables related to a given
imaging problem. This leads to an improvement in image reconstruction fidelity
over generative models that rely only on a marginal prior over the image
contrast of interest.
GestureDiffuCLIP: Gesture Diffusion Model with CLIP Latents
March 26, 2023
Tenglong Ao, Zeyi Zhang, Libin Liu
The automatic generation of stylized co-speech gestures has recently received
increasing attention. Previous systems typically allow style control via
predefined text labels or example motion clips, which are often not flexible
enough to convey user intent accurately. In this work, we present
GestureDiffuCLIP, a neural network framework for synthesizing realistic,
stylized co-speech gestures with flexible style control. We leverage the power
of the large-scale Contrastive-Language-Image-Pre-training (CLIP) model and
present a novel CLIP-guided mechanism that extracts efficient style
representations from multiple input modalities, such as a piece of text, an
example motion clip, or a video. Our system learns a latent diffusion model to
generate high-quality gestures and infuses the CLIP representations of style
into the generator via an adaptive instance normalization (AdaIN) layer. We
further devise a gesture-transcript alignment mechanism that ensures a
semantically correct gesture generation based on contrastive learning. Our
system can also be extended to allow fine-grained style control of individual
body parts. We demonstrate an extensive set of examples showing the flexibility
and generalizability of our model to a variety of style descriptions. In a user
study, we show that our system outperforms the state-of-the-art approaches
regarding human likeness, appropriateness, and style correctness.
DiracDiffusion: Denoising and Incremental Reconstruction with Assured Data-Consistency
March 25, 2023
Zalan Fabian, Berk Tinaz, Mahdi Soltanolkotabi
eess.IV, cs.CV, cs.LG, I.2.6; I.4.4; I.4.5
Diffusion models have established new state of the art in a multitude of
computer vision tasks, including image restoration. Diffusion-based inverse
problem solvers generate reconstructions of exceptional visual quality from
heavily corrupted measurements. However, in what is widely known as the
perception-distortion trade-off, the price of perceptually appealing
reconstructions is often paid in declined distortion metrics, such as PSNR.
Distortion metrics measure faithfulness to the observation, a crucial
requirement in inverse problems. In this work, we propose a novel framework for
inverse problem solving, namely we assume that the observation comes from a
stochastic degradation process that gradually degrades and noises the original
clean image. We learn to reverse the degradation process in order to recover
the clean image. Our technique maintains consistency with the original
measurement throughout the reverse process, and allows for great flexibility in
trading off perceptual quality for improved distortion metrics and sampling
speedup via early-stopping. We demonstrate the efficiency of our method on
different high-resolution datasets and inverse problems, achieving great
improvements over other state-of-the-art diffusion-based methods with respect
to both perceptual and distortion metrics. Source code and pre-trained models
will be released soon.
DiffuScene: Scene Graph Denoising Diffusion Probabilistic Model for Generative Indoor Scene Synthesis
March 24, 2023
Jiapeng Tang, Yinyu Nie, Lev Markhasin, Angela Dai, Justus Thies, Matthias Nießner
We present DiffuScene for indoor 3D scene synthesis based on a novel scene
graph denoising diffusion probabilistic model, which generates 3D instance
properties stored in a fully-connected scene graph and then retrieves the most
similar object geometry for each graph node i.e. object instance which is
characterized as a concatenation of different attributes, including location,
size, orientation, semantic, and geometry features. Based on this scene graph,
we designed a diffusion model to determine the placements and types of 3D
instances. Our method can facilitate many downstream applications, including
scene completion, scene arrangement, and text-conditioned scene synthesis.
Experiments on the 3D-FRONT dataset show that our method can synthesize more
physically plausible and diverse indoor scenes than state-of-the-art methods.
Extensive ablation studies verify the effectiveness of our design choice in
scene diffusion models.
MindDiffuser: Controlled Image Reconstruction from Human Brain Activity with Semantic and Structural Diffusion
March 24, 2023
Yizhuo Lu, Changde Du, Dianpeng Wang, Huiguang He
Reconstructing visual stimuli from measured functional magnetic resonance
imaging (fMRI) has been a meaningful and challenging task. Previous studies
have successfully achieved reconstructions with structures similar to the
original images, such as the outlines and size of some natural images. However,
these reconstructions lack explicit semantic information and are difficult to
discern. In recent years, many studies have utilized multi-modal pre-trained
models with stronger generative capabilities to reconstruct images that are
semantically similar to the original ones. However, these images have
uncontrollable structural information such as position and orientation. To
address both of the aforementioned issues simultaneously, we propose a
two-stage image reconstruction model called MindDiffuser, utilizing Stable
Diffusion. In Stage 1, the VQ-VAE latent representations and the CLIP text
embeddings decoded from fMRI are put into the image-to-image process of Stable
Diffusion, which yields a preliminary image that contains semantic and
structural information. In Stage 2, we utilize the low-level CLIP visual
features decoded from fMRI as supervisory information, and continually adjust
the two features in Stage 1 through backpropagation to align the structural
information. The results of both qualitative and quantitative analyses
demonstrate that our proposed model has surpassed the current state-of-the-art
models in terms of reconstruction results on Natural Scenes Dataset (NSD).
Furthermore, the results of ablation experiments indicate that each component
of our model is effective for image reconstruction.
CoLa-Diff: Conditional Latent Diffusion Model for Multi-Modal MRI Synthesis
March 24, 2023
Lan Jiang, Ye Mao, Xi Chen, Xiangfeng Wang, Chao Li
eess.IV, cs.CV, I.3.3; I.4.10
MRI synthesis promises to mitigate the challenge of missing MRI modality in
clinical practice. Diffusion model has emerged as an effective technique for
image synthesis by modelling complex and variable data distributions. However,
most diffusion-based MRI synthesis models are using a single modality. As they
operate in the original image domain, they are memory-intensive and less
feasible for multi-modal synthesis. Moreover, they often fail to preserve the
anatomical structure in MRI. Further, balancing the multiple conditions from
multi-modal MRI inputs is crucial for multi-modal synthesis. Here, we propose
the first diffusion-based multi-modality MRI synthesis model, namely
Conditioned Latent Diffusion Model (CoLa-Diff). To reduce memory consumption,
we design CoLa-Diff to operate in the latent space. We propose a novel network
architecture, e.g., similar cooperative filtering, to solve the possible
compression and noise in latent space. To better maintain the anatomical
structure, brain region masks are introduced as the priors of density
distributions to guide diffusion process. We further present auto-weight
adaptation to employ multi-modal information effectively. Our experiments
demonstrate that CoLa-Diff outperforms other state-of-the-art MRI synthesis
methods, promising to serve as an effective tool for multi-modal MRI synthesis.
DisC-Diff: Disentangled Conditional Diffusion Model for Multi-Contrast MRI Super-Resolution
March 24, 2023
Ye Mao, Lan Jiang, Xi Chen, Chao Li
Multi-contrast magnetic resonance imaging (MRI) is the most common management
tool used to characterize neurological disorders based on brain tissue
contrasts. However, acquiring high-resolution MRI scans is time-consuming and
infeasible under specific conditions. Hence, multi-contrast super-resolution
methods have been developed to improve the quality of low-resolution contrasts
by leveraging complementary information from multi-contrast MRI. Current deep
learning-based super-resolution methods have limitations in estimating
restoration uncertainty and avoiding mode collapse. Although the diffusion
model has emerged as a promising approach for image enhancement, capturing
complex interactions between multiple conditions introduced by multi-contrast
MRI super-resolution remains a challenge for clinical applications. In this
paper, we propose a disentangled conditional diffusion model, DisC-Diff, for
multi-contrast brain MRI super-resolution. It utilizes the sampling-based
generation and simple objective function of diffusion models to estimate
uncertainty in restorations effectively and ensure a stable optimization
process. Moreover, DisC-Diff leverages a disentangled multi-stream network to
fully exploit complementary information from multi-contrast MRI, improving
model interpretation under multiple conditions of multi-contrast inputs. We
validated the effectiveness of DisC-Diff on two datasets: the IXI dataset,
which contains 578 normal brains, and a clinical dataset with 316 pathological
brains. Our experimental results demonstrate that DisC-Diff outperforms other
state-of-the-art methods both quantitatively and visually.
Conditional Image-to-Video Generation with Latent Flow Diffusion Models
March 24, 2023
Haomiao Ni, Changhao Shi, Kai Li, Sharon X. Huang, Martin Renqiang Min
Conditional image-to-video (cI2V) generation aims to synthesize a new
plausible video starting from an image (e.g., a person’s face) and a condition
(e.g., an action class label like smile). The key challenge of the cI2V task
lies in the simultaneous generation of realistic spatial appearance and
temporal dynamics corresponding to the given image and condition. In this
paper, we propose an approach for cI2V using novel latent flow diffusion models
(LFDM) that synthesize an optical flow sequence in the latent space based on
the given condition to warp the given image. Compared to previous
direct-synthesis-based works, our proposed LFDM can better synthesize spatial
details and temporal motion by fully utilizing the spatial content of the given
image and warping it in the latent space according to the generated
temporally-coherent flow. The training of LFDM consists of two separate stages:
(1) an unsupervised learning stage to train a latent flow auto-encoder for
spatial content generation, including a flow predictor to estimate latent flow
between pairs of video frames, and (2) a conditional learning stage to train a
3D-UNet-based diffusion model (DM) for temporal latent flow generation. Unlike
previous DMs operating in pixel space or latent feature space that couples
spatial and temporal information, the DM in our LFDM only needs to learn a
low-dimensional latent flow space for motion generation, thus being more
computationally efficient. We conduct comprehensive experiments on multiple
datasets, where LFDM consistently outperforms prior arts. Furthermore, we show
that LFDM can be easily adapted to new domains by simply finetuning the image
decoder. Our code is available at https://github.com/nihaomiao/CVPR23_LFDM.
End-to-End Diffusion Latent Optimization Improves Classifier Guidance
March 23, 2023
Bram Wallace, Akash Gokul, Stefano Ermon, Nikhil Naik
Classifier guidance – using the gradients of an image classifier to steer
the generations of a diffusion model – has the potential to dramatically
expand the creative control over image generation and editing. However,
currently classifier guidance requires either training new noise-aware models
to obtain accurate gradients or using a one-step denoising approximation of the
final generation, which leads to misaligned gradients and sub-optimal control.
We highlight this approximation’s shortcomings and propose a novel guidance
method: Direct Optimization of Diffusion Latents (DOODL), which enables
plug-and-play guidance by optimizing diffusion latents w.r.t. the gradients of
a pre-trained classifier on the true generated pixels, using an invertible
diffusion process to achieve memory-efficient backpropagation. Showcasing the
potential of more precise guidance, DOODL outperforms one-step classifier
guidance on computational and human evaluation metrics across different forms
of guidance: using CLIP guidance to improve generations of complex prompts from
DrawBench, using fine-grained visual classifiers to expand the vocabulary of
Stable Diffusion, enabling image-conditioned generation with a CLIP visual
encoder, and improving image aesthetics using an aesthetic scoring network.
Code at https://github.com/salesforce/DOODL.
Ablating Concepts in Text-to-Image Diffusion Models
March 23, 2023
Nupur Kumari, Bingliang Zhang, Sheng-Yu Wang, Eli Shechtman, Richard Zhang, Jun-Yan Zhu
Large-scale text-to-image diffusion models can generate high-fidelity images
with powerful compositional ability. However, these models are typically
trained on an enormous amount of Internet data, often containing copyrighted
material, licensed images, and personal photos. Furthermore, they have been
found to replicate the style of various living artists or memorize exact
training samples. How can we remove such copyrighted concepts or images without
retraining the model from scratch? To achieve this goal, we propose an
efficient method of ablating concepts in the pretrained model, i.e., preventing
the generation of a target concept. Our algorithm learns to match the image
distribution for a target style, instance, or text prompt we wish to ablate to
the distribution corresponding to an anchor concept. This prevents the model
from generating target concepts given its text condition. Extensive experiments
show that our method can successfully prevent the generation of the ablated
concept while preserving closely related concepts in the model.
Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators
March 23, 2023
Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, Humphrey Shi
Recent text-to-video generation approaches rely on computationally heavy
training and require large-scale video datasets. In this paper, we introduce a
new task of zero-shot text-to-video generation and propose a low-cost approach
(without any training or optimization) by leveraging the power of existing
text-to-image synthesis methods (e.g., Stable Diffusion), making them suitable
for the video domain.
Our key modifications include (i) enriching the latent codes of the generated
frames with motion dynamics to keep the global scene and the background time
consistent; and (ii) reprogramming frame-level self-attention using a new
cross-frame attention of each frame on the first frame, to preserve the
context, appearance, and identity of the foreground object.
Experiments show that this leads to low overhead, yet high-quality and
remarkably consistent video generation. Moreover, our approach is not limited
to text-to-video synthesis but is also applicable to other tasks such as
conditional and content-specialized video generation, and Video
Instruct-Pix2Pix, i.e., instruction-guided video editing.
As experiments show, our method performs comparably or sometimes better than
recent approaches, despite not being trained on additional video data. Our code
will be open sourced at: https://github.com/Picsart-AI-Research/Text2Video-Zero .
Medical diffusion on a budget: textual inversion for medical image generation
March 23, 2023
Bram de Wilde, Anindo Saha, Richard P. G. ten Broek, Henkjan Huisman
Diffusion-based models for text-to-image generation have gained immense
popularity due to recent advancements in efficiency, accessibility, and
quality. Although it is becoming increasingly feasible to perform inference
with these systems using consumer-grade GPUs, training them from scratch still
requires access to large datasets and significant computational resources. In
the case of medical image generation, the availability of large, publicly
accessible datasets that include text reports is limited due to legal and
ethical concerns. While training a diffusion model on a private dataset may
address this issue, it is not always feasible for institutions lacking the
necessary computational resources. This work demonstrates that pre-trained
Stable Diffusion models, originally trained on natural images, can be adapted
to various medical imaging modalities by training text embeddings with textual
inversion. In this study, we conducted experiments using medical datasets
comprising only 100 samples from three medical modalities. Embeddings were
trained in a matter of hours, while still retaining diagnostic relevance in
image generation. Experiments were designed to achieve several objectives.
Firstly, we fine-tuned the training and inference processes of textual
inversion, revealing that larger embeddings and more examples are required.
Secondly, we validated our approach by demonstrating a 2\% increase in the
diagnostic accuracy (AUC) for detecting prostate cancer on MRI, which is a
challenging multi-modal imaging modality, from 0.78 to 0.80. Thirdly, we
performed simulations by interpolating between healthy and diseased states,
combining multiple pathologies, and inpainting to show embedding flexibility
and control of disease appearance. Finally, the embeddings trained in this
study are small (less than 1 MB), which facilitates easy sharing of medical
data with reduced privacy concerns.
March 23, 2023
Ce Zheng, Xianpeng Liu, Mengyuan Liu, Tianfu Wu, Guo-Jun Qi, Chen Chen
cs.CV, cs.AI, cs.HC, cs.MM
Human mesh recovery (HMR) provides rich human body information for various
real-world applications. While image-based HMR methods have achieved impressive
results, they often struggle to recover humans in dynamic scenarios, leading to
temporal inconsistencies and non-smooth 3D motion predictions due to the
absence of human motion. In contrast, video-based approaches leverage temporal
information to mitigate this issue. In this paper, we present DiffMesh, an
innovative motion-aware Diffusion-like framework for video-based HMR. DiffMesh
establishes a bridge between diffusion models and human motion, efficiently
generating accurate and smooth output mesh sequences by incorporating human
motion within the forward process and reverse process in the diffusion model.
Extensive experiments are conducted on the widely used datasets (Human3.6M
\cite{h36m_pami} and 3DPW \cite{pw3d2018}), which demonstrate the effectiveness
and efficiency of our DiffMesh. Visual comparisons in real-world scenarios
further highlight DiffMesh’s suitability for practical applications.
Audio Diffusion Model for Speech Synthesis: A Survey on Text To Speech and Speech Enhancement in Generative AI
March 23, 2023
Chenshuang Zhang, Chaoning Zhang, Sheng Zheng, Mengchun Zhang, Maryam Qamar, Sung-Ho Bae, In So Kweon
cs.SD, cs.AI, cs.LG, cs.MM, eess.AS
Generative AI has demonstrated impressive performance in various fields,
among which speech synthesis is an interesting direction. With the diffusion
model as the most popular generative model, numerous works have attempted two
active tasks: text to speech and speech enhancement. This work conducts a
survey on audio diffusion model, which is complementary to existing surveys
that either lack the recent progress of diffusion-based speech synthesis or
highlight an overall picture of applying diffusion model in multiple fields.
Specifically, this work first briefly introduces the background of audio and
diffusion model. As for the text-to-speech task, we divide the methods into
three categories based on the stage where diffusion model is adopted: acoustic
model, vocoder and end-to-end framework. Moreover, we categorize various speech
enhancement tasks by either certain signals are removed or added into the input
speech. Comparisons of experimental results and discussions are also covered in
this survey.
MagicFusion: Boosting Text-to-Image Generation Performance by Fusing Diffusion Models
March 23, 2023
Jing Zhao, Heliang Zheng, Chaoyue Wang, Long Lan, Wenjing Yang
The advent of open-source AI communities has produced a cornucopia of
powerful text-guided diffusion models that are trained on various datasets.
While few explorations have been conducted on ensembling such models to combine
their strengths. In this work, we propose a simple yet effective method called
Saliency-aware Noise Blending (SNB) that can empower the fused text-guided
diffusion models to achieve more controllable generation. Specifically, we
experimentally find that the responses of classifier-free guidance are highly
related to the saliency of generated images. Thus we propose to trust different
models in their areas of expertise by blending the predicted noises of two
diffusion models in a saliency-aware manner. SNB is training-free and can be
completed within a DDIM sampling process. Additionally, it can automatically
align the semantics of two noise spaces without requiring additional
annotations such as masks. Extensive experiments show the impressive
effectiveness of SNB in various applications. Project page is available at
https://magicfusion.github.io/.
Enhancing Unsupervised Speech Recognition with Diffusion GANs
March 23, 2023
Xianchao Wu
cs.CL, cs.LG, cs.SD, eess.AS
We enhance the vanilla adversarial training method for unsupervised Automatic
Speech Recognition (ASR) by a diffusion-GAN. Our model (1) injects instance
noises of various intensities to the generator’s output and unlabeled reference
text which are sampled from pretrained phoneme language models with a length
constraint, (2) asks diffusion timestep-dependent discriminators to separate
them, and (3) back-propagates the gradients to update the generator.
Word/phoneme error rate comparisons with wav2vec-U under Librispeech (3.1% for
test-clean and 5.6% for test-other), TIMIT and MLS datasets, show that our
enhancement strategies work effectively.
Cube-Based 3D Denoising Diffusion Probabilistic Model for Cone Beam Computed Tomography Reconstruction with Incomplete Data
March 22, 2023
Wenjun Xia, Chuang Niu, Wenxiang Cong, Ge Wang
eess.IV, cs.LG, eess.SP, physics.bio-ph
Deep learning (DL) has emerged as a new approach in the field of computed
tomography (CT) with many applicaitons. A primary example is CT reconstruction
from incomplete data, such as sparse-view image reconstruction. However,
applying DL to sparse-view cone-beam CT (CBCT) remains challenging. Many models
learn the mapping from sparse-view CT images to the ground truth but often fail
to achieve satisfactory performance. Incorporating sinogram data and performing
dual-domain reconstruction improve image quality with artifact suppression, but
a straightforward 3D implementation requires storing an entire 3D sinogram in
memory and many parameters of dual-domain networks. This remains a major
challenge, limiting further research, development and applications. In this
paper, we propose a sub-volume-based 3D denoising diffusion probabilistic model
(DDPM) for CBCT image reconstruction from down-sampled data. Our DDPM network,
trained on data cubes extracted from paired fully sampled sinograms and
down-sampled sinograms, is employed to inpaint down-sampled sinograms. Our
method divides the entire sinogram into overlapping cubes and processes them in
parallel on multiple GPUs, successfully overcoming the memory limitation.
Experimental results demonstrate that our approach effectively suppresses
few-view artifacts while preserving textural details faithfully.
Diffuse-Denoise-Count: Accurate Crowd-Counting with Diffusion Models
March 22, 2023
Yasiru Ranasinghe, Nithin Gopalakrishnan Nair, Wele Gedara Chaminda Bandara, Vishal M. Patel
Crowd counting is a key aspect of crowd analysis and has been typically
accomplished by estimating a crowd-density map and summing over the density
values. However, this approach suffers from background noise accumulation and
loss of density due to the use of broad Gaussian kernels to create the ground
truth density maps. This issue can be overcome by narrowing the Gaussian
kernel. However, existing approaches perform poorly when trained with such
ground truth density maps. To overcome this limitation, we propose using
conditional diffusion models to predict density maps, as diffusion models are
known to model complex distributions well and show high fidelity to training
data during crowd-density map generation. Furthermore, as the intermediate time
steps of the diffusion process are noisy, we incorporate a regression branch
for direct crowd estimation only during training to improve the feature
learning. In addition, owing to the stochastic nature of the diffusion model,
we introduce producing multiple density maps to improve the counting
performance contrary to the existing crowd counting pipelines. Further, we also
differ from the density summation and introduce contour detection followed by
summation as the counting operation, which is more immune to background noise.
We conduct extensive experiments on public datasets to validate the
effectiveness of our method. Specifically, our novel crowd-counting pipeline
improves the error of crowd-counting by up to $6\%$ on JHU-CROWD++ and up to
$7\%$ on UCF-QNRF.
Pix2Video: Video Editing using Image Diffusion
March 22, 2023
Duygu Ceylan, Chun-Hao Paul Huang, Niloy J. Mitra
Image diffusion models, trained on massive image collections, have emerged as
the most versatile image generator model in terms of quality and diversity.
They support inverting real images and conditional (e.g., text) generation,
making them attractive for high-quality image editing applications. We
investigate how to use such pre-trained image models for text-guided video
editing. The critical challenge is to achieve the target edits while still
preserving the content of the source video. Our method works in two simple
steps: first, we use a pre-trained structure-guided (e.g., depth) image
diffusion model to perform text-guided edits on an anchor frame; then, in the
key step, we progressively propagate the changes to the future frames via
self-attention feature injection to adapt the core denoising step of the
diffusion model. We then consolidate the changes by adjusting the latent code
for the frame before continuing the process. Our approach is training-free and
generalizes to a wide range of edits. We demonstrate the effectiveness of the
approach by extensive experimentation and compare it against four different
prior and parallel efforts (on ArXiv). We demonstrate that realistic
text-guided video edits are possible, without any compute-intensive
preprocessing or video-specific finetuning.
EDGI: Equivariant Diffusion for Planning with Embodied Agents
March 22, 2023
Johann Brehmer, Joey Bose, Pim de Haan, Taco Cohen
Embodied agents operate in a structured world, often solving tasks with
spatial, temporal, and permutation symmetries. Most algorithms for planning and
model-based reinforcement learning (MBRL) do not take this rich geometric
structure into account, leading to sample inefficiency and poor generalization.
We introduce the Equivariant Diffuser for Generating Interactions (EDGI), an
algorithm for MBRL and planning that is equivariant with respect to the product
of the spatial symmetry group SE(3), the discrete-time translation group Z, and
the object permutation group Sn. EDGI follows the Diffuser framework (Janner et
al., 2022) in treating both learning a world model and planning in it as a
conditional generative modeling problem, training a diffusion model on an
offline trajectory dataset. We introduce a new SE(3)xZxSn-equivariant diffusion
model that supports multiple representations. We integrate this model in a
planning loop, where conditioning and classifier guidance let us softly break
the symmetry for specific tasks as needed. On object manipulation and
navigation tasks, EDGI is substantially more sample efficient and generalizes
better across the symmetry group than non-equivariant models.
NUWA-XL: Diffusion over Diffusion for eXtremely Long Video Generation
March 22, 2023
Shengming Yin, Chenfei Wu, Huan Yang, Jianfeng Wang, Xiaodong Wang, Minheng Ni, Zhengyuan Yang, Linjie Li, Shuguang Liu, Fan Yang, Jianlong Fu, Gong Ming, Lijuan Wang, Zicheng Liu, Houqiang Li, Nan Duan
In this paper, we propose NUWA-XL, a novel Diffusion over Diffusion
architecture for eXtremely Long video generation. Most current work generates
long videos segment by segment sequentially, which normally leads to the gap
between training on short videos and inferring long videos, and the sequential
generation is inefficient. Instead, our approach adopts a ``coarse-to-fine’’
process, in which the video can be generated in parallel at the same
granularity. A global diffusion model is applied to generate the keyframes
across the entire time range, and then local diffusion models recursively fill
in the content between nearby frames. This simple yet effective strategy allows
us to directly train on long videos (3376 frames) to reduce the
training-inference gap, and makes it possible to generate all segments in
parallel. To evaluate our model, we build FlintstonesHD dataset, a new
benchmark for long video generation. Experiments show that our model not only
generates high-quality long videos with both global and local coherence, but
also decreases the average inference time from 7.55min to 26s (by 94.26\%) at
the same hardware setting when generating 1024 frames. The homepage link is
\url{https://msra-nuwa.azurewebsites.net/}
LD-ZNet: A Latent Diffusion Approach for Text-Based Image Segmentation
March 22, 2023
Koutilya Pnvr, Bharat Singh, Pallabi Ghosh, Behjat Siddiquie, David Jacobs
Large-scale pre-training tasks like image classification, captioning, or
self-supervised techniques do not incentivize learning the semantic boundaries
of objects. However, recent generative foundation models built using text-based
latent diffusion techniques may learn semantic boundaries. This is because they
have to synthesize intricate details about all objects in an image based on a
text description. Therefore, we present a technique for segmenting real and
AI-generated images using latent diffusion models (LDMs) trained on
internet-scale datasets. First, we show that the latent space of LDMs (z-space)
is a better input representation compared to other feature representations like
RGB images or CLIP encodings for text-based image segmentation. By training the
segmentation models on the latent z-space, which creates a compressed
representation across several domains like different forms of art, cartoons,
illustrations, and photographs, we are also able to bridge the domain gap
between real and AI-generated images. We show that the internal features of
LDMs contain rich semantic information and present a technique in the form of
LD-ZNet to further boost the performance of text-based segmentation. Overall,
we show up to 6% improvement over standard baselines for text-to-image
segmentation on natural images. For AI-generated imagery, we show close to 20%
improvement compared to state-of-the-art techniques. The project is available
at https://koutilya-pnvr.github.io/LD-ZNet/.
Distribution Aligned Diffusion and Prototype-guided network for Unsupervised Domain Adaptive Segmentation
March 22, 2023
Haipeng Zhou, Lei Zhu, Yuyin Zhou
The Diffusion Probabilistic Model (DPM) has emerged as a highly effective
generative model in the field of computer vision. Its intermediate latent
vectors offer rich semantic information, making it an attractive option for
various downstream tasks such as segmentation and detection. In order to
explore its potential further, we have taken a step forward and considered a
more complex scenario in the medical image domain, specifically, under an
unsupervised adaptation condition. To this end, we propose a Diffusion-based
and Prototype-guided network (DP-Net) for unsupervised domain adaptive
segmentation. Concretely, our DP-Net consists of two stages: 1) Distribution
Aligned Diffusion (DADiff), which involves training a domain discriminator to
minimize the difference between the intermediate features generated by the DPM,
thereby aligning the inter-domain distribution; and 2) Prototype-guided
Consistency Learning (PCL), which utilizes feature centroids as prototypes and
applies a prototype-guided loss to ensure that the segmentor learns consistent
content from both source and target domains. Our approach is evaluated on
fundus datasets through a series of experiments, which demonstrate that the
performance of the proposed method is reliable and outperforms state-of-the-art
methods. Our work presents a promising direction for using DPM in complex
medical image scenarios, opening up new possibilities for further research in
medical imaging.
Compositional 3D Scene Generation using Locally Conditioned Diffusion
March 21, 2023
Ryan Po, Gordon Wetzstein
Designing complex 3D scenes has been a tedious, manual process requiring
domain expertise. Emerging text-to-3D generative models show great promise for
making this task more intuitive, but existing approaches are limited to
object-level generation. We introduce \textbf{locally conditioned diffusion} as
an approach to compositional scene diffusion, providing control over semantic
parts using text prompts and bounding boxes while ensuring seamless transitions
between these parts. We demonstrate a score distillation sampling–based
text-to-3D synthesis pipeline that enables compositional 3D scene generation at
a higher fidelity than relevant baselines.
Affordance Diffusion: Synthesizing Hand-Object Interactions
March 21, 2023
Yufei Ye, Xueting Li, Abhinav Gupta, Shalini De Mello, Stan Birchfield, Jiaming Song, Shubham Tulsiani, Sifei Liu
Recent successes in image synthesis are powered by large-scale diffusion
models. However, most methods are currently limited to either text- or
image-conditioned generation for synthesizing an entire image, texture transfer
or inserting objects into a user-specified region. In contrast, in this work we
focus on synthesizing complex interactions (ie, an articulated hand) with a
given object. Given an RGB image of an object, we aim to hallucinate plausible
images of a human hand interacting with it. We propose a two-step generative
approach: a LayoutNet that samples an articulation-agnostic
hand-object-interaction layout, and a ContentNet that synthesizes images of a
hand grasping the object given the predicted layout. Both are built on top of a
large-scale pretrained diffusion model to make use of its latent
representation. Compared to baselines, the proposed method is shown to
generalize better to novel objects and perform surprisingly well on
out-of-distribution in-the-wild scenes of portable-sized objects. The resulting
system allows us to predict descriptive affordance information, such as hand
articulation and approaching orientation. Project page:
https://judyye.github.io/affordiffusion-www
Semantic Latent Space Regression of Diffusion Autoencoders for Vertebral Fracture Grading
March 21, 2023
Matthias Keicher, Matan Atad, David Schinz, Alexandra S. Gersing, Sarah C. Foreman, Sophia S. Goller, Juergen Weissinger, Jon Rischewski, Anna-Sophia Dietrich, Benedikt Wiestler, Jan S. Kirschke, Nassir Navab
Vertebral fractures are a consequence of osteoporosis, with significant
health implications for affected patients. Unfortunately, grading their
severity using CT exams is hard and subjective, motivating automated grading
methods. However, current approaches are hindered by imbalance and scarcity of
data and a lack of interpretability. To address these challenges, this paper
proposes a novel approach that leverages unlabelled data to train a generative
Diffusion Autoencoder (DAE) model as an unsupervised feature extractor. We
model fracture grading as a continuous regression, which is more reflective of
the smooth progression of fractures. Specifically, we use a binary, supervised
fracture classifier to construct a hyperplane in the DAE’s latent space. We
then regress the severity of the fracture as a function of the distance to this
hyperplane, calibrating the results to the Genant scale. Importantly, the
generative nature of our method allows us to visualize different grades of a
given vertebra, providing interpretability and insight into the features that
contribute to automated grading.
CompoDiff: Versatile Composed Image Retrieval With Latent Diffusion
March 21, 2023
Geonmo Gu, Sanghyuk Chun, Wonjae Kim, HeeJae Jun, Yoohoon Kang, Sangdoo Yun
This paper proposes a novel diffusion-based model, CompoDiff, for solving
Composed Image Retrieval (CIR) with latent diffusion and presents a newly
created dataset, named SynthTriplets18M, of 18 million reference images,
conditions, and corresponding target image triplets to train the model.
CompoDiff and SynthTriplets18M tackle the shortages of the previous CIR
approaches, such as poor generalizability due to the small dataset scale and
the limited types of conditions. CompoDiff not only achieves a new zero-shot
state-of-the-art on four CIR benchmarks, including FashionIQ, CIRR, CIRCO, and
GeneCIS, but also enables a more versatile and controllable CIR by accepting
various conditions, such as negative text and image mask conditions, and the
controllability to the importance between multiple queries or the trade-off
between inference speed and the performance which are unavailable with existing
CIR methods. The code and dataset are available at
https://github.com/navervision/CompoDiff
LayoutDiffusion: Improving Graphic Layout Generation by Discrete Diffusion Probabilistic Models
March 21, 2023
Junyi Zhang, Jiaqi Guo, Shizhao Sun, Jian-Guang Lou, Dongmei Zhang
Creating graphic layouts is a fundamental step in graphic designs. In this
work, we present a novel generative model named LayoutDiffusion for automatic
layout generation. As layout is typically represented as a sequence of discrete
tokens, LayoutDiffusion models layout generation as a discrete denoising
diffusion process. It learns to reverse a mild forward process, in which
layouts become increasingly chaotic with the growth of forward steps and
layouts in the neighboring steps do not differ too much. Designing such a mild
forward process is however very challenging as layout has both categorical
attributes and ordinal attributes. To tackle the challenge, we summarize three
critical factors for achieving a mild forward process for the layout, i.e.,
legality, coordinate proximity and type disruption. Based on the factors, we
propose a block-wise transition matrix coupled with a piece-wise linear noise
schedule. Experiments on RICO and PubLayNet datasets show that LayoutDiffusion
outperforms state-of-the-art approaches significantly. Moreover, it enables two
conditional layout generation tasks in a plug-and-play manner without
re-training and achieves better performance than existing methods.
Diffusion-Based 3D Human Pose Estimation with Multi-Hypothesis Aggregation
March 21, 2023
Wenkang Shan, Zhenhua Liu, Xinfeng Zhang, Zhao Wang, Kai Han, Shanshe Wang, Siwei Ma, Wen Gao
In this paper, a novel Diffusion-based 3D Pose estimation (D3DP) method with
Joint-wise reProjection-based Multi-hypothesis Aggregation (JPMA) is proposed
for probabilistic 3D human pose estimation. On the one hand, D3DP generates
multiple possible 3D pose hypotheses for a single 2D observation. It gradually
diffuses the ground truth 3D poses to a random distribution, and learns a
denoiser conditioned on 2D keypoints to recover the uncontaminated 3D poses.
The proposed D3DP is compatible with existing 3D pose estimators and supports
users to balance efficiency and accuracy during inference through two
customizable parameters. On the other hand, JPMA is proposed to assemble
multiple hypotheses generated by D3DP into a single 3D pose for practical use.
It reprojects 3D pose hypotheses to the 2D camera plane, selects the best
hypothesis joint-by-joint based on the reprojection errors, and combines the
selected joints into the final pose. The proposed JPMA conducts aggregation at
the joint level and makes use of the 2D prior information, both of which have
been overlooked by previous approaches. Extensive experiments on Human3.6M and
MPI-INF-3DHP datasets show that our method outperforms the state-of-the-art
deterministic and probabilistic approaches by 1.5% and 8.9%, respectively. Code
is available at https://github.com/paTRICK-swk/D3DP.
NASDM: Nuclei-Aware Semantic Histopathology Image Generation Using Diffusion Models
March 20, 2023
Aman Shrivastava, P. Thomas Fletcher
In recent years, computational pathology has seen tremendous progress driven
by deep learning methods in segmentation and classification tasks aiding
prognostic and diagnostic settings. Nuclei segmentation, for instance, is an
important task for diagnosing different cancers. However, training deep
learning models for nuclei segmentation requires large amounts of annotated
data, which is expensive to collect and label. This necessitates explorations
into generative modeling of histopathological images. In this work, we use
recent advances in conditional diffusion modeling to formulate a
first-of-its-kind nuclei-aware semantic tissue generation framework (NASDM)
which can synthesize realistic tissue samples given a semantic instance mask of
up to six different nuclei types, enabling pixel-perfect nuclei localization in
generated samples. These synthetic images are useful in applications in
pathology pedagogy, validation of models, and supplementation of existing
nuclei segmentation datasets. We demonstrate that NASDM is able to synthesize
high-quality histopathology images of the colon with superior quality and
semantic controllability over existing generative methods.
Inversion by Direct Iteration: An Alternative to Denoising Diffusion for Image Restoration
March 20, 2023
Mauricio Delbracio, Peyman Milanfar
Inversion by Direct Iteration (InDI) is a new formulation for supervised
image restoration that avoids the so-called ``regression to the mean’’ effect
and produces more realistic and detailed images than existing regression-based
methods. It does this by gradually improving image quality in small steps,
similar to generative denoising diffusion models. Image restoration is an
ill-posed problem where multiple high-quality images are plausible
reconstructions of a given low-quality input. Therefore, the outcome of a
single step regression model is typically an aggregate of all possible
explanations, therefore lacking details and realism. The main advantage of InDI
is that it does not try to predict the clean target image in a single step but
instead gradually improves the image in small steps, resulting in better
perceptual quality. While generative denoising diffusion models also work in
small steps, our formulation is distinct in that it does not require knowledge
of any analytic form of the degradation process. Instead, we directly learn an
iterative restoration process from low-quality and high-quality paired
examples. InDI can be applied to virtually any image degradation, given paired
training data. In conditional denoising diffusion image restoration the
denoising network generates the restored image by repeatedly denoising an
initial image of pure noise, conditioned on the degraded input. Contrary to
conditional denoising formulations, InDI directly proceeds by iteratively
restoring the input low-quality image, producing high-quality results on a
variety of image restoration tasks, including motion and out-of-focus
deblurring, super-resolution, compression artifact removal, and denoising.
Text2Tex: Text-driven Texture Synthesis via Diffusion Models
March 20, 2023
Dave Zhenyu Chen, Yawar Siddiqui, Hsin-Ying Lee, Sergey Tulyakov, Matthias Nießner
We present Text2Tex, a novel method for generating high-quality textures for
3D meshes from the given text prompts. Our method incorporates inpainting into
a pre-trained depth-aware image diffusion model to progressively synthesize
high resolution partial textures from multiple viewpoints. To avoid
accumulating inconsistent and stretched artifacts across views, we dynamically
segment the rendered view into a generation mask, which represents the
generation status of each visible texel. This partitioned view representation
guides the depth-aware inpainting model to generate and update partial textures
for the corresponding regions. Furthermore, we propose an automatic view
sequence generation scheme to determine the next best view for updating the
partial texture. Extensive experiments demonstrate that our method
significantly outperforms the existing text-driven approaches and GAN-based
methods.
SVDiff: Compact Parameter Space for Diffusion Fine-Tuning
March 20, 2023
Ligong Han, Yinxiao Li, Han Zhang, Peyman Milanfar, Dimitris Metaxas, Feng Yang
Diffusion models have achieved remarkable success in text-to-image
generation, enabling the creation of high-quality images from text prompts or
other modalities. However, existing methods for customizing these models are
limited by handling multiple personalized subjects and the risk of overfitting.
Moreover, their large number of parameters is inefficient for model storage. In
this paper, we propose a novel approach to address these limitations in
existing text-to-image diffusion models for personalization. Our method
involves fine-tuning the singular values of the weight matrices, leading to a
compact and efficient parameter space that reduces the risk of overfitting and
language drifting. We also propose a Cut-Mix-Unmix data-augmentation technique
to enhance the quality of multi-subject image generation and a simple
text-based image editing framework. Our proposed SVDiff method has a
significantly smaller model size compared to existing methods (approximately
2,200 times fewer parameters compared with vanilla DreamBooth), making it more
practical for real-world applications.
Cascaded Latent Diffusion Models for High-Resolution Chest X-ray Synthesis
March 20, 2023
Tobias Weber, Michael Ingrisch, Bernd Bischl, David Rügamer
While recent advances in large-scale foundational models show promising
results, their application to the medical domain has not yet been explored in
detail. In this paper, we progress into the realms of large-scale modeling in
medical synthesis by proposing Cheff - a foundational cascaded latent diffusion
model, which generates highly-realistic chest radiographs providing
state-of-the-art quality on a 1-megapixel scale. We further propose MaCheX,
which is a unified interface for public chest datasets and forms the largest
open collection of chest X-rays up to date. With Cheff conditioned on
radiological reports, we further guide the synthesis process over text prompts
and unveil the research area of report-to-chest-X-ray generation.
AnimeDiffusion: Anime Face Line Drawing Colorization via Diffusion Models
March 20, 2023
Yu Cao, Xiangqiao Meng, P. Y. Mok, Xueting Liu, Tong-Yee Lee, Ping Li
It is a time-consuming and tedious work for manually colorizing anime line
drawing images, which is an essential stage in cartoon animation creation
pipeline. Reference-based line drawing colorization is a challenging task that
relies on the precise cross-domain long-range dependency modelling between the
line drawing and reference image. Existing learning methods still utilize
generative adversarial networks (GANs) as one key module of their model
architecture. In this paper, we propose a novel method called AnimeDiffusion
using diffusion models that performs anime face line drawing colorization
automatically. To the best of our knowledge, this is the first diffusion model
tailored for anime content creation. In order to solve the huge training
consumption problem of diffusion models, we design a hybrid training strategy,
first pre-training a diffusion model with classifier-free guidance and then
fine-tuning it with image reconstruction guidance. We find that with a few
iterations of fine-tuning, the model shows wonderful colorization performance,
as illustrated in Fig. 1. For training AnimeDiffusion, we conduct an anime face
line drawing colorization benchmark dataset, which contains 31696 training data
and 579 testing data. We hope this dataset can fill the gap of no available
high resolution anime face dataset for colorization method evaluation. Through
multiple quantitative metrics evaluated on our dataset and a user study, we
demonstrate AnimeDiffusion outperforms state-of-the-art GANs-based models for
anime face line drawing colorization. We also collaborate with professional
artists to test and apply our AnimeDiffusion for their creation work. We
release our code on https://github.com/xq-meng/AnimeDiffusion.
Positional Diffusion: Ordering Unordered Sets with Diffusion Probabilistic Models
March 20, 2023
Francesco Giuliari, Gianluca Scarpellini, Stuart James, Yiming Wang, Alessio Del Bue
Positional reasoning is the process of ordering unsorted parts contained in a
set into a consistent structure. We present Positional Diffusion, a
plug-and-play graph formulation with Diffusion Probabilistic Models to address
positional reasoning. We use the forward process to map elements’ positions in
a set to random positions in a continuous space. Positional Diffusion learns to
reverse the noising process and recover the original positions through an
Attention-based Graph Neural Network. We conduct extensive experiments with
benchmark datasets including two puzzle datasets, three sentence ordering
datasets, and one visual storytelling dataset, demonstrating that our method
outperforms long-lasting research on puzzle solving with up to +18% compared to
the second-best deep learning method, and performs on par against the
state-of-the-art methods on sentence ordering and visual storytelling. Our work
highlights the suitability of diffusion models for ordering problems and
proposes a novel formulation and method for solving various ordering tasks.
Project website at https://iit-pavis.github.io/Positional_Diffusion/
Leapfrog Diffusion Model for Stochastic Trajectory Prediction
March 20, 2023
Weibo Mao, Chenxin Xu, Qi Zhu, Siheng Chen, Yanfeng Wang
To model the indeterminacy of human behaviors, stochastic trajectory
prediction requires a sophisticated multi-modal distribution of future
trajectories. Emerging diffusion models have revealed their tremendous
representation capacities in numerous generation tasks, showing potential for
stochastic trajectory prediction. However, expensive time consumption prevents
diffusion models from real-time prediction, since a large number of denoising
steps are required to assure sufficient representation ability. To resolve the
dilemma, we present LEapfrog Diffusion model (LED), a novel diffusion-based
trajectory prediction model, which provides real-time, precise, and diverse
predictions. The core of the proposed LED is to leverage a trainable leapfrog
initializer to directly learn an expressive multi-modal distribution of future
trajectories, which skips a large number of denoising steps, significantly
accelerating inference speed. Moreover, the leapfrog initializer is trained to
appropriately allocate correlated samples to provide a diversity of predicted
future trajectories, significantly improving prediction performances. Extensive
experiments on four real-world datasets, including NBA/NFL/SDD/ETH-UCY, show
that LED consistently improves performance and achieves 23.7%/21.9% ADE/FDE
improvement on NFL. The proposed LED also speeds up the inference
19.3/30.8/24.3/25.1 times compared to the standard diffusion model on
NBA/NFL/SDD/ETH-UCY, satisfying real-time inference needs. Code is available at
https://github.com/MediaBrain-SJTU/LED.
Object-Centric Slot Diffusion
March 20, 2023
Jindong Jiang, Fei Deng, Gautam Singh, Sungjin Ahn
The recent success of transformer-based image generative models in
object-centric learning highlights the importance of powerful image generators
for handling complex scenes. However, despite the high expressiveness of
diffusion models in image generation, their integration into object-centric
learning remains largely unexplored in this domain. In this paper, we explore
the feasibility and potential of integrating diffusion models into
object-centric learning and investigate the pros and cons of this approach. We
introduce Latent Slot Diffusion (LSD), a novel model that serves dual purposes:
it is the first object-centric learning model to replace conventional slot
decoders with a latent diffusion model conditioned on object slots, and it is
also the first unsupervised compositional conditional diffusion model that
operates without the need for supervised annotations like text. Through
experiments on various object-centric tasks, including the first application of
the FFHQ dataset in this field, we demonstrate that LSD significantly
outperforms state-of-the-art transformer-based decoders, particularly in more
complex scenes, and exhibits superior unsupervised compositional generation
quality. In addition, we conduct a preliminary investigation into the
integration of pre-trained diffusion models in LSD and demonstrate its
effectiveness in real-world image segmentation and generation. Project page is
available at https://latentslotdiffusion.github.io
DiffMIC: Dual-Guidance Diffusion Network for Medical Image Classification
March 19, 2023
Yijun Yang, Huazhu Fu, Angelica I. Aviles-Rivero, Carola-Bibiane Schönlieb, Lei Zhu
Diffusion Probabilistic Models have recently shown remarkable performance in
generative image modeling, attracting significant attention in the computer
vision community. However, while a substantial amount of diffusion-based
research has focused on generative tasks, few studies have applied diffusion
models to general medical image classification. In this paper, we propose the
first diffusion-based model (named DiffMIC) to address general medical image
classification by eliminating unexpected noise and perturbations in medical
images and robustly capturing semantic representation. To achieve this goal, we
devise a dual conditional guidance strategy that conditions each diffusion step
with multiple granularities to improve step-wise regional attention.
Furthermore, we propose learning the mutual information in each granularity by
enforcing Maximum-Mean Discrepancy regularization during the diffusion forward
process. We evaluate the effectiveness of our DiffMIC on three medical
classification tasks with different image modalities, including placental
maturity grading on ultrasound images, skin lesion classification using
dermatoscopic images, and diabetic retinopathy grading using fundus images. Our
experimental results demonstrate that DiffMIC outperforms state-of-the-art
methods by a significant margin, indicating the universality and effectiveness
of the proposed model. Our code will be publicly available at
https://github.com/scott-yjyang/DiffMIC.
Diff-UNet: A Diffusion Embedded Network for Volumetric Segmentation
March 18, 2023
Zhaohu Xing, Liang Wan, Huazhu Fu, Guang Yang, Lei Zhu
In recent years, Denoising Diffusion Models have demonstrated remarkable
success in generating semantically valuable pixel-wise representations for
image generative modeling. In this study, we propose a novel end-to-end
framework, called Diff-UNet, for medical volumetric segmentation. Our approach
integrates the diffusion model into a standard U-shaped architecture to extract
semantic information from the input volume effectively, resulting in excellent
pixel-level representations for medical volumetric segmentation. To enhance the
robustness of the diffusion model’s prediction results, we also introduce a
Step-Uncertainty based Fusion (SUF) module during inference to combine the
outputs of the diffusion models at each step. We evaluate our method on three
datasets, including multimodal brain tumors in MRI, liver tumors, and
multi-organ CT volumes, and demonstrate that Diff-UNet outperforms other
state-of-the-art methods significantly. Our experimental results also indicate
the universality and effectiveness of the proposed model. The proposed
framework has the potential to facilitate the accurate diagnosis and treatment
of medical conditions by enabling more precise segmentation of anatomical
structures. The codes of Diff-UNet are available at
https://github.com/ge-xing/Diff-UNet
A Recipe for Watermarking Diffusion Models
March 17, 2023
Yunqing Zhao, Tianyu Pang, Chao Du, Xiao Yang, Ngai-Man Cheung, Min Lin
Diffusion models (DMs) have demonstrated advantageous potential on generative
tasks. Widespread interest exists in incorporating DMs into downstream
applications, such as producing or editing photorealistic images. However,
practical deployment and unprecedented power of DMs raise legal issues,
including copyright protection and monitoring of generated content. In this
regard, watermarking has been a proven solution for copyright protection and
content monitoring, but it is underexplored in the DMs literature.
Specifically, DMs generate samples from longer tracks and may have newly
designed multimodal structures, necessitating the modification of conventional
watermarking pipelines. To this end, we conduct comprehensive analyses and
derive a recipe for efficiently watermarking state-of-the-art DMs (e.g., Stable
Diffusion), via training from scratch or finetuning. Our recipe is
straightforward but involves empirically ablated implementation details,
providing a foundation for future research on watermarking DMs. The code is
available at https://github.com/yunqing-me/WatermarkDM.
DiffusionRet: Generative Text-Video Retrieval with Diffusion Model
March 17, 2023
Peng Jin, Hao Li, Zesen Cheng, Kehan Li, Xiangyang Ji, Chang Liu, Li Yuan, Jie Chen
Existing text-video retrieval solutions are, in essence, discriminant models
focused on maximizing the conditional likelihood, i.e., p(candidates|query).
While straightforward, this de facto paradigm overlooks the underlying data
distribution p(query), which makes it challenging to identify
out-of-distribution data. To address this limitation, we creatively tackle this
task from a generative viewpoint and model the correlation between the text and
the video as their joint probability p(candidates,query). This is accomplished
through a diffusion-based text-video retrieval framework (DiffusionRet), which
models the retrieval task as a process of gradually generating joint
distribution from noise. During training, DiffusionRet is optimized from both
the generation and discrimination perspectives, with the generator being
optimized by generation loss and the feature extractor trained with contrastive
loss. In this way, DiffusionRet cleverly leverages the strengths of both
generative and discriminative methods. Extensive experiments on five commonly
used text-video retrieval benchmarks, including MSRVTT, LSMDC, MSVD,
ActivityNet Captions, and DiDeMo, with superior performances, justify the
efficacy of our method. More encouragingly, without any modification,
DiffusionRet even performs well in out-domain retrieval settings. We believe
this work brings fundamental insights into the related fields. Code is
available at https://github.com/jpthu17/DiffusionRet.
FreeDoM: Training-Free Energy-Guided Conditional Diffusion Model
March 17, 2023
Jiwen Yu, Yinhuai Wang, Chen Zhao, Bernard Ghanem, Jian Zhang
Recently, conditional diffusion models have gained popularity in numerous
applications due to their exceptional generation ability. However, many
existing methods are training-required. They need to train a time-dependent
classifier or a condition-dependent score estimator, which increases the cost
of constructing conditional diffusion models and is inconvenient to transfer
across different conditions. Some current works aim to overcome this limitation
by proposing training-free solutions, but most can only be applied to a
specific category of tasks and not to more general conditions. In this work, we
propose a training-Free conditional Diffusion Model (FreeDoM) used for various
conditions. Specifically, we leverage off-the-shelf pre-trained networks, such
as a face detection model, to construct time-independent energy functions,
which guide the generation process without requiring training. Furthermore,
because the construction of the energy function is very flexible and adaptable
to various conditions, our proposed FreeDoM has a broader range of applications
than existing training-free methods. FreeDoM is advantageous in its simplicity,
effectiveness, and low cost. Experiments demonstrate that FreeDoM is effective
for various conditions and suitable for diffusion models of diverse data
domains, including image and latent code domains.
DiffusionSeg: Adapting Diffusion Towards Unsupervised Object Discovery
March 17, 2023
Chaofan Ma, Yuhuan Yang, Chen Ju, Fei Zhang, Jinxiang Liu, Yu Wang, Ya Zhang, Yanfeng Wang
Learning from a large corpus of data, pre-trained models have achieved
impressive progress nowadays. As popular generative pre-training, diffusion
models capture both low-level visual knowledge and high-level semantic
relations. In this paper, we propose to exploit such knowledgeable diffusion
models for mainstream discriminative tasks, i.e., unsupervised object
discovery: saliency segmentation and object localization. However, the
challenges exist as there is one structural difference between generative and
discriminative models, which limits the direct use. Besides, the lack of
explicitly labeled data significantly limits performance in unsupervised
settings. To tackle these issues, we introduce DiffusionSeg, one novel
synthesis-exploitation framework containing two-stage strategies. To alleviate
data insufficiency, we synthesize abundant images, and propose a novel
training-free AttentionCut to obtain masks in the first synthesis stage. In the
second exploitation stage, to bridge the structural gap, we use the inversion
technique, to map the given image back to diffusion features. These features
can be directly used by downstream architectures. Extensive experiments and
ablation studies demonstrate the superiority of adapting diffusion for
unsupervised object discovery.
Denoising Diffusion Autoencoders are Unified Self-supervised Learners
March 17, 2023
Weilai Xiang, Hongyu Yang, Di Huang, Yunhong Wang
Inspired by recent advances in diffusion models, which are reminiscent of
denoising autoencoders, we investigate whether they can acquire discriminative
representations for classification via generative pre-training. This paper
shows that the networks in diffusion models, namely denoising diffusion
autoencoders (DDAE), are unified self-supervised learners: by pre-training on
unconditional image generation, DDAE has already learned strongly
linear-separable representations within its intermediate layers without
auxiliary encoders, thus making diffusion pre-training emerge as a general
approach for generative-and-discriminative dual learning. To validate this, we
conduct linear probe and fine-tuning evaluations. Our diffusion-based approach
achieves 95.9% and 50.0% linear evaluation accuracies on CIFAR-10 and
Tiny-ImageNet, respectively, and is comparable to contrastive learning and
masked autoencoders for the first time. Transfer learning from ImageNet also
confirms the suitability of DDAE for Vision Transformers, suggesting the
potential to scale DDAEs as unified foundation models. Code is available at
github.com/FutureXiang/ddae.
Diffusing the Optimal Topology: A Generative Optimization Approach
March 17, 2023
Giorgio Giannone, Faez Ahmed
Topology Optimization seeks to find the best design that satisfies a set of
constraints while maximizing system performance. Traditional iterative
optimization methods like SIMP can be computationally expensive and get stuck
in local minima, limiting their applicability to complex or large-scale
problems. Learning-based approaches have been developed to accelerate the
topology optimization process, but these methods can generate designs with
floating material and low performance when challenged with out-of-distribution
constraint configurations. Recently, deep generative models, such as Generative
Adversarial Networks and Diffusion Models, conditioned on constraints and
physics fields have shown promise, but they require extensive pre-processing
and surrogate models for improving performance. To address these issues, we
propose a Generative Optimization method that integrates classic optimization
like SIMP as a refining mechanism for the topology generated by a deep
generative model. We also remove the need for conditioning on physical fields
using a computationally inexpensive approximation inspired by classic ODE
solutions and reduce the number of steps needed to generate a feasible and
performant topology. Our method allows us to efficiently generate good
topologies and explicitly guide them to regions with high manufacturability and
high performance, without the need for external auxiliary models or additional
labeled data. We believe that our method can lead to significant advancements
in the design and optimization of structures in engineering applications, and
can be applied to a broader spectrum of performance-aware engineering design
problems.
SUD2: Supervision by Denoising Diffusion Models for Image Reconstruction
March 16, 2023
Matthew A. Chan, Sean I. Young, Christopher A. Metzler
Many imaging inverse problems$\unicode{x2014}$such as image-dependent
in-painting and dehazing$\unicode{x2014}$are challenging because their forward
models are unknown or depend on unknown latent parameters. While one can solve
such problems by training a neural network with vast quantities of paired
training data, such paired training data is often unavailable. In this paper,
we propose a generalized framework for training image reconstruction networks
when paired training data is scarce. In particular, we demonstrate the ability
of image denoising algorithms and, by extension, denoising diffusion models to
supervise network training in the absence of paired training data.
Denoising Diffusion Post-Processing for Low-Light Image Enhancement
March 16, 2023
Savvas Panagiotou, Anna S. Bosman
Low-light image enhancement (LLIE) techniques attempt to increase the
visibility of images captured in low-light scenarios. However, as a result of
enhancement, a variety of image degradations such as noise and color bias are
revealed. Furthermore, each particular LLIE approach may introduce a different
form of flaw within its enhanced results. To combat these image degradations,
post-processing denoisers have widely been used, which often yield oversmoothed
results lacking detail. We propose using a diffusion model as a post-processing
approach, and we introduce Low-light Post-processing Diffusion Model (LPDM) in
order to model the conditional distribution between under-exposed and
normally-exposed images. We apply LPDM in a manner which avoids the
computationally expensive generative reverse process of typical diffusion
models, and post-process images in one pass through LPDM. Extensive experiments
demonstrate that our approach outperforms competing post-processing denoisers
by increasing the perceptual quality of enhanced low-light images on a variety
of challenging low-light datasets. Source code is available at
https://github.com/savvaki/LPDM.
DS-Fusion: Artistic Typography via Discriminated and Stylized Diffusion
March 16, 2023
Maham Tanveer, Yizhi Wang, Ali Mahdavi-Amiri, Hao Zhang
We introduce a novel method to automatically generate an artistic typography
by stylizing one or more letter fonts to visually convey the semantics of an
input word, while ensuring that the output remains readable. To address an
assortment of challenges with our task at hand including conflicting goals
(artistic stylization vs. legibility), lack of ground truth, and immense search
space, our approach utilizes large language models to bridge texts and visual
images for stylization and build an unsupervised generative model with a
diffusion model backbone. Specifically, we employ the denoising generator in
Latent Diffusion Model (LDM), with the key addition of a CNN-based
discriminator to adapt the input style onto the input text. The discriminator
uses rasterized images of a given letter/word font as real samples and output
of the denoising generator as fake samples. Our model is coined DS-Fusion for
discriminated and stylized diffusion. We showcase the quality and versatility
of our method through numerous examples, qualitative and quantitative
evaluation, as well as ablation studies. User studies comparing to strong
baselines including CLIPDraw and DALL-E 2, as well as artist-crafted
typographies, demonstrate strong performance of DS-Fusion.
Efficient Diffusion Training via Min-SNR Weighting Strategy
March 16, 2023
Tiankai Hang, Shuyang Gu, Chen Li, Jianmin Bao, Dong Chen, Han Hu, Xin Geng, Baining Guo
Denoising diffusion models have been a mainstream approach for image
generation, however, training these models often suffers from slow convergence.
In this paper, we discovered that the slow convergence is partly due to
conflicting optimization directions between timesteps. To address this issue,
we treat the diffusion training as a multi-task learning problem, and introduce
a simple yet effective approach referred to as Min-SNR-$\gamma$. This method
adapts loss weights of timesteps based on clamped signal-to-noise ratios, which
effectively balances the conflicts among timesteps. Our results demonstrate a
significant improvement in converging speed, 3.4$\times$ faster than previous
weighting strategies. It is also more effective, achieving a new record FID
score of 2.06 on the ImageNet $256\times256$ benchmark using smaller
architectures than that employed in previous state-of-the-art. The code is
available at https://github.com/TiankaiHang/Min-SNR-Diffusion-Training.
Diffusion-HPC: Generating Synthetic Images with Realistic Humans
March 16, 2023
Zhenzhen Weng, Laura Bravo-Sánchez, Serena Yeung-Levy
Recent text-to-image generative models have exhibited remarkable abilities in
generating high-fidelity and photo-realistic images. However, despite the
visually impressive results, these models often struggle to preserve plausible
human structure in the generations. Due to this reason, while generative models
have shown promising results in aiding downstream image recognition tasks by
generating large volumes of synthetic data, they are not suitable for improving
downstream human pose perception and understanding. In this work, we propose a
Diffusion model with Human Pose Correction (Diffusion-HPC), a text-conditioned
method that generates photo-realistic images with plausible posed humans by
injecting prior knowledge about human body structure. Our generated images are
accompanied by 3D meshes that serve as ground truths for improving Human Mesh
Recovery tasks, where a shortage of 3D training data has long been an issue.
Furthermore, we show that Diffusion-HPC effectively improves the realism of
human generations under varying conditioning strategies.
DiffIR: Efficient Diffusion Model for Image Restoration
March 16, 2023
Bin Xia, Yulun Zhang, Shiyin Wang, Yitong Wang, Xinglong Wu, Yapeng Tian, Wenming Yang, Luc Van Gool
Diffusion model (DM) has achieved SOTA performance by modeling the image
synthesis process into a sequential application of a denoising network.
However, different from image synthesis, image restoration (IR) has a strong
constraint to generate results in accordance with ground-truth. Thus, for IR,
traditional DMs running massive iterations on a large model to estimate whole
images or feature maps is inefficient. To address this issue, we propose an
efficient DM for IR (DiffIR), which consists of a compact IR prior extraction
network (CPEN), dynamic IR transformer (DIRformer), and denoising network.
Specifically, DiffIR has two training stages: pretraining and training DM. In
pretraining, we input ground-truth images into CPEN${S1}$ to capture a compact
IR prior representation (IPR) to guide DIRformer. In the second stage, we train
the DM to directly estimate the same IRP as pretrained CPEN${S1}$ only using
LQ images. We observe that since the IPR is only a compact vector, DiffIR can
use fewer iterations than traditional DM to obtain accurate estimations and
generate more stable and realistic results. Since the iterations are few, our
DiffIR can adopt a joint optimization of CPEN$_{S2}$, DIRformer, and denoising
network, which can further reduce the estimation error influence. We conduct
extensive experiments on several IR tasks and achieve SOTA performance while
consuming less computational costs. Code is available at
\url{https://github.com/Zj-BinXia/DiffIR}.
DINAR: Diffusion Inpainting of Neural Textures for One-Shot Human Avatars
March 16, 2023
David Svitov, Dmitrii Gudkov, Renat Bashirov, Victor Lempitsky
We present DINAR, an approach for creating realistic rigged fullbody avatars
from single RGB images. Similarly to previous works, our method uses neural
textures combined with the SMPL-X body model to achieve photo-realistic quality
of avatars while keeping them easy to animate and fast to infer. To restore the
texture, we use a latent diffusion model and show how such model can be trained
in the neural texture space. The use of the diffusion model allows us to
realistically reconstruct large unseen regions such as the back of a person
given the frontal view. The models in our pipeline are trained using 2D images
and videos only. In the experiments, our approach achieves state-of-the-art
rendering quality and good generalization to new poses and viewpoints. In
particular, the approach improves state-of-the-art on the SnapshotPeople public
benchmark.
Unified Multi-Modal Latent Diffusion for Joint Subject and Text Conditional Image Generation
March 16, 2023
Yiyang Ma, Huan Yang, Wenjing Wang, Jianlong Fu, Jiaying Liu
Language-guided image generation has achieved great success nowadays by using
diffusion models. However, texts can be less detailed to describe
highly-specific subjects such as a particular dog or a certain car, which makes
pure text-to-image generation not accurate enough to satisfy user requirements.
In this work, we present a novel Unified Multi-Modal Latent Diffusion
(UMM-Diffusion) which takes joint texts and images containing specified
subjects as input sequences and generates customized images with the subjects.
To be more specific, both input texts and images are encoded into one unified
multi-modal latent space, in which the input images are learned to be projected
to pseudo word embedding and can be further combined with text to guide image
generation. Besides, to eliminate the irrelevant parts of the input images such
as background or illumination, we propose a novel sampling technique of
diffusion models used by the image generator which fuses the results guided by
multi-modal input and pure text input. By leveraging the large-scale
pre-trained text-to-image generator and the designed image encoder, our method
is able to generate high-quality images with complex semantics from both
aspects of input texts and images.
DIRE for Diffusion-Generated Image Detection
March 16, 2023
Zhendong Wang, Jianmin Bao, Wengang Zhou, Weilun Wang, Hezhen Hu, Hong Chen, Houqiang Li
Diffusion models have shown remarkable success in visual synthesis, but have
also raised concerns about potential abuse for malicious purposes. In this
paper, we seek to build a detector for telling apart real images from
diffusion-generated images. We find that existing detectors struggle to detect
images generated by diffusion models, even if we include generated images from
a specific diffusion model in their training data. To address this issue, we
propose a novel image representation called DIffusion Reconstruction Error
(DIRE), which measures the error between an input image and its reconstruction
counterpart by a pre-trained diffusion model. We observe that
diffusion-generated images can be approximately reconstructed by a diffusion
model while real images cannot. It provides a hint that DIRE can serve as a
bridge to distinguish generated and real images. DIRE provides an effective way
to detect images generated by most diffusion models, and it is general for
detecting generated images from unseen diffusion models and robust to various
perturbations. Furthermore, we establish a comprehensive diffusion-generated
benchmark including images generated by eight diffusion models to evaluate the
performance of diffusion-generated image detectors. Extensive experiments on
our collected benchmark demonstrate that DIRE exhibits superiority over
previous generated-image detectors. The code and dataset are available at
https://github.com/ZhendongWang6/DIRE.
Stochastic Segmentation with Conditional Categorical Diffusion Models
March 15, 2023
Lukas Zbinden, Lars Doorenbos, Theodoros Pissas, Adrian Thomas Huber, Raphael Sznitman, Pablo Márquez-Neila
Semantic segmentation has made significant progress in recent years thanks to
deep neural networks, but the common objective of generating a single
segmentation output that accurately matches the image’s content may not be
suitable for safety-critical domains such as medical diagnostics and autonomous
driving. Instead, multiple possible correct segmentation maps may be required
to reflect the true distribution of annotation maps. In this context,
stochastic semantic segmentation methods must learn to predict conditional
distributions of labels given the image, but this is challenging due to the
typically multimodal distributions, high-dimensional output spaces, and limited
annotation data. To address these challenges, we propose a conditional
categorical diffusion model (CCDM) for semantic segmentation based on Denoising
Diffusion Probabilistic Models. Our model is conditioned to the input image,
enabling it to generate multiple segmentation label maps that account for the
aleatoric uncertainty arising from divergent ground truth annotations. Our
experimental results show that CCDM achieves state-of-the-art performance on
LIDC, a stochastic semantic segmentation dataset, and outperforms established
baselines on the classical segmentation dataset Cityscapes.
Class-Guided Image-to-Image Diffusion: Cell Painting from Brightfield Images with Class Labels
March 15, 2023
Jan Oscar Cross-Zamirski, Praveen Anand, Guy Williams, Elizabeth Mouchet, Yinhai Wang, Carola-Bibiane Schönlieb
Image-to-image reconstruction problems with free or inexpensive metadata in
the form of class labels appear often in biological and medical image domains.
Existing text-guided or style-transfer image-to-image approaches do not
translate to datasets where additional information is provided as discrete
classes. We introduce and implement a model which combines image-to-image and
class-guided denoising diffusion probabilistic models. We train our model on a
real-world dataset of microscopy images used for drug discovery, with and
without incorporating metadata labels. By exploring the properties of
image-to-image diffusion with relevant labels, we show that class-guided
image-to-image diffusion can improve the meaningful content of the
reconstructed images and outperform the unguided model in useful downstream
tasks.
Stochastic Interpolants: A Unifying Framework for Flows and Diffusions
March 15, 2023
Michael S. Albergo, Nicholas M. Boffi, Eric Vanden-Eijnden
cs.LG, cond-mat.dis-nn, math.PR
A class of generative models that unifies flow-based and diffusion-based
methods is introduced. These models extend the framework proposed in Albergo &
Vanden-Eijnden (2023), enabling the use of a broad class of continuous-time
stochastic processes called `stochastic interpolants’ to bridge any two
arbitrary probability density functions exactly in finite time. These
interpolants are built by combining data from the two prescribed densities with
an additional latent variable that shapes the bridge in a flexible way. The
time-dependent probability density function of the stochastic interpolant is
shown to satisfy a first-order transport equation as well as a family of
forward and backward Fokker-Planck equations with tunable diffusion
coefficient. Upon consideration of the time evolution of an individual sample,
this viewpoint immediately leads to both deterministic and stochastic
generative models based on probability flow equations or stochastic
differential equations with an adjustable level of noise. The drift
coefficients entering these models are time-dependent velocity fields
characterized as the unique minimizers of simple quadratic objective functions,
one of which is a new objective for the score of the interpolant density. We
show that minimization of these quadratic objectives leads to control of the
likelihood for generative models built upon stochastic dynamics, while
likelihood control for deterministic dynamics is more stringent. We also
discuss connections with other methods such as score-based diffusion models,
stochastic localization processes, probabilistic denoising techniques, and
rectifying flows. In addition, we demonstrate that stochastic interpolants
recover the Schr"odinger bridge between the two target densities when
explicitly optimizing over the interpolant. Finally, algorithmic aspects are
discussed and the approach is illustrated on numerical examples.
DiffusionAD: Denoising Diffusion for Anomaly Detection
March 15, 2023
Hui Zhang, Zheng Wang, Zuxuan Wu, Yu-Gang Jiang
Anomaly detection has garnered extensive applications in real industrial
manufacturing due to its remarkable effectiveness and efficiency. However,
previous generative-based models have been limited by suboptimal reconstruction
quality, hampering their overall performance. A fundamental enhancement lies in
our reformulation of the reconstruction process using a diffusion model into a
noise-to-norm paradigm. Here, anomalous regions are perturbed with Gaussian
noise and reconstructed as normal, overcoming the limitations of previous
models by facilitating anomaly-free restoration. Additionally, we propose a
rapid one-step denoising paradigm, significantly faster than the traditional
iterative denoising in diffusion models. Furthermore, the introduction of the
norm-guided paradigm elevates the accuracy and fidelity of reconstructions. The
segmentation sub-network predicts pixel-level anomaly scores using the input
image and its anomaly-free restoration. Comprehensive evaluations on four
standard and challenging benchmarks reveal that DiffusionAD outperforms current
state-of-the-art approaches, demonstrating the effectiveness and broad
applicability of the proposed pipeline.
ResDiff: Combining CNN and Diffusion Model for Image Super-Resolution
March 15, 2023
Shuyao Shang, Zhengyang Shan, Guangxing Liu, Jinglin Zhang
Adapting the Diffusion Probabilistic Model (DPM) for direct image
super-resolution is wasteful, given that a simple Convolutional Neural Network
(CNN) can recover the main low-frequency content. Therefore, we present
ResDiff, a novel Diffusion Probabilistic Model based on Residual structure for
Single Image Super-Resolution (SISR). ResDiff utilizes a combination of a CNN,
which restores primary low-frequency components, and a DPM, which predicts the
residual between the ground-truth image and the CNN-predicted image. In
contrast to the common diffusion-based methods that directly use LR images to
guide the noise towards HR space, ResDiff utilizes the CNN’s initial prediction
to direct the noise towards the residual space between HR space and
CNN-predicted space, which not only accelerates the generation process but also
acquires superior sample quality. Additionally, a frequency-domain-based loss
function for CNN is introduced to facilitate its restoration, and a
frequency-domain guided diffusion is designed for DPM on behalf of predicting
high-frequency details. The extensive experiments on multiple benchmark
datasets demonstrate that ResDiff outperforms previous diffusion-based methods
in terms of shorter model convergence time, superior generation quality, and
more diverse samples.
Speech Signal Improvement Using Causal Generative Diffusion Models
March 15, 2023
Julius Richter, Simon Welker, Jean-Marie Lemercier, Bunlong Lay, Tal Peer, Timo Gerkmann
In this paper, we present a causal speech signal improvement system that is
designed to handle different types of distortions. The method is based on a
generative diffusion model which has been shown to work well in scenarios with
missing data and non-linear corruptions. To guarantee causal processing, we
modify the network architecture of our previous work and replace global
normalization with causal adaptive gain control. We generate diverse training
data containing a broad range of distortions. This work was performed in the
context of an “ICASSP Signal Processing Grand Challenge” and submitted to the
non-real-time track of the “Speech Signal Improvement Challenge 2023”, where it
was ranked fifth.
Generating symbolic music using diffusion models
March 15, 2023
Lilac Atassi
Denoising Diffusion Probabilistic models have emerged as simple yet very
powerful generative models. Unlike other generative models, diffusion models do
not suffer from mode collapse or require a discriminator to generate
high-quality samples. In this paper, a diffusion model that uses a binomial
prior distribution to generate piano rolls is proposed. The paper also proposes
an efficient method to train the model and generate samples. The generated
music has coherence at time scales up to the length of the training piano roll
segments. The paper demonstrates how this model is conditioned on the input and
can be used to harmonize a given melody, complete an incomplete piano roll, or
generate a variation of a given piece. The code is publicly shared to encourage
the use and development of the method by the community.
DiffBEV: Conditional Diffusion Model for Bird’s Eye View Perception
March 15, 2023
Jiayu Zou, Zheng Zhu, Yun Ye, Xingang Wang
BEV perception is of great importance in the field of autonomous driving,
serving as the cornerstone of planning, controlling, and motion prediction. The
quality of the BEV feature highly affects the performance of BEV perception.
However, taking the noises in camera parameters and LiDAR scans into
consideration, we usually obtain BEV representation with harmful noises.
Diffusion models naturally have the ability to denoise noisy samples to the
ideal data, which motivates us to utilize the diffusion model to get a better
BEV representation. In this work, we propose an end-to-end framework, named
DiffBEV, to exploit the potential of diffusion model to generate a more
comprehensive BEV representation. To the best of our knowledge, we are the
first to apply diffusion model to BEV perception. In practice, we design three
types of conditions to guide the training of the diffusion model which denoises
the coarse samples and refines the semantic feature in a progressive way.
What’s more, a cross-attention module is leveraged to fuse the context of BEV
feature and the semantic content of conditional diffusion model. DiffBEV
achieves a 25.9% mIoU on the nuScenes dataset, which is 6.2% higher than the
best-performing existing approach. Quantitative and qualitative results on
multiple benchmarks demonstrate the effectiveness of DiffBEV in BEV semantic
segmentation and 3D object detection tasks. The code will be available soon.
Decomposed Diffusion Models for High-Quality Video Generation
March 15, 2023
Zhengxiong Luo, Dayou Chen, Yingya Zhang, Yan Huang, Liang Wang, Yujun Shen, Deli Zhao, Jingren Zhou, Tieniu Tan
A diffusion probabilistic model (DPM), which constructs a forward diffusion
process by gradually adding noise to data points and learns the reverse
denoising process to generate new samples, has been shown to handle complex
data distribution. Despite its recent success in image synthesis, applying DPMs
to video generation is still challenging due to high-dimensional data spaces.
Previous methods usually adopt a standard diffusion process, where frames in
the same video clip are destroyed with independent noises, ignoring the content
redundancy and temporal correlation. This work presents a decomposed diffusion
process via resolving the per-frame noise into a base noise that is shared
among all frames and a residual noise that varies along the time axis. The
denoising pipeline employs two jointly-learned networks to match the noise
decomposition accordingly. Experiments on various datasets confirm that our
approach, termed as VideoFusion, surpasses both GAN-based and diffusion-based
alternatives in high-quality video generation. We further show that our
decomposed formulation can benefit from pre-trained image diffusion models and
well-support text-conditioned video creation.
Diffusion Models for Contrast Harmonization of Magnetic Resonance Images
March 14, 2023
Alicia Durrer, Julia Wolleb, Florentin Bieder, Tim Sinnecker, Matthias Weigel, Robin Sandkühler, Cristina Granziera, Özgür Yaldizli, Philippe C. Cattin
Magnetic resonance (MR) images from multiple sources often show differences
in image contrast related to acquisition settings or the used scanner type. For
long-term studies, longitudinal comparability is essential but can be impaired
by these contrast differences, leading to biased results when using automated
evaluation tools. This study presents a diffusion model-based approach for
contrast harmonization. We use a data set consisting of scans of 18 Multiple
Sclerosis patients and 22 healthy controls. Each subject was scanned in two MR
scanners of different magnetic field strengths (1.5 T and 3 T), resulting in a
paired data set that shows scanner-inherent differences. We map images from the
source contrast to the target contrast for both directions, from 3 T to 1.5 T
and from 1.5 T to 3 T. As we only want to change the contrast, not the
anatomical information, our method uses the original image to guide the
image-to-image translation process by adding structural information. The aim is
that the mapped scans display increased comparability with scans of the target
contrast for downstream tasks. We evaluate this method for the task of
segmentation of cerebrospinal fluid, grey matter and white matter. Our method
achieves good and consistent results for both directions of the mapping.
LayoutDM: Discrete Diffusion Model for Controllable Layout Generation
March 14, 2023
Naoto Inoue, Kotaro Kikuchi, Edgar Simo-Serra, Mayu Otani, Kota Yamaguchi
Controllable layout generation aims at synthesizing plausible arrangement of
element bounding boxes with optional constraints, such as type or position of a
specific element. In this work, we try to solve a broad range of layout
generation tasks in a single model that is based on discrete state-space
diffusion models. Our model, named LayoutDM, naturally handles the structured
layout data in the discrete representation and learns to progressively infer a
noiseless layout from the initial input, where we model the layout corruption
process by modality-wise discrete diffusion. For conditional generation, we
propose to inject layout constraints in the form of masking or logit adjustment
during inference. We show in the experiments that our LayoutDM successfully
generates high-quality layouts and outperforms both task-specific and
task-agnostic baselines on several layout tasks.
MeshDiffusion: Score-based Generative 3D Mesh Modeling
March 14, 2023
Zhen Liu, Yao Feng, Michael J. Black, Derek Nowrouzezahrai, Liam Paull, Weiyang Liu
cs.GR, cs.AI, cs.CV, cs.LG
We consider the task of generating realistic 3D shapes, which is useful for a
variety of applications such as automatic scene generation and physical
simulation. Compared to other 3D representations like voxels and point clouds,
meshes are more desirable in practice, because (1) they enable easy and
arbitrary manipulation of shapes for relighting and simulation, and (2) they
can fully leverage the power of modern graphics pipelines which are mostly
optimized for meshes. Previous scalable methods for generating meshes typically
rely on sub-optimal post-processing, and they tend to produce overly-smooth or
noisy surfaces without fine-grained geometric details. To overcome these
shortcomings, we take advantage of the graph structure of meshes and use a
simple yet very effective generative modeling method to generate 3D meshes.
Specifically, we represent meshes with deformable tetrahedral grids, and then
train a diffusion model on this direct parametrization. We demonstrate the
effectiveness of our model on multiple generative tasks.
Point Cloud Diffusion Models for Automatic Implant Generation
March 14, 2023
Paul Friedrich, Julia Wolleb, Florentin Bieder, Florian M. Thieringer, Philippe C. Cattin
Advances in 3D printing of biocompatible materials make patient-specific
implants increasingly popular. The design of these implants is, however, still
a tedious and largely manual process. Existing approaches to automate implant
generation are mainly based on 3D U-Net architectures on downsampled or
patch-wise data, which can result in a loss of detail or contextual
information. Following the recent success of Diffusion Probabilistic Models, we
propose a novel approach for implant generation based on a combination of 3D
point cloud diffusion models and voxelization networks. Due to the stochastic
sampling process in our diffusion model, we can propose an ensemble of
different implants per defect, from which the physicians can choose the most
suitable one. We evaluate our method on the SkullBreak and SkullFix datasets,
generating high-quality implants and achieving competitive evaluation scores.
Text-to-image Diffusion Model in Generative AI: A Survey
March 14, 2023
Chenshuang Zhang, Chaoning Zhang, Mengchun Zhang, In So Kweon
This survey reviews text-to-image diffusion models in the context that
diffusion models have emerged to be popular for a wide range of generative
tasks. As a self-contained work, this survey starts with a brief introduction
of how a basic diffusion model works for image synthesis, followed by how
condition or guidance improves learning. Based on that, we present a review of
state-of-the-art methods on text-conditioned image synthesis, i.e.,
text-to-image. We further summarize applications beyond text-to-image
generation: text-guided creative generation and text-guided image editing.
Beyond the progress made so far, we discuss existing challenges and promising
future directions.
Generalised Scale-Space Properties for Probabilistic Diffusion Models
March 14, 2023
Pascal Peter
Probabilistic diffusion models enjoy increasing popularity in the deep
learning community. They generate convincing samples from a learned
distribution of input images with a wide field of practical applications.
Originally, these approaches were motivated from drift-diffusion processes, but
these origins find less attention in recent, practice-oriented publications. We
investigate probabilistic diffusion models from the viewpoint of scale-space
research and show that they fulfil generalised scale-space properties on
evolving probability distributions. Moreover, we discuss similarities and
differences between interpretations of the physical core concept of
drift-diffusion in the deep learning and model-based world. To this end, we
examine relations of probabilistic diffusion to osmosis filters.
DiffuseRoll: Multi-track multi-category music generation based on diffusion model
March 14, 2023
Hongfei Wang
Recent advancements in generative models have shown remarkable progress in
music generation. However, most existing methods focus on generating monophonic
or homophonic music, while the generation of polyphonic and multi-track music
with rich attributes is still a challenging task. In this paper, we propose a
novel approach for multi-track, multi-attribute symphonic music generation
using the diffusion model. Specifically, we generate piano-roll representations
with a diffusion model and map them to MIDI format for output. To capture rich
attribute information, we introduce a color coding scheme to encode note
sequences into color and position information that represents pitch,velocity,
and instrument. This scheme enables a seamless mapping between discrete music
sequences and continuous images. We also propose a post-processing method to
optimize the generated scores for better performance. Experimental results show
that our method outperforms state-of-the-art methods in terms of polyphonic
music generation with rich attribute information compared to the figure
methods.
Erasing Concepts from Diffusion Models
March 13, 2023
Rohit Gandikota, Joanna Materzynska, Jaden Fiotto-Kaufman, David Bau
Motivated by recent advancements in text-to-image diffusion, we study erasure
of specific concepts from the model’s weights. While Stable Diffusion has shown
promise in producing explicit or realistic artwork, it has raised concerns
regarding its potential for misuse. We propose a fine-tuning method that can
erase a visual concept from a pre-trained diffusion model, given only the name
of the style and using negative guidance as a teacher. We benchmark our method
against previous approaches that remove sexually explicit content and
demonstrate its effectiveness, performing on par with Safe Latent Diffusion and
censored training. To evaluate artistic style removal, we conduct experiments
erasing five modern artists from the network and conduct a user study to assess
the human perception of the removed styles. Unlike previous methods, our
approach can remove concepts from a diffusion model permanently rather than
modifying the output at the inference time, so it cannot be circumvented even
if a user has access to model weights. Our code, data, and results are
available at https://erasing.baulab.info/
Synthesizing Realistic Image Restoration Training Pairs: A Diffusion Approach
March 13, 2023
Tao Yang, Peiran Ren, Xuansong xie, Lei Zhang
In supervised image restoration tasks, one key issue is how to obtain the
aligned high-quality (HQ) and low-quality (LQ) training image pairs.
Unfortunately, such HQ-LQ training pairs are hard to capture in practice, and
hard to synthesize due to the complex unknown degradation in the wild. While
several sophisticated degradation models have been manually designed to
synthesize LQ images from their HQ counterparts, the distribution gap between
the synthesized and real-world LQ images remains large. We propose a new
approach to synthesizing realistic image restoration training pairs using the
emerging denoising diffusion probabilistic model (DDPM).
First, we train a DDPM, which could convert a noisy input into the desired LQ
image, with a large amount of collected LQ images, which define the target data
distribution. Then, for a given HQ image, we synthesize an initial LQ image by
using an off-the-shelf degradation model, and iteratively add proper Gaussian
noises to it. Finally, we denoise the noisy LQ image using the pre-trained DDPM
to obtain the final LQ image, which falls into the target distribution of
real-world LQ images. Thanks to the strong capability of DDPM in distribution
approximation, the synthesized HQ-LQ image pairs can be used to train robust
models for real-world image restoration tasks, such as blind face image
restoration and blind image super-resolution. Experiments demonstrated the
superiority of our proposed approach to existing degradation models. Code and
data will be released.
DR2: Diffusion-based Robust Degradation Remover for Blind Face Restoration
March 13, 2023
Zhixin Wang, Xiaoyun Zhang, Ziying Zhang, Huangjie Zheng, Mingyuan Zhou, Ya Zhang, Yanfeng Wang
Blind face restoration usually synthesizes degraded low-quality data with a
pre-defined degradation model for training, while more complex cases could
happen in the real world. This gap between the assumed and actual degradation
hurts the restoration performance where artifacts are often observed in the
output. However, it is expensive and infeasible to include every type of
degradation to cover real-world cases in the training data. To tackle this
robustness issue, we propose Diffusion-based Robust Degradation Remover (DR2)
to first transform the degraded image to a coarse but degradation-invariant
prediction, then employ an enhancement module to restore the coarse prediction
to a high-quality image. By leveraging a well-performing denoising diffusion
probabilistic model, our DR2 diffuses input images to a noisy status where
various types of degradation give way to Gaussian noise, and then captures
semantic information through iterative denoising steps. As a result, DR2 is
robust against common degradation (e.g. blur, resize, noise and compression)
and compatible with different designs of enhancement modules. Experiments in
various settings show that our framework outperforms state-of-the-art methods
on heavily degraded synthetic and real-world datasets.
DDS2M: Self-Supervised Denoising Diffusion Spatio-Spectral Model for Hyperspectral Image Restoration
March 12, 2023
Yuchun Miao, Lefei Zhang, Liangpei Zhang, Dacheng Tao
Diffusion models have recently received a surge of interest due to their
impressive performance for image restoration, especially in terms of noise
robustness. However, existing diffusion-based methods are trained on a large
amount of training data and perform very well in-distribution, but can be quite
susceptible to distribution shift. This is especially inappropriate for
data-starved hyperspectral image (HSI) restoration. To tackle this problem,
this work puts forth a self-supervised diffusion model for HSI restoration,
namely Denoising Diffusion Spatio-Spectral Model (\texttt{DDS2M}), which works
by inferring the parameters of the proposed Variational Spatio-Spectral Module
(VS2M) during the reverse diffusion process, solely using the degraded HSI
without any extra training data. In VS2M, a variational inference-based loss
function is customized to enable the untrained spatial and spectral networks to
learn the posterior distribution, which serves as the transitions of the
sampling chain to help reverse the diffusion process. Benefiting from its
self-supervised nature and the diffusion process, \texttt{DDS2M} enjoys
stronger generalization ability to various HSIs compared to existing
diffusion-based methods and superior robustness to noise compared to existing
HSI restoration methods. Extensive experiments on HSI denoising, noisy HSI
completion and super-resolution on a variety of HSIs demonstrate
\texttt{DDS2M}’s superiority over the existing task-specific state-of-the-arts.
March 12, 2023
Fan Bao, Shen Nie, Kaiwen Xue, Chongxuan Li, Shi Pu, Yaole Wang, Gang Yue, Yue Cao, Hang Su, Jun Zhu
This paper proposes a unified diffusion framework (dubbed UniDiffuser) to fit
all distributions relevant to a set of multi-modal data in one model. Our key
insight is – learning diffusion models for marginal, conditional, and joint
distributions can be unified as predicting the noise in the perturbed data,
where the perturbation levels (i.e. timesteps) can be different for different
modalities. Inspired by the unified view, UniDiffuser learns all distributions
simultaneously with a minimal modification to the original diffusion model –
perturbs data in all modalities instead of a single modality, inputs individual
timesteps in different modalities, and predicts the noise of all modalities
instead of a single modality. UniDiffuser is parameterized by a transformer for
diffusion models to handle input types of different modalities. Implemented on
large-scale paired image-text data, UniDiffuser is able to perform image, text,
text-to-image, image-to-text, and image-text pair generation by setting proper
timesteps without additional overhead. In particular, UniDiffuser is able to
produce perceptually realistic samples in all tasks and its quantitative
results (e.g., the FID and CLIP score) are not only superior to existing
general-purpose models but also comparable to the bespoken models (e.g., Stable
Diffusion and DALL-E 2) in representative tasks (e.g., text-to-image
generation).
PARASOL: Parametric Style Control for Diffusion Image Synthesis
March 11, 2023
Gemma Canet Tarrés, Dan Ruta, Tu Bui, John Collomosse
We propose PARASOL, a multi-modal synthesis model that enables disentangled,
parametric control of the visual style of the image by jointly conditioning
synthesis on both content and a fine-grained visual style embedding. We train a
latent diffusion model (LDM) using specific losses for each modality and adapt
the classifier-free guidance for encouraging disentangled control over
independent content and style modalities at inference time. We leverage
auxiliary semantic and style-based search to create training triplets for
supervision of the LDM, ensuring complementarity of content and style cues.
PARASOL shows promise for enabling nuanced control over visual style in
diffusion models for image creation and stylization, as well as generative
search where text-based search results may be adapted to more closely match
user intent by interpolating both content and style descriptors.
Importance of Aligning Training Strategy with Evaluation for Diffusion Models in 3D Multiclass Segmentation
March 10, 2023
Yunguan Fu, Yiwen Li, Shaheer U. Saeed, Matthew J. Clarkson, Yipeng Hu
Recently, denoising diffusion probabilistic models (DDPM) have been applied
to image segmentation by generating segmentation masks conditioned on images,
while the applications were mainly limited to 2D networks without exploiting
potential benefits from the 3D formulation. In this work, we studied the
DDPM-based segmentation model for 3D multiclass segmentation on two large
multiclass data sets (prostate MR and abdominal CT). We observed that the
difference between training and test methods led to inferior performance for
existing DDPM methods. To mitigate the inconsistency, we proposed a recycling
method which generated corrupted masks based on the model’s prediction at a
previous time step instead of using ground truth. The proposed method achieved
statistically significantly improved performance compared to existing DDPMs,
independent of a number of other techniques for reducing train-test
discrepancy, including performing mask prediction, using Dice loss, and
reducing the number of diffusion time steps during training. The performance of
diffusion models was also competitive and visually similar to
non-diffusion-based U-net, within the same compute budget. The JAX-based
diffusion framework has been released at
https://github.com/mathpluscode/ImgX-DiffSeg.
Score-Based Generative Models for Medical Image Segmentation using Signed Distance Functions
March 10, 2023
Lea Bogensperger, Dominik Narnhofer, Filip Ilic, Thomas Pock
Medical image segmentation is a crucial task that relies on the ability to
accurately identify and isolate regions of interest in medical images. Thereby,
generative approaches allow to capture the statistical properties of
segmentation masks that are dependent on the respective structures. In this
work we propose a conditional score-based generative modeling framework to
represent the signed distance function (SDF) leading to an implicit
distribution of segmentation masks. The advantage of leveraging the SDF is a
more natural distortion when compared to that of binary masks. By learning the
score function of the conditional distribution of SDFs we can accurately sample
from the distribution of segmentation masks, allowing for the evaluation of
statistical quantities. Thus, this probabilistic representation allows for the
generation of uncertainty maps represented by the variance, which can aid in
further analysis and enhance the predictive robustness. We qualitatively and
quantitatively illustrate competitive performance of the proposed method on a
public nuclei and gland segmentation data set, highlighting its potential
utility in medical image segmentation applications.
TrojDiff: Trojan Attacks on Diffusion Models with Diverse Targets
March 10, 2023
Weixin Chen, Dawn Song, Bo Li
Diffusion models have achieved great success in a range of tasks, such as
image synthesis and molecule design. As such successes hinge on large-scale
training data collected from diverse sources, the trustworthiness of these
collected data is hard to control or audit. In this work, we aim to explore the
vulnerabilities of diffusion models under potential training data manipulations
and try to answer: How hard is it to perform Trojan attacks on well-trained
diffusion models? What are the adversarial targets that such Trojan attacks can
achieve? To answer these questions, we propose an effective Trojan attack
against diffusion models, TrojDiff, which optimizes the Trojan diffusion and
generative processes during training. In particular, we design novel
transitions during the Trojan diffusion process to diffuse adversarial targets
into a biased Gaussian distribution and propose a new parameterization of the
Trojan generative process that leads to an effective training objective for the
attack. In addition, we consider three types of adversarial targets: the
Trojaned diffusion models will always output instances belonging to a certain
class from the in-domain distribution (In-D2D attack), out-of-domain
distribution (Out-D2D-attack), and one specific instance (D2I attack). We
evaluate TrojDiff on CIFAR-10 and CelebA datasets against both DDPM and DDIM
diffusion models. We show that TrojDiff always achieves high attack performance
under different adversarial targets using different types of triggers, while
the performance in benign environments is preserved. The code is available at
https://github.com/chenweixin107/TrojDiff.
Fast Diffusion Sampler for Inverse Problems by Geometric Decomposition
March 10, 2023
Hyungjin Chung, Suhyeon Lee, Jong Chul Ye
cs.LG, cs.AI, cs.CV, stat.ML
Krylov subspace, which is generated by multiplying a given vector by the
matrix of a linear transformation and its successive powers, has been
extensively studied in classical optimization literature to design algorithms
that converge quickly for large linear inverse problems. For example, the
conjugate gradient method (CG), one of the most popular Krylov subspace
methods, is based on the idea of minimizing the residual error in the Krylov
subspace. However, with the recent advancement of high-performance diffusion
solvers for inverse problems, it is not clear how classical wisdom can be
synergistically combined with modern diffusion models. In this study, we
propose a novel and efficient diffusion sampling strategy that synergistically
combine the diffusion sampling and Krylov subspace methods. Specifically, we
prove that if the tangent space at a denoised sample by Tweedie’s formula forms
a Krylov subspace, then the CG initialized with the denoised data ensures the
data consistency update to remain in the tangent space. This negates the need
to compute the manifold-constrained gradient (MCG), leading to a more efficient
diffusion sampling method. Our method is applicable regardless of the
parametrization and setting (i.e., VE, VP). Notably, we achieve
state-of-the-art reconstruction quality on challenging real-world medical
inverse imaging problems, including multi-coil MRI reconstruction and 3D CT
reconstruction. Moreover, our proposed method achieves more than 80 times
faster inference time than the previous state-of-the-art method.
March 10, 2023
Amir Sadikov, Xinlei Pan, Hannah Choi, Lanya T. Cai, Pratik Mukherjee
Diffusion MRI is a non-invasive, in-vivo biomedical imaging method for
mapping tissue microstructure. Applications include structural connectivity
imaging of the human brain and detecting microstructural neural changes.
However, acquiring high signal-to-noise ratio dMRI datasets with high angular
and spatial resolution requires prohibitively long scan times, limiting usage
in many important clinical settings, especially for children, the elderly, and
in acute neurological disorders that may require conscious sedation or general
anesthesia. We employ a Swin UNEt Transformers model, trained on augmented
Human Connectome Project data and conditioned on registered T1 scans, to
perform generalized denoising of dMRI. We also qualitatively demonstrate
super-resolution with artificially downsampled HCP data in normal adult
volunteers. Remarkably, Swin UNETR can be fine-tuned for an out-of-domain
dataset with a single example scan, as we demonstrate on dMRI of children with
neurodevelopmental disorders and of adults with acute evolving traumatic brain
injury, each cohort scanned on different models of scanners with different
imaging protocols at different sites. We exceed current state-of-the-art
denoising methods in accuracy and test-retest reliability of rapid diffusion
tensor imaging requiring only 90 seconds of scan time. Applied to tissue
microstructural modeling of dMRI, Swin UNETR denoising achieves dramatic
improvements over the state-of-the-art for test-retest reliability of
intracellular volume fraction and free water fraction measurements and can
remove heavy-tail noise, improving biophysical modeling fidelity. Swin UNeTR
enables rapid diffusion MRI with unprecedented accuracy and reliability,
especially for probing biological tissues for scientific and clinical
applications. The code and model are publicly available at
https://github.com/ucsfncl/dmri-swin.
EHRDiff: Exploring Realistic EHR Synthesis with Diffusion Models
March 10, 2023
Hongyi Yuan, Songchi Zhou, Sheng Yu
Electronic health records (EHR) contain vast biomedical knowledge and are
rich resources for developing precise medicine systems. However, due to privacy
concerns, there are limited high-quality EHR data accessible to researchers
hence hindering the advancement of methodologies. Recent research has explored
using generative modelling methods to synthesize realistic EHR data, and most
proposed methods are based on the generative adversarial network (GAN) and its
variants for EHR synthesis. Although GAN-style methods achieved
state-of-the-art performance in generating high-quality EHR data, such methods
are hard to train and prone to mode collapse. Diffusion models are recently
proposed generative modelling methods and set cutting-edge performance in image
generation. The performance of diffusion models in realistic EHR synthesis is
rarely explored. In this work, we explore whether the superior performance of
diffusion models can translate to the domain of EHR synthesis and propose a
novel EHR synthesis method named EHRDiff. Through comprehensive experiments,
EHRDiff achieves new state-of-the-art performance for the quality of synthetic
EHR data and can better protect private information in real training EHRs in
the meanwhile.
PC-JeDi: Diffusion for Particle Cloud Generation in High Energy Physics
March 09, 2023
Matthew Leigh, Debajyoti Sengupta, Guillaume Quétant, John Andrew Raine, Knut Zoch, Tobias Golling
In this paper, we present a new method to efficiently generate jets in High
Energy Physics called PC-JeDi. This method utilises score-based diffusion
models in conjunction with transformers which are well suited to the task of
generating jets as particle clouds due to their permutation equivariance.
PC-JeDi achieves competitive performance with current state-of-the-art methods
across several metrics that evaluate the quality of the generated jets.
Although slower than other models, due to the large number of forward passes
required by diffusion models, it is still substantially faster than traditional
detailed simulation. Furthermore, PC-JeDi uses conditional generation to
produce jets with a desired mass and transverse momentum for two different
particles, top quarks and gluons.
Brain-Diffuser: Natural scene reconstruction from fMRI signals using generative latent diffusion
March 09, 2023
Furkan Ozcelik, Rufin VanRullen
In neural decoding research, one of the most intriguing topics is the
reconstruction of perceived natural images based on fMRI signals. Previous
studies have succeeded in re-creating different aspects of the visuals, such as
low-level properties (shape, texture, layout) or high-level features (category
of objects, descriptive semantics of scenes) but have typically failed to
reconstruct these properties together for complex scene images. Generative AI
has recently made a leap forward with latent diffusion models capable of
generating high-complexity images. Here, we investigate how to take advantage
of this innovative technology for brain decoding. We present a two-stage scene
reconstruction framework called Brain-Diffuser''. In the first stage,
starting from fMRI signals, we reconstruct images that capture low-level
properties and overall layout using a VDVAE (Very Deep Variational Autoencoder)
model. In the second stage, we use the image-to-image framework of a latent
diffusion model (Versatile Diffusion) conditioned on predicted multimodal (text
and visual) features, to generate final reconstructed images. On the publicly
available Natural Scenes Dataset benchmark, our method outperforms previous
models both qualitatively and quantitatively. When applied to synthetic fMRI
patterns generated from individual ROI (region-of-interest) masks, our trained
model creates compelling
ROI-optimal’’ scenes consistent with neuroscientific
knowledge. Thus, the proposed methodology can have an impact on both applied
(e.g. brain-computer interface) and fundamental neuroscience.
MaskDiff: Modeling Mask Distribution with Diffusion Probabilistic Model for Few-Shot Instance Segmentation
March 09, 2023
Minh-Quan Le, Tam V. Nguyen, Trung-Nghia Le, Thanh-Toan Do, Minh N. Do, Minh-Triet Tran
Few-shot instance segmentation extends the few-shot learning paradigm to the
instance segmentation task, which tries to segment instance objects from a
query image with a few annotated examples of novel categories. Conventional
approaches have attempted to address the task via prototype learning, known as
point estimation. However, this mechanism is susceptible to noise and suffers
from bias due to a significant scarcity of data. To overcome the disadvantages
of the point estimation mechanism, we propose a novel approach, dubbed
MaskDiff, which models the underlying conditional distribution of a binary
mask, which is conditioned on an object region and $K$-shot information.
Inspired by augmentation approaches that perturb data with Gaussian noise for
populating low data density regions, we model the mask distribution with a
diffusion probabilistic model. In addition, we propose to utilize
classifier-free guided mask sampling to integrate category information into the
binary mask generation process. Without bells and whistles, our proposed method
consistently outperforms state-of-the-art methods on both base and novel
classes of the COCO dataset while simultaneously being more stable than
existing methods.
StyleDiff: Attribute Comparison Between Unlabeled Datasets in Latent Disentangled Space
March 09, 2023
Keisuke Kawano, Takuro Kutsuna, Ryoko Tokuhisa, Akihiro Nakamura, Yasushi Esaki
One major challenge in machine learning applications is coping with
mismatches between the datasets used in the development and those obtained in
real-world applications. These mismatches may lead to inaccurate predictions
and errors, resulting in poor product quality and unreliable systems. In this
study, we propose StyleDiff to inform developers of the differences between the
two datasets for the steady development of machine learning systems. Using
disentangled image spaces obtained from recently proposed generative models,
StyleDiff compares the two datasets by focusing on attributes in the images and
provides an easy-to-understand analysis of the differences between the
datasets. The proposed StyleDiff performs in $O (d N\log N)$, where $N$ is the
size of the datasets and $d$ is the number of attributes, enabling the
application to large datasets. We demonstrate that StyleDiff accurately detects
differences between datasets and presents them in an understandable format
using, for example, driving scenes datasets.
DiffusionDepth: Diffusion Denoising Approach for Monocular Depth Estimation
March 09, 2023
Yiqun Duan, Xianda Guo, Zheng Zhu
Monocular depth estimation is a challenging task that predicts the pixel-wise
depth from a single 2D image. Current methods typically model this problem as a
regression or classification task. We propose DiffusionDepth, a new approach
that reformulates monocular depth estimation as a denoising diffusion process.
It learns an iterative denoising process to `denoise’ random depth distribution
into a depth map with the guidance of monocular visual conditions. The process
is performed in the latent space encoded by a dedicated depth encoder and
decoder. Instead of diffusing ground truth (GT) depth, the model learns to
reverse the process of diffusing the refined depth of itself into random depth
distribution. This self-diffusion formulation overcomes the difficulty of
applying generative models to sparse GT depth scenarios. The proposed approach
benefits this task by refining depth estimation step by step, which is superior
for generating accurate and highly detailed depth maps. Experimental results on
KITTI and NYU-Depth-V2 datasets suggest that a simple yet efficient diffusion
approach could reach state-of-the-art performance in both indoor and outdoor
scenarios with acceptable inference time.
Multilevel Diffusion: Infinite Dimensional Score-Based Diffusion Models for Image Generation
March 08, 2023
Paul Hagemann, Sophie Mildenberger, Lars Ruthotto, Gabriele Steidl, Nicole Tianjiao Yang
cs.LG, cs.CV, math.PR, stat.ML, 60H10, 65D18
Score-based diffusion models (SBDM) have recently emerged as state-of-the-art
approaches for image generation. Existing SBDMs are typically formulated in a
finite-dimensional setting, where images are considered as tensors of finite
size. This paper develops SBDMs in the infinite-dimensional setting, that is,
we model the training data as functions supported on a rectangular domain.
Besides the quest for generating images at ever higher resolution, our primary
motivation is to create a well-posed infinite-dimensional learning problem so
that we can discretize it consistently on multiple resolution levels. We
thereby intend to obtain diffusion models that generalize across different
resolution levels and improve the efficiency of the training process. We
demonstrate how to overcome two shortcomings of current SBDM approaches in the
infinite-dimensional setting. First, we modify the forward process to ensure
that the latent distribution is well-defined in the infinite-dimensional
setting using the notion of trace class operators. We derive the reverse
processes for finite approximations. Second, we illustrate that approximating
the score function with an operator network is beneficial for multilevel
training. After deriving the convergence of the discretization and the
approximation of multilevel training, we implement an infinite-dimensional SBDM
approach and show the first promising results on MNIST and Fashion-MNIST,
underlining our developed theory.
Diffusing Gaussian Mixtures for Generating Categorical Data
March 08, 2023
Florence Regol, Mark Coates
Learning a categorical distribution comes with its own set of challenges. A
successful approach taken by state-of-the-art works is to cast the problem in a
continuous domain to take advantage of the impressive performance of the
generative models for continuous data. Amongst them are the recently emerging
diffusion probabilistic models, which have the observed advantage of generating
high-quality samples. Recent advances for categorical generative models have
focused on log likelihood improvements. In this work, we propose a generative
model for categorical data based on diffusion models with a focus on
high-quality sample generation, and propose sampled-based evaluation methods.
The efficacy of our method stems from performing diffusion in the continuous
domain while having its parameterization informed by the structure of the
categorical nature of the target distribution. Our method of evaluation
highlights the capabilities and limitations of different generative models for
generating categorical data, and includes experiments on synthetic and
real-world protein datasets.
Learning Enhancement From Degradation: A Diffusion Model For Fundus Image Enhancement
March 08, 2023
Puijin Cheng, Li Lin, Yijin Huang, Huaqing He, Wenhan Luo, Xiaoying Tang
The quality of a fundus image can be compromised by numerous factors, many of
which are challenging to be appropriately and mathematically modeled. In this
paper, we introduce a novel diffusion model based framework, named Learning
Enhancement from Degradation (LED), for enhancing fundus images. Specifically,
we first adopt a data-driven degradation framework to learn degradation
mappings from unpaired high-quality to low-quality images. We then apply a
conditional diffusion model to learn the inverse enhancement process in a
paired manner. The proposed LED is able to output enhancement results that
maintain clinically important features with better clarity. Moreover, in the
inference phase, LED can be easily and effectively integrated with any existing
fundus image enhancement framework. We evaluate the proposed LED on several
downstream tasks with respect to various clinically-relevant metrics,
successfully demonstrating its superiority over existing state-of-the-art
methods both quantitatively and qualitatively. The source code is available at
https://github.com/QtacierP/LED.
Diffusion in the Dark: A Diffusion Model for Low-Light Text Recognition
March 07, 2023
Cindy M. Nguyen, Eric R. Chan, Alexander W. Bergman, Gordon Wetzstein
Capturing images is a key part of automation for high-level tasks such as
scene text recognition. Low-light conditions pose a challenge for high-level
perception stacks, which are often optimized on well-lit, artifact-free images.
Reconstruction methods for low-light images can produce well-lit counterparts,
but typically at the cost of high-frequency details critical for downstream
tasks. We propose Diffusion in the Dark (DiD), a diffusion model for low-light
image reconstruction for text recognition. DiD provides qualitatively
competitive reconstructions with that of state-of-the-art (SOTA), while
preserving high-frequency details even in extremely noisy, dark conditions. We
demonstrate that DiD, without any task-specific optimization, can outperform
SOTA low-light methods in low-light text recognition on real images, bolstering
the potential of diffusion models to solve ill-posed inverse problems.
TRACT: Denoising Diffusion Models with Transitive Closure Time-Distillation
March 07, 2023
David Berthelot, Arnaud Autef, Jierui Lin, Dian Ang Yap, Shuangfei Zhai, Siyuan Hu, Daniel Zheng, Walter Talbott, Eric Gu
Denoising Diffusion models have demonstrated their proficiency for generative
sampling. However, generating good samples often requires many iterations.
Consequently, techniques such as binary time-distillation (BTD) have been
proposed to reduce the number of network calls for a fixed architecture. In
this paper, we introduce TRAnsitive Closure Time-distillation (TRACT), a new
method that extends BTD. For single step diffusion,TRACT improves FID by up to
2.4x on the same architecture, and achieves new single-step Denoising Diffusion
Implicit Models (DDIM) state-of-the-art FID (7.4 for ImageNet64, 3.8 for
CIFAR10). Finally we tease apart the method through extended ablations. The
PyTorch implementation will be released soon.
Patched Diffusion Models for Unsupervised Anomaly Detection in Brain MRI
March 07, 2023
Finn Behrendt, Debayan Bhattacharya, Julia Krüger, Roland Opfer, Alexander Schlaefer
The use of supervised deep learning techniques to detect pathologies in brain
MRI scans can be challenging due to the diversity of brain anatomy and the need
for annotated data sets. An alternative approach is to use unsupervised anomaly
detection, which only requires sample-level labels of healthy brains to create
a reference representation. This reference representation can then be compared
to unhealthy brain anatomy in a pixel-wise manner to identify abnormalities. To
accomplish this, generative models are needed to create anatomically consistent
MRI scans of healthy brains. While recent diffusion models have shown promise
in this task, accurately generating the complex structure of the human brain
remains a challenge. In this paper, we propose a method that reformulates the
generation task of diffusion models as a patch-based estimation of healthy
brain anatomy, using spatial context to guide and improve reconstruction. We
evaluate our approach on data of tumors and multiple sclerosis lesions and
demonstrate a relative improvement of 25.1% compared to existing baselines.
3D Equivariant Diffusion for Target-Aware Molecule Generation and Affinity Prediction
March 06, 2023
Jiaqi Guan, Wesley Wei Qian, Xingang Peng, Yufeng Su, Jian Peng, Jianzhu Ma
Rich data and powerful machine learning models allow us to design drugs for a
specific protein target \textit{in silico}. Recently, the inclusion of 3D
structures during targeted drug design shows superior performance to other
target-free models as the atomic interaction in the 3D space is explicitly
modeled. However, current 3D target-aware models either rely on the voxelized
atom densities or the autoregressive sampling process, which are not
equivariant to rotation or easily violate geometric constraints resulting in
unrealistic structures. In this work, we develop a 3D equivariant diffusion
model to solve the above challenges. To achieve target-aware molecule design,
our method learns a joint generative process of both continuous atom
coordinates and categorical atom types with a SE(3)-equivariant network.
Moreover, we show that our model can serve as an unsupervised feature extractor
to estimate the binding affinity under proper parameterization, which provides
an effective way for drug screening. To evaluate our model, we propose a
comprehensive framework to evaluate the quality of sampled molecules from
different dimensions. Empirical studies show our model could generate molecules
with more realistic 3D structures and better affinities towards the protein
targets, and improve binding affinity ranking and prediction without
retraining.
Restoration-Degradation Beyond Linear Diffusions: A Non-Asymptotic Analysis For DDIM-Type Samplers
March 06, 2023
Sitan Chen, Giannis Daras, Alexandros G. Dimakis
cs.LG, math.ST, stat.ML, stat.TH
We develop a framework for non-asymptotic analysis of deterministic samplers
used for diffusion generative modeling. Several recent works have analyzed
stochastic samplers using tools like Girsanov’s theorem and a chain rule
variant of the interpolation argument. Unfortunately, these techniques give
vacuous bounds when applied to deterministic samplers. We give a new
operational interpretation for deterministic sampling by showing that one step
along the probability flow ODE can be expressed as two steps: 1) a restoration
step that runs gradient ascent on the conditional log-likelihood at some
infinitesimally previous time, and 2) a degradation step that runs the forward
process using noise pointing back towards the current iterate. This perspective
allows us to extend denoising diffusion implicit models to general, non-linear
forward processes. We then develop the first polynomial convergence bounds for
these samplers under mild conditions on the data distribution.
EEG Synthetic Data Generation Using Probabilistic Diffusion Models
March 06, 2023
Giulio Tosato, Cesare M. Dalbagno, Francesco Fumagalli
eess.SP, cs.AI, cs.LG, q-bio.NC
Electroencephalography (EEG) plays a significant role in the Brain Computer
Interface (BCI) domain, due to its non-invasive nature, low cost, and ease of
use, making it a highly desirable option for widespread adoption by the general
public. This technology is commonly used in conjunction with deep learning
techniques, the success of which is largely dependent on the quality and
quantity of data used for training. To address the challenge of obtaining
sufficient EEG data from individual participants while minimizing user effort
and maintaining accuracy, this study proposes an advanced methodology for data
augmentation: generating synthetic EEG data using denoising diffusion
probabilistic models. The synthetic data are generated from electrode-frequency
distribution maps (EFDMs) of emotionally labeled EEG recordings. To assess the
validity of the synthetic data generated, both a qualitative and a quantitative
comparison with real EEG data were successfully conducted. This study opens up
the possibility for an open\textendash source accessible and versatile toolbox
that can process and generate data in both time and frequency dimensions,
regardless of the number of channels involved. Finally, the proposed
methodology has potential implications for the broader field of neuroscience
research by enabling the creation of large, publicly available synthetic EEG
datasets without privacy concerns.
Diffusion Models Generate Images Like Painters: an Analytical Theory of Outline First, Details Later
March 04, 2023
Binxu Wang, John J. Vastola
cs.CV, cs.AI, cs.GR, cs.NE, F.2.2; I.3.3; I.2.10; I.2.6
How do diffusion generative models convert pure noise into meaningful images?
We argue that generation involves first committing to an outline, and then to
finer and finer details. The corresponding reverse diffusion process can be
modeled by dynamics on a (time-dependent) high-dimensional landscape full of
Gaussian-like modes, which makes the following predictions: (i) individual
trajectories tend to be very low-dimensional; (ii) scene elements that vary
more within training data tend to emerge earlier; and (iii) early perturbations
substantially change image content more often than late perturbations. We show
that the behavior of a variety of trained unconditional and conditional
diffusion models like Stable Diffusion is consistent with these predictions.
Finally, we use our theory to search for the latent image manifold of diffusion
models, and propose a new way to generate interpretable image variations. Our
viewpoint suggests generation by GANs and diffusion models have unexpected
similarities.
Unleashing Text-to-Image Diffusion Models for Visual Perception
March 03, 2023
Wenliang Zhao, Yongming Rao, Zuyan Liu, Benlin Liu, Jie Zhou, Jiwen Lu
Diffusion models (DMs) have become the new trend of generative models and
have demonstrated a powerful ability of conditional synthesis. Among those,
text-to-image diffusion models pre-trained on large-scale image-text pairs are
highly controllable by customizable prompts. Unlike the unconditional
generative models that focus on low-level attributes and details, text-to-image
diffusion models contain more high-level knowledge thanks to the
vision-language pre-training. In this paper, we propose VPD (Visual Perception
with a pre-trained Diffusion model), a new framework that exploits the semantic
information of a pre-trained text-to-image diffusion model in visual perception
tasks. Instead of using the pre-trained denoising autoencoder in a
diffusion-based pipeline, we simply use it as a backbone and aim to study how
to take full advantage of the learned knowledge. Specifically, we prompt the
denoising decoder with proper textual inputs and refine the text features with
an adapter, leading to a better alignment to the pre-trained stage and making
the visual contents interact with the text prompts. We also propose to utilize
the cross-attention maps between the visual features and the text features to
provide explicit guidance. Compared with other pre-training methods, we show
that vision-language pre-trained diffusion models can be faster adapted to
downstream visual perception tasks using the proposed VPD. Extensive
experiments on semantic segmentation, referring image segmentation and depth
estimation demonstrates the effectiveness of our method. Notably, VPD attains
0.254 RMSE on NYUv2 depth estimation and 73.3% oIoU on RefCOCO-val referring
image segmentation, establishing new records on these two benchmarks. Code is
available at https://github.com/wl-zhao/VPD
Diffusion Models are Minimax Optimal Distribution Estimators
March 03, 2023
Kazusato Oko, Shunta Akiyama, Taiji Suzuki
While efficient distribution learning is no doubt behind the groundbreaking
success of diffusion modeling, its theoretical guarantees are quite limited. In
this paper, we provide the first rigorous analysis on approximation and
generalization abilities of diffusion modeling for well-known function spaces.
The highlight of this paper is that when the true density function belongs to
the Besov space and the empirical score matching loss is properly minimized,
the generated data distribution achieves the nearly minimax optimal estimation
rates in the total variation distance and in the Wasserstein distance of order
one. Furthermore, we extend our theory to demonstrate how diffusion models
adapt to low-dimensional data distributions. We expect these results advance
theoretical understandings of diffusion modeling and its ability to generate
verisimilar outputs.
Deep Momentum Multi-Marginal Schrödinger Bridge
March 03, 2023
Tianrong Chen, Guan-Horng Liu, Molei Tao, Evangelos A. Theodorou
It is a crucial challenge to reconstruct population dynamics using unlabeled
samples from distributions at coarse time intervals. Recent approaches such as
flow-based models or Schr"odinger Bridge (SB) models have demonstrated
appealing performance, yet the inferred sample trajectories either fail to
account for the underlying stochasticity or are $\underline{D}$eep
$\underline{M}$omentum Multi-Marginal $\underline{S}$chr"odinger
$\underline{B}$ridge(DMSB), a novel computational framework that learns the
smooth measure-valued spline for stochastic systems that satisfy position
marginal constraints across time. By tailoring the celebrated Bregman Iteration
and extending the Iteration Proportional Fitting to phase space, we manage to
handle high-dimensional multi-marginal trajectory inference tasks efficiently.
Our algorithm outperforms baselines significantly, as evidenced by experiments
for synthetic datasets and a real-world single-cell RNA sequence dataset.
Additionally, the proposed approach can reasonably reconstruct the evolution of
velocity distribution, from position snapshots only, when there is a ground
truth velocity that is nevertheless inaccessible.
Generative Diffusions in Augmented Spaces: A Complete Recipe
March 03, 2023
Kushagra Pandey, Stephan Mandt
Score-based Generative Models (SGMs) have demonstrated exceptional synthesis
outcomes across various tasks. However, the current design landscape of the
forward diffusion process remains largely untapped and often relies on physical
heuristics or simplifying assumptions. Utilizing insights from the development
of scalable Bayesian posterior samplers, we present a complete recipe for
formulating forward processes in SGMs, ensuring convergence to the desired
target distribution. Our approach reveals that several existing SGMs can be
seen as specific manifestations of our framework. Building upon this method, we
introduce Phase Space Langevin Diffusion (PSLD), which relies on score-based
modeling within an augmented space enriched by auxiliary variables akin to
physical phase space. Empirical results exhibit the superior sample quality and
improved speed-quality trade-off of PSLD compared to various competing
approaches on established image synthesis benchmarks. Remarkably, PSLD achieves
sample quality akin to state-of-the-art SGMs (FID: 2.10 for unconditional
CIFAR-10 generation). Lastly, we demonstrate the applicability of PSLD in
conditional synthesis using pre-trained score networks, offering an appealing
alternative as an SGM backbone for future advancements. Code and model
checkpoints can be accessed at \url{https://github.com/mandt-lab/PSLD}.
Consistency Models
March 02, 2023
Yang Song, Prafulla Dhariwal, Mark Chen, Ilya Sutskever
Diffusion models have significantly advanced the fields of image, audio, and
video generation, but they depend on an iterative sampling process that causes
slow generation. To overcome this limitation, we propose consistency models, a
new family of models that generate high quality samples by directly mapping
noise to data. They support fast one-step generation by design, while still
allowing multistep sampling to trade compute for sample quality. They also
support zero-shot data editing, such as image inpainting, colorization, and
super-resolution, without requiring explicit training on these tasks.
Consistency models can be trained either by distilling pre-trained diffusion
models, or as standalone generative models altogether. Through extensive
experiments, we demonstrate that they outperform existing distillation
techniques for diffusion models in one- and few-step sampling, achieving the
new state-of-the-art FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 for
one-step generation. When trained in isolation, consistency models become a new
family of generative models that can outperform existing one-step,
non-adversarial generative models on standard benchmarks such as CIFAR-10,
ImageNet 64x64 and LSUN 256x256.
Human Motion Diffusion as a Generative Prior
March 02, 2023
Yonatan Shafir, Guy Tevet, Roy Kapon, Amit H. Bermano
Recent work has demonstrated the significant potential of denoising diffusion
models for generating human motion, including text-to-motion capabilities.
However, these methods are restricted by the paucity of annotated motion data,
a focus on single-person motions, and a lack of detailed control. In this
paper, we introduce three forms of composition based on diffusion priors:
sequential, parallel, and model composition. Using sequential composition, we
tackle the challenge of long sequence generation. We introduce DoubleTake, an
inference-time method with which we generate long animations consisting of
sequences of prompted intervals and their transitions, using a prior trained
only for short clips. Using parallel composition, we show promising steps
toward two-person generation. Beginning with two fixed priors as well as a few
two-person training examples, we learn a slim communication block, ComMDM, to
coordinate interaction between the two resulting motions. Lastly, using model
composition, we first train individual priors to complete motions that realize
a prescribed motion for a given joint. We then introduce DiffusionBlending, an
interpolation mechanism to effectively blend several such models to enable
flexible and efficient fine-grained joint and trajectory-level control and
editing. We evaluate the composition methods using an off-the-shelf motion
diffusion model, and further compare the results to dedicated models trained
for these specific tasks.
Defending against Adversarial Audio via Diffusion Model
March 02, 2023
Shutong Wu, Jiongxiao Wang, Wei Ping, Weili Nie, Chaowei Xiao
cs.SD, cs.CR, cs.LG, eess.AS
Deep learning models have been widely used in commercial acoustic systems in
recent years. However, adversarial audio examples can cause abnormal behaviors
for those acoustic systems, while being hard for humans to perceive. Various
methods, such as transformation-based defenses and adversarial training, have
been proposed to protect acoustic systems from adversarial attacks, but they
are less effective against adaptive attacks. Furthermore, directly applying the
methods from the image domain can lead to suboptimal results because of the
unique properties of audio data. In this paper, we propose an adversarial
purification-based defense pipeline, AudioPure, for acoustic systems via
off-the-shelf diffusion models. Taking advantage of the strong generation
ability of diffusion models, AudioPure first adds a small amount of noise to
the adversarial audio and then runs the reverse sampling step to purify the
noisy audio and recover clean audio. AudioPure is a plug-and-play method that
can be directly applied to any pretrained classifier without any fine-tuning or
re-training. We conduct extensive experiments on speech command recognition
task to evaluate the robustness of AudioPure. Our method is effective against
diverse adversarial attacks (e.g. $\mathcal{L}2$ or
$\mathcal{L}\infty$-norm). It outperforms the existing methods under both
strong adaptive white-box and black-box attacks bounded by $\mathcal{L}2$ or
$\mathcal{L}\infty$-norm (up to +20\% in robust accuracy). Besides, we also
evaluate the certified robustness for perturbations bounded by
$\mathcal{L}_2$-norm via randomized smoothing. Our pipeline achieves a higher
certified accuracy than baselines.
Understanding the Diffusion Objective as a Weighted Integral of ELBOs
March 01, 2023
Diederik P. Kingma, Ruiqi Gao
To achieve the highest perceptual quality, state-of-the-art diffusion models
are optimized with objectives that typically look very different from the
maximum likelihood and the Evidence Lower Bound (ELBO) objectives. In this
work, we reveal that diffusion model objectives are actually closely related to
the ELBO.
Specifically, we show that all commonly used diffusion model objectives
equate to a weighted integral of ELBOs over different noise levels, where the
weighting depends on the specific objective used. Under the condition of
monotonic weighting, the connection is even closer: the diffusion objective
then equals the ELBO, combined with simple data augmentation, namely Gaussian
noise perturbation. We show that this condition holds for a number of
state-of-the-art diffusion models.
In experiments, we explore new monotonic weightings and demonstrate their
effectiveness, achieving state-of-the-art FID scores on the high-resolution
ImageNet benchmark.
Continuous-Time Functional Diffusion Processes
March 01, 2023
Giulio Franzese, Giulio Corallo, Simone Rossi, Markus Heinonen, Maurizio Filippone, Pietro Michiardi
We introduce Functional Diffusion Processes (FDPs), which generalize
score-based diffusion models to infinite-dimensional function spaces. FDPs
require a new mathematical framework to describe the forward and backward
dynamics, and several extensions to derive practical training objectives. These
include infinite-dimensional versions of Girsanov theorem, in order to be able
to compute an ELBO, and of the sampling theorem, in order to guarantee that
functional evaluations in a countable set of points are equivalent to
infinite-dimensional functions. We use FDPs to build a new breed of generative
models in function spaces, which do not require specialized network
architectures, and that can work with any kind of continuous data. Our results
on real data show that FDPs achieve high-quality image generation, using a
simple MLP architecture with orders of magnitude fewer parameters than existing
diffusion models.
Unlimited-Size Diffusion Restoration
March 01, 2023
Yinhuai Wang, Jiwen Yu, Runyi Yu, Jian Zhang
Recently, using diffusion models for zero-shot image restoration (IR) has
become a new hot paradigm. This type of method only needs to use the
pre-trained off-the-shelf diffusion models, without any finetuning, and can
directly handle various IR tasks. The upper limit of the restoration
performance depends on the pre-trained diffusion models, which are in rapid
evolution. However, current methods only discuss how to deal with fixed-size
images, but dealing with images of arbitrary sizes is very important for
practical applications. This paper focuses on how to use those diffusion-based
zero-shot IR methods to deal with any size while maintaining the excellent
characteristics of zero-shot. A simple way to solve arbitrary size is to divide
it into fixed-size patches and solve each patch independently. But this may
yield significant artifacts since it neither considers the global semantics of
all patches nor the local information of adjacent patches. Inspired by the
Range-Null space Decomposition, we propose the Mask-Shift Restoration to
address local incoherence and propose the Hierarchical Restoration to alleviate
out-of-domain issues. Our simple, parameter-free approaches can be used not
only for image restoration but also for image generation of unlimited sizes,
with the potential to be a general tool for diffusion models. Code:
https://github.com/wyhuai/DDNM/tree/main/hq_demo
Collage Diffusion
March 01, 2023
Vishnu Sarukkai, Linden Li, Arden Ma, Christopher Ré, Kayvon Fatahalian
We seek to give users precise control over diffusion-based image generation
by modeling complex scenes as sequences of layers, which define the desired
spatial arrangement and visual attributes of objects in the scene. Collage
Diffusion harmonizes the input layers to make objects fit together – the key
challenge involves minimizing changes in the positions and key visual
attributes of the input layers while allowing other attributes to change in the
harmonization process. We ensure that objects are generated in the correct
locations by modifying text-image cross-attention with the layers’ alpha masks.
We preserve key visual attributes of input layers by learning specialized text
representations per layer and by extending ControlNet to operate on layers.
Layer input allows users to control the extent of image harmonization on a
per-object basis, and users can even iteratively edit individual objects in
generated images while keeping other objects fixed. By leveraging the rich
information present in layer input, Collage Diffusion generates globally
harmonized images that maintain desired object characteristics better than
prior approaches.
Diffusion Probabilistic Fields
March 01, 2023
Peiye Zhuang, Samira Abnar, Jiatao Gu, Alex Schwing, Joshua M. Susskind, Miguel Ángel Bautista
Diffusion probabilistic models have quickly become a major approach for
generative modeling of images, 3D geometry, video and other domains. However,
to adapt diffusion generative modeling to these domains the denoising network
needs to be carefully designed for each domain independently, oftentimes under
the assumption that data lives in a Euclidean grid. In this paper we introduce
Diffusion Probabilistic Fields (DPF), a diffusion model that can learn
distributions over continuous functions defined over metric spaces, commonly
known as fields. We extend the formulation of diffusion probabilistic models to
deal with this field parametrization in an explicit way, enabling us to define
an end-to-end learning algorithm that side-steps the requirement of
representing fields with latent vectors as in previous approaches (Dupont et
al., 2022a; Du et al., 2021). We empirically show that, while using the same
denoising network, DPF effectively deals with different modalities like 2D
images and 3D geometry, in addition to modeling distributions over fields
defined on non-Euclidean metric spaces.
Monocular Depth Estimation using Diffusion Models
February 28, 2023
Saurabh Saxena, Abhishek Kar, Mohammad Norouzi, David J. Fleet
We formulate monocular depth estimation using denoising diffusion models,
inspired by their recent successes in high fidelity image generation. To that
end, we introduce innovations to address problems arising due to noisy,
incomplete depth maps in training data, including step-unrolled denoising
diffusion, an $L_1$ loss, and depth infilling during training. To cope with the
limited availability of data for supervised training, we leverage pre-training
on self-supervised image-to-image translation tasks. Despite the simplicity of
the approach, with a generic loss and architecture, our DepthGen model achieves
SOTA performance on the indoor NYU dataset, and near SOTA results on the
outdoor KITTI dataset. Further, with a multimodal posterior, DepthGen naturally
represents depth ambiguity (e.g., from transparent surfaces), and its zero-shot
performance combined with depth imputation, enable a simple but effective
text-to-3D pipeline. Project page: https://depth-gen.github.io
Reducing the Prior Mismatch of Stochastic Differential Equations for Diffusion-based Speech Enhancement
February 28, 2023
Bunlong Lay, Simon Welker, Julius Richter, Timo Gerkmann
Recently, score-based generative models have been successfully employed for
the task of speech enhancement. A stochastic differential equation is used to
model the iterative forward process, where at each step environmental noise and
white Gaussian noise are added to the clean speech signal. While in limit the
mean of the forward process ends at the noisy mixture, in practice it stops
earlier and thus only at an approximation of the noisy mixture. This results in
a discrepancy between the terminating distribution of the forward process and
the prior used for solving the reverse process at inference. In this paper, we
address this discrepancy and propose a forward process based on a Brownian
bridge. We show that such a process leads to a reduction of the mismatch
compared to previous diffusion processes. More importantly, we show that our
approach improves in objective metrics over the baseline process with only half
of the iteration steps and having one hyperparameter less to tune.
Can We Use Diffusion Probabilistic Models for 3D Motion Prediction?
February 28, 2023
Hyemin Ahn, Esteve Valls Mascaro, Dongheui Lee
After many researchers observed fruitfulness from the recent diffusion
probabilistic model, its effectiveness in image generation is actively studied
these days. In this paper, our objective is to evaluate the potential of
diffusion probabilistic models for 3D human motion-related tasks. To this end,
this paper presents a study of employing diffusion probabilistic models to
predict future 3D human motion(s) from the previously observed motion. Based on
the Human 3.6M and HumanEva-I datasets, our results show that diffusion
probabilistic models are competitive for both single (deterministic) and
multiple (stochastic) 3D motion prediction tasks, after finishing a single
training process. In addition, we find out that diffusion probabilistic models
can offer an attractive compromise, since they can strike the right balance
between the likelihood and diversity of the predicted future motions. Our code
is publicly available on the project website:
https://sites.google.com/view/diffusion-motion-prediction.
Towards Enhanced Controllability of Diffusion Models
February 28, 2023
Wonwoong Cho, Hareesh Ravi, Midhun Harikumar, Vinh Khuc, Krishna Kumar Singh, Jingwan Lu, David I. Inouye, Ajinkya Kale
Denoising Diffusion models have shown remarkable capabilities in generating
realistic, high-quality and diverse images. However, the extent of
controllability during generation is underexplored. Inspired by techniques
based on GAN latent space for image manipulation, we train a diffusion model
conditioned on two latent codes, a spatial content mask and a flattened style
embedding. We rely on the inductive bias of the progressive denoising process
of diffusion models to encode pose/layout information in the spatial structure
mask and semantic/style information in the style code. We propose two generic
sampling techniques for improving controllability. We extend composable
diffusion models to allow for some dependence between conditional inputs, to
improve the quality of generations while also providing control over the amount
of guidance from each latent code and their joint distribution. We also propose
timestep dependent weight scheduling for content and style latents to further
improve the translations. We observe better controllability compared to
existing methods and show that without explicit training objectives, diffusion
models can be used for effective image manipulation and image translation.
Differentially Private Diffusion Models Generate Useful Synthetic Images
February 27, 2023
Sahra Ghalebikesabi, Leonard Berrada, Sven Gowal, Ira Ktena, Robert Stanforth, Jamie Hayes, Soham De, Samuel L. Smith, Olivia Wiles, Borja Balle
cs.LG, cs.CR, cs.CV, stat.ML
The ability to generate privacy-preserving synthetic versions of sensitive
image datasets could unlock numerous ML applications currently constrained by
data availability. Due to their astonishing image generation quality, diffusion
models are a prime candidate for generating high-quality synthetic data.
However, recent studies have found that, by default, the outputs of some
diffusion models do not preserve training data privacy. By privately
fine-tuning ImageNet pre-trained diffusion models with more than 80M
parameters, we obtain SOTA results on CIFAR-10 and Camelyon17 in terms of both
FID and the accuracy of downstream classifiers trained on synthetic data. We
decrease the SOTA FID on CIFAR-10 from 26.2 to 9.8, and increase the accuracy
from 51.0% to 88.0%. On synthetic data from Camelyon17, we achieve a downstream
accuracy of 91.1% which is close to the SOTA of 96.5% when training on the real
data. We leverage the ability of generative models to create infinite amounts
of data to maximise the downstream prediction performance, and further show how
to use synthetic data for hyperparameter tuning. Our results demonstrate that
diffusion models fine-tuned with differential privacy can produce useful and
provably private synthetic data, even in applications with significant
distribution shift between the pre-training and fine-tuning distributions.
Denoising Diffusion Samplers
February 27, 2023
Francisco Vargas, Will Grathwohl, Arnaud Doucet
Denoising diffusion models are a popular class of generative models providing
state-of-the-art results in many domains. One adds gradually noise to data
using a diffusion to transform the data distribution into a Gaussian
distribution. Samples from the generative model are then obtained by simulating
an approximation of the time-reversal of this diffusion initialized by Gaussian
samples. Practically, the intractable score terms appearing in the
time-reversed process are approximated using score matching techniques. We
explore here a similar idea to sample approximately from unnormalized
probability density functions and estimate their normalizing constants. We
consider a process where the target density diffuses towards a Gaussian.
Denoising Diffusion Samplers (DDS) are obtained by approximating the
corresponding time-reversal. While score matching is not applicable in this
context, we can leverage many of the ideas introduced in generative modeling
for Monte Carlo sampling. Existing theoretical results from denoising diffusion
models also provide theoretical guarantees for DDS. We discuss the connections
between DDS, optimal control and Schr"odinger bridges and finally demonstrate
DDS experimentally on a variety of challenging sampling tasks.
Imaginary Voice: Face-styled Diffusion Model for Text-to-Speech
February 27, 2023
Jiyoung Lee, Joon Son Chung, Soo-Whan Chung
cs.LG, cs.CV, cs.SD, eess.AS
The goal of this work is zero-shot text-to-speech synthesis, with speaking
styles and voices learnt from facial characteristics. Inspired by the natural
fact that people can imagine the voice of someone when they look at his or her
face, we introduce a face-styled diffusion text-to-speech (TTS) model within a
unified framework learnt from visible attributes, called Face-TTS. This is the
first time that face images are used as a condition to train a TTS model.
We jointly train cross-model biometrics and TTS models to preserve speaker
identity between face images and generated speech segments. We also propose a
speaker feature binding loss to enforce the similarity of the generated and the
ground truth speech segments in speaker embedding space. Since the biometric
information is extracted directly from the face image, our method does not
require extra fine-tuning steps to generate speech from unseen and unheard
speakers. We train and evaluate the model on the LRS3 dataset, an in-the-wild
audio-visual corpus containing background noise and diverse speaking styles.
The project page is https://facetts.github.io.
February 26, 2023
Yifan Jiang, Han Chen, Hanseok Ko
Recently, skeleton-based human action has become a hot research topic because
the compact representation of human skeletons brings new blood to this research
domain. As a result, researchers began to notice the importance of using RGB or
other sensors to analyze human action by extracting skeleton information.
Leveraging the rapid development of deep learning (DL), a significant number of
skeleton-based human action approaches have been presented with fine-designed
DL structures recently. However, a well-trained DL model always demands
high-quality and sufficient data, which is hard to obtain without costing high
expenses and human labor. In this paper, we introduce a novel data augmentation
method for skeleton-based action recognition tasks, which can effectively
generate high-quality and diverse sequential actions. In order to obtain
natural and realistic action sequences, we propose denoising diffusion
probabilistic models (DDPMs) that can generate a series of synthetic action
sequences, and their generation process is precisely guided by a
spatial-temporal transformer (ST-Trans). Experimental results show that our
method outperforms the state-of-the-art (SOTA) motion generation approaches on
different naturality and diversity metrics. It proves that its high-quality
synthetic data can also be effectively deployed to existing action recognition
models with significant performance improvement.
Diffusion Model-Augmented Behavioral Cloning
February 26, 2023
Hsiang-Chun Wang, Shang-Fu Chen, Ming-Hao Hsu, Chun-Mao Lai, Shao-Hua Sun
Imitation learning addresses the challenge of learning by observing an
expert’s demonstrations without access to reward signals from environments.
Most existing imitation learning methods that do not require interacting with
environments either model the expert distribution as the conditional
probability p(a|s) (e.g., behavioral cloning, BC) or the joint probability p(s,
a). Despite its simplicity, modeling the conditional probability with BC
usually struggles with generalization. While modeling the joint probability can
lead to improved generalization performance, the inference procedure is often
time-consuming and the model can suffer from manifold overfitting. This work
proposes an imitation learning framework that benefits from modeling both the
conditional and joint probability of the expert distribution. Our proposed
diffusion model-augmented behavioral cloning (DBC) employs a diffusion model
trained to model expert behaviors and learns a policy to optimize both the BC
loss (conditional) and our proposed diffusion model loss (joint). DBC
outperforms baselines in various continuous control tasks in navigation, robot
arm manipulation, dexterous manipulation, and locomotion. We design additional
experiments to verify the limitations of modeling either the conditional
probability or the joint probability of the expert distribution as well as
compare different generative models. Ablation studies justify the effectiveness
of our design choices.
Directed Diffusion: Direct Control of Object Placement through Attention Guidance
February 25, 2023
Wan-Duo Kurt Ma, J. P. Lewis, Avisek Lahiri, Thomas Leung, W. Bastiaan Kleijn
Text-guided diffusion models such as DALLE-2, Imagen, eDiff-I, and Stable
Diffusion are able to generate an effectively endless variety of images given
only a short text prompt describing the desired image content. In many cases
the images are of very high quality. However, these models often struggle to
compose scenes containing several key objects such as characters in specified
positional relationships. The missing capability to direct'' the placement of
characters and objects both within and across images is crucial in
storytelling, as recognized in the literature on film and animation theory. In
this work, we take a particularly straightforward approach to providing the
needed direction. Drawing on the observation that the cross-attention maps for
prompt words reflect the spatial layout of objects denoted by those words, we
introduce an optimization objective that produces
activation’’ at desired
positions in these cross-attention maps. The resulting approach is a step
toward generalizing the applicability of text-guided diffusion models beyond
single images to collections of related images, as in storybooks. Directed
Diffusion provides easy high-level positional control over multiple objects,
while making use of an existing pre-trained model and maintaining a coherent
blend between the positioned objects and the background. Moreover, it requires
only a few lines to implement.
Denoising diffusion algorithm for inverse design of microstructures with fine-tuned nonlinear material properties
February 24, 2023
Nikolaos N. Vlassis, WaiChing Sun
In this paper, we introduce a denoising diffusion algorithm to discover
microstructures with nonlinear fine-tuned properties. Denoising diffusion
probabilistic models are generative models that use diffusion-based dynamics to
gradually denoise images and generate realistic synthetic samples. By learning
the reverse of a Markov diffusion process, we design an artificial intelligence
to efficiently manipulate the topology of microstructures to generate a massive
number of prototypes that exhibit constitutive responses sufficiently close to
designated nonlinear constitutive responses. To identify the subset of
microstructures with sufficiently precise fine-tuned properties, a
convolutional neural network surrogate is trained to replace high-fidelity
finite element simulations to filter out prototypes outside the admissible
range. The results of this study indicate that the denoising diffusion process
is capable of creating microstructures of fine-tuned nonlinear material
properties within the latent space of the training data. More importantly, the
resulting algorithm can be easily extended to incorporate additional
topological and geometric modifications by introducing high-dimensional
structures embedded in the latent space. The algorithm is tested on the
open-source mechanical MNIST data set. Consequently, this algorithm is not only
capable of performing inverse design of nonlinear effective media but also
learns the nonlinear structure-property map to quantitatively understand the
multiscale interplay among the geometry and topology and their effective
macroscopic properties.
To the Noise and Back: Diffusion for Shared Autonomy
February 23, 2023
Takuma Yoneda, Luzhe Sun, and Ge Yang, Bradly Stadie, Matthew Walter
Shared autonomy is an operational concept in which a user and an autonomous
agent collaboratively control a robotic system. It provides a number of
advantages over the extremes of full-teleoperation and full-autonomy in many
settings. Traditional approaches to shared autonomy rely on knowledge of the
environment dynamics, a discrete space of user goals that is known a priori, or
knowledge of the user’s policy – assumptions that are unrealistic in many
domains. Recent works relax some of these assumptions by formulating shared
autonomy with model-free deep reinforcement learning (RL). In particular, they
no longer need knowledge of the goal space (e.g., that the goals are discrete
or constrained) or environment dynamics. However, they need knowledge of a
task-specific reward function to train the policy. Unfortunately, such reward
specification can be a difficult and brittle process. On top of that, the
formulations inherently rely on human-in-the-loop training, and that
necessitates them to prepare a policy that mimics users’ behavior. In this
paper, we present a new approach to shared autonomy that employs a modulation
of the forward and reverse diffusion process of diffusion models. Our approach
does not assume known environment dynamics or the space of user goals, and in
contrast to previous work, it does not require any reward feedback, nor does it
require access to the user’s policy during training. Instead, our framework
learns a distribution over a space of desired behaviors. It then employs a
diffusion model to translate the user’s actions to a sample from this
distribution. Crucially, we show that it is possible to carry out this process
in a manner that preserves the user’s control authority. We evaluate our
framework on a series of challenging continuous control tasks, and analyze its
ability to effectively correct user actions while maintaining their autonomy.
DiffusioNeRF: Regularizing Neural Radiance Fields with Denoising Diffusion Models
February 23, 2023
Jamie Wynn, Daniyar Turmukhambetov
Under good conditions, Neural Radiance Fields (NeRFs) have shown impressive
results on novel view synthesis tasks. NeRFs learn a scene’s color and density
fields by minimizing the photometric discrepancy between training views and
differentiable renderings of the scene. Once trained from a sufficient set of
views, NeRFs can generate novel views from arbitrary camera positions. However,
the scene geometry and color fields are severely under-constrained, which can
lead to artifacts, especially when trained with few input views.
To alleviate this problem we learn a prior over scene geometry and color,
using a denoising diffusion model (DDM). Our DDM is trained on RGBD patches of
the synthetic Hypersim dataset and can be used to predict the gradient of the
logarithm of a joint probability distribution of color and depth patches. We
show that, these gradients of logarithms of RGBD patch priors serve to
regularize geometry and color of a scene. During NeRF training, random RGBD
patches are rendered and the estimated gradient of the log-likelihood is
backpropagated to the color and density fields. Evaluations on LLFF, the most
relevant dataset, show that our learned prior achieves improved quality in the
reconstructed geometry and improved generalization to novel views. Evaluations
on DTU show improved reconstruction quality among NeRF methods.
Modeling Molecular Structures with Intrinsic Diffusion Models
February 23, 2023
Gabriele Corso
Since its foundations, more than one hundred years ago, the field of
structural biology has strived to understand and analyze the properties of
molecules and their interactions by studying the structure that they take in 3D
space. However, a fundamental challenge with this approach has been the dynamic
nature of these particles, which forces us to model not a single but a whole
distribution of structures for every molecular system. This thesis proposes
Intrinsic Diffusion Modeling, a novel approach to this problem based on
combining diffusion generative models with scientific knowledge about the
flexibility of biological complexes. The knowledge of these degrees of freedom
is translated into the definition of a manifold over which the diffusion
process is defined. This manifold significantly reduces the dimensionality and
increases the smoothness of the generation space allowing for significantly
faster and more accurate generative processes. We demonstrate the effectiveness
of this approach on two fundamental tasks at the basis of computational
chemistry and biology: molecular conformer generation and molecular docking. In
both tasks, we construct the first deep learning method to outperform
traditional computational approaches achieving an unprecedented level of
accuracy for scalable programs.
Aligned Diffusion Schrodinger Bridges
February 22, 2023
Vignesh Ram Somnath, Matteo Pariset, Ya-Ping Hsieh, Maria Rodriguez Martinez, Andreas Krause, Charlotte Bunne
Diffusion Schr"odinger bridges (DSB) have recently emerged as a powerful
framework for recovering stochastic dynamics via their marginal observations at
different time points. Despite numerous successful applications, existing
algorithms for solving DSBs have so far failed to utilize the structure of
aligned data, which naturally arises in many biological phenomena. In this
paper, we propose a novel algorithmic framework that, for the first time,
solves DSBs while respecting the data alignment. Our approach hinges on a
combination of two decades-old ideas: The classical Schr"odinger bridge theory
and Doob’s $h$-transform. Compared to prior methods, our approach leads to a
simpler training procedure with lower variance, which we further augment with
principled regularization schemes. This ultimately leads to sizeable
improvements across experiments on synthetic and real data, including the tasks
of rigid protein docking and temporal evolution of cellular differentiation
processes.
On Calibrating Diffusion Probabilistic Models
February 21, 2023
Tianyu Pang, Cheng Lu, Chao Du, Min Lin, Shuicheng Yan, Zhijie Deng
Recently, diffusion probabilistic models (DPMs) have achieved promising
results in diverse generative tasks. A typical DPM framework includes a forward
process that gradually diffuses the data distribution and a reverse process
that recovers the data distribution from time-dependent data scores. In this
work, we observe that the stochastic reverse process of data scores is a
martingale, from which concentration bounds and the optional stopping theorem
for data scores can be derived. Then, we discover a simple way for calibrating
an arbitrary pretrained DPM, with which the score matching loss can be reduced
and the lower bounds of model likelihood can consequently be increased. We
provide general calibration guidelines under various model parametrizations.
Our calibration method is performed only once and the resulting models can be
used repeatedly for sampling. We conduct experiments on multiple datasets to
empirically validate our proposal. Our code is at
https://github.com/thudzj/Calibrated-DPMs.
PC2: Projection-Conditioned Point Cloud Diffusion for Single-Image 3D Reconstruction
February 21, 2023
Luke Melas-Kyriazi, Christian Rupprecht, Andrea Vedaldi
Reconstructing the 3D shape of an object from a single RGB image is a
long-standing and highly challenging problem in computer vision. In this paper,
we propose a novel method for single-image 3D reconstruction which generates a
sparse point cloud via a conditional denoising diffusion process. Our method
takes as input a single RGB image along with its camera pose and gradually
denoises a set of 3D points, whose positions are initially sampled randomly
from a three-dimensional Gaussian distribution, into the shape of an object.
The key to our method is a geometrically-consistent conditioning process which
we call projection conditioning: at each step in the diffusion process, we
project local image features onto the partially-denoised point cloud from the
given camera pose. This projection conditioning process enables us to generate
high-resolution sparse geometries that are well-aligned with the input image,
and can additionally be used to predict point colors after shape
reconstruction. Moreover, due to the probabilistic nature of the diffusion
process, our method is naturally capable of generating multiple different
shapes consistent with a single input image. In contrast to prior work, our
approach not only performs well on synthetic benchmarks, but also gives large
qualitative improvements on complex real-world data.
Diffusion Models and Semi-Supervised Learners Benefit Mutually with Few Labels
February 21, 2023
Zebin You, Yong Zhong, Fan Bao, Jiacheng Sun, Chongxuan Li, Jun Zhu
In an effort to further advance semi-supervised generative and classification
tasks, we propose a simple yet effective training strategy called dual pseudo
training (DPT), built upon strong semi-supervised learners and diffusion
models. DPT operates in three stages: training a classifier on partially
labeled data to predict pseudo-labels; training a conditional generative model
using these pseudo-labels to generate pseudo images; and retraining the
classifier with a mix of real and pseudo images. Empirically, DPT consistently
achieves SOTA performance of semi-supervised generation and classification
across various settings. In particular, with one or two labels per class, DPT
achieves a Fr'echet Inception Distance (FID) score of 3.08 or 2.52 on ImageNet
256x256. Besides, DPT outperforms competitive semi-supervised baselines
substantially on ImageNet classification tasks, achieving top-1 accuracies of
59.0 (+2.8), 69.5 (+3.0), and 74.4 (+2.0) with one, two, or five labels per
class, respectively. Notably, our results demonstrate that diffusion can
generate realistic images with only a few labels (e.g., <0.1%) and generative
augmentation remains viable for semi-supervised classification. Our code is
available at https://github.com/ML-GSAI/DPT.
Diffusion Probabilistic Models for Graph-Structured Prediction
February 21, 2023
Hyosoon Jang, Seonghyun Park, Sangwoo Mo, Sungsoo Ahn
This paper studies structured node classification on graphs, where the
predictions should consider dependencies between the node labels. In
particular, we focus on solving the problem for partially labeled graphs where
it is essential to incorporate the information in the known label for
predicting the unknown labels. To address this issue, we propose a novel
framework leveraging the diffusion probabilistic model for structured node
classification (DPM-SNC). At the heart of our framework is the extraordinary
capability of DPM-SNC to (a) learn a joint distribution over the labels with an
expressive reverse diffusion process and (b) make predictions conditioned on
the known labels utilizing manifold-constrained sampling. Since the DPMs lack
training algorithms for partially labeled data, we design a novel training
algorithm to apply DPMs, maximizing a new variational lower bound. We also
theoretically analyze how DPMs benefit node classification by enhancing the
expressive power of GNNs based on proposing AGG-WL, which is strictly more
powerful than the classic 1-WL test. We extensively verify the superiority of
our DPM-SNC in diverse scenarios, which include not only the transductive
setting on partially labeled graphs but also the inductive setting and
unlabeled graphs.
Learning Gradually Non-convex Image Priors Using Score Matching
February 21, 2023
Erich Kobler, Thomas Pock
cs.LG, cs.CV, I.2.6; I.4.10
In this paper, we propose a unified framework of denoising score-based models
in the context of graduated non-convex energy minimization. We show that for
sufficiently large noise variance, the associated negative log density – the
energy – becomes convex. Consequently, denoising score-based models
essentially follow a graduated non-convexity heuristic. We apply this framework
to learning generalized Fields of Experts image priors that approximate the
joint density of noisy images and their associated variances. These priors can
be easily incorporated into existing optimization algorithms for solving
inverse problems and naturally implement a fast and robust graduated
non-convexity mechanism.
Infinite-Dimensional Diffusion Models for Function Spaces
February 20, 2023
Jakiw Pidstrigach, Youssef Marzouk, Sebastian Reich, Sven Wang
stat.ML, cs.LG, math.PR, 68T99, 60Hxx
Diffusion models have had a profound impact on many application areas,
including those where data are intrinsically infinite-dimensional, such as
images or time series. The standard approach is first to discretize and then to
apply diffusion models to the discretized data. While such approaches are
practically appealing, the performance of the resulting algorithms typically
deteriorates as discretization parameters are refined. In this paper, we
instead directly formulate diffusion-based generative models in infinite
dimensions and apply them to the generative modeling of functions. We prove
that our formulations are well posed in the infinite-dimensional setting and
provide dimension-independent distance bounds from the sample to the target
measure. Using our theory, we also develop guidelines for the design of
infinite-dimensional diffusion models. For image distributions, these
guidelines are in line with the canonical choices currently made for diffusion
models. For other distributions, however, we can improve upon these canonical
choices, which we show both theoretically and empirically, by applying the
algorithms to data distributions on manifolds and inspired by Bayesian inverse
problems or simulation-based inference.
NerfDiff: Single-image View Synthesis with NeRF-guided Distillation from 3D-aware Diffusion
February 20, 2023
Jiatao Gu, Alex Trevithick, Kai-En Lin, Josh Susskind, Christian Theobalt, Lingjie Liu, Ravi Ramamoorthi
Novel view synthesis from a single image requires inferring occluded regions
of objects and scenes whilst simultaneously maintaining semantic and physical
consistency with the input. Existing approaches condition neural radiance
fields (NeRF) on local image features, projecting points to the input image
plane, and aggregating 2D features to perform volume rendering. However, under
severe occlusion, this projection fails to resolve uncertainty, resulting in
blurry renderings that lack details. In this work, we propose NerfDiff, which
addresses this issue by distilling the knowledge of a 3D-aware conditional
diffusion model (CDM) into NeRF through synthesizing and refining a set of
virtual views at test time. We further propose a novel NeRF-guided distillation
algorithm that simultaneously generates 3D consistent virtual views from the
CDM samples, and finetunes the NeRF based on the improved virtual views. Our
approach significantly outperforms existing NeRF-based and geometry-free
approaches on challenging datasets, including ShapeNet, ABO, and Clevr3D.
DINOISER: Diffused Conditional Sequence Learning by Manipulating Noises
February 20, 2023
Jiasheng Ye, Zaixiang Zheng, Yu Bao, Lihua Qian, Mingxuan Wang
While diffusion models have achieved great success in generating continuous
signals such as images and audio, it remains elusive for diffusion models in
learning discrete sequence data like natural languages. Although recent
advances circumvent this challenge of discreteness by embedding discrete tokens
as continuous surrogates, they still fall short of satisfactory generation
quality. To understand this, we first dive deep into the denoised training
protocol of diffusion-based sequence generative models and determine their
three severe problems, i.e., 1) failing to learn, 2) lack of scalability, and
3) neglecting source conditions. We argue that these problems can be boiled
down to the pitfall of the not completely eliminated discreteness in the
embedding space, and the scale of noises is decisive herein. In this paper, we
introduce DINOISER to facilitate diffusion models for sequence generation by
manipulating noises. We propose to adaptively determine the range of sampled
noise scales for counter-discreteness training; and encourage the proposed
diffused sequence learner to leverage source conditions with amplified noise
scales during inference. Experiments show that DINOISER enables consistent
improvement over the baselines of previous diffusion-based sequence generative
models on several conditional sequence modeling benchmarks thanks to both
effective training and inference strategies. Analyses further verify that
DINOISER can make better use of source conditions to govern its generative
process.
Restoration based Generative Models
February 20, 2023
Jaemoo Choi, Yesom Park, Myungjoo Kang
Denoising diffusion models (DDMs) have recently attracted increasing
attention by showing impressive synthesis quality. DDMs are built on a
diffusion process that pushes data to the noise distribution and the models
learn to denoise. In this paper, we establish the interpretation of DDMs in
terms of image restoration (IR). Integrating IR literature allows us to use an
alternative objective and diverse forward processes, not confining to the
diffusion process. By imposing prior knowledge on the loss function grounded on
MAP-based estimation, we eliminate the need for the expensive sampling of DDMs.
Also, we propose a multi-scale training, which improves the performance
compared to the diffusion process, by taking advantage of the flexibility of
the forward process. Experimental results demonstrate that our model improves
the quality and efficiency of both training and inference. Furthermore, we show
the applicability of our model to inverse problems. We believe that our
framework paves the way for designing a new type of flexible general generative
model.
Consistent Diffusion Models: Mitigating Sampling Drift by Learning to be Consistent
February 17, 2023
Giannis Daras, Yuval Dagan, Alexandros G. Dimakis, Constantinos Daskalakis
cs.LG, cs.AI, cs.CV, cs.IT, math.IT
Imperfect score-matching leads to a shift between the training and the
sampling distribution of diffusion models. Due to the recursive nature of the
generation process, errors in previous steps yield sampling iterates that drift
away from the training distribution. Yet, the standard training objective via
Denoising Score Matching (DSM) is only designed to optimize over non-drifted
data. To train on drifted data, we propose to enforce a \emph{consistency}
property which states that predictions of the model on its own generated data
are consistent across time. Theoretically, we show that if the score is learned
perfectly on some non-drifted points (via DSM) and if the consistency property
is enforced everywhere, then the score is learned accurately everywhere.
Empirically we show that our novel training objective yields state-of-the-art
results for conditional and unconditional generation in CIFAR-10 and baseline
improvements in AFHQ and FFHQ. We open-source our code and models:
https://github.com/giannisdaras/cdm
Text-driven Visual Synthesis with Latent Diffusion Prior
February 16, 2023
Ting-Hsuan Liao, Songwei Ge, Yiran Xu, Yao-Chih Lee, Badour AlBahar, Jia-Bin Huang
There has been tremendous progress in large-scale text-to-image synthesis
driven by diffusion models enabling versatile downstream applications such as
3D object synthesis from texts, image editing, and customized generation. We
present a generic approach using latent diffusion models as powerful image
priors for various visual synthesis tasks. Existing methods that utilize such
priors fail to use these models’ full capabilities. To improve this, our core
ideas are 1) a feature matching loss between features from different layers of
the decoder to provide detailed guidance and 2) a KL divergence loss to
regularize the predicted latent features and stabilize the training. We
demonstrate the efficacy of our approach on three different applications,
text-to-3D, StyleGAN adaptation, and layered image editing. Extensive results
show our method compares favorably against baselines.
T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models
February 16, 2023
Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, Ying Shan, Xiaohu Qie
cs.CV, cs.AI, cs.LG, cs.MM
The incredible generative ability of large-scale text-to-image (T2I) models
has demonstrated strong power of learning complex structures and meaningful
semantics. However, relying solely on text prompts cannot fully take advantage
of the knowledge learned by the model, especially when flexible and accurate
controlling (e.g., color and structure) is needed. In this paper, we aim to
``dig out” the capabilities that T2I models have implicitly learned, and then
explicitly use them to control the generation more granularly. Specifically, we
propose to learn simple and lightweight T2I-Adapters to align internal
knowledge in T2I models with external control signals, while freezing the
original large T2I models. In this way, we can train various adapters according
to different conditions, achieving rich control and editing effects in the
color and structure of the generation results. Further, the proposed
T2I-Adapters have attractive properties of practical value, such as
composability and generalization ability. Extensive experiments demonstrate
that our T2I-Adapter has promising generation quality and a wide range of
applications.
Explicit Diffusion of Gaussian Mixture Model Based Image Priors
February 16, 2023
Martin Zach, Thomas Pock, Erich Kobler, Antonin Chambolle
In this work we tackle the problem of estimating the density $f_X$ of a
random variable $X$ by successive smoothing, such that the smoothed random
variable $Y$ fulfills $(\partial_t - \Delta_1)f_Y(\,\cdot\,, t) = 0$,
$f_Y(\,\cdot\,, 0) = f_X$. With a focus on image processing, we propose a
product/fields of experts model with Gaussian mixture experts that admits an
analytic expression for $f_Y (\,\cdot\,, t)$ under an orthogonality constraint
on the filters. This construction naturally allows the model to be trained
simultaneously over the entire diffusion horizon using empirical Bayes. We show
preliminary results on image denoising where our model leads to competitive
results while being tractable, interpretable, and having only a small number of
learnable parameters. As a byproduct, our model can be used for reliable noise
estimation, allowing blind denoising of images corrupted by heteroscedastic
noise.
Exploring the Representation Manifolds of Stable Diffusion Through the Lens of Intrinsic Dimension
February 16, 2023
Henry Kvinge, Davis Brown, Charles Godfrey
Prompting has become an important mechanism by which users can more
effectively interact with many flavors of foundation model. Indeed, the last
several years have shown that well-honed prompts can sometimes unlock emergent
capabilities within such models. While there has been a substantial amount of
empirical exploration of prompting within the community, relatively few works
have studied prompting at a mathematical level. In this work we aim to take a
first step towards understanding basic geometric properties induced by prompts
in Stable Diffusion, focusing on the intrinsic dimension of internal
representations within the model. We find that choice of prompt has a
substantial impact on the intrinsic dimension of representations at both layers
of the model which we explored, but that the nature of this impact depends on
the layer being considered. For example, in certain bottleneck layers of the
model, intrinsic dimension of representations is correlated with prompt
perplexity (measured using a surrogate model), while this correlation is not
apparent in the latent layers. Our evidence suggests that intrinsic dimension
could be a useful tool for future studies of the impact of different prompts on
text-to-image models.
LayoutDiffuse: Adapting Foundational Diffusion Models for Layout-to-Image Generation
February 16, 2023
Jiaxin Cheng, Xiao Liang, Xingjian Shi, Tong He, Tianjun Xiao, Mu Li
Layout-to-image generation refers to the task of synthesizing photo-realistic
images based on semantic layouts. In this paper, we propose LayoutDiffuse that
adapts a foundational diffusion model pretrained on large-scale image or
text-image datasets for layout-to-image generation. By adopting a novel neural
adaptor based on layout attention and task-aware prompts, our method trains
efficiently, generates images with both high perceptual quality and layout
alignment, and needs less data. Experiments on three datasets show that our
method significantly outperforms other 10 generative models based on GANs,
VQ-VAE, and diffusion models.
MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation
February 16, 2023
Omer Bar-Tal, Lior Yariv, Yaron Lipman, Tali Dekel
Recent advances in text-to-image generation with diffusion models present
transformative capabilities in image quality. However, user controllability of
the generated image, and fast adaptation to new tasks still remains an open
challenge, currently mostly addressed by costly and long re-training and
fine-tuning or ad-hoc adaptations to specific image generation tasks. In this
work, we present MultiDiffusion, a unified framework that enables versatile and
controllable image generation, using a pre-trained text-to-image diffusion
model, without any further training or finetuning. At the center of our
approach is a new generation process, based on an optimization task that binds
together multiple diffusion generation processes with a shared set of
parameters or constraints. We show that MultiDiffusion can be readily applied
to generate high quality and diverse images that adhere to user-provided
controls, such as desired aspect ratio (e.g., panorama), and spatial guiding
signals, ranging from tight segmentation masks to bounding boxes. Project
webpage: https://multidiffusion.github.io
PRedItOR: Text Guided Image Editing with Diffusion Prior
February 15, 2023
Hareesh Ravi, Sachin Kelkar, Midhun Harikumar, Ajinkya Kale
Diffusion models have shown remarkable capabilities in generating high
quality and creative images conditioned on text. An interesting application of
such models is structure preserving text guided image editing. Existing
approaches rely on text conditioned diffusion models such as Stable Diffusion
or Imagen and require compute intensive optimization of text embeddings or
fine-tuning the model weights for text guided image editing. We explore text
guided image editing with a Hybrid Diffusion Model (HDM) architecture similar
to DALLE-2. Our architecture consists of a diffusion prior model that generates
CLIP image embedding conditioned on a text prompt and a custom Latent Diffusion
Model trained to generate images conditioned on CLIP image embedding. We
discover that the diffusion prior model can be used to perform text guided
conceptual edits on the CLIP image embedding space without any finetuning or
optimization. We combine this with structure preserving edits on the image
decoder using existing approaches such as reverse DDIM to perform text guided
image editing. Our approach, PRedItOR does not require additional inputs,
fine-tuning, optimization or objectives and shows on par or better results than
baselines qualitatively and quantitatively. We provide further analysis and
understanding of the diffusion prior model and believe this opens up new
possibilities in diffusion models research.
Denoising Diffusion Probabilistic Models for Robust Image Super-Resolution in the Wild
February 15, 2023
Hshmat Sahak, Daniel Watson, Chitwan Saharia, David Fleet
Diffusion models have shown promising results on single-image
super-resolution and other image- to-image translation tasks. Despite this
success, they have not outperformed state-of-the-art GAN models on the more
challenging blind super-resolution task, where the input images are out of
distribution, with unknown degradations. This paper introduces SR3+, a
diffusion-based model for blind super-resolution, establishing a new
state-of-the-art. To this end, we advocate self-supervised training with a
combination of composite, parameterized degradations for self-supervised
training, and noise-conditioing augmentation during training and testing. With
these innovations, a large-scale convolutional architecture, and large-scale
datasets, SR3+ greatly outperforms SR3. It outperforms Real-ESRGAN when trained
on the same data, with a DRealSR FID score of 36.82 vs. 37.22, which further
improves to FID of 32.37 with larger models, and further still with larger
training sets.
Video Probabilistic Diffusion Models in Projected Latent Space
February 15, 2023
Sihyun Yu, Kihyuk Sohn, Subin Kim, Jinwoo Shin
Despite the remarkable progress in deep generative models, synthesizing
high-resolution and temporally coherent videos still remains a challenge due to
their high-dimensionality and complex temporal dynamics along with large
spatial variations. Recent works on diffusion models have shown their potential
to solve this challenge, yet they suffer from severe computation- and
memory-inefficiency that limit the scalability. To handle this issue, we
propose a novel generative model for videos, coined projected latent video
diffusion models (PVDM), a probabilistic diffusion model which learns a video
distribution in a low-dimensional latent space and thus can be efficiently
trained with high-resolution videos under limited resources. Specifically, PVDM
is composed of two components: (a) an autoencoder that projects a given video
as 2D-shaped latent vectors that factorize the complex cubic structure of video
pixels and (b) a diffusion model architecture specialized for our new
factorized latent space and the training/sampling procedure to synthesize
videos of arbitrary length with a single model. Experiments on popular video
generation datasets demonstrate the superiority of PVDM compared with previous
video synthesis methods; e.g., PVDM obtains the FVD score of 639.7 on the
UCF-101 long video (128 frames) generation benchmark, which improves 1773.4 of
the prior state-of-the-art.
Score-based Diffusion Models in Function Space
February 14, 2023
Jae Hyun Lim, Nikola B. Kovachki, Ricardo Baptista, Christopher Beckham, Kamyar Azizzadenesheli, Jean Kossaifi, Vikram Voleti, Jiaming Song, Karsten Kreis, Jan Kautz, Christopher Pal, Arash Vahdat, Anima Anandkumar
cs.LG, math.FA, stat.ML, 46B09 (Primary), 60J22 (Secondary), I.2.6; J.2
Diffusion models have recently emerged as a powerful framework for generative
modeling. They consist of a forward process that perturbs input data with
Gaussian white noise and a reverse process that learns a score function to
generate samples by denoising. Despite their tremendous success, they are
mostly formulated on finite-dimensional spaces, e.g. Euclidean, limiting their
applications to many domains where the data has a functional form such as in
scientific computing and 3D geometric data analysis. In this work, we introduce
a mathematically rigorous framework called Denoising Diffusion Operators (DDOs)
for training diffusion models in function space. In DDOs, the forward process
perturbs input functions gradually using a Gaussian process. The generative
process is formulated by integrating a function-valued Langevin dynamic. Our
approach requires an appropriate notion of the score for the perturbed data
distribution, which we obtain by generalizing denoising score matching to
function spaces that can be infinite-dimensional. We show that the
corresponding discretized algorithm generates accurate samples at a fixed cost
that is independent of the data resolution. We theoretically and numerically
verify the applicability of our approach on a set of problems, including
generating solutions to the Navier-Stokes equation viewed as the push-forward
distribution of forcings from a Gaussian Random Field (GRF).
CDPMSR: Conditional Diffusion Probabilistic Models for Single Image Super-Resolution
February 14, 2023
Axi Niu, Kang Zhang, Trung X. Pham, Jinqiu Sun, Yu Zhu, In So Kweon, Yanning Zhang
Diffusion probabilistic models (DPM) have been widely adopted in
image-to-image translation to generate high-quality images. Prior attempts at
applying the DPM to image super-resolution (SR) have shown that iteratively
refining a pure Gaussian noise with a conditional image using a U-Net trained
on denoising at various-level noises can help obtain a satisfied
high-resolution image for the low-resolution one. To further improve the
performance and simplify current DPM-based super-resolution methods, we propose
a simple but non-trivial DPM-based super-resolution post-process
framework,i.e., cDPMSR. After applying a pre-trained SR model on the to-be-test
LR image to provide the conditional input, we adapt the standard DPM to conduct
conditional image generation and perform super-resolution through a
deterministic iterative denoising process. Our method surpasses prior attempts
on both qualitative and quantitative results and can generate more
photo-realistic counterparts for the low-resolution images with various
benchmark datasets including Set5, Set14, Urban100, BSD100, and Manga109. Code
will be published after accepted.
Raising the Cost of Malicious AI-Powered Image Editing
February 13, 2023
Hadi Salman, Alaa Khaddaj, Guillaume Leclerc, Andrew Ilyas, Aleksander Madry
We present an approach to mitigating the risks of malicious image editing
posed by large diffusion models. The key idea is to immunize images so as to
make them resistant to manipulation by these models. This immunization relies
on injection of imperceptible adversarial perturbations designed to disrupt the
operation of the targeted diffusion models, forcing them to generate
unrealistic images. We provide two methods for crafting such perturbations, and
then demonstrate their efficacy. Finally, we discuss a policy component
necessary to make our approach fully effective and practical – one that
involves the organizations developing diffusion models, rather than individual
users, to implement (and support) the immunization process.
A Reparameterized Discrete Diffusion Model for Text Generation
February 13, 2023
Hadi Salman, Alaa Khaddaj, Guillaume Leclerc, Andrew Ilyas, Aleksander Madry
We present an approach to mitigating the risks of malicious image editing
posed by large diffusion models. The key idea is to immunize images so as to
make them resistant to manipulation by these models. This immunization relies
on injection of imperceptible adversarial perturbations designed to disrupt the
operation of the targeted diffusion models, forcing them to generate
unrealistic images. We provide two methods for crafting such perturbations, and
then demonstrate their efficacy. Finally, we discuss a policy component
necessary to make our approach fully effective and practical – one that
involves the organizations developing diffusion models, rather than individual
users, to implement (and support) the immunization process.
Preconditioned Score-based Generative Models
February 13, 2023
Hengyuan Ma, Li Zhang, Xiatian Zhu, Jianfeng Feng
Score-based generative models (SGMs) have recently emerged as a promising
class of generative models. However, a fundamental limitation is that their
sampling process is slow due to a need for many (\eg, $2000$) iterations of
sequential computations. An intuitive acceleration method is to reduce the
sampling iterations which however causes severe performance degradation. We
assault this problem to the ill-conditioned issues of the Langevin dynamics and
reverse diffusion in the sampling process. Under this insight, we propose a
model-agnostic {\bf\em preconditioned diffusion sampling} (PDS) method that
leverages matrix preconditioning to alleviate the aforementioned problem. PDS
alters the sampling process of a vanilla SGM at marginal extra computation
cost, and without model retraining. Theoretically, we prove that PDS preserves
the output distribution of the SGM, no risk of inducing systematical bias to
the original sampling process. We further theoretically reveal a relation
between the parameter of PDS and the sampling iterations,easing the parameter
estimation under varying sampling iterations. Extensive experiments on various
image datasets with a variety of resolutions and diversity validate that our
PDS consistently accelerates off-the-shelf SGMs whilst maintaining the
synthesis quality. In particular, PDS can accelerate by up to $29\times$ on
more challenging high resolution (1024$\times$1024) image generation. Compared
with the latest generative models (\eg, CLD-SGM, DDIM, and Analytic-DDIM), PDS
can achieve the best sampling quality on CIFAR-10 at a FID score of 1.99. Our
code is made publicly available to foster any further research
https://github.com/fudan-zvg/PDS.
February 13, 2023
Zhiye Guo, Jian Liu, Yanli Wang, Mengrui Chen, Duolin Wang, Dong Xu, Jianlin Cheng
cs.LG, cs.AI, q-bio.QM, I.2.1; J.3
Denoising diffusion models have emerged as one of the most powerful
generative models in recent years. They have achieved remarkable success in
many fields, such as computer vision, natural language processing (NLP), and
bioinformatics. Although there are a few excellent reviews on diffusion models
and their applications in computer vision and NLP, there is a lack of an
overview of their applications in bioinformatics. This review aims to provide a
rather thorough overview of the applications of diffusion models in
bioinformatics to aid their further development in bioinformatics and
computational biology. We start with an introduction of the key concepts and
theoretical foundations of three cornerstone diffusion modeling frameworks
(denoising diffusion probabilistic models, noise-conditioned scoring networks,
and stochastic differential equations), followed by a comprehensive description
of diffusion models employed in the different domains of bioinformatics,
including cryo-EM data enhancement, single-cell data analysis, protein design
and generation, drug and small molecule design, and protein-ligand interaction.
The review is concluded with a summary of the potential new development and
applications of diffusion models in bioinformatics.
Single Motion Diffusion
February 12, 2023
Sigal Raab, Inbal Leibovitch, Guy Tevet, Moab Arar, Amit H. Bermano, Daniel Cohen-Or
Synthesizing realistic animations of humans, animals, and even imaginary
creatures, has long been a goal for artists and computer graphics
professionals. Compared to the imaging domain, which is rich with large
available datasets, the number of data instances for the motion domain is
limited, particularly for the animation of animals and exotic creatures (e.g.,
dragons), which have unique skeletons and motion patterns. In this work, we
present a Single Motion Diffusion Model, dubbed SinMDM, a model designed to
learn the internal motifs of a single motion sequence with arbitrary topology
and synthesize motions of arbitrary length that are faithful to them. We
harness the power of diffusion models and present a denoising network
explicitly designed for the task of learning from a single input motion. SinMDM
is designed to be a lightweight architecture, which avoids overfitting by using
a shallow network with local attention layers that narrow the receptive field
and encourage motion diversity. SinMDM can be applied in various contexts,
including spatial and temporal in-betweening, motion expansion, style transfer,
and crowd animation. Our results show that SinMDM outperforms existing methods
both in quality and time-space efficiency. Moreover, while current approaches
require additional training for different applications, our work facilitates
these applications at inference time. Our code and trained models are available
at https://sinmdm.github.io/SinMDM-page.
SB: Image-to-Image Schrödinger Bridge
February 12, 2023
Guan-Horng Liu, Arash Vahdat, De-An Huang, Evangelos A. Theodorou, Weili Nie, Anima Anandkumar
We propose Image-to-Image Schr"odinger Bridge (I$^2$SB), a new class of
conditional diffusion models that directly learn the nonlinear diffusion
processes between two given distributions. These diffusion bridges are
particularly useful for image restoration, as the degraded images are
structurally informative priors for reconstructing the clean images. I$^2$SB
belongs to a tractable class of Schr"odinger bridge, the nonlinear extension
to score-based models, whose marginal distributions can be computed
analytically given boundary pairs. This results in a simulation-free framework
for nonlinear diffusions, where the I$^2$SB training becomes scalable by
adopting practical techniques used in standard diffusion models. We validate
I$^2$SB in solving various image restoration tasks, including inpainting,
super-resolution, deblurring, and JPEG restoration on ImageNet 256x256 and show
that I$^2$SB surpasses standard conditional diffusion models with more
interpretable generative processes. Moreover, I$^2$SB matches the performance
of inverse methods that additionally require the knowledge of the corruption
operators. Our work opens up new algorithmic opportunities for developing
efficient nonlinear diffusion models on a large scale. scale. Project page and
codes: https://i2sb.github.io/
Adding Conditional Control to Text-to-Image Diffusion Models
February 10, 2023
Lvmin Zhang, Anyi Rao, Maneesh Agrawala
cs.CV, cs.AI, cs.GR, cs.HC, cs.MM
We present ControlNet, a neural network architecture to add spatial
conditioning controls to large, pretrained text-to-image diffusion models.
ControlNet locks the production-ready large diffusion models, and reuses their
deep and robust encoding layers pretrained with billions of images as a strong
backbone to learn a diverse set of conditional controls. The neural
architecture is connected with “zero convolutions” (zero-initialized
convolution layers) that progressively grow the parameters from zero and ensure
that no harmful noise could affect the finetuning. We test various conditioning
controls, eg, edges, depth, segmentation, human pose, etc, with Stable
Diffusion, using single or multiple conditions, with or without prompts. We
show that the training of ControlNets is robust with small (<50k) and large
(>1m) datasets. Extensive results show that ControlNet may facilitate wider
applications to control image diffusion models.
Star-Shaped Denoising Diffusion Probabilistic Models
February 10, 2023
Andrey Okhotin, Dmitry Molchanov, Vladimir Arkhipkin, Grigory Bartosh, Viktor Ohanesian, Aibek Alanov, Dmitry Vetrov
Denoising Diffusion Probabilistic Models (DDPMs) provide the foundation for
the recent breakthroughs in generative modeling. Their Markovian structure
makes it difficult to define DDPMs with distributions other than Gaussian or
discrete. In this paper, we introduce Star-Shaped DDPM (SS-DDPM). Its
star-shaped diffusion process allows us to bypass the need to define the
transition probabilities or compute posteriors. We establish duality between
star-shaped and specific Markovian diffusions for the exponential family of
distributions and derive efficient algorithms for training and sampling from
SS-DDPMs. In the case of Gaussian distributions, SS-DDPM is equivalent to DDPM.
However, SS-DDPMs provide a simple recipe for designing diffusion models with
distributions such as Beta, von Mises$\unicode{x2013}$Fisher, Dirichlet,
Wishart and others, which can be especially useful when data lies on a
constrained manifold. We evaluate the model in different settings and find it
competitive even on image data, where Beta SS-DDPM achieves results comparable
to a Gaussian DDPM. Our implementation is available at
https://github.com/andrey-okhotin/star-shaped .
Example-Based Sampling with Diffusion Models
February 10, 2023
Bastien Doignies, Nicolas Bonneel, David Coeurjolly, Julie Digne, Loïs Paulin, Jean-Claude Iehl, Victor Ostromoukhov
cs.GR, cs.CV, cs.LG, I.3.7; I.5.1; I.6.8; G.1.4
Much effort has been put into developing samplers with specific properties,
such as producing blue noise, low-discrepancy, lattice or Poisson disk samples.
These samplers can be slow if they rely on optimization processes, may rely on
a wide range of numerical methods, are not always differentiable. The success
of recent diffusion models for image generation suggests that these models
could be appropriate for learning how to generate point sets from examples.
However, their convolutional nature makes these methods impractical for dealing
with scattered data such as point sets. We propose a generic way to produce 2-d
point sets imitating existing samplers from observed point sets using a
diffusion model. We address the problem of convolutional layers by leveraging
neighborhood information from an optimal transport matching to a uniform grid,
that allows us to benefit from fast convolutions on grids, and to support the
example-based learning of non-uniform sampling patterns. We demonstrate how the
differentiability of our approach can be used to optimize point sets to enforce
properties.
UniPC: A Unified Predictor-Corrector Framework for Fast Sampling of Diffusion Models
February 09, 2023
Wenliang Zhao, Lujia Bai, Yongming Rao, Jie Zhou, Jiwen Lu
Diffusion probabilistic models (DPMs) have demonstrated a very promising
ability in high-resolution image synthesis. However, sampling from a
pre-trained DPM is time-consuming due to the multiple evaluations of the
denoising network, making it more and more important to accelerate the sampling
of DPMs. Despite recent progress in designing fast samplers, existing methods
still cannot generate satisfying images in many applications where fewer steps
(e.g., $<$10) are favored. In this paper, we develop a unified corrector (UniC)
that can be applied after any existing DPM sampler to increase the order of
accuracy without extra model evaluations, and derive a unified predictor (UniP)
that supports arbitrary order as a byproduct. Combining UniP and UniC, we
propose a unified predictor-corrector framework called UniPC for the fast
sampling of DPMs, which has a unified analytical form for any order and can
significantly improve the sampling quality over previous methods, especially in
extremely few steps. We evaluate our methods through extensive experiments
including both unconditional and conditional sampling using pixel-space and
latent-space DPMs. Our UniPC can achieve 3.87 FID on CIFAR10 (unconditional)
and 7.51 FID on ImageNet 256$\times$256 (conditional) with only 10 function
evaluations. Code is available at https://github.com/wl-zhao/UniPC.
Better Diffusion Models Further Improve Adversarial Training
February 09, 2023
Zekai Wang, Tianyu Pang, Chao Du, Min Lin, Weiwei Liu, Shuicheng Yan
cs.CV, cs.AI, cs.CR, cs.LG
It has been recognized that the data generated by the denoising diffusion
probabilistic model (DDPM) improves adversarial training. After two years of
rapid development in diffusion models, a question naturally arises: can better
diffusion models further improve adversarial training? This paper gives an
affirmative answer by employing the most recent diffusion model which has
higher efficiency ($\sim 20$ sampling steps) and image quality (lower FID
score) compared with DDPM. Our adversarially trained models achieve
state-of-the-art performance on RobustBench using only generated data (no
external datasets). Under the $\ell_\infty$-norm threat model with
$\epsilon=8/255$, our models achieve $70.69\%$ and $42.67\%$ robust accuracy on
CIFAR-10 and CIFAR-100, respectively, i.e. improving upon previous
state-of-the-art models by $+4.58\%$ and $+8.03\%$. Under the $\ell_2$-norm
threat model with $\epsilon=128/255$, our models achieve $84.86\%$ on CIFAR-10
($+4.44\%$). These results also beat previous works that use external data. We
also provide compelling results on the SVHN and TinyImageNet datasets. Our code
is available at https://github.com/wzekai99/DM-Improves-AT.
Geometry of Score Based Generative Models
February 09, 2023
Sandesh Ghimire, Jinyang Liu, Armand Comas, Davin Hill, Aria Masoomi, Octavia Camps, Jennifer Dy
In this work, we look at Score-based generative models (also called diffusion
generative models) from a geometric perspective. From a new view point, we
prove that both the forward and backward process of adding noise and generating
from noise are Wasserstein gradient flow in the space of probability measures.
We are the first to prove this connection. Our understanding of Score-based
(and Diffusion) generative models have matured and become more complete by
drawing ideas from different fields like Bayesian inference, control theory,
stochastic differential equation and Schrodinger bridge. However, many open
questions and challenges remain. One problem, for example, is how to decrease
the sampling time? We demonstrate that looking from geometric perspective
enables us to answer many of these questions and provide new interpretations to
some known results. Furthermore, geometric perspective enables us to devise an
intuitive geometric solution to the problem of faster sampling. By augmenting
traditional score-based generative models with a projection step, we show that
we can generate high quality images with significantly fewer sampling-steps.
MedDiff: Generating Electronic Health Records using Accelerated Denoising Diffusion Model
February 08, 2023
Huan He, Shifan Zhao, Yuanzhe Xi, Joyce C Ho
Due to patient privacy protection concerns, machine learning research in
healthcare has been undeniably slower and limited than in other application
domains. High-quality, realistic, synthetic electronic health records (EHRs)
can be leveraged to accelerate methodological developments for research
purposes while mitigating privacy concerns associated with data sharing. The
current state-of-the-art model for synthetic EHR generation is generative
adversarial networks, which are notoriously difficult to train and can suffer
from mode collapse. Denoising Diffusion Probabilistic Models, a class of
generative models inspired by statistical thermodynamics, have recently been
shown to generate high-quality synthetic samples in certain domains. It is
unknown whether these can generalize to generation of large-scale,
high-dimensional EHRs. In this paper, we present a novel generative model based
on diffusion models that is the first successful application on electronic
health records. Our model proposes a mechanism to perform class-conditional
sampling to preserve label information. We also introduce a new sampling
strategy to accelerate the inference speed. We empirically show that our model
outperforms existing state-of-the-art synthetic EHR generation methods.
Geometry-Complete Diffusion for 3D Molecule Generation
February 08, 2023
Alex Morehead, Jianlin Cheng
cs.LG, cs.AI, q-bio.BM, q-bio.QM, stat.ML, I.2.1; J.3
Denoising diffusion probabilistic models (DDPMs) have recently taken the
field of generative modeling by storm, pioneering new state-of-the-art results
in disciplines such as computer vision and computational biology for diverse
tasks ranging from text-guided image generation to structure-guided protein
design. Along this latter line of research, methods have recently been proposed
for generating 3D molecules using equivariant graph neural networks (GNNs)
within a DDPM framework. However, such methods are unable to learn important
geometric and physical properties of 3D molecules during molecular graph
generation, as they adopt molecule-agnostic and non-geometric GNNs as their 3D
graph denoising networks, which negatively impacts their ability to effectively
scale to datasets of large 3D molecules. In this work, we address these gaps by
introducing the Geometry-Complete Diffusion Model (GCDM) for 3D molecule
generation, which outperforms existing 3D molecular diffusion models by
significant margins across conditional and unconditional settings for the QM9
dataset as well as for the larger GEOM-Drugs dataset. Importantly, we
demonstrate that the geometry-complete denoising process GCDM learns for 3D
molecule generation allows the model to generate realistic and stable large
molecules at the scale of GEOM-Drugs, whereas previous methods fail to do so
with the features they learn. Additionally, we show that GCDM’s geometric
features can effectively be repurposed to directly optimize the geometry and
chemical composition of existing 3D molecules for specific molecular
properties, demonstrating new, real-world versatility of molecular diffusion
models. Our source code, data, and reproducibility instructions are freely
available at https://github.com/BioinfoMachineLearning/Bio-Diffusion.
Q-Diffusion: Quantizing Diffusion Models
February 08, 2023
Xiuyu Li, Yijiang Liu, Long Lian, Huanrui Yang, Zhen Dong, Daniel Kang, Shanghang Zhang, Kurt Keutzer
Diffusion models have achieved great success in image synthesis through
iterative noise estimation using deep neural networks. However, the slow
inference, high memory consumption, and computation intensity of the noise
estimation model hinder the efficient adoption of diffusion models. Although
post-training quantization (PTQ) is considered a go-to compression method for
other tasks, it does not work out-of-the-box on diffusion models. We propose a
novel PTQ method specifically tailored towards the unique multi-timestep
pipeline and model architecture of the diffusion models, which compresses the
noise estimation network to accelerate the generation process. We identify the
key difficulty of diffusion model quantization as the changing output
distributions of noise estimation networks over multiple time steps and the
bimodal activation distribution of the shortcut layers within the noise
estimation network. We tackle these challenges with timestep-aware calibration
and split shortcut quantization in this work. Experimental results show that
our proposed method is able to quantize full-precision unconditional diffusion
models into 4-bit while maintaining comparable performance (small FID change of
at most 2.34 compared to >100 for traditional PTQ) in a training-free manner.
Our approach can also be applied to text-guided image generation, where we can
run stable diffusion in 4-bit weights with high generation quality for the
first time.
PFGM++: Unlocking the Potential of Physics-Inspired Generative Models
February 08, 2023
Yilun Xu, Ziming Liu, Yonglong Tian, Shangyuan Tong, Max Tegmark, Tommi Jaakkola
We introduce a new family of physics-inspired generative models termed PFGM++
that unifies diffusion models and Poisson Flow Generative Models (PFGM). These
models realize generative trajectories for $N$ dimensional data by embedding
paths in $N{+}D$ dimensional space while still controlling the progression with
a simple scalar norm of the $D$ additional variables. The new models reduce to
PFGM when $D{=}1$ and to diffusion models when $D{\to}\infty$. The flexibility
of choosing $D$ allows us to trade off robustness against rigidity as
increasing $D$ results in more concentrated coupling between the data and the
additional variable norms. We dispense with the biased large batch field
targets used in PFGM and instead provide an unbiased perturbation-based
objective similar to diffusion models. To explore different choices of $D$, we
provide a direct alignment method for transferring well-tuned hyperparameters
from diffusion models ($D{\to} \infty$) to any finite $D$ values. Our
experiments show that models with finite $D$ can be superior to previous
state-of-the-art diffusion models on CIFAR-10/FFHQ $64{\times}64$ datasets,
with FID scores of $1.91/2.43$ when $D{=}2048/128$. In class-conditional
setting, $D{=}2048$ yields current state-of-the-art FID of $1.74$ on CIFAR-10.
In addition, we demonstrate that models with smaller $D$ exhibit improved
robustness against modeling errors. Code is available at
https://github.com/Newbeeer/pfgmpp
Noise2Music: Text-conditioned Music Generation with Diffusion Models
February 08, 2023
Qingqing Huang, Daniel S. Park, Tao Wang, Timo I. Denk, Andy Ly, Nanxin Chen, Zhengdong Zhang, Zhishuai Zhang, Jiahui Yu, Christian Frank, Jesse Engel, Quoc V. Le, William Chan, Zhifeng Chen, Wei Han
We introduce Noise2Music, where a series of diffusion models is trained to
generate high-quality 30-second music clips from text prompts. Two types of
diffusion models, a generator model, which generates an intermediate
representation conditioned on text, and a cascader model, which generates
high-fidelity audio conditioned on the intermediate representation and possibly
the text, are trained and utilized in succession to generate high-fidelity
music. We explore two options for the intermediate representation, one using a
spectrogram and the other using audio with lower fidelity. We find that the
generated audio is not only able to faithfully reflect key elements of the text
prompt such as genre, tempo, instruments, mood, and era, but goes beyond to
ground fine-grained semantics of the prompt. Pretrained large language models
play a key role in this story – they are used to generate paired text for the
audio of the training set and to extract embeddings of the text prompts
ingested by the diffusion models.
Generated examples: https://google-research.github.io/noise2music
Zero-shot Generation of Coherent Storybook from Plain Text Story using Diffusion Models
February 08, 2023
Hyeonho Jeong, Gihyun Kwon, Jong Chul Ye
cs.CV, cs.AI, cs.LG, stat.ML
Recent advancements in large scale text-to-image models have opened new
possibilities for guiding the creation of images through human-devised natural
language. However, while prior literature has primarily focused on the
generation of individual images, it is essential to consider the capability of
these models to ensure coherency within a sequence of images to fulfill the
demands of real-world applications such as storytelling. To address this, here
we present a novel neural pipeline for generating a coherent storybook from the
plain text of a story. Specifically, we leverage a combination of a pre-trained
Large Language Model and a text-guided Latent Diffusion Model to generate
coherent images. While previous story synthesis frameworks typically require a
large-scale text-to-image model trained on expensive image-caption pairs to
maintain the coherency, we employ simple textual inversion techniques along
with detector-based semantic image editing which allows zero-shot generation of
the coherent storybook. Experimental results show that our proposed method
outperforms state-of-the-art image editing baselines.
February 07, 2023
Xianghao Kong, Rob Brekelmans, Greg Ver Steeg
Denoising diffusion models have spurred significant gains in density modeling
and image generation, precipitating an industrial revolution in text-guided AI
art generation. We introduce a new mathematical foundation for diffusion models
inspired by classic results in information theory that connect Information with
Minimum Mean Square Error regression, the so-called I-MMSE relations. We
generalize the I-MMSE relations to exactly relate the data distribution to an
optimal denoising regression problem, leading to an elegant refinement of
existing diffusion bounds. This new insight leads to several improvements for
probability distribution estimation, including theoretical justification for
diffusion model ensembling. Remarkably, our framework shows how continuous and
discrete probabilities can be learned with the same regression objective,
avoiding domain-specific generative models used in variational methods. Code to
reproduce experiments is provided at http://github.com/kxh001/ITdiffusion and
simplified demonstration code is at
http://github.com/gregversteeg/InfoDiffusionSimple.
February 07, 2023
Jacopo Teneggi, Matthew Tivnan, J. Webster Stayman, Jeremias Sulam
Score-based generative modeling, informally referred to as diffusion models,
continue to grow in popularity across several important domains and tasks.
While they provide high-quality and diverse samples from empirical
distributions, important questions remain on the reliability and
trustworthiness of these sampling procedures for their responsible use in
critical scenarios. Conformal prediction is a modern tool to construct
finite-sample, distribution-free uncertainty guarantees for any black-box
predictor. In this work, we focus on image-to-image regression tasks and we
present a generalization of the Risk-Controlling Prediction Sets (RCPS)
procedure, that we term $K$-RCPS, which allows to $(i)$ provide entrywise
calibrated intervals for future samples of any diffusion model, and $(ii)$
control a certain notion of risk with respect to a ground truth image with
minimal mean interval length. Differently from existing conformal risk control
procedures, ours relies on a novel convex optimization approach that allows for
multidimensional risk control while provably minimizing the mean interval
length. We illustrate our approach on two real-world image denoising problems:
on natural images of faces as well as on computed tomography (CT) scans of the
abdomen, demonstrating state of the art performance.
GraphGUIDE: interpretable and controllable conditional graph generation with discrete Bernoulli diffusion
February 07, 2023
Alex M. Tseng, Nathaniel Diamant, Tommaso Biancalani, Gabriele Scalia
Diffusion models achieve state-of-the-art performance in generating realistic
objects and have been successfully applied to images, text, and videos. Recent
work has shown that diffusion can also be defined on graphs, including graph
representations of drug-like molecules. Unfortunately, it remains difficult to
perform conditional generation on graphs in a way which is interpretable and
controllable. In this work, we propose GraphGUIDE, a novel framework for graph
generation using diffusion models, where edges in the graph are flipped or set
at each discrete time step. We demonstrate GraphGUIDE on several graph
datasets, and show that it enables full control over the conditional generation
of arbitrary structural properties without relying on predefined labels. Our
framework for graph diffusion can have a large impact on the interpretable
conditional generation of graphs, including the generation of drug-like
molecules with desired properties in a way which is informed by experimental
evidence.
Effective Data Augmentation With Diffusion Models
February 07, 2023
Brandon Trabucco, Kyle Doherty, Max Gurinas, Ruslan Salakhutdinov
Data augmentation is one of the most prevalent tools in deep learning,
underpinning many recent advances, including those from classification,
generative models, and representation learning. The standard approach to data
augmentation combines simple transformations like rotations and flips to
generate new images from existing ones. However, these new images lack
diversity along key semantic axes present in the data. Current augmentations
cannot alter the high-level semantic attributes, such as animal species present
in a scene, to enhance the diversity of data. We address the lack of diversity
in data augmentation with image-to-image transformations parameterized by
pre-trained text-to-image diffusion models. Our method edits images to change
their semantics using an off-the-shelf diffusion model, and generalizes to
novel visual concepts from a few labelled examples. We evaluate our approach on
few-shot image classification tasks, and on a real-world weed recognition task,
and observe an improvement in accuracy in tested domains.
Long Horizon Temperature Scaling
February 07, 2023
Andy Shih, Dorsa Sadigh, Stefano Ermon
Temperature scaling is a popular technique for tuning the sharpness of a
model distribution. It is used extensively for sampling likely generations and
calibrating model uncertainty, and even features as a controllable parameter to
many large language models in deployment. However, autoregressive models rely
on myopic temperature scaling that greedily optimizes the next token. To
address this, we propose Long Horizon Temperature Scaling (LHTS), a novel
approach for sampling from temperature-scaled joint distributions. LHTS is
compatible with all likelihood-based models, and optimizes for the long horizon
likelihood of samples. We derive a temperature-dependent LHTS objective, and
show that finetuning a model on a range of temperatures produces a single model
capable of generation with a controllable long horizon temperature parameter.
We experiment with LHTS on image diffusion models and character/language
autoregressive models, demonstrating advantages over myopic temperature scaling
in likelihood and sample quality, and showing improvements in accuracy on a
multiple choice analogy task by $10\%$.
Fair Diffusion: Instructing Text-to-Image Generation Models on Fairness
February 07, 2023
Felix Friedrich, Manuel Brack, Lukas Struppek, Dominik Hintersdorf, Patrick Schramowski, Sasha Luccioni, Kristian Kersting
cs.LG, cs.AI, cs.CV, cs.CY, cs.HC
Generative AI models have recently achieved astonishing results in quality
and are consequently employed in a fast-growing number of applications.
However, since they are highly data-driven, relying on billion-sized datasets
randomly scraped from the internet, they also suffer from degenerated and
biased human behavior, as we demonstrate. In fact, they may even reinforce such
biases. To not only uncover but also combat these undesired effects, we present
a novel strategy, called Fair Diffusion, to attenuate biases after the
deployment of generative text-to-image models. Specifically, we demonstrate
shifting a bias, based on human instructions, in any direction yielding
arbitrarily new proportions for, e.g., identity groups. As our empirical
evaluation demonstrates, this introduced control enables instructing generative
image models on fairness, with no data filtering and additional training
required.
Riemannian Flow Matching on General Geometries
February 07, 2023
Ricky T. Q. Chen, Yaron Lipman
We propose Riemannian Flow Matching (RFM), a simple yet powerful framework
for training continuous normalizing flows on manifolds. Existing methods for
generative modeling on manifolds either require expensive simulation, are
inherently unable to scale to high dimensions, or use approximations for
limiting quantities that result in biased training objectives. Riemannian Flow
Matching bypasses these limitations and offers several advantages over previous
approaches: it is simulation-free on simple geometries, does not require
divergence computation, and computes its target vector field in closed-form.
The key ingredient behind RFM is the construction of a relatively simple
premetric for defining target vector fields, which encompasses the existing
Euclidean case. To extend to general geometries, we rely on the use of spectral
decompositions to efficiently compute premetrics on the fly. Our method
achieves state-of-the-art performance on real-world non-Euclidean datasets, and
we demonstrate tractable training on general geometries, including triangular
meshes with highly non-trivial curvature and boundaries.
Graph Generation with Destination-Driven Diffusion Mixture
February 07, 2023
Jaehyeong Jo, Dongki Kim, Sung Ju Hwang
Generation of graphs is a major challenge for real-world tasks that require
understanding the complex nature of their non-Euclidean structures. Although
diffusion models have achieved notable success in graph generation recently,
they are ill-suited for modeling the structural information of graphs since
learning to denoise the noisy samples does not explicitly capture the graph
topology. To tackle this limitation, we propose a novel generative framework
that models the topology of graphs by predicting the destination of the
diffusion process, which is the original graph that has the correct topology
information, as a weighted mean of data. Specifically, we design the generative
process as a mixture of diffusion processes conditioned on the endpoint in the
data distribution, which drives the process toward the predicted destination,
resulting in rapid convergence. We introduce new simulation-free training
objectives for predicting the destination, and further discuss the advantages
of our framework that can explicitly model the graph topology and exploit the
inductive bias of the data. Through extensive experimental validation on
general graph and 2D/3D molecule generation tasks, we show that our method
outperforms previous generative models, generating graphs with correct topology
with both continuous (e.g. 3D coordinates) and discrete (e.g. atom types)
features.
DDM2: Self-Supervised Diffusion MRI Denoising with Generative Diffusion Models
February 06, 2023
Tiange Xiang, Mahmut Yurt, Ali B Syed, Kawin Setsompop, Akshay Chaudhari
Magnetic resonance imaging (MRI) is a common and life-saving medical imaging
technique. However, acquiring high signal-to-noise ratio MRI scans requires
long scan times, resulting in increased costs and patient discomfort, and
decreased throughput. Thus, there is great interest in denoising MRI scans,
especially for the subtype of diffusion MRI scans that are severely
SNR-limited. While most prior MRI denoising methods are supervised in nature,
acquiring supervised training datasets for the multitude of anatomies, MRI
scanners, and scan parameters proves impractical. Here, we propose Denoising
Diffusion Models for Denoising Diffusion MRI (DDM$^2$), a self-supervised
denoising method for MRI denoising using diffusion denoising generative models.
Our three-stage framework integrates statistic-based denoising theory into
diffusion models and performs denoising through conditional generation. During
inference, we represent input noisy measurements as a sample from an
intermediate posterior distribution within the diffusion Markov chain. We
conduct experiments on 4 real-world in-vivo diffusion MRI datasets and show
that our DDM$^2$ demonstrates superior denoising performances ascertained with
clinically-relevant visual qualitative and quantitative metrics.
Structure and Content-Guided Video Synthesis with Diffusion Models
February 06, 2023
Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, Anastasis Germanidis
Text-guided generative diffusion models unlock powerful image creation and
editing tools. While these have been extended to video generation, current
approaches that edit the content of existing footage while retaining structure
require expensive re-training for every input or rely on error-prone
propagation of image edits across frames. In this work, we present a structure
and content-guided video diffusion model that edits videos based on visual or
textual descriptions of the desired output. Conflicts between user-provided
content edits and structure representations occur due to insufficient
disentanglement between the two aspects. As a solution, we show that training
on monocular depth estimates with varying levels of detail provides control
over structure and content fidelity. Our model is trained jointly on images and
videos which also exposes explicit control of temporal consistency through a
novel guidance method. Our experiments demonstrate a wide variety of successes;
fine-grained control over output characteristics, customization based on a few
reference images, and a strong user preference towards results by our model.
Generative Diffusion Models on Graphs: Methods and Applications
February 06, 2023
Chengyi Liu, Wenqi Fan, Yunqing Liu, Jiatong Li, Hang Li, Hui Liu, Jiliang Tang, Qing Li
Diffusion models, as a novel generative paradigm, have achieved remarkable
success in various image generation tasks such as image inpainting,
image-to-text translation, and video generation. Graph generation is a crucial
computational task on graphs with numerous real-world applications. It aims to
learn the distribution of given graphs and then generate new graphs. Given the
great success of diffusion models in image generation, increasing efforts have
been made to leverage these techniques to advance graph generation in recent
years. In this paper, we first provide a comprehensive overview of generative
diffusion models on graphs, In particular, we review representative algorithms
for three variants of graph diffusion models, i.e., Score Matching with
Langevin Dynamics (SMLD), Denoising Diffusion Probabilistic Model (DDPM), and
Score-based Generative Model (SGM). Then, we summarize the major applications
of generative diffusion models on graphs with a specific focus on molecule and
protein modeling. Finally, we discuss promising directions in generative
diffusion models on graph-structured data. For this survey, we also created a
GitHub project website by collecting the supporting resources for generative
diffusion models on graphs, at the link:
https://github.com/ChengyiLIU-cs/Generative-Diffusion-Models-on-Graphs
Mixture of Diffusers for scene composition and high resolution image generation
February 05, 2023
Álvaro Barbero Jiménez
cs.CV, cs.AI, cs.LG, I.2.6
Diffusion methods have been proven to be very effective to generate images
while conditioning on a text prompt. However, and although the quality of the
generated images is unprecedented, these methods seem to struggle when trying
to generate specific image compositions. In this paper we present Mixture of
Diffusers, an algorithm that builds over existing diffusion models to provide a
more detailed control over composition. By harmonizing several diffusion
processes acting on different regions of a canvas, it allows generating larger
images, where the location of each object and style is controlled by a separate
diffusion process.
Diffusion Model for Generative Image Denoising
February 05, 2023
Yutong Xie, Minne Yuan, Bin Dong, Quanzheng Li
In supervised learning for image denoising, usually the paired clean images
and noisy images are collected or synthesised to train a denoising model. L2
norm loss or other distance functions are used as the objective function for
training. It often leads to an over-smooth result with less image details. In
this paper, we regard the denoising task as a problem of estimating the
posterior distribution of clean images conditioned on noisy images. We apply
the idea of diffusion model to realize generative image denoising. According to
the noise model in denoising tasks, we redefine the diffusion process such that
it is different from the original one. Hence, the sampling of the posterior
distribution is a reverse process of dozens of steps from the noisy image. We
consider three types of noise model, Gaussian, Gamma and Poisson noise. With
the guarantee of theory, we derive a unified strategy for model training. Our
method is verified through experiments on three types of noise models and
achieves excellent performance.
Eliminating Prior Bias for Semantic Image Editing via Dual-Cycle Diffusion
February 05, 2023
Zuopeng Yang, Tianshu Chu, Xin Lin, Erdun Gao, Daqing Liu, Jie Yang, Chaoyue Wang
The recent success of text-to-image generation diffusion models has also
revolutionized semantic image editing, enabling the manipulation of images
based on query/target texts. Despite these advancements, a significant
challenge lies in the potential introduction of contextual prior bias in
pre-trained models during image editing, e.g., making unexpected modifications
to inappropriate regions. To address this issue, we present a novel approach
called Dual-Cycle Diffusion, which generates an unbiased mask to guide image
editing. The proposed model incorporates a Bias Elimination Cycle that consists
of both a forward path and an inverted path, each featuring a Structural
Consistency Cycle to ensure the preservation of image content during the
editing process. The forward path utilizes the pre-trained model to produce the
edited image, while the inverted path converts the result back to the source
image. The unbiased mask is generated by comparing differences between the
processed source image and the edited image to ensure that both conform to the
same distribution. Our experiments demonstrate the effectiveness of the
proposed method, as it significantly improves the D-CLIP score from 0.272 to
0.283. The code will be available at
https://github.com/JohnDreamer/DualCycleDiffsion.
ShiftDDPMs: Exploring Conditional Diffusion Models by Shifting Diffusion Trajectories
February 05, 2023
Zijian Zhang, Zhou Zhao, Jun Yu, Qi Tian
Diffusion models have recently exhibited remarkable abilities to synthesize
striking image samples since the introduction of denoising diffusion
probabilistic models (DDPMs). Their key idea is to disrupt images into noise
through a fixed forward process and learn its reverse process to generate
samples from noise in a denoising way. For conditional DDPMs, most existing
practices relate conditions only to the reverse process and fit it to the
reversal of unconditional forward process. We find this will limit the
condition modeling and generation in a small time window. In this paper, we
propose a novel and flexible conditional diffusion model by introducing
conditions into the forward process. We utilize extra latent space to allocate
an exclusive diffusion trajectory for each condition based on some shifting
rules, which will disperse condition modeling to all timesteps and improve the
learning capacity of model. We formulate our method, which we call
\textbf{ShiftDDPMs}, and provide a unified point of view on existing related
methods. Extensive qualitative and quantitative experiments on image synthesis
demonstrate the feasibility and effectiveness of ShiftDDPMs.
ReDi: Efficient Learning-Free Diffusion Inference via Trajectory Retrieval
February 05, 2023
Kexun Zhang, Xianjun Yang, William Yang Wang, Lei Li
Diffusion models show promising generation capability for a variety of data.
Despite their high generation quality, the inference for diffusion models is
still time-consuming due to the numerous sampling iterations required. To
accelerate the inference, we propose ReDi, a simple yet learning-free
Retrieval-based Diffusion sampling framework. From a precomputed knowledge
base, ReDi retrieves a trajectory similar to the partially generated trajectory
at an early stage of generation, skips a large portion of intermediate steps,
and continues sampling from a later step in the retrieved trajectory. We
theoretically prove that the generation performance of ReDi is guaranteed. Our
experiments demonstrate that ReDi improves the model inference efficiency by 2x
speedup. Furthermore, ReDi is able to generalize well in zero-shot cross-domain
image generation such as image stylization.
Design Booster: A Text-Guided Diffusion Model for Image Translation with Spatial Layout Preservation
February 05, 2023
Shiqi Sun, Shancheng Fang, Qian He, Wei Liu
Diffusion models are able to generate photorealistic images in arbitrary
scenes. However, when applying diffusion models to image translation, there
exists a trade-off between maintaining spatial structure and high-quality
content. Besides, existing methods are mainly based on test-time optimization
or fine-tuning model for each input image, which are extremely time-consuming
for practical applications. To address these issues, we propose a new approach
for flexible image translation by learning a layout-aware image condition
together with a text condition. Specifically, our method co-encodes images and
text into a new domain during the training phase. In the inference stage, we
can choose images/text or both as the conditions for each time step, which
gives users more flexible control over layout and content. Experimental
comparisons of our method with state-of-the-art methods demonstrate our model
performs best in both style image translation and semantic image translation
and took the shortest time.
SE(3) diffusion model with application to protein backbone generation
February 05, 2023
Jason Yim, Brian L. Trippe, Valentin De Bortoli, Emile Mathieu, Arnaud Doucet, Regina Barzilay, Tommi Jaakkola
The design of novel protein structures remains a challenge in protein
engineering for applications across biomedicine and chemistry. In this line of
work, a diffusion model over rigid bodies in 3D (referred to as frames) has
shown success in generating novel, functional protein backbones that have not
been observed in nature. However, there exists no principled methodological
framework for diffusion on SE(3), the space of orientation preserving rigid
motions in R3, that operates on frames and confers the group invariance. We
address these shortcomings by developing theoretical foundations of SE(3)
invariant diffusion models on multiple frames followed by a novel framework,
FrameDiff, for learning the SE(3) equivariant score over multiple frames. We
apply FrameDiff on monomer backbone generation and find it can generate
designable monomers up to 500 amino acids without relying on a pretrained
protein structure prediction network that has been integral to previous
methods. We find our samples are capable of generalizing beyond any known
protein structure.
Divide and Compose with Score Based Generative Models
February 05, 2023
Sandesh Ghimire, Armand Comas, Davin Hill, Aria Masoomi, Octavia Camps, Jennifer Dy
While score based generative models, or diffusion models, have found success
in image synthesis, they are often coupled with text data or image label to be
able to manipulate and conditionally generate images. Even though manipulation
of images by changing the text prompt is possible, our understanding of the
text embedding and our ability to modify it to edit images is quite limited.
Towards the direction of having more control over image manipulation and
conditional generation, we propose to learn image components in an unsupervised
manner so that we can compose those components to generate and manipulate
images in informed manner. Taking inspiration from energy based models, we
interpret different score components as the gradient of different energy
functions. We show how score based learning allows us to learn interesting
components and we can visualize them through generation. We also show how this
novel decomposition allows us to compose, generate and modify images in
interesting ways akin to dreaming. We make our code available at
https://github.com/sandeshgh/Score-based-disentanglement
Multi-Source Diffusion Models for Simultaneous Music Generation and Separation
February 04, 2023
Giorgio Mariani, Irene Tallini, Emilian Postolache, Michele Mancusi, Luca Cosmo, Emanuele Rodolà
In this work, we define a diffusion-based generative model capable of both
music synthesis and source separation by learning the score of the joint
probability density of sources sharing a context. Alongside the classic total
inference tasks (i.e., generating a mixture, separating the sources), we also
introduce and experiment on the partial generation task of source imputation,
where we generate a subset of the sources given the others (e.g., play a piano
track that goes well with the drums). Additionally, we introduce a novel
inference method for the separation task based on Dirac likelihood functions.
We train our model on Slakh2100, a standard dataset for musical source
separation, provide qualitative results in the generation settings, and
showcase competitive quantitative results in the source separation setting. Our
method is the first example of a single model that can handle both generation
and separation tasks, thus representing a step toward general audio models.
AdaptDiffuser: Diffusion Models as Adaptive Self-evolving Planners
February 03, 2023
Zhixuan Liang, Yao Mu, Mingyu Ding, Fei Ni, Masayoshi Tomizuka, Ping Luo
Diffusion models have demonstrated their powerful generative capability in
many tasks, with great potential to serve as a paradigm for offline
reinforcement learning. However, the quality of the diffusion model is limited
by the insufficient diversity of training data, which hinders the performance
of planning and the generalizability to new tasks. This paper introduces
AdaptDiffuser, an evolutionary planning method with diffusion that can
self-evolve to improve the diffusion model hence a better planner, not only for
seen tasks but can also adapt to unseen tasks. AdaptDiffuser enables the
generation of rich synthetic expert data for goal-conditioned tasks using
guidance from reward gradients. It then selects high-quality data via a
discriminator to finetune the diffusion model, which improves the
generalization ability to unseen tasks. Empirical experiments on two benchmark
environments and two carefully designed unseen tasks in KUKA industrial robot
arm and Maze2D environments demonstrate the effectiveness of AdaptDiffuser. For
example, AdaptDiffuser not only outperforms the previous art Diffuser by 20.8%
on Maze2D and 7.5% on MuJoCo locomotion, but also adapts better to new tasks,
e.g., KUKA pick-and-place, by 27.9% without requiring additional expert data.
More visualization results and demo videos could be found on our project page.
MorDIFF: Recognition Vulnerability and Attack Detectability of Face Morphing Attacks Created by Diffusion Autoencoders
February 03, 2023
Naser Damer, Meiling Fang, Patrick Siebke, Jan Niklas Kolf, Marco Huber, Fadi Boutros
Investigating new methods of creating face morphing attacks is essential to
foresee novel attacks and help mitigate them. Creating morphing attacks is
commonly either performed on the image-level or on the representation-level.
The representation-level morphing has been performed so far based on generative
adversarial networks (GAN) where the encoded images are interpolated in the
latent space to produce a morphed image based on the interpolated vector. Such
a process was constrained by the limited reconstruction fidelity of GAN
architectures. Recent advances in the diffusion autoencoder models have
overcome the GAN limitations, leading to high reconstruction fidelity. This
theoretically makes them a perfect candidate to perform representation-level
face morphing. This work investigates using diffusion autoencoders to create
face morphing attacks by comparing them to a wide range of image-level and
representation-level morphs. Our vulnerability analyses on four
state-of-the-art face recognition models have shown that such models are highly
vulnerable to the created attacks, the MorDIFF, especially when compared to
existing representation-level morphs. Detailed detectability analyses are also
performed on the MorDIFF, showing that they are as challenging to detect as
other morphing attacks created on the image- or representation-level. Data and
morphing script are made public: https://github.com/naserdamer/MorDIFF.
Interventional and Counterfactual Inference with Diffusion Models
February 03, 2023
Muah Kim, Rick Fritschek, Rafael F. Schaefer
It is a known problem that deep-learning-based end-to-end (E2E) channel
coding systems depend on a known and differentiable channel model, due to the
learning process and based on the gradient-descent optimization methods. This
places the challenge to approximate or generate the channel or its derivative
from samples generated by pilot signaling in real-world scenarios. Currently,
there are two prevalent methods to solve this problem. One is to generate the
channel via a generative adversarial network (GAN), and the other is to, in
essence, approximate the gradient via reinforcement learning methods. Other
methods include using score-based methods, variational autoencoders, or
mutual-information-based methods. In this paper, we focus on generative models
and, in particular, on a new promising method called diffusion models, which
have shown a higher quality of generation in image-based tasks. We will show
that diffusion models can be used in wireless E2E scenarios and that they work
as good as Wasserstein GANs while having a more stable training procedure and a
better generalization ability in testing.
Dreamix: Video Diffusion Models are General Video Editors
February 02, 2023
Eyal Molad, Eliahu Horwitz, Dani Valevski, Alex Rav Acha, Yossi Matias, Yael Pritch, Yaniv Leviathan, Yedid Hoshen
Text-driven image and video diffusion models have recently achieved
unprecedented generation realism. While diffusion models have been successfully
applied for image editing, very few works have done so for video editing. We
present the first diffusion-based method that is able to perform text-based
motion and appearance editing of general videos. Our approach uses a video
diffusion model to combine, at inference time, the low-resolution
spatio-temporal information from the original video with new, high resolution
information that it synthesized to align with the guiding text prompt. As
obtaining high-fidelity to the original video requires retaining some of its
high-resolution information, we add a preliminary stage of finetuning the model
on the original video, significantly boosting fidelity. We propose to improve
motion editability by a new, mixed objective that jointly finetunes with full
temporal attention and with temporal attention masking. We further introduce a
new framework for image animation. We first transform the image into a coarse
video by simple image processing operations such as replication and perspective
geometric projections, and then use our general video editor to animate it. As
a further application, we can use our method for subject-driven video
generation. Extensive qualitative and numerical experiments showcase the
remarkable editing ability of our method and establish its superior performance
compared to baseline methods.
Are Diffusion Models Vulnerable to Membership Inference Attacks?
February 02, 2023
Jinhao Duan, Fei Kong, Shiqi Wang, Xiaoshuang Shi, Kaidi Xu
cs.CV, cs.AI, cs.CR, cs.LG
Diffusion-based generative models have shown great potential for image
synthesis, but there is a lack of research on the security and privacy risks
they may pose. In this paper, we investigate the vulnerability of diffusion
models to Membership Inference Attacks (MIAs), a common privacy concern. Our
results indicate that existing MIAs designed for GANs or VAE are largely
ineffective on diffusion models, either due to inapplicable scenarios (e.g.,
requiring the discriminator of GANs) or inappropriate assumptions (e.g., closer
distances between synthetic samples and member samples). To address this gap,
we propose Step-wise Error Comparing Membership Inference (SecMI), a
query-based MIA that infers memberships by assessing the matching of forward
process posterior estimation at each timestep. SecMI follows the common
overfitting assumption in MIA where member samples normally have smaller
estimation errors, compared with hold-out samples. We consider both the
standard diffusion models, e.g., DDPM, and the text-to-image diffusion models,
e.g., Latent Diffusion Models and Stable Diffusion. Experimental results
demonstrate that our methods precisely infer the membership with high
confidence on both of the two scenarios across multiple different datasets.
Code is available at https://github.com/jinhaoduan/SecMI.
A Theoretical Justification for Image Inpainting using Denoising Diffusion Probabilistic Models
February 02, 2023
Litu Rout, Advait Parulekar, Constantine Caramanis, Sanjay Shakkottai
stat.ML, cs.AI, cs.LG, math.ST, stat.TH
We provide a theoretical justification for sample recovery using diffusion
based image inpainting in a linear model setting. While most inpainting
algorithms require retraining with each new mask, we prove that diffusion based
inpainting generalizes well to unseen masks without retraining. We analyze a
recently proposed popular diffusion based inpainting algorithm called RePaint
(Lugmayr et al., 2022), and show that it has a bias due to misalignment that
hampers sample recovery even in a two-state diffusion process. Motivated by our
analysis, we propose a modified RePaint algorithm we call RePaint$^+$ that
provably recovers the underlying true sample and enjoys a linear rate of
convergence. It achieves this by rectifying the misalignment error present in
drift and dispersion of the reverse process. To the best of our knowledge, this
is the first linear convergence result for a diffusion based image inpainting
algorithm.
Stable Target Field for Reduced Variance Score Estimation in Diffusion Models
February 01, 2023
Yilun Xu, Shangyuan Tong, Tommi Jaakkola
Diffusion models generate samples by reversing a fixed forward diffusion
process. Despite already providing impressive empirical results, these
diffusion models algorithms can be further improved by reducing the variance of
the training targets in their denoising score-matching objective. We argue that
the source of such variance lies in the handling of intermediate noise-variance
scales, where multiple modes in the data affect the direction of reverse paths.
We propose to remedy the problem by incorporating a reference batch which we
use to calculate weighted conditional scores as more stable training targets.
We show that the procedure indeed helps in the challenging intermediate regime
by reducing (the trace of) the covariance of training targets. The new stable
targets can be seen as trading bias for reduced variance, where the bias
vanishes with increasing reference batch size. Empirically, we show that the
new objective improves the image quality, stability, and training speed of
various popular diffusion models across datasets with both general ODE and SDE
solvers. When used in combination with EDM, our method yields a current SOTA
FID of 1.90 with 35 network evaluations on the unconditional CIFAR-10
generation task. The code is available at https://github.com/Newbeeer/stf
Two for One: Diffusion Models and Force Fields for Coarse-Grained Molecular Dynamics
February 01, 2023
Marloes Arts, Victor Garcia Satorras, Chin-Wei Huang, Daniel Zuegner, Marco Federici, Cecilia Clementi, Frank Noé, Robert Pinsler, Rianne van den Berg
Coarse-grained (CG) molecular dynamics enables the study of biological
processes at temporal and spatial scales that would be intractable at an
atomistic resolution. However, accurately learning a CG force field remains a
challenge. In this work, we leverage connections between score-based generative
models, force fields and molecular dynamics to learn a CG force field without
requiring any force inputs during training. Specifically, we train a diffusion
generative model on protein structures from molecular dynamics simulations, and
we show that its score function approximates a force field that can directly be
used to simulate CG molecular dynamics. While having a vastly simplified
training setup compared to previous work, we demonstrate that our approach
leads to improved performance across several small- to medium-sized protein
simulations, reproducing the CG equilibrium distribution, and preserving
dynamics of all-atom simulations such as protein folding events.
Conditional Flow Matching: Simulation-Free Dynamic Optimal Transport
February 01, 2023
Alexander Tong, Nikolay Malkin, Guillaume Huguet, Yanlei Zhang, Jarrid Rector-Brooks, Kilian Fatras, Guy Wolf, Yoshua Bengio
Continuous normalizing flows (CNFs) are an attractive generative modeling
technique, but they have been held back by limitations in their
simulation-based maximum likelihood training. We introduce the generalized
conditional flow matching (CFM) technique, a family of simulation-free training
objectives for CNFs. CFM features a stable regression objective like that used
to train the stochastic flow in diffusion models but enjoys the efficient
inference of deterministic flow models. In contrast to both diffusion models
and prior CNF training algorithms, CFM does not require the source distribution
to be Gaussian or require evaluation of its density. A variant of our objective
is optimal transport CFM (OT-CFM), which creates simpler flows that are more
stable to train and lead to faster inference, as evaluated in our experiments.
Furthermore, OT-CFM is the first method to compute dynamic OT in a
simulation-free way. Training CNFs with CFM improves results on a variety of
conditional and unconditional generation tasks, such as inferring single cell
dynamics, unsupervised image translation, and Schr"odinger bridge inference.
Diffusion Models for High-Resolution Solar Forecasts
February 01, 2023
Yusuke Hatanaka, Yannik Glaser, Geoff Galgon, Giuseppe Torri, Peter Sadowski
Forecasting future weather and climate is inherently difficult. Machine
learning offers new approaches to increase the accuracy and computational
efficiency of forecasts, but current methods are unable to accurately model
uncertainty in high-dimensional predictions. Score-based diffusion models offer
a new approach to modeling probability distributions over many dependent
variables, and in this work, we demonstrate how they provide probabilistic
forecasts of weather and climate variables at unprecedented resolution, speed,
and accuracy. We apply the technique to day-ahead solar irradiance forecasts by
generating many samples from a diffusion model trained to super-resolve
coarse-resolution numerical weather predictions to high-resolution weather
satellite observations.
Salient Conditional Diffusion for Defending Against Backdoor Attacks
January 31, 2023
Brandon B. May, N. Joseph Tatro, Dylan Walker, Piyush Kumar, Nathan Shnidman
We propose a novel algorithm, Salient Conditional Diffusion (Sancdifi), a
state-of-the-art defense against backdoor attacks. Sancdifi uses a denoising
diffusion probabilistic model (DDPM) to degrade an image with noise and then
recover said image using the learned reverse diffusion. Critically, we compute
saliency map-based masks to condition our diffusion, allowing for stronger
diffusion on the most salient pixels by the DDPM. As a result, Sancdifi is
highly effective at diffusing out triggers in data poisoned by backdoor
attacks. At the same time, it reliably recovers salient features when applied
to clean data. This performance is achieved without requiring access to the
model parameters of the Trojan network, meaning Sancdifi operates as a
black-box defense.
Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models
January 31, 2023
Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, Daniel Cohen-Or
cs.CV, cs.CL, cs.GR, cs.LG
Recent text-to-image generative models have demonstrated an unparalleled
ability to generate diverse and creative imagery guided by a target text
prompt. While revolutionary, current state-of-the-art diffusion models may
still fail in generating images that fully convey the semantics in the given
text prompt. We analyze the publicly available Stable Diffusion model and
assess the existence of catastrophic neglect, where the model fails to generate
one or more of the subjects from the input prompt. Moreover, we find that in
some cases the model also fails to correctly bind attributes (e.g., colors) to
their corresponding subjects. To help mitigate these failure cases, we
introduce the concept of Generative Semantic Nursing (GSN), where we seek to
intervene in the generative process on the fly during inference time to improve
the faithfulness of the generated images. Using an attention-based formulation
of GSN, dubbed Attend-and-Excite, we guide the model to refine the
cross-attention units to attend to all subject tokens in the text prompt and
strengthen - or excite - their activations, encouraging the model to generate
all subjects described in the text prompt. We compare our approach to
alternative approaches and demonstrate that it conveys the desired concepts
more faithfully across a range of text prompts.
January 31, 2023
Zihao Wang, Yingyu Yang, Maxime Sermesant, Hervé Delingette, Ona Wu
cs.CV, cs.AI, cs.LG, stat.ML
Cross-modality data translation has attracted great interest in image
computing. Deep generative models (\textit{e.g.}, GANs) show performance
improvement in tackling those problems. Nevertheless, as a fundamental
challenge in image translation, the problem of Zero-shot-Learning
Cross-Modality Data Translation with fidelity remains unanswered. This paper
proposes a new unsupervised zero-shot-learning method named Mutual Information
guided Diffusion cross-modality data translation Model (MIDiffusion), which
learns to translate the unseen source data to the target domain. The
MIDiffusion leverages a score-matching-based generative model, which learns the
prior knowledge in the target domain. We propose a differentiable
local-wise-MI-Layer ($LMI$) for conditioning the iterative denoising sampling.
The $LMI$ captures the identical cross-modality features in the statistical
domain for the diffusion guidance; thus, our method does not require retraining
when the source domain is changed, as it does not rely on any direct mapping
between the source and target domains. This advantage is critical for applying
cross-modality data translation methods in practice, as a reasonable amount of
source domain dataset is not always available for supervised training. We
empirically show the advanced performance of MIDiffusion in comparison with an
influential group of generative models, including adversarial-based and other
score-matching-based models.
DisDiff: Unsupervised Disentanglement of Diffusion Probabilistic Models
January 31, 2023
Tao Yang, Yuwang Wang, Yan Lv, Nanning Zheng
Targeting to understand the underlying explainable factors behind
observations and modeling the conditional generation process on these factors,
we connect disentangled representation learning to Diffusion Probabilistic
Models (DPMs) to take advantage of the remarkable modeling ability of DPMs. We
propose a new task, disentanglement of (DPMs): given a pre-trained DPM, without
any annotations of the factors, the task is to automatically discover the
inherent factors behind the observations and disentangle the gradient fields of
DPM into sub-gradient fields, each conditioned on the representation of each
discovered factor. With disentangled DPMs, those inherent factors can be
automatically discovered, explicitly represented, and clearly injected into the
diffusion process via the sub-gradient fields. To tackle this task, we devise
an unsupervised approach named DisDiff, achieving disentangled representation
learning in the framework of DPMs. Extensive experiments on synthetic and
real-world datasets demonstrate the effectiveness of DisDiff.
Transport with Support: Data-Conditional Diffusion Bridges
January 31, 2023
Ella Tamir, Martin Trapp, Arno Solin
The dynamic Schr"odinger bridge problem provides an appealing setting for
solving constrained time-series data generation tasks posed as optimal
transport problems. It consists of learning non-linear diffusion processes
using efficient iterative solvers. Recent works have demonstrated
state-of-the-art results (eg. in modelling single-cell embryo RNA sequences or
sampling from complex posteriors) but are limited to learning bridges with only
initial and terminal constraints. Our work extends this paradigm by proposing
the Iterative Smoothing Bridge (ISB). We integrate Bayesian filtering and
optimal control into learning the diffusion process, enabling the generation of
constrained stochastic processes governed by sparse observations at
intermediate stages and terminal constraints. We assess the effectiveness of
our method on synthetic and real-world data generation tasks and we show that
the ISB generalises well to high-dimensional data, is computationally
efficient, and provides accurate estimates of the marginals at intermediate and
terminal times.
DiffSTG: Probabilistic Spatio-Temporal Graph Forecasting with Denoising Diffusion Models
January 31, 2023
Haomin Wen, Youfang Lin, Yutong Xia, Huaiyu Wan, Qingsong Wen, Roger Zimmermann, Yuxuan Liang
Spatio-temporal graph neural networks (STGNN) have emerged as the dominant
model for spatio-temporal graph (STG) forecasting. Despite their success, they
fail to model intrinsic uncertainties within STG data, which cripples their
practicality in downstream tasks for decision-making. To this end, this paper
focuses on probabilistic STG forecasting, which is challenging due to the
difficulty in modeling uncertainties and complex ST dependencies. In this
study, we present the first attempt to generalize the popular denoising
diffusion probabilistic models to STGs, leading to a novel non-autoregressive
framework called DiffSTG, along with the first denoising network UGnet for STG
in the framework. Our approach combines the spatio-temporal learning
capabilities of STGNNs with the uncertainty measurements of diffusion models.
Extensive experiments validate that DiffSTG reduces the Continuous Ranked
Probability Score (CRPS) by 4%-14%, and Root Mean Squared Error (RMSE) by 2%-7%
over existing methods on three real-world datasets.
Learning Data Representations with Joint Diffusion Models
January 31, 2023
Kamil Deja, Tomasz Trzcinski, Jakub M. Tomczak
Joint machine learning models that allow synthesizing and classifying data
often offer uneven performance between those tasks or are unstable to train. In
this work, we depart from a set of empirical observations that indicate the
usefulness of internal representations built by contemporary deep
diffusion-based generative models not only for generating but also predicting.
We then propose to extend the vanilla diffusion model with a classifier that
allows for stable joint end-to-end training with shared parameterization
between those objectives. The resulting joint diffusion model outperforms
recent state-of-the-art hybrid methods in terms of both classification and
generation quality on all evaluated benchmarks. On top of our joint training
approach, we present how we can directly benefit from shared generative and
discriminative representations by introducing a method for visual
counterfactual explanations.
Optimizing DDPM Sampling with Shortcut Fine-Tuning
January 31, 2023
Ying Fan, Kangwook Lee
In this study, we propose Shortcut Fine-Tuning (SFT), a new approach for
addressing the challenge of fast sampling of pretrained Denoising Diffusion
Probabilistic Models (DDPMs). SFT advocates for the fine-tuning of DDPM
samplers through the direct minimization of Integral Probability Metrics (IPM),
instead of learning the backward diffusion process. This enables samplers to
discover an alternative and more efficient sampling shortcut, deviating from
the backward diffusion process. Inspired by a control perspective, we propose a
new algorithm SFT-PG: Shortcut Fine-Tuning with Policy Gradient, and prove that
under certain assumptions, gradient descent of diffusion models with respect to
IPM is equivalent to performing policy gradient. To our best knowledge, this is
the first attempt to utilize reinforcement learning (RL) methods to train
diffusion models. Through empirical evaluation, we demonstrate that our
fine-tuning method can further enhance existing fast DDPM samplers, resulting
in sample quality comparable to or even surpassing that of the full-step model
across various datasets.
ArchiSound: Audio Generation with Diffusion
January 30, 2023
Flavio Schneider
The recent surge in popularity of diffusion models for image generation has
brought new attention to the potential of these models in other areas of media
generation. One area that has yet to be fully explored is the application of
diffusion models to audio generation. Audio generation requires an
understanding of multiple aspects, such as the temporal dimension, long term
structure, multiple layers of overlapping sounds, and the nuances that only
trained listeners can detect. In this work, we investigate the potential of
diffusion models for audio generation. We propose a set of models to tackle
multiple aspects, including a new method for text-conditional latent audio
diffusion with stacked 1D U-Nets, that can generate multiple minutes of music
from a textual description. For each model, we make an effort to maintain
reasonable inference speed, targeting real-time on a single consumer GPU. In
addition to trained models, we provide a collection of open source libraries
with the hope of simplifying future work in the field. Samples can be found at
https://bit.ly/audio-diffusion. Codes are at
https://github.com/archinetai/audio-diffusion-pytorch.
January 30, 2023
Nicholas Carlini, Jamie Hayes, Milad Nasr, Matthew Jagielski, Vikash Sehwag, Florian Tramèr, Borja Balle, Daphne Ippolito, Eric Wallace
Image diffusion models such as DALL-E 2, Imagen, and Stable Diffusion have
attracted significant attention due to their ability to generate high-quality
synthetic images. In this work, we show that diffusion models memorize
individual images from their training data and emit them at generation time.
With a generate-and-filter pipeline, we extract over a thousand training
examples from state-of-the-art models, ranging from photographs of individual
people to trademarked company logos. We also train hundreds of diffusion models
in various settings to analyze how different modeling and data decisions affect
privacy. Overall, our results show that diffusion models are much less private
than prior generative models such as GANs, and that mitigating these
vulnerabilities may require new advances in privacy-preserving training.
ERA-Solver: Error-Robust Adams Solver for Fast Sampling of…
January 30, 2023
Shengmeng Li, Luping Liu, Zenghao Chai, Runnan Li, Xu Tan
Though denoising diffusion probabilistic models (DDPMs) have achieved
remarkable generation results, the low sampling efficiency of DDPMs still
limits further applications. Since DDPMs can be formulated as diffusion
ordinary differential equations (ODEs), various fast sampling methods can be
derived from solving diffusion ODEs. However, we notice that previous sampling
methods with fixed analytical form are not robust with the error in the noise
estimated from pretrained diffusion models. In this work, we construct an
error-robust Adams solver (ERA-Solver), which utilizes the implicit Adams
numerical method that consists of a predictor and a corrector. Different from
the traditional predictor based on explicit Adams methods, we leverage a
Lagrange interpolation function as the predictor, which is further enhanced
with an error-robust strategy to adaptively select the Lagrange bases with
lower error in the estimated noise. Experiments on Cifar10, LSUN-Church, and
LSUN-Bedroom datasets demonstrate that our proposed ERA-Solver achieves 5.14,
9.42, and 9.69 Fenchel Inception Distance (FID) for image generation, with only
10 network evaluations.
GibbsDDRM: A Partially Collapsed Gibbs Sampler for Solving Bli
January 30, 2023
Naoki Murata, Koichi Saito, Chieh-Hsin Lai, Yuhta Takida, Toshimitsu Uesaka, Yuki Mitsufuji, Stefano Ermon
cs.LG, cs.AI, cs.CV, cs.SD, eess.AS
Pre-trained diffusion models have been successfully used as priors in a
variety of linear inverse problems, where the goal is to reconstruct a signal
from noisy linear measurements. However, existing approaches require knowledge
of the linear operator. In this paper, we propose GibbsDDRM, an extension of
Denoising Diffusion Restoration Models (DDRM) to a blind setting in which the
linear measurement operator is unknown. GibbsDDRM constructs a joint
distribution of the data, measurements, and linear operator by using a
pre-trained diffusion model for the data prior, and it solves the problem by
posterior sampling with an efficient variant of a Gibbs sampler. The proposed
method is problem-agnostic, meaning that a pre-trained diffusion model can be
applied to various inverse problems without fine-tuning. In experiments, it
achieved high performance on both blind image deblurring and vocal
dereverberation tasks, despite the use of simple generic priors for the
underlying linear operators.
Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models
January 30, 2023
Rongjie Huang, Jiawei Huang, Dongchao Yang, Yi Ren, Luping Liu, Mingze Li, Zhenhui Ye, Jinglin Liu, Xiang Yin, Zhou Zhao
cs.SD, cs.LG, cs.MM, eess.AS
Large-scale multimodal generative modeling has created milestones in
text-to-image and text-to-video generation. Its application to audio still lags
behind for two main reasons: the lack of large-scale datasets with high-quality
text-audio pairs, and the complexity of modeling long continuous audio data. In
this work, we propose Make-An-Audio with a prompt-enhanced diffusion model that
addresses these gaps by 1) introducing pseudo prompt enhancement with a
distill-then-reprogram approach, it alleviates data scarcity with orders of
magnitude concept compositions by using language-free audios; 2) leveraging
spectrogram autoencoder to predict the self-supervised audio representation
instead of waveforms. Together with robust contrastive language-audio
pretraining (CLAP) representations, Make-An-Audio achieves state-of-the-art
results in both objective and subjective benchmark evaluation. Moreover, we
present its controllability and generalization for X-to-Audio with “No Modality
Left Behind”, for the first time unlocking the ability to generate
high-definition, high-fidelity audios given a user-defined modality input.
Audio samples are available at https://Text-to-Audio.github.io
AudioLDM: Text-to-Audio Generation with Latent Diffusion Models
January 29, 2023
Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo Mandic, Wenwu Wang, Mark D. Plumbley
cs.SD, cs.AI, cs.MM, eess.AS, eess.SP
Text-to-audio (TTA) system has recently gained attention for its ability to
synthesize general audio based on text descriptions. However, previous studies
in TTA have limited generation quality with high computational costs. In this
study, we propose AudioLDM, a TTA system that is built on a latent space to
learn the continuous audio representations from contrastive language-audio
pretraining (CLAP) latents. The pretrained CLAP models enable us to train LDMs
with audio embedding while providing text embedding as a condition during
sampling. By learning the latent representations of audio signals and their
compositions without modeling the cross-modal relationship, AudioLDM is
advantageous in both generation quality and computational efficiency. Trained
on AudioCaps with a single GPU, AudioLDM achieves state-of-the-art TTA
performance measured by both objective and subjective metrics (e.g., frechet
distance). Moreover, AudioLDM is the first TTA system that enables various
text-guided audio manipulations (e.g., style transfer) in a zero-shot fashion.
Our implementation and demos are available at https://audioldm.github.io.
SEGA: Instructing Diffusion using Semantic Dimensions
January 28, 2023
Manuel Brack, Felix Friedrich, Dominik Hintersdorf, Lukas Struppek, Patrick Schramowski, Kristian Kersting
Text-to-image diffusion models have recently received a lot of interest for
their astonishing ability to produce high-fidelity images from text only.
However, achieving one-shot generation that aligns with the user’s intent is
nearly impossible, yet small changes to the input prompt often result in very
different images. This leaves the user with little semantic control. To put the
user in control, we show how to interact with the diffusion process to flexibly
steer it along semantic directions. This semantic guidance (SEGA) generalizes
to any generative architecture using classifier-free guidance. More
importantly, it allows for subtle and extensive edits, changes in composition
and style, as well as optimizing the overall artistic conception. We
demonstrate SEGA’s effectiveness on both latent and pixel-based diffusion
models such as Stable Diffusion, Paella, and DeepFloyd-IF using a variety of
tasks, thus providing strong evidence for its versatility, flexibility, and
improvements over existing methods.
Minimizing Trajectory Curvature of ODE-based Generative Models
January 27, 2023
Sangyun Lee, Beomsu Kim, Jong Chul Ye
cs.LG, cs.AI, cs.CV, stat.ML
Recent ODE/SDE-based generative models, such as diffusion models, rectified
flows, and flow matching, define a generative process as a time reversal of a
fixed forward process. Even though these models show impressive performance on
large-scale datasets, numerical simulation requires multiple evaluations of a
neural network, leading to a slow sampling speed. We attribute the reason to
the high curvature of the learned generative trajectories, as it is directly
related to the truncation error of a numerical solver. Based on the
relationship between the forward process and the curvature, here we present an
efficient method of training the forward process to minimize the curvature of
generative trajectories without any ODE/SDE simulation. Experiments show that
our method achieves a lower curvature than previous models and, therefore,
decreased sampling costs while maintaining competitive performance. Code is
available at https://github.com/sangyun884/fast-ode.
Moûsai: Text-to-Music Generation with Long-Context Latent Diffusion
January 27, 2023
Flavio Schneider, Ojasv Kamal, Zhijing Jin, Bernhard Schölkopf
cs.CL, cs.LG, cs.SD, eess.AS
Recent years have seen the rapid development of large generative models for
text; however, much less research has explored the connection between text and
another “language” of communication – music. Music, much like text, can convey
emotions, stories, and ideas, and has its own unique structure and syntax. In
our work, we bridge text and music via a text-to-music generation model that is
highly efficient, expressive, and can handle long-term structure. Specifically,
we develop Mo\^usai, a cascading two-stage latent diffusion model that can
generate multiple minutes of high-quality stereo music at 48kHz from textual
descriptions. Moreover, our model features high efficiency, which enables
real-time inference on a single consumer GPU with a reasonable speed. Through
experiments and property analyses, we show our model’s competence over a
variety of criteria compared with existing music generation models. Lastly, to
promote the open-source culture, we provide a collection of open-source
libraries with the hope of facilitating future work in the field. We
open-source the following: Codes:
https://github.com/archinetai/audio-diffusion-pytorch; music samples for this
paper: http://bit.ly/44ozWDH; all music samples for all models:
https://bit.ly/audio-diffusion.
January 27, 2023
Mang Ning, Enver Sangineto, Angelo Porrello, Simone Calderara, Rita Cucchiara
Denoising Diffusion Probabilistic Models have shown an impressive generation
quality, although their long sampling chain leads to high computational costs.
In this paper, we observe that a long sampling chain also leads to an error
accumulation phenomenon, which is similar to the exposure bias problem in
autoregressive text generation. Specifically, we note that there is a
discrepancy between training and testing, since the former is conditioned on
the ground truth samples, while the latter is conditioned on the previously
generated results. To alleviate this problem, we propose a very simple but
effective training regularization, consisting in perturbing the ground truth
samples to simulate the inference time prediction errors. We empirically show
that, without affecting the recall and precision, the proposed input
perturbation leads to a significant improvement in the sample quality while
reducing both the training and the inference times. For instance, on CelebA
64$\times$64, we achieve a new state-of-the-art FID score of 1.27, while saving
37.5% of the training time. The code is publicly available at
https://github.com/forever208/DDPM-IP
Image Restoration with Mean-Reverting Stochastic Differential Equations
January 27, 2023
Ziwei Luo, Fredrik K. Gustafsson, Zheng Zhao, Jens Sjölund, Thomas B. Schön
This paper presents a stochastic differential equation (SDE) approach for
general-purpose image restoration. The key construction consists in a
mean-reverting SDE that transforms a high-quality image into a degraded
counterpart as a mean state with fixed Gaussian noise. Then, by simulating the
corresponding reverse-time SDE, we are able to restore the origin of the
low-quality image without relying on any task-specific prior knowledge.
Crucially, the proposed mean-reverting SDE has a closed-form solution, allowing
us to compute the ground truth time-dependent score and learn it with a neural
network. Moreover, we propose a maximum likelihood objective to learn an
optimal reverse trajectory that stabilizes the training and improves the
restoration results. The experiments show that our proposed method achieves
highly competitive performance in quantitative comparisons on image deraining,
deblurring, and denoising, setting a new state-of-the-art on two deraining
datasets. Finally, the general applicability of our approach is further
demonstrated via qualitative results on image super-resolution, inpainting, and
dehazing. Code is available at
https://github.com/Algolzw/image-restoration-sde.
A denoting diffusion model for fluid flow prediction
January 27, 2023
Gefan Yang, Stefan Sommer
We propose a novel denoising diffusion generative model for predicting
nonlinear fluid fields named FluidDiff. By performing a diffusion process, the
model is able to learn a complex representation of the high-dimensional dynamic
system, and then Langevin sampling is used to generate predictions for the flow
state under specified initial conditions. The model is trained with finite,
discrete fluid simulation data. We demonstrate that our model has the capacity
to model the distribution of simulated training data and that it gives accurate
predictions on the test data. Without encoded prior knowledge of the underlying
physical system, it shares competitive performance with other deep learning
models for fluid prediction, which is promising for investigation on new
computational fluid dynamics methods.
Accelerating Guided Diffusion Sampling with Splitting Numerical Methods
January 27, 2023
Suttisak Wizadwongsa, Supasorn Suwajanakorn
Guided diffusion is a technique for conditioning the output of a diffusion
model at sampling time without retraining the network for each specific task.
One drawback of diffusion models, however, is their slow sampling process.
Recent techniques can accelerate unguided sampling by applying high-order
numerical methods to the sampling process when viewed as differential
equations. On the contrary, we discover that the same techniques do not work
for guided sampling, and little has been explored about its acceleration. This
paper explores the culprit of this problem and provides a solution based on
operator splitting methods, motivated by our key finding that classical
high-order numerical methods are unsuitable for the conditional function. Our
proposed method can re-utilize the high-order methods for guided sampling and
can generate images with the same quality as a 250-step DDIM baseline using
32-58% less sampling time on ImageNet256. We also demonstrate usage on a wide
variety of conditional generation tasks, such as text-to-image generation,
colorization, inpainting, and super-resolution.
PLay: Parametrically Conditioned Layout Generation using Latent Diffusion
January 27, 2023
Chin-Yi Cheng, Forrest Huang, Gang Li, Yang Li
Layout design is an important task in various design fields, including user
interface, document, and graphic design. As this task requires tedious manual
effort by designers, prior works have attempted to automate this process using
generative models, but commonly fell short of providing intuitive user controls
and achieving design objectives. In this paper, we build a conditional latent
diffusion model, PLay, that generates parametrically conditioned layouts in
vector graphic space from user-specified guidelines, which are commonly used by
designers for representing their design intents in current practices. Our
method outperforms prior works across three datasets on metrics including FID
and FD-VG, and in user study. Moreover, it brings a novel and interactive
experience to professional layout design processes.
Diffusion Denoising for Low-Dose-CT Model
January 27, 2023
Runyi Li
Low-dose Computed Tomography (LDCT) reconstruction is an important task in
medical image analysis. Recent years have seen many deep learning based
methods, proved to be effective in this area. However, these methods mostly
follow a supervised architecture, which needs paired CT image of full dose and
quarter dose, and the solution is highly dependent on specific measurements. In
this work, we introduce Denoising Diffusion LDCT Model, dubbed as DDLM,
generating noise-free CT image using conditioned sampling. DDLM uses pretrained
model, and need no training nor tuning process, thus our proposal is in
unsupervised manner. Experiments on LDCT images have shown comparable
performance of DDLM using less inference time, surpassing other
state-of-the-art methods, proving both accurate and efficient. Implementation
code will be set to public soon.
3DShape2VecSet: A 3D Shape Representation for Neural Fields and Generative Diffusion Models
January 26, 2023
Biao Zhang, Jiapeng Tang, Matthias Niessner, Peter Wonka
We introduce 3DShape2VecSet, a novel shape representation for neural fields
designed for generative diffusion models. Our shape representation can encode
3D shapes given as surface models or point clouds, and represents them as
neural fields. The concept of neural fields has previously been combined with a
global latent vector, a regular grid of latent vectors, or an irregular grid of
latent vectors. Our new representation encodes neural fields on top of a set of
vectors. We draw from multiple concepts, such as the radial basis function
representation and the cross attention and self-attention function, to design a
learnable representation that is especially suitable for processing with
transformers. Our results show improved performance in 3D shape encoding and 3D
shape generative modeling tasks. We demonstrate a wide variety of generative
applications: unconditioned generation, category-conditioned generation,
text-conditioned generation, point-cloud completion, and image-conditioned
generation.
simple diffusion: End-to-end diffusion for high resolution images
January 26, 2023
Emiel Hoogeboom, Jonathan Heek, Tim Salimans
Currently, applying diffusion models in pixel space of high resolution images
is difficult. Instead, existing approaches focus on diffusion in lower
dimensional spaces (latent diffusion), or have multiple super-resolution levels
of generation referred to as cascades. The downside is that these approaches
add additional complexity to the diffusion framework.
This paper aims to improve denoising diffusion for high resolution images
while keeping the model as simple as possible. The paper is centered around the
research question: How can one train a standard denoising diffusion models on
high resolution images, and still obtain performance comparable to these
alternate approaches?
The four main findings are: 1) the noise schedule should be adjusted for high
resolution images, 2) It is sufficient to scale only a particular part of the
architecture, 3) dropout should be added at specific locations in the
architecture, and 4) downsampling is an effective strategy to avoid high
resolution feature maps. Combining these simple yet effective techniques, we
achieve state-of-the-art on image generation among diffusion models without
sampling modifiers on ImageNet.
Dual Diffusion Architecture for Fisheye Image Rectification: Synthetic-to-Real Generalization
January 26, 2023
Shangrong Yang, Chunyu Lin, Kang Liao, Yao Zhao
Fisheye image rectification has a long-term unresolved issue with
synthetic-to-real generalization. In most previous works, the model trained on
the synthetic images obtains unsatisfactory performance on the real-world
fisheye image. To this end, we propose a Dual Diffusion Architecture (DDA) for
the fisheye rectification with a better generalization ability. The proposed
DDA is simultaneously trained with paired synthetic fisheye images and
unlabeled real fisheye images. By gradually introducing noises, the synthetic
and real fisheye images can eventually develop into a consistent noise
distribution, improving the generalization and achieving unlabeled real fisheye
correction. The original image serves as the prior guidance in existing DDPMs
(Denoising Diffusion Probabilistic Models). However, the non-negligible
indeterminate relationship between the prior condition and the target affects
the generation performance. Especially in the rectification task, the radial
distortion can cause significant artifacts. Therefore, we provide an
unsupervised one-pass network that produces a plausible new condition to
strengthen guidance. This network can be regarded as an alternate scheme for
fast producing reliable results without iterative inference. Compared with the
state-of-the-art methods, our approach can reach superior performance in both
synthetic and real fisheye image corrections.
On the Importance of Noise Scheduling for Diffusion Models
January 26, 2023
Ting Chen
cs.CV, cs.GR, cs.LG, cs.MM
We empirically study the effect of noise scheduling strategies for denoising
diffusion generative models. There are three findings: (1) the noise scheduling
is crucial for the performance, and the optimal one depends on the task (e.g.,
image sizes), (2) when increasing the image size, the optimal noise scheduling
shifts towards a noisier one (due to increased redundancy in pixels), and (3)
simply scaling the input data by a factor of $b$ while keeping the noise
schedule function fixed (equivalent to shifting the logSNR by $\log b$) is a
good strategy across image sizes. This simple recipe, when combined with
recently proposed Recurrent Interface Network (RIN), yields state-of-the-art
pixel-based diffusion models for high-resolution images on ImageNet, enabling
single-stage, end-to-end generation of diverse and high-fidelity images at
1024$\times$1024 resolution (without upsampling/cascades).
Separate And Diffuse: Using a Pretrained Diffusion Model
January 25, 2023
Shahar Lutati, Eliya Nachmani, Lior Wolf
The problem of speech separation, also known as the cocktail party problem,
refers to the task of isolating a single speech signal from a mixture of speech
signals. Previous work on source separation derived an upper bound for the
source separation task in the domain of human speech. This bound is derived for
deterministic models. Recent advancements in generative models challenge this
bound. We show how the upper bound can be generalized to the case of random
generative models. Applying a diffusion model Vocoder that was pretrained to
model single-speaker voices on the output of a deterministic separation model
leads to state-of-the-art separation results. It is shown that this requires
one to combine the output of the separation model with that of the diffusion
model. In our method, a linear combination is performed, in the frequency
domain, using weights that are inferred by a learned model. We show
state-of-the-art results on 2, 3, 5, 10, and 20 speakers on multiple
benchmarks. In particular, for two speakers, our method is able to surpass what
was previously considered the upper performance bound.
On the Mathematics of Diffusion Models
January 25, 2023
David McAllester
This paper gives direct derivations of the differential equations and
likelihood formulas of diffusion models assuming only knowledge of Gaussian
distributions. A VAE analysis derives both forward and backward stochastic
differential equations (SDEs) as well as non-variational integral expressions
for likelihood formulas. A score-matching analysis derives the reverse
diffusion ordinary differential equation (ODE) and a family of
reverse-diffusion SDEs parameterized by noise level. The paper presents the
mathematics directly with attributions saved for a final section.
Imitating Human Behaviour with Diffusion Models
January 25, 2023
Tim Pearce, Tabish Rashid, Anssi Kanervisto, Dave Bignell, Mingfei Sun, Raluca Georgescu, Sergio Valcarcel Macua, Shan Zheng Tan, Ida Momennejad, Katja Hofmann, Sam Devlin
Diffusion models have emerged as powerful generative models in the
text-to-image domain. This paper studies their application as
observation-to-action models for imitating human behaviour in sequential
environments. Human behaviour is stochastic and multimodal, with structured
correlations between action dimensions. Meanwhile, standard modelling choices
in behaviour cloning are limited in their expressiveness and may introduce bias
into the cloned policy. We begin by pointing out the limitations of these
choices. We then propose that diffusion models are an excellent fit for
imitating human behaviour, since they learn an expressive distribution over the
joint action space. We introduce several innovations to make diffusion models
suitable for sequential environments; designing suitable architectures,
investigating the role of guidance, and developing reliable sampling
strategies. Experimentally, diffusion models closely match human demonstrations
in a simulated robotic control task and a modern 3D gaming environment.
Score Matching via Differentiable Physics
January 24, 2023
Benjamin J. Holzschuh, Simona Vegetti, Nils Thuerey
We propose to solve inverse problems involving the temporal evolution of
physics systems by leveraging recent advances from diffusion models. Our method
moves the system’s current state backward in time step by step by combining an
approximate inverse physics simulator and a learned correction function. A
central insight of our work is that training the learned correction with a
single-step loss is equivalent to a score matching objective, while recursively
predicting longer parts of the trajectory during training relates to maximum
likelihood training of a corresponding probability flow. We highlight the
advantages of our algorithm compared to standard denoising score matching and
implicit score matching, as well as fully learned baselines for a wide range of
inverse physics problems. The resulting inverse solver has excellent accuracy
and temporal stability and, in contrast to other learned inverse solvers,
allows for sampling the posterior of the solutions.
Bipartite Graph Diffusion Model for Human Interaction Generation
January 24, 2023
Baptiste Chopin, Hao Tang, Mohamed Daoudi
The generation of natural human motion interactions is a hot topic in
computer vision and computer animation. It is a challenging task due to the
diversity of possible human motion interactions. Diffusion models, which have
already shown remarkable generative capabilities in other domains, are a good
candidate for this task. In this paper, we introduce a novel bipartite graph
diffusion method (BiGraphDiff) to generate human motion interactions between
two persons. Specifically, bipartite node sets are constructed to model the
inherent geometric constraints between skeleton nodes during interactions. The
interaction graph diffusion model is transformer-based, combining some
state-of-the-art motion methods. We show that the proposed achieves new
state-of-the-art results on leading benchmarks for the human interaction
generation task.
DiffMotion: Speech-Driven Gesture Synthesis Using Denoising Diffusion Model
January 24, 2023
Fan Zhang, Naye Ji, Fuxing Gao, Yongping Li
cs.GR, cs.CV, cs.HC, cs.SD, eess.AS, eess.IV
Speech-driven gesture synthesis is a field of growing interest in virtual
human creation. However, a critical challenge is the inherent intricate
one-to-many mapping between speech and gestures. Previous studies have explored
and achieved significant progress with generative models. Notwithstanding, most
synthetic gestures are still vastly less natural. This paper presents
DiffMotion, a novel speech-driven gesture synthesis architecture based on
diffusion models. The model comprises an autoregressive temporal encoder and a
denoising diffusion probability Module. The encoder extracts the temporal
context of the speech input and historical gestures. The diffusion module
learns a parameterized Markov chain to gradually convert a simple distribution
into a complex distribution and generates the gestures according to the
accompanied speech. Compared with baselines, objective and subjective
evaluations confirm that our approach can produce natural and diverse
gesticulation and demonstrate the benefits of diffusion-based models on
speech-driven gesture synthesis.
Membership Inference of Diffusion Models
January 24, 2023
Hailong Hu, Jun Pang
Recent years have witnessed the tremendous success of diffusion models in
data synthesis. However, when diffusion models are applied to sensitive data,
they also give rise to severe privacy concerns. In this paper, we
systematically present the first study about membership inference attacks
against diffusion models, which aims to infer whether a sample was used to
train the model. Two attack methods are proposed, namely loss-based and
likelihood-based attacks. Our attack methods are evaluated on several
state-of-the-art diffusion models, over different datasets in relation to
privacy-sensitive data. Extensive experimental evaluations show that our
attacks can achieve remarkable performance. Furthermore, we exhaustively
investigate various factors which can affect attack performance. Finally, we
also evaluate the performance of our attack methods on diffusion models trained
with differential privacy.
January 23, 2023
Qitian Wu, Chenxiao Yang, Wentao Zhao, Yixuan He, David Wipf, Junchi Yan
Real-world data generation often involves complex inter-dependencies among
instances, violating the IID-data hypothesis of standard learning paradigms and
posing a challenge for uncovering the geometric structures for learning desired
instance representations. To this end, we introduce an energy constrained
diffusion model which encodes a batch of instances from a dataset into
evolutionary states that progressively incorporate other instances’ information
by their interactions. The diffusion process is constrained by descent criteria
w.r.t.~a principled energy function that characterizes the global consistency
of instance representations over latent structures. We provide rigorous theory
that implies closed-form optimal estimates for the pairwise diffusion strength
among arbitrary instance pairs, which gives rise to a new class of neural
encoders, dubbed as DIFFormer (diffusion-based Transformers), with two
instantiations: a simple version with linear complexity for prohibitive
instance numbers, and an advanced version for learning complex structures.
Experiments highlight the wide applicability of our model as a general-purpose
encoder backbone with superior performance in various tasks, such as node
classification on large graphs, semi-supervised image/text classification, and
spatial-temporal dynamics prediction.
RainDiffusion:When Unsupervised Learning Meets Diffusion
January 23, 2023
Mingqiang Wei, Yiyang Shen, Yongzhen Wang, Haoran Xie, Jing Qin, Fu Lee Wang
Recent diffusion models show great potential in generative modeling tasks.
This motivates us to raise an intriguing question - What will happen when
unsupervised learning meets diffusion models for real-world image deraining?
Before answering it, we observe two major obstacles of diffusion models in
real-world image deraining: the need for paired training data and the limited
utilization of multi-scale rain patterns. To overcome the obstacles, we propose
RainDiffusion, the first real-world image deraining paradigm based on diffusion
models. RainDiffusion is a non-adversarial training paradigm that introduces
stable training of unpaired real-world data, rather than weakly adversarial
training, serving as a new standard bar for real-world image deraining. It
consists of two cooperative branches: Non-diffusive Translation Branch (NTB)
and Diffusive Translation Branch (DTB). NTB exploits a cycle-consistent
architecture to bypass the difficulty in unpaired training of regular diffusion
models by generating initial clean/rainy image pairs. Given initial image
pairs, DTB leverages multi-scale diffusion models to progressively refine the
desired output via diffusive generative and multi-scale priors, to obtain a
better generalization capacity of real-world image deraining. Extensive
experiments confirm the superiority of our RainDiffusion over eight
un/semi-supervised methods and show its competitive advantages over seven
fully-supervised ones.
DiffSDS: A language diffusion model for protein backbone inpainting under geometric conditions and constraints
January 22, 2023
Zhangyang Gao, Cheng Tan, Stan Z. Li
Have you ever been troubled by the complexity and computational cost of SE(3)
protein structure modeling and been amazed by the simplicity and power of
language modeling? Recent work has shown promise in simplifying protein
structures as sequences of protein angles; therefore, language models could be
used for unconstrained protein backbone generation. Unfortunately, such
simplification is unsuitable for the constrained protein inpainting problem,
where the model needs to recover masked structures conditioned on unmasked
ones, as it dramatically increases the computing cost of geometric constraints.
To overcome this dilemma, we suggest inserting a hidden \textbf{a}tomic
\textbf{d}irection \textbf{s}pace (\textbf{ADS}) upon the language model,
converting invariant backbone angles into equivalent direction vectors and
preserving the simplicity, called Seq2Direct encoder ($\text{Enc}{s2d}$).
Geometric constraints could be efficiently imposed on the newly introduced
direction space. A Direct2Seq decoder ($\text{Dec}{d2s}$) with mathematical
guarantees is also introduced to develop a \textbf{SDS}
($\text{Enc}{s2d}$+$\text{Dec}{d2s}$) model. We apply the SDS model as the
denoising neural network during the conditional diffusion process, resulting in
a constrained generative model–\textbf{DiffSDS}. Extensive experiments show
that the plug-and-play ADS could transform the language model into a strong
structural model without loss of simplicity. More importantly, the proposed
DiffSDS outperforms previous strong baselines by a large margin on the task of
protein inpainting.
Removing Structured Noise with Diffusion Models
January 20, 2023
Tristan S. W. Stevens, Hans van Gorp, Faik C. Meral, Junseob Shin, Jason Yu, Jean-Luc Robert, Ruud J. G. van Sloun
Solving ill-posed inverse problems requires careful formulation of prior
beliefs over the signals of interest and an accurate description of their
manifestation into noisy measurements. Handcrafted signal priors based on e.g.
sparsity are increasingly replaced by data-driven deep generative models, and
several groups have recently shown that state-of-the-art score-based diffusion
models yield particularly strong performance and flexibility. In this paper, we
show that the powerful paradigm of posterior sampling with diffusion models can
be extended to include rich, structured, noise models. To that end, we propose
a joint conditional reverse diffusion process with learned scores for the noise
and signal-generating distribution. We demonstrate strong performance gains
across various inverse problems with structured noise, outperforming
competitive baselines that use normalizing flows and adversarial networks. This
opens up new opportunities and relevant practical applications of diffusion
modeling for inverse problems in the context of non-Gaussian measurement
models.
DiffusionCT: Latent Diffusion Model for CT Image Standardization
January 20, 2023
Md Selim, Jie Zhang, Michael A. Brooks, Ge Wang, Jin Chen
Computed tomography (CT) is one of the modalities for effective lung cancer
screening, diagnosis, treatment, and prognosis. The features extracted from CT
images are now used to quantify spatial and temporal variations in tumors.
However, CT images obtained from various scanners with customized acquisition
protocols may introduce considerable variations in texture features, even for
the same patient. This presents a fundamental challenge to downstream studies
that require consistent and reliable feature analysis. Existing CT image
harmonization models rely on GAN-based supervised or semi-supervised learning,
with limited performance. This work addresses the issue of CT image
harmonization using a new diffusion-based model, named DiffusionCT, to
standardize CT images acquired from different vendors and protocols.
DiffusionCT operates in the latent space by mapping a latent non-standard
distribution into a standard one. DiffusionCT incorporates an Unet-based
encoder-decoder, augmented by a diffusion model integrated into the bottleneck
part. The model is designed in two training phases. The encoder-decoder is
first trained, without embedding the diffusion model, to learn the latent
representation of the input data. The latent diffusion model is then trained in
the next training phase while fixing the encoder-decoder. Finally, the decoder
synthesizes a standardized image with the transformed latent representation.
The experimental results demonstrate a significant improvement in the
performance of the standardization task using DiffusionCT.
Regular Time-series Generation using SGM
January 20, 2023
Haksoo Lim, Minjung Kim, Sewon Park, Noseong Park
Score-based generative models (SGMs) are generative models that are in the
spotlight these days. Time-series frequently occurs in our daily life, e.g.,
stock data, climate data, and so on. Especially, time-series forecasting and
classification are popular research topics in the field of machine learning.
SGMs are also known for outperforming other generative models. As a result, we
apply SGMs to synthesize time-series data by learning conditional score
functions. We propose a conditional score network for the time-series
generation domain. Furthermore, we also derive the loss function between the
score matching and the denoising score matching in the time-series generation
domain. Finally, we achieve state-of-the-art results on real-world datasets in
terms of sampling diversity and quality.
Diffusion-based Conditional ECG Generation with Structured State Space Models
January 19, 2023
Juan Miguel Lopez Alcaraz, Nils Strodthoff
Synthetic data generation is a promising solution to address privacy issues
with the distribution of sensitive health data. Recently, diffusion models have
set new standards for generative models for different data modalities. Also
very recently, structured state space models emerged as a powerful modeling
paradigm to capture long-term dependencies in time series. We put forward
SSSD-ECG, as the combination of these two technologies, for the generation of
synthetic 12-lead electrocardiograms conditioned on more than 70 ECG
statements. Due to a lack of reliable baselines, we also propose conditional
variants of two state-of-the-art unconditional generative models. We thoroughly
evaluate the quality of the generated samples, by evaluating pretrained
classifiers on the generated data and by evaluating the performance of a
classifier trained only on synthetic data, where SSSD-ECG clearly outperforms
its GAN-based competitors. We demonstrate the soundness of our approach through
further experiments, including conditional class interpolation and a clinical
Turing test demonstrating the high quality of the SSSD-ECG samples across a
wide range of conditions.
Dif-Fusion: Towards High Color Fidelity in Infrared and Visible Image Fusion with Diffusion Models
January 19, 2023
Jun Yue, Leyuan Fang, Shaobo Xia, Yue Deng, Jiayi Ma
Color plays an important role in human visual perception, reflecting the
spectrum of objects. However, the existing infrared and visible image fusion
methods rarely explore how to handle multi-spectral/channel data directly and
achieve high color fidelity. This paper addresses the above issue by proposing
a novel method with diffusion models, termed as Dif-Fusion, to generate the
distribution of the multi-channel input data, which increases the ability of
multi-source information aggregation and the fidelity of colors. In specific,
instead of converting multi-channel images into single-channel data in existing
fusion methods, we create the multi-channel data distribution with a denoising
network in a latent space with forward and reverse diffusion process. Then, we
use the the denoising network to extract the multi-channel diffusion features
with both visible and infrared information. Finally, we feed the multi-channel
diffusion features to the multi-channel fusion module to directly generate the
three-channel fused image. To retain the texture and intensity information, we
propose multi-channel gradient loss and intensity loss. Along with the current
evaluation metrics for measuring texture and intensity fidelity, we introduce a
new evaluation metric to quantify color fidelity. Extensive experiments
indicate that our method is more effective than other state-of-the-art image
fusion methods, especially in color fidelity.
Fast Inference in Denoising Diffusion Models via MMD Finetuning
January 19, 2023
Emanuele Aiello, Diego Valsesia, Enrico Magli
Denoising Diffusion Models (DDMs) have become a popular tool for generating
high-quality samples from complex data distributions. These models are able to
capture sophisticated patterns and structures in the data, and can generate
samples that are highly diverse and representative of the underlying
distribution. However, one of the main limitations of diffusion models is the
complexity of sample generation, since a large number of inference timesteps is
required to faithfully capture the data distribution. In this paper, we present
MMD-DDM, a novel method for fast sampling of diffusion models. Our approach is
based on the idea of using the Maximum Mean Discrepancy (MMD) to finetune the
learned distribution with a given budget of timesteps. This allows the
finetuned model to significantly improve the speed-quality trade-off, by
substantially increasing fidelity in inference regimes with few steps or,
equivalently, by reducing the required number of steps to reach a target
fidelity, thus paving the way for a more practical adoption of diffusion models
in a wide range of applications. We evaluate our approach on unconditional
image generation with extensive experiments across the CIFAR-10, CelebA,
ImageNet and LSUN-Church datasets. Our findings show that the proposed method
is able to produce high-quality samples in a fraction of the time required by
widely-used diffusion models, and outperforms state-of-the-art techniques for
accelerated sampling. Code is available at:
https://github.com/diegovalsesia/MMD-DDM.
Understanding the diffusion models by conditional expectations
January 19, 2023
Yubin Lu, Zhongjian Wang, Guillaume Bal
cs.LG, cs.NA, math.NA, stat.ML
This paper provide several mathematical analyses of the diffusion model in
machine learning. The drift term of the backwards sampling process is
represented as a conditional expectation involving the data distribution and
the forward diffusion. The training process aims to find such a drift function
by minimizing the mean-squared residue related to the conditional expectation.
Using small-time approximations of the Green’s function of the forward
diffusion, we show that the analytical mean drift function in DDPM and the
score function in SGM asymptotically blow up in the final stages of the
sampling process for singular data distributions such as those concentrated on
lower-dimensional manifolds, and is therefore difficult to approximate by a
network. To overcome this difficulty, we derive a new target function and
associated loss, which remains bounded even for singular data distributions. We
illustrate the theoretical findings with several numerical examples.
January 19, 2023
Junde Wu, Wei Ji, Huazhu Fu, Min Xu, Yueming Jin, Yanwu Xu
The Diffusion Probabilistic Model (DPM) has recently gained popularity in the
field of computer vision, thanks to its image generation applications, such as
Imagen, Latent Diffusion Models, and Stable Diffusion, which have demonstrated
impressive capabilities and sparked much discussion within the community.
Recent investigations have further unveiled the utility of DPM in the domain of
medical image analysis, as underscored by the commendable performance exhibited
by the medical image segmentation model across various tasks. Although these
models were originally underpinned by a UNet architecture, there exists a
potential avenue for enhancing their performance through the integration of
vision transformer mechanisms. However, we discovered that simply combining
these two models resulted in subpar performance. To effectively integrate these
two cutting-edge techniques for the Medical image segmentation, we propose a
novel Transformer-based Diffusion framework, called MedSegDiff-V2. We verify
its effectiveness on 20 medical image segmentation tasks with different image
modalities. Through comprehensive evaluation, our approach demonstrates
superiority over prior state-of-the-art (SOTA) methodologies. Code is released
at https://github.com/KidsWithTokens/MedSegDiff
Denoising Diffusion Probabilistic Models as a Defense against Adversarial Attacks
January 17, 2023
Lars Lien Ankile, Anna Midgley, Sebastian Weisshaar
Neural Networks are infamously sensitive to small perturbations in their
inputs, making them vulnerable to adversarial attacks. This project evaluates
the performance of Denoising Diffusion Probabilistic Models (DDPM) as a
purification technique to defend against adversarial attacks. This works by
adding noise to an adversarial example before removing it through the reverse
process of the diffusion model. We evaluate the approach on the PatchCamelyon
data set for histopathologic scans of lymph node sections and find an
improvement of the robust accuracy by up to 88\% of the original model’s
accuracy, constituting a considerable improvement over the vanilla model and
our baselines. The project code is located at
https://github.com/ankile/Adversarial-Diffusion.
Diffusion-based Generation, Optimization, and Planning in 3D Scenes
January 15, 2023
Siyuan Huang, Zan Wang, Puhao Li, Baoxiong Jia, Tengyu Liu, Yixin Zhu, Wei Liang, Song-Chun Zhu
We introduce SceneDiffuser, a conditional generative model for 3D scene
understanding. SceneDiffuser provides a unified model for solving
scene-conditioned generation, optimization, and planning. In contrast to prior
works, SceneDiffuser is intrinsically scene-aware, physics-based, and
goal-oriented. With an iterative sampling strategy, SceneDiffuser jointly
formulates the scene-aware generation, physics-based optimization, and
goal-oriented planning via a diffusion-based denoising process in a fully
differentiable fashion. Such a design alleviates the discrepancies among
different modules and the posterior collapse of previous scene-conditioned
generative models. We evaluate SceneDiffuser with various 3D scene
understanding tasks, including human pose and motion generation, dexterous
grasp generation, path planning for 3D navigation, and motion planning for
robot arms. The results show significant improvements compared with previous
models, demonstrating the tremendous potential of SceneDiffuser for the broad
community of 3D scene understanding.
Neural Image Compression with a Diffusion-Based Decoder
January 13, 2023
Noor Fathima Ghouse, Jens Petersen, Auke Wiggers, Tianlin Xu, Guillaume Sautière
Diffusion probabilistic models have recently achieved remarkable success in
generating high quality image and video data. In this work, we build on this
class of generative models and introduce a method for lossy compression of high
resolution images. The resulting codec, which we call DIffuson-based Residual
Augmentation Codec (DIRAC), is the first neural codec to allow smooth traversal
of the rate-distortion-perception tradeoff at test time, while obtaining
competitive performance with GAN-based methods in perceptual quality.
Furthermore, while sampling from diffusion probabilistic models is notoriously
expensive, we show that in the compression setting the number of steps can be
drastically reduced.
Guiding Text-to-Image Diffusion Model Towards Grounded Generation
January 12, 2023
Ziyi Li, Qinye Zhou, Xiaoyun Zhang, Ya Zhang, Yanfeng Wang, Weidi Xie
The goal of this paper is to extract the visual-language correspondence from
a pre-trained text-to-image diffusion model, in the form of segmentation map,
i.e., simultaneously generating images and segmentation masks for the
corresponding visual entities described in the text prompt. We make the
following contributions: (i) we pair the existing Stable Diffusion model with a
novel grounding module, that can be trained to align the visual and textual
embedding space of the diffusion model with only a small number of object
categories; (ii) we establish an automatic pipeline for constructing a dataset,
that consists of {image, segmentation mask, text prompt} triplets, to train the
proposed grounding module; (iii) we evaluate the performance of open-vocabulary
grounding on images generated from the text-to-image diffusion model and show
that the module can well segment the objects of categories beyond seen ones at
training time; (iv) we adopt the augmented diffusion model to build a synthetic
semantic segmentation dataset, and show that, training a standard segmentation
model on such dataset demonstrates competitive performance on the zero-shot
segmentation(ZS3) benchmark, which opens up new opportunities for adopting the
powerful diffusion model for discriminative tasks.
Thompson Sampling with Diffusion Generative Prior
January 12, 2023
Yu-Guan Hsieh, Shiva Prasad Kasiviswanathan, Branislav Kveton, Patrick Blöbaum
In this work, we initiate the idea of using denoising diffusion models to
learn priors for online decision making problems. Our special focus is on the
meta-learning for bandit framework, with the goal of learning a strategy that
performs well across bandit tasks of a same class. To this end, we train a
diffusion model that learns the underlying task distribution and combine
Thompson sampling with the learned prior to deal with new tasks at test time.
Our posterior sampling algorithm is designed to carefully balance between the
learned prior and the noisy observations that come from the learner’s
interaction with the environment. To capture realistic bandit scenarios, we
also propose a novel diffusion model training procedure that trains even from
incomplete and/or noisy data, which could be of independent interest. Finally,
our extensive experimental evaluations clearly demonstrate the potential of the
proposed approach.
Diffusion-based Data Augmentation for Skin Disease Classification
January 12, 2023
Mohamed Akrout, Bálint Gyepesi, Péter Holló, Adrienn Poór, Blága Kincső, Stephen Solis, Katrina Cirone, Jeremy Kawahara, Dekker Slade, Latif Abid, Máté Kovács, István Fazekas
Despite continued advancement in recent years, deep neural networks still
rely on large amounts of training data to avoid overfitting. However, labeled
training data for real-world applications such as healthcare is limited and
difficult to access given longstanding privacy, and strict data sharing
policies. By manipulating image datasets in the pixel or feature space,
existing data augmentation techniques represent one of the effective ways to
improve the quantity and diversity of training data. Here, we look to advance
augmentation techniques by building upon the emerging success of text-to-image
diffusion probabilistic models in augmenting the training samples of our
macroscopic skin disease dataset. We do so by enabling fine-grained control of
the image generation process via input text prompts. We demonstrate that this
generative data augmentation approach successfully maintains a similar
classification accuracy of the visual classifier even when trained on a fully
synthetic skin disease dataset. Similar to recent applications of generative
models, our study suggests that diffusion models are indeed effective in
generating high-quality skin images that do not sacrifice the classifier
performance, and can improve the augmentation of training datasets after
curation.
Diffusion Models For Stronger Face Morphing Attacks
January 10, 2023
Zander Blasingame, Chen Liu
Face morphing attacks seek to deceive a Face Recognition (FR) system by
presenting a morphed image consisting of the biometric qualities from two
different identities with the aim of triggering a false acceptance with one of
the two identities, thereby presenting a significant threat to biometric
systems. The success of a morphing attack is dependent on the ability of the
morphed image to represent the biometric characteristics of both identities
that were used to create the image. We present a novel morphing attack that
uses a Diffusion-based architecture to improve the visual fidelity of the image
and the ability of the morphing attack to represent characteristics from both
identities. We demonstrate the effectiveness of the proposed attack by
evaluating its visual fidelity via the Frechet Inception Distance (FID). Also,
extensive experiments are conducted to measure the vulnerability of FR systems
to the proposed attack. The ability of a morphing attack detector to detect the
proposed attack is measured and compared against two state-of-the-art GAN-based
morphing attacks along with two Landmark-based attacks. Additionally, a novel
metric to measure the relative strength between different morphing attacks is
introduced and evaluated.
Speech Driven Video Editing via an Audio-Conditioned Diffusion Model
January 10, 2023
Dan Bigioi, Shubhajit Basak, Michał Stypułkowski, Maciej Zięba, Hugh Jordan, Rachel McDonnell, Peter Corcoran
cs.CV, cs.LG, cs.SD, eess.AS
Taking inspiration from recent developments in visual generative tasks using
diffusion models, we propose a method for end-to-end speech-driven video
editing using a denoising diffusion model. Given a video of a talking person,
and a separate auditory speech recording, the lip and jaw motions are
re-synchronized without relying on intermediate structural representations such
as facial landmarks or a 3D face model. We show this is possible by
conditioning a denoising diffusion model on audio mel spectral features to
generate synchronised facial motion. Proof of concept results are demonstrated
on both single-speaker and multi-speaker video editing, providing a baseline
model on the CREMA-D audiovisual data set. To the best of our knowledge, this
is the first work to demonstrate and validate the feasibility of applying
end-to-end denoising diffusion models to the task of audio-driven video
editing.
DiffTalk: Crafting Diffusion Models for Generalized Talking Head Synthesis
January 10, 2023
Shuai Shen, Wenliang Zhao, Zibin Meng, Wanhua Li, Zheng Zhu, Jie Zhou, Jiwen Lu
Talking head synthesis is a promising approach for the video production
industry. Recently, a lot of effort has been devoted in this research area to
improve the generation quality or enhance the model generalization. However,
there are few works able to address both issues simultaneously, which is
essential for practical applications. To this end, in this paper, we turn
attention to the emerging powerful Latent Diffusion Models, and model the
Talking head generation as an audio-driven temporally coherent denoising
process (DiffTalk). More specifically, instead of employing audio signals as
the single driving factor, we investigate the control mechanism of the talking
face, and incorporate reference face images and landmarks as conditions for
personality-aware generalized synthesis. In this way, the proposed DiffTalk is
capable of producing high-quality talking head videos in synchronization with
the source audio, and more importantly, it can be naturally generalized across
different identities without any further fine-tuning. Additionally, our
DiffTalk can be gracefully tailored for higher-resolution synthesis with
negligible extra computational cost. Extensive experiments show that the
proposed DiffTalk efficiently synthesizes high-fidelity audio-driven talking
head videos for generalized novel identities. For more video results, please
refer to \url{https://sstzal.github.io/DiffTalk/}.
Image Denoising: The Deep Learning Revolution
January 09, 2023
Michael Elad, Bahjat Kawar, Gregory Vaksman
Image denoising (removal of additive white Gaussian noise from an image) is
one of the oldest and most studied problems in image processing. An extensive
work over several decades has led to thousands of papers on this subject, and
to many well-performing algorithms for this task. Indeed, 10 years ago, these
achievements have led some researchers to suspect that “Denoising is Dead”, in
the sense that all that can be achieved in this domain has already been
obtained. However, this turned out to be far from the truth, with the
penetration of deep learning (DL) into image processing. The era of DL brought
a revolution to image denoising, both by taking the lead in today’s ability for
noise removal in images, and by broadening the scope of denoising problems
being treated. Our paper starts by describing this evolution, highlighting in
particular the tension and synergy that exist between classical approaches and
modern DL-based alternatives in design of image denoisers.
The recent transitions in the field of image denoising go far beyond the
ability to design better denoisers. In the 2nd part of this paper we focus on
recently discovered abilities and prospects of image denoisers. We expose the
possibility of using denoisers to serve other problems, such as regularizing
general inverse problems and serving as the prime engine in diffusion-based
image synthesis. We also unveil the idea that denoising and other inverse
problems might not have a unique solution as common algorithms would have us
believe. Instead, we describe constructive ways to produce randomized and
diverse high quality results for inverse problems, all fueled by the progress
that DL brought to image denoising.
This survey paper aims to provide a broad view of the history of image
denoising and closely related topics. Our aim is to give a better context to
recent discoveries, and to the influence of DL in our domain.
Image Denoising: The Deep Learning Revolution and Beyond – A Survey Paper –
January 09, 2023
Michael Elad, Bahjat Kawar, Gregory Vaksman
Image denoising (removal of additive white Gaussian noise from an image) is
one of the oldest and most studied problems in image processing. An extensive
work over several decades has led to thousands of papers on this subject, and
to many well-performing algorithms for this task. Indeed, 10 years ago, these
achievements have led some researchers to suspect that “Denoising is Dead”, in
the sense that all that can be achieved in this domain has already been
obtained. However, this turned out to be far from the truth, with the
penetration of deep learning (DL) into image processing. The era of DL brought
a revolution to image denoising, both by taking the lead in today’s ability for
noise removal in images, and by broadening the scope of denoising problems
being treated. Our paper starts by describing this evolution, highlighting in
particular the tension and synergy that exist between classical approaches and
modern DL-based alternatives in design of image denoisers.
The recent transitions in the field of image denoising go far beyond the
ability to design better denoisers. In the 2nd part of this paper we focus on
recently discovered abilities and prospects of image denoisers. We expose the
possibility of using denoisers to serve other problems, such as regularizing
general inverse problems and serving as the prime engine in diffusion-based
image synthesis. We also unveil the idea that denoising and other inverse
problems might not have a unique solution as common algorithms would have us
believe. Instead, we describe constructive ways to produce randomized and
diverse high quality results for inverse problems, all fueled by the progress
that DL brought to image denoising.
This survey paper aims to provide a broad view of the history of image
denoising and closely related topics. Our aim is to give a better context to
recent discoveries, and to the influence of DL in our domain.
Generative Time Series Forecasting with Diffusion, Denoise, and Disentanglement
January 08, 2023
Yan Li, Xinjiang Lu, Yaqing Wang, Dejing Dou
Time series forecasting has been a widely explored task of great importance
in many applications. However, it is common that real-world time series data
are recorded in a short time period, which results in a big gap between the
deep model and the limited and noisy time series. In this work, we propose to
address the time series forecasting problem with generative modeling and
propose a bidirectional variational auto-encoder (BVAE) equipped with
diffusion, denoise, and disentanglement, namely D3VAE. Specifically, a coupled
diffusion probabilistic model is proposed to augment the time series data
without increasing the aleatoric uncertainty and implement a more tractable
inference process with BVAE. To ensure the generated series move toward the
true target, we further propose to adapt and integrate the multiscale denoising
score matching into the diffusion process for time series forecasting. In
addition, to enhance the interpretability and stability of the prediction, we
treat the latent variable in a multivariate manner and disentangle them on top
of minimizing total correlation. Extensive experiments on synthetic and
real-world data show that D3VAE outperforms competitive algorithms with
remarkable margins. Our implementation is available at
https://github.com/PaddlePaddle/PaddleSpatial/tree/main/research/D3VAE.
Annealed Score-Based Diffusion Model for MR Motion Artifact Reduction
January 08, 2023
Gyutaek Oh, Jeong Eun Lee, Jong Chul Ye
Motion artifact reduction is one of the important research topics in MR
imaging, as the motion artifact degrades image quality and makes diagnosis
difficult. Recently, many deep learning approaches have been studied for motion
artifact reduction. Unfortunately, most existing models are trained in a
supervised manner, requiring paired motion-corrupted and motion-free images, or
are based on a strict motion-corruption model, which limits their use for
real-world situations. To address this issue, here we present an annealed
score-based diffusion model for MRI motion artifact reduction. Specifically, we
train a score-based model using only motion-free images, and then motion
artifacts are removed by applying forward and reverse diffusion processes
repeatedly to gradually impose a low-frequency data consistency. Experimental
results verify that the proposed method successfully reduces both simulated and
in vivo motion artifacts, outperforming the state-of-the-art deep learning
methods.
Diffused Heads: Diffusion Models Beat GANs on Talking-Face Generation
January 06, 2023
Michał Stypułkowski, Konstantinos Vougioukas, Sen He, Maciej Zięba, Stavros Petridis, Maja Pantic
Talking face generation has historically struggled to produce head movements
and natural facial expressions without guidance from additional reference
videos. Recent developments in diffusion-based generative models allow for more
realistic and stable data synthesis and their performance on image and video
generation has surpassed that of other generative models. In this work, we
present an autoregressive diffusion model that requires only one identity image
and audio sequence to generate a video of a realistic talking human head. Our
solution is capable of hallucinating head movements, facial expressions, such
as blinks, and preserving a given background. We evaluate our model on two
different datasets, achieving state-of-the-art results on both of them.
Denoising Diffusion Probabilistic Models for Generation of Realistic Fully-Annotated Microscopy Image Data Sets
January 02, 2023
Dennis Eschweiler, Rüveyda Yilmaz, Matisse Baumann, Ina Laube, Rijo Roy, Abin Jose, Daniel Brückner, Johannes Stegmaier
Recent advances in computer vision have led to significant progress in the
generation of realistic image data, with denoising diffusion probabilistic
models proving to be a particularly effective method. In this study, we
demonstrate that diffusion models can effectively generate fully-annotated
microscopy image data sets through an unsupervised and intuitive approach,
using rough sketches of desired structures as the starting point. The proposed
pipeline helps to reduce the reliance on manual annotations when training deep
learning-based segmentation approaches and enables the segmentation of diverse
datasets without the need for human annotations. This approach holds great
promise in streamlining the data generation process and enabling a more
efficient and scalable training of segmentation models, as we show in the
example of different practical experiments involving various organisms and cell
types.
Diffusion Probabilistic Models for Scene-Scale 3D Categorical Data
January 02, 2023
Jumin Lee, Woobin Im, Sebin Lee, Sung-Eui Yoon
In this paper, we learn a diffusion model to generate 3D data on a
scene-scale. Specifically, our model crafts a 3D scene consisting of multiple
objects, while recent diffusion research has focused on a single object. To
realize our goal, we represent a scene with discrete class labels, i.e.,
categorical distribution, to assign multiple objects into semantic categories.
Thus, we extend discrete diffusion models to learn scene-scale categorical
distributions. In addition, we validate that a latent diffusion model can
reduce computation costs for training and deploying. To the best of our
knowledge, our work is the first to apply discrete and latent diffusion for 3D
categorical data on a scene-scale. We further propose to perform semantic scene
completion (SSC) by learning a conditional distribution using our diffusion
model, where the condition is a partial observation in a sparse point cloud. In
experiments, we empirically show that our diffusion models not only generate
reasonable scenes, but also perform the scene completion task better than a
discriminative model. Our code and models are available at
https://github.com/zoomin-lee/scene-scale-diffusion
Conditional Diffusion Based on Discrete Graph Structures for Molecular Graph Generation
January 01, 2023
Han Huang, Leilei Sun, Bowen Du, Weifeng Lv
Learning the underlying distribution of molecular graphs and generating
high-fidelity samples is a fundamental research problem in drug discovery and
material science. However, accurately modeling distribution and rapidly
generating novel molecular graphs remain crucial and challenging goals. To
accomplish these goals, we propose a novel Conditional Diffusion model based on
discrete Graph Structures (CDGS) for molecular graph generation. Specifically,
we construct a forward graph diffusion process on both graph structures and
inherent features through stochastic differential equations (SDE) and derive
discrete graph structures as the condition for reverse generative processes. We
present a specialized hybrid graph noise prediction model that extracts the
global context and the local node-edge dependency from intermediate graph
states. We further utilize ordinary differential equation (ODE) solvers for
efficient graph sampling, based on the semi-linear structure of the probability
flow ODE. Experiments on diverse datasets validate the effectiveness of our
framework. Particularly, the proposed method still generates high-quality
molecular graphs in a limited number of steps. Our code is provided in
https://github.com/GRAPH-0/CDGS.
Diffusion Model based Semi-supervised Learning on Brain Hemorrhage Images for Efficient Midline Shift Quantification
January 01, 2023
Shizhan Gong, Cheng Chen, Yuqi Gong, Nga Yan Chan, Wenao Ma, Calvin Hoi-Kwan Mak, Jill Abrigo, Qi Dou
Brain midline shift (MLS) is one of the most critical factors to be
considered for clinical diagnosis and treatment decision-making for
intracranial hemorrhage. Existing computational methods on MLS quantification
not only require intensive labeling in millimeter-level measurement but also
suffer from poor performance due to their dependence on specific landmarks or
simplified anatomical assumptions. In this paper, we propose a novel
semi-supervised framework to accurately measure the scale of MLS from head CT
scans. We formulate the MLS measurement task as a deformation estimation
problem and solve it using a few MLS slices with sparse labels. Meanwhile, with
the help of diffusion models, we are able to use a great number of unlabeled
MLS data and 2793 non-MLS cases for representation learning and regularization.
The extracted representation reflects how the image is different from a non-MLS
image and regularization serves an important role in the sparse-to-dense
refinement of the deformation field. Our experiment on a real clinical brain
hemorrhage dataset has achieved state-of-the-art performance and can generate
interpretable deformation fields.
ResGrad: Residual Denoising Diffusion Probabilistic Models for Text to Speech
December 30, 2022
Zehua Chen, Yihan Wu, Yichong Leng, Jiawei Chen, Haohe Liu, Xu Tan, Yang Cui, Ke Wang, Lei He, Sheng Zhao, Jiang Bian, Danilo Mandic
eess.AS, cs.CL, cs.LG, cs.SD, eess.SP
Denoising Diffusion Probabilistic Models (DDPMs) are emerging in
text-to-speech (TTS) synthesis because of their strong capability of generating
high-fidelity samples. However, their iterative refinement process in
high-dimensional data space results in slow inference speed, which restricts
their application in real-time systems. Previous works have explored speeding
up by minimizing the number of inference steps but at the cost of sample
quality. In this work, to improve the inference speed for DDPM-based TTS model
while achieving high sample quality, we propose ResGrad, a lightweight
diffusion model which learns to refine the output spectrogram of an existing
TTS model (e.g., FastSpeech 2) by predicting the residual between the model
output and the corresponding ground-truth speech. ResGrad has several
advantages: 1) Compare with other acceleration methods for DDPM which need to
synthesize speech from scratch, ResGrad reduces the complexity of task by
changing the generation target from ground-truth mel-spectrogram to the
residual, resulting into a more lightweight model and thus a smaller real-time
factor. 2) ResGrad is employed in the inference process of the existing TTS
model in a plug-and-play way, without re-training this model. We verify ResGrad
on the single-speaker dataset LJSpeech and two more challenging datasets with
multiple speakers (LibriTTS) and high sampling rate (VCTK). Experimental
results show that in comparison with other speed-up methods of DDPMs: 1)
ResGrad achieves better sample quality with the same inference speed measured
by real-time factor; 2) with similar speech quality, ResGrad synthesizes speech
faster than baseline methods by more than 10 times. Audio samples are available
at https://resgrad1.github.io/.
Dream3D: Zero-Shot Text-to-3D Synthesis Using 3D Shape Prior and Text-to-Image Diffusion Models
December 28, 2022
Jiale Xu, Xintao Wang, Weihao Cheng, Yan-Pei Cao, Ying Shan, Xiaohu Qie, Shenghua Gao
Recent CLIP-guided 3D optimization methods, such as DreamFields and
PureCLIPNeRF, have achieved impressive results in zero-shot text-to-3D
synthesis. However, due to scratch training and random initialization without
prior knowledge, these methods often fail to generate accurate and faithful 3D
structures that conform to the input text. In this paper, we make the first
attempt to introduce explicit 3D shape priors into the CLIP-guided 3D
optimization process. Specifically, we first generate a high-quality 3D shape
from the input text in the text-to-shape stage as a 3D shape prior. We then use
it as the initialization of a neural radiance field and optimize it with the
full prompt. To address the challenging text-to-shape generation task, we
present a simple yet effective approach that directly bridges the text and
image modalities with a powerful text-to-image diffusion model. To narrow the
style domain gap between the images synthesized by the text-to-image diffusion
model and shape renderings used to train the image-to-shape generator, we
further propose to jointly optimize a learnable text prompt and fine-tune the
text-to-image diffusion model for rendering-style image generation. Our method,
Dream3D, is capable of generating imaginative 3D content with superior visual
quality and shape accuracy compared to state-of-the-art methods.
December 28, 2022
He Cao, Jianan Wang, Tianhe Ren, Xianbiao Qi, Yihao Chen, Yuan Yao, Lei Zhang
Score-based diffusion models have captured widespread attention and funded
fast progress of recent vision generative tasks. In this paper, we focus on
diffusion model backbone which has been much neglected before. We
systematically explore vision Transformers as diffusion learners for various
generative tasks. With our improvements the performance of vanilla ViT-based
backbone (IU-ViT) is boosted to be on par with traditional U-Net-based methods.
We further provide a hypothesis on the implication of disentangling the
generative backbone as an encoder-decoder structure and show proof-of-concept
experiments verifying the effectiveness of a stronger encoder for generative
tasks with ASymmetriC ENcoder Decoder (ASCEND). Our improvements achieve
competitive results on CIFAR-10, CelebA, LSUN, CUB Bird and large-resolution
text-to-image tasks. To the best of our knowledge, we are the first to
successfully train a single diffusion model on text-to-image task beyond 64x64
resolution. We hope this will motivate people to rethink the modeling choices
and the training pipelines for diffusion-based generative models.
December 27, 2022
Princy Chahal
We present an end-to-end Transformer based Latent Diffusion model for image
synthesis. On the ImageNet class conditioned generation task we show that a
Transformer based Latent Diffusion model achieves a 14.1FID which is comparable
to the 13.1FID score of a UNet based architecture. In addition to showing the
application of Transformer models for Diffusion based image synthesis this
simplification in architecture allows easy fusion and modeling of text and
image data. The multi-head attention mechanism of Transformers enables
simplified interaction between the image and text features which removes the
requirement for crossattention mechanism in UNet based Diffusion models.
DiffFace: Diffusion-based Face Swapping with Facial Guidance
December 27, 2022
Kihong Kim, Yunho Kim, Seokju Cho, Junyoung Seo, Jisu Nam, Kychul Lee, Seungryong Kim, KwangHee Lee
In this paper, we propose a diffusion-based face swapping framework for the
first time, called DiffFace, composed of training ID conditional DDPM, sampling
with facial guidance, and a target-preserving blending. In specific, in the
training process, the ID conditional DDPM is trained to generate face images
with the desired identity. In the sampling process, we use the off-the-shelf
facial expert models to make the model transfer source identity while
preserving target attributes faithfully. During this process, to preserve the
background of the target image and obtain the desired face swapping result, we
additionally propose a target-preserving blending strategy. It helps our model
to keep the attributes of the target face from noise while transferring the
source facial identity. In addition, without any re-training, our model can
flexibly apply additional facial guidance and adaptively control the
ID-attributes trade-off to achieve the desired results. To the best of our
knowledge, this is the first approach that applies the diffusion model in face
swapping task. Compared with previous GAN-based approaches, by taking advantage
of the diffusion model for the face swapping task, DiffFace achieves better
benefits such as training stability, high fidelity, diversity of the samples,
and controllability. Extensive experiments show that our DiffFace is comparable
or superior to the state-of-the-art methods on several standard face swapping
benchmarks.
Unsupervised Representation Learning from Pre-trained Diffusion Probabilistic Models
December 26, 2022
Zijian Zhang, Zhou Zhao, Zhijie Lin
Diffusion Probabilistic Models (DPMs) have shown a powerful capacity of
generating high-quality image samples. Recently, diffusion autoencoders
(Diff-AE) have been proposed to explore DPMs for representation learning via
autoencoding. Their key idea is to jointly train an encoder for discovering
meaningful representations from images and a conditional DPM as the decoder for
reconstructing images. Considering that training DPMs from scratch will take a
long time and there have existed numerous pre-trained DPMs, we propose
\textbf{P}re-trained \textbf{D}PM \textbf{A}uto\textbf{E}ncoding
(\textbf{PDAE}), a general method to adapt existing pre-trained DPMs to the
decoders for image reconstruction, with better training efficiency and
performance than Diff-AE. Specifically, we find that the reason that
pre-trained DPMs fail to reconstruct an image from its latent variables is due
to the information loss of forward process, which causes a gap between their
predicted posterior mean and the true one. From this perspective, the
classifier-guided sampling method can be explained as computing an extra mean
shift to fill the gap, reconstructing the lost class information in samples.
These imply that the gap corresponds to the lost information of the image, and
we can reconstruct the image by filling the gap. Drawing inspiration from this,
we employ a trainable model to predict a mean shift according to encoded
representation and train it to fill as much gap as possible, in this way, the
encoder is forced to learn as much information as possible from images to help
the filling. By reusing a part of network of pre-trained DPMs and redesigning
the weighting scheme of diffusion loss, PDAE can learn meaningful
representations from images efficiently. Extensive experiments demonstrate the
effectiveness, efficiency and flexibility of PDAE.
Your diffusion model secretly knows the dimension of the data manifold
December 23, 2022
Jan Stanczuk, Georgios Batzolis, Teo Deveney, Carola-Bibiane Schönlieb
In this work, we propose a novel framework for estimating the dimension of
the data manifold using a trained diffusion model. A diffusion model
approximates the score function i.e. the gradient of the log density of a
noise-corrupted version of the target distribution for varying levels of
corruption. We prove that, if the data concentrates around a manifold embedded
in the high-dimensional ambient space, then as the level of corruption
decreases, the score function points towards the manifold, as this direction
becomes the direction of maximal likelihood increase. Therefore, for small
levels of corruption, the diffusion model provides us with access to an
approximation of the normal bundle of the data manifold. This allows us to
estimate the dimension of the tangent space, thus, the intrinsic dimension of
the data manifold. To the best of our knowledge, our method is the first
estimator of the data manifold dimension based on diffusion models and it
outperforms well established statistical estimators in controlled experiments
on both Euclidean and image data.
Scalable Adaptive Computation for Iterative Generation
December 22, 2022
Allan Jabri, David Fleet, Ting Chen
Natural data is redundant yet predominant architectures tile computation
uniformly across their input and output space. We propose the Recurrent
Interface Networks (RINs), an attention-based architecture that decouples its
core computation from the dimensionality of the data, enabling adaptive
computation for more scalable generation of high-dimensional data. RINs focus
the bulk of computation (i.e. global self-attention) on a set of latent tokens,
using cross-attention to read and write (i.e. route) information between latent
and data tokens. Stacking RIN blocks allows bottom-up (data to latent) and
top-down (latent to data) feedback, leading to deeper and more expressive
routing. While this routing introduces challenges, this is less problematic in
recurrent computation settings where the task (and routing problem) changes
gradually, such as iterative generation with diffusion models. We show how to
leverage recurrence by conditioning the latent tokens at each forward pass of
the reverse diffusion process with those from prior computation, i.e. latent
self-conditioning. RINs yield state-of-the-art pixel diffusion models for image
and video generation, scaling to 1024X1024 images without cascades or guidance,
while being domain-agnostic and up to 10X more efficient than 2D and 3D U-Nets.
StoRM: A Diffusion-based Stochastic Regeneration Model for Speech Enhancement and Dereverberation
December 22, 2022
Jean-Marie Lemercier, Julius Richter, Simon Welker, Timo Gerkmann
Diffusion models have shown a great ability at bridging the performance gap
between predictive and generative approaches for speech enhancement. We have
shown that they may even outperform their predictive counterparts for
non-additive corruption types or when they are evaluated on mismatched
conditions. However, diffusion models suffer from a high computational burden,
mainly as they require to run a neural network for each reverse diffusion step,
whereas predictive approaches only require one pass. As diffusion models are
generative approaches they may also produce vocalizing and breathing artifacts
in adverse conditions. In comparison, in such difficult scenarios, predictive
models typically do not produce such artifacts but tend to distort the target
speech instead, thereby degrading the speech quality. In this work, we present
a stochastic regeneration approach where an estimate given by a predictive
model is provided as a guide for further diffusion. We show that the proposed
approach uses the predictive model to remove the vocalizing and breathing
artifacts while producing very high quality samples thanks to the diffusion
model, even in adverse conditions. We further show that this approach enables
to use lighter sampling schemes with fewer diffusion steps without sacrificing
quality, thus lifting the computational burden by an order of magnitude. Source
code and audio examples are available online (https://uhh.de/inf-sp-storm).
GENIE: Large Scale Pre-training for Text Generation with Diffusion Model
December 22, 2022
Zhenghao Lin, Yeyun Gong, Yelong Shen, Tong Wu, Zhihao Fan, Chen Lin, Nan Duan, Weizhu Chen
In this paper, we introduce a novel dIffusion language modEl pre-training
framework for text generation, which we call GENIE. GENIE is a large-scale
pretrained diffusion language model that consists of an encoder and a
diffusion-based decoder, which can generate text by gradually transforming a
random noise sequence into a coherent text sequence. To pre-train GENIE on a
large-scale language corpus, we design a new continuous paragraph denoise
objective, which encourages the diffusion-decoder to reconstruct a clean text
paragraph from a corrupted version, while preserving the semantic and syntactic
coherence. We evaluate GENIE on four downstream text generation benchmarks,
namely XSum, CNN/DailyMail, Gigaword, and CommonGen. Our experimental results
show that GENIE achieves comparable performance with the state-of-the-art
autoregressive models on these benchmarks, and generates more diverse text
samples. The code and models of GENIE are available at
https://github.com/microsoft/ProphetNet/tree/master/GENIE.
Character-Aware Models Improve Visual Text Rendering
December 20, 2022
Rosanne Liu, Dan Garrette, Chitwan Saharia, William Chan, Adam Roberts, Sharan Narang, Irina Blok, RJ Mical, Mohammad Norouzi, Noah Constant
Current image generation models struggle to reliably produce well-formed
visual text. In this paper, we investigate a key contributing factor: popular
text-to-image models lack character-level input features, making it much harder
to predict a word’s visual makeup as a series of glyphs. To quantify this
effect, we conduct a series of experiments comparing character-aware vs.
character-blind text encoders. In the text-only domain, we find that
character-aware models provide large gains on a novel spelling task
(WikiSpell). Applying our learnings to the visual domain, we train a suite of
image generation models, and show that character-aware variants outperform
their character-blind counterparts across a range of novel text rendering tasks
(our DrawText benchmark). Our models set a much higher state-of-the-art on
visual spelling, with 30+ point accuracy gains over competitors on rare words,
despite training on far fewer examples.
December 19, 2022
William Peebles, Saining Xie
We explore a new class of diffusion models based on the transformer
architecture. We train latent diffusion models of images, replacing the
commonly-used U-Net backbone with a transformer that operates on latent
patches. We analyze the scalability of our Diffusion Transformers (DiTs)
through the lens of forward pass complexity as measured by Gflops. We find that
DiTs with higher Gflops – through increased transformer depth/width or
increased number of input tokens – consistently have lower FID. In addition to
possessing good scalability properties, our largest DiT-XL/2 models outperform
all prior diffusion models on the class-conditional ImageNet 512x512 and
256x256 benchmarks, achieving a state-of-the-art FID of 2.27 on the latter.
Latent Diffusion for Language Generation
December 19, 2022
Justin Lovelace, Varsha Kishore, Chao Wan, Eliot Shekhtman, Kilian Q. Weinberger
Diffusion models have achieved great success in modeling continuous data
modalities such as images, audio, and video, but have seen limited use in
discrete domains such as language. Recent attempts to adapt diffusion to
language have presented diffusion as an alternative to existing pretrained
language models. We view diffusion and existing language models as
complementary. We demonstrate that encoder-decoder language models can be
utilized to efficiently learn high-quality language autoencoders. We then
demonstrate that continuous diffusion models can be learned in the latent space
of the language autoencoder, enabling us to sample continuous latent
representations that can be decoded into natural language with the pretrained
decoder. We validate the effectiveness of our approach for unconditional,
class-conditional, and sequence-to-sequence language generation. We demonstrate
across multiple diverse data sets that our latent language diffusion models are
significantly more effective than previous diffusion language models.
Difformer: Empowering Diffusion Model on Embedding Space for Text Generation
December 19, 2022
Zhujin Gao, Junliang Guo, Xu Tan, Yongxin Zhu, Fang Zhang, Jiang Bian, Linli Xu
Diffusion models have achieved state-of-the-art synthesis quality on both
visual and audio tasks, and recent works further adapt them to textual data by
diffusing on the embedding space. In this paper, we conduct systematic studies
and analyze the challenges between the continuous data space and the embedding
space which have not been carefully explored. Firstly, the data distribution is
learnable for embeddings, which may lead to the collapse of the loss function.
Secondly, as the norm of embeddings varies between popular and rare words,
adding the same noise scale will lead to sub-optimal results. In addition, we
find the normal level of noise causes insufficient training of the model. To
address the above challenges, we propose Difformer, an embedding diffusion
model based on Transformer, which consists of three essential modules including
an additional anchor loss function, a layer normalization module for
embeddings, and a noise factor to the Gaussian noise. Experiments on two
seminal text generation tasks including machine translation and text
summarization show the superiority of Difformer over compared embedding
diffusion baselines.
Diffusing Surrogate Dreams of Video Scenes to Predict Video Memorability
December 19, 2022
Lorin Sweeney, Graham Healy, Alan F. Smeaton
As part of the MediaEval 2022 Predicting Video Memorability task we explore
the relationship between visual memorability, the visual representation that
characterises it, and the underlying concept portrayed by that visual
representation. We achieve state-of-the-art memorability prediction performance
with a model trained and tested exclusively on surrogate dream images,
elevating concepts to the status of a cornerstone memorability feature, and
finding strong evidence to suggest that the intrinsic memorability of visual
content can be distilled to its underlying concept or meaning irrespective of
its specific visual representational.
Speed up the inference of diffusion models via shortcut MCMC sampling
December 18, 2022
Gang Chen
cs.CV, cs.LG, 68T10, I.2.6
Diffusion probabilistic models have generated high quality image synthesis
recently. However, one pain point is the notorious inference to gradually
obtain clear images with thousands of steps, which is time consuming compared
to other generative models. In this paper, we present a shortcut MCMC sampling
algorithm, which balances training and inference, while keeping the generated
data’s quality. In particular, we add the global fidelity constraint with
shortcut MCMC sampling to combat the local fitting from diffusion models. We do
some initial experiments and show very promising results. Our implementation is
available at https://github.com//vividitytech/diffusion-mcmc.git.
Point-E: A System for Generating 3D Point Clouds from Complex Prompts
December 16, 2022
Alex Nichol, Heewoo Jun, Prafulla Dhariwal, Pamela Mishkin, Mark Chen
While recent work on text-conditional 3D object generation has shown
promising results, the state-of-the-art methods typically require multiple
GPU-hours to produce a single sample. This is in stark contrast to
state-of-the-art generative image models, which produce samples in a number of
seconds or minutes. In this paper, we explore an alternative method for 3D
object generation which produces 3D models in only 1-2 minutes on a single GPU.
Our method first generates a single synthetic view using a text-to-image
diffusion model, and then produces a 3D point cloud using a second diffusion
model which conditions on the generated image. While our method still falls
short of the state-of-the-art in terms of sample quality, it is one to two
orders of magnitude faster to sample from, offering a practical trade-off for
some use cases. We release our pre-trained point cloud diffusion models, as
well as evaluation code and models, at https://github.com/openai/point-e.
Diffusion Probabilistic Models beat GANs on Medical Images
December 14, 2022
Gustav Müller-Franzes, Jan Moritz Niehues, Firas Khader, Soroosh Tayebi Arasteh, Christoph Haarburger, Christiane Kuhl, Tianci Wang, Tianyu Han, Sven Nebelung, Jakob Nikolas Kather, Daniel Truhn
The success of Deep Learning applications critically depends on the quality
and scale of the underlying training data. Generative adversarial networks
(GANs) can generate arbitrary large datasets, but diversity and fidelity are
limited, which has recently been addressed by denoising diffusion probabilistic
models (DDPMs) whose superiority has been demonstrated on natural images. In
this study, we propose Medfusion, a conditional latent DDPM for medical images.
We compare our DDPM-based model against GAN-based models, which constitute the
current state-of-the-art in the medical domain. Medfusion was trained and
compared with (i) StyleGan-3 on n=101,442 images from the AIROGS challenge
dataset to generate fundoscopies with and without glaucoma, (ii) ProGAN on
n=191,027 from the CheXpert dataset to generate radiographs with and without
cardiomegaly and (iii) wGAN on n=19,557 images from the CRCMS dataset to
generate histopathological images with and without microsatellite stability. In
the AIROGS, CRMCS, and CheXpert datasets, Medfusion achieved lower (=better)
FID than the GANs (11.63 versus 20.43, 30.03 versus 49.26, and 17.28 versus
84.31). Also, fidelity (precision) and diversity (recall) were higher (=better)
for Medfusion in all three datasets. Our study shows that DDPM are a superior
alternative to GANs for image synthesis in the medical domain.
SPIRiT-Diffusion: SPIRiT-driven Score-Based Generative Modeling for Vessel Wall imaging
December 14, 2022
Chentao Cao, Zhuo-Xu Cui, Jing Cheng, Sen Jia, Hairong Zheng, Dong Liang, Yanjie Zhu
Diffusion model is the most advanced method in image generation and has been
successfully applied to MRI reconstruction. However, the existing methods do
not consider the characteristics of multi-coil acquisition of MRI data.
Therefore, we give a new diffusion model, called SPIRiT-Diffusion, based on the
SPIRiT iterative reconstruction algorithm. Specifically, SPIRiT-Diffusion
characterizes the prior distribution of coil-by-coil images by score matching
and characterizes the k-space redundant prior between coils based on
self-consistency. With sufficient prior constraint utilized, we achieve
superior reconstruction results on the joint Intracranial and Carotid Vessel
Wall imaging dataset.
DifFace: Blind Face Restoration with Diffused Error Contraction
December 13, 2022
Zongsheng Yue, Chen Change Loy
While deep learning-based methods for blind face restoration have achieved
unprecedented success, they still suffer from two major limitations. First,
most of them deteriorate when facing complex degradations out of their training
data. Second, these methods require multiple constraints, e.g., fidelity,
perceptual, and adversarial losses, which require laborious hyper-parameter
tuning to stabilize and balance their influences. In this work, we propose a
novel method named DifFace that is capable of coping with unseen and complex
degradations more gracefully without complicated loss designs. The key of our
method is to establish a posterior distribution from the observed low-quality
(LQ) image to its high-quality (HQ) counterpart. In particular, we design a
transition distribution from the LQ image to the intermediate state of a
pre-trained diffusion model and then gradually transmit from this intermediate
state to the HQ target by recursively applying a pre-trained diffusion model.
The transition distribution only relies on a restoration backbone that is
trained with $L_2$ loss on some synthetic data, which favorably avoids the
cumbersome training process in existing methods. Moreover, the transition
distribution can contract the error of the restoration backbone and thus makes
our method more robust to unknown degradations. Comprehensive experiments show
that DifFace is superior to current state-of-the-art methods, especially in
cases with severe degradations. Code and model are available at
https://github.com/zsyOAOA/DifFace.
HS-Diffusion: Learning a Semantic-Guided Diffusion Model for Head Swapping
December 13, 2022
Qinghe Wang, Lijie Liu, Miao Hua, Pengfei Zhu, Wangmeng Zuo, Qinghua Hu, Huchuan Lu, Bing Cao
Image-based head swapping task aims to stitch a source head to another source
body flawlessly. This seldom-studied task faces two major challenges: 1)
Preserving the head and body from various sources while generating a seamless
transition region. 2) No paired head swapping dataset and benchmark so far. In
this paper, we propose a semantic-mixing diffusion model for head swapping
(HS-Diffusion) which consists of a latent diffusion model (LDM) and a semantic
layout generator. We blend the semantic layouts of source head and source body,
and then inpaint the transition region by the semantic layout generator,
achieving a coarse-grained head swapping. Semantic-mixing LDM can further
implement a fine-grained head swapping with the inpainted layout as condition
by a progressive fusion process, while preserving head and body with
high-quality reconstruction. To this end, we propose a semantic calibration
strategy for natural inpainting and a neck alignment for geometric realism.
Importantly, we construct a new image-based head swapping benchmark and design
two tailor-designed metrics (Mask-FID and Focal-FID). Extensive experiments
demonstrate the superiority of our framework. The code will be available:
https://github.com/qinghew/HS-Diffusion.
Score-based Generative Modeling Secretly Minimizes the Wasserstein Distance
December 13, 2022
Dohyun Kwon, Ying Fan, Kangwook Lee
cs.LG, cs.AI, cs.NA, math.NA
Score-based generative models are shown to achieve remarkable empirical
performances in various applications such as image generation and audio
synthesis. However, a theoretical understanding of score-based diffusion models
is still incomplete. Recently, Song et al. showed that the training objective
of score-based generative models is equivalent to minimizing the
Kullback-Leibler divergence of the generated distribution from the data
distribution. In this work, we show that score-based models also minimize the
Wasserstein distance between them under suitable assumptions on the model.
Specifically, we prove that the Wasserstein distance is upper bounded by the
square root of the objective function up to multiplicative constants and a
fixed constant offset. Our proof is based on a novel application of the theory
of optimal transport, which can be of independent interest to the society. Our
numerical experiments support our findings. By analyzing our upper bounds, we
provide a few techniques to obtain tighter upper bounds.
The Stable Artist: Steering Semantics in Diffusion Latent Space
December 12, 2022
Manuel Brack, Patrick Schramowski, Felix Friedrich, Dominik Hintersdorf, Kristian Kersting
Large, text-conditioned generative diffusion models have recently gained a
lot of attention for their impressive performance in generating high-fidelity
images from text alone. However, achieving high-quality results is almost
unfeasible in a one-shot fashion. On the contrary, text-guided image generation
involves the user making many slight changes to inputs in order to iteratively
carve out the envisioned image. However, slight changes to the input prompt
often lead to entirely different images being generated, and thus the control
of the artist is limited in its granularity. To provide flexibility, we present
the Stable Artist, an image editing approach enabling fine-grained control of
the image generation process. The main component is semantic guidance (SEGA)
which steers the diffusion process along variable numbers of semantic
directions. This allows for subtle edits to images, changes in composition and
style, as well as optimization of the overall artistic conception. Furthermore,
SEGA enables probing of latent spaces to gain insights into the representation
of concepts learned by the model, even complex ones such as ‘carbon emission’.
We demonstrate the Stable Artist on several tasks, showcasing high-quality
image editing and composition.
Diff-Font: Diffusion Model for Robust One-Shot Font Generation
December 12, 2022
Haibin He, Xinyuan Chen, Chaoyue Wang, Juhua Liu, Bo Du, Dacheng Tao, Yu Qiao
Font generation is a difficult and time-consuming task, especially in those
languages using ideograms that have complicated structures with a large number
of characters, such as Chinese. To solve this problem, few-shot font generation
and even one-shot font generation have attracted a lot of attention. However,
most existing font generation methods may still suffer from (i) large
cross-font gap challenge; (ii) subtle cross-font variation problem; and (iii)
incorrect generation of complicated characters. In this paper, we propose a
novel one-shot font generation method based on a diffusion model, named
Diff-Font, which can be stably trained on large datasets. The proposed model
aims to generate the entire font library by giving only one sample as the
reference. Specifically, a large stroke-wise dataset is constructed, and a
stroke-wise diffusion model is proposed to preserve the structure and the
completion of each generated character. To our best knowledge, the proposed
Diff-Font is the first work that developed diffusion models to handle the font
generation task. The well-trained Diff-Font is not only robust to font gap and
font variation, but also achieved promising performance on difficult character
generation. Compared to previous font generation methods, our model reaches
state-of-the-art performance both qualitatively and quantitatively.
DiffAlign : Few-shot learning using diffusion based synthesis and alignment
December 11, 2022
Aniket Roy, Anshul Shah, Ketul Shah, Anirban Roy, Rama Chellappa
Visual recognition in a low-data regime is challenging and often prone to
overfitting. To mitigate this issue, several data augmentation strategies have
been proposed. However, standard transformations, e.g., rotation, cropping, and
flipping provide limited semantic variations. To this end, we propose Cap2Aug,
an image-to-image diffusion model-based data augmentation strategy using image
captions as text prompts. We generate captions from the limited training images
and using these captions edit the training images using an image-to-image
stable diffusion model to generate semantically meaningful augmentations. This
strategy generates augmented versions of images similar to the training images
yet provides semantic diversity across the samples. We show that the variations
within the class can be captured by the captions and then translated to
generate diverse samples using the image-to-image diffusion model guided by the
captions. However, naive learning on synthetic images is not adequate due to
the domain gap between real and synthetic images. Thus, we employ a maximum
mean discrepancy (MMD) loss to align the synthetic images to the real images
for minimizing the domain gap. We evaluate our method on few-shot and long-tail
classification tasks and obtain performance improvements over state-of-the-art,
especially in the low-data regimes.
How to Backdoor Diffusion Models?
December 11, 2022
Sheng-Yen Chou, Pin-Yu Chen, Tsung-Yi Ho
Diffusion models are state-of-the-art deep learning empowered generative
models that are trained based on the principle of learning forward and reverse
diffusion processes via progressive noise-addition and denoising. To gain a
better understanding of the limitations and potential risks, this paper
presents the first study on the robustness of diffusion models against backdoor
attacks. Specifically, we propose BadDiffusion, a novel attack framework that
engineers compromised diffusion processes during model training for backdoor
implantation. At the inference stage, the backdoored diffusion model will
behave just like an untampered generator for regular data inputs, while falsely
generating some targeted outcome designed by the bad actor upon receiving the
implanted trigger signal. Such a critical risk can be dreadful for downstream
tasks and applications built upon the problematic model. Our extensive
experiments on various backdoor attack settings show that BadDiffusion can
consistently lead to compromised diffusion models with high utility and target
specificity. Even worse, BadDiffusion can be made cost-effective by simply
finetuning a clean pre-trained diffusion model to implant backdoors. We also
explore some possible countermeasures for risk mitigation. Our results call
attention to potential risks and possible misuse of diffusion models. Our code
is available on https://github.com/IBM/BadDiffusion.
ShadowDiffusion: When Degradation Prior Meets Diffusion Model for Shadow Removal
December 09, 2022
Lanqing Guo, Chong Wang, Wenhan Yang, Siyu Huang, Yufei Wang, Hanspeter Pfister, Bihan Wen
Recent deep learning methods have achieved promising results in image shadow
removal. However, their restored images still suffer from unsatisfactory
boundary artifacts, due to the lack of degradation prior embedding and the
deficiency in modeling capacity. Our work addresses these issues by proposing a
unified diffusion framework that integrates both the image and degradation
priors for highly effective shadow removal. In detail, we first propose a
shadow degradation model, which inspires us to build a novel unrolling
diffusion model, dubbed ShandowDiffusion. It remarkably improves the model’s
capacity in shadow removal via progressively refining the desired output with
both degradation prior and diffusive generative prior, which by nature can
serve as a new strong baseline for image restoration. Furthermore,
ShadowDiffusion progressively refines the estimated shadow mask as an auxiliary
task of the diffusion generator, which leads to more accurate and robust
shadow-free image generation. We conduct extensive experiments on three popular
public datasets, including ISTD, ISTD+, and SRD, to validate our method’s
effectiveness. Compared to the state-of-the-art methods, our model achieves a
significant improvement in terms of PSNR, increasing from 31.69dB to 34.73dB
over SRD dataset.
MoFusion: A Framework for Denoising-Diffusion-based Motion Synthesis
December 08, 2022
Rishabh Dabral, Muhammad Hamza Mughal, Vladislav Golyanik, Christian Theobalt
Conventional methods for human motion synthesis are either deterministic or
struggle with the trade-off between motion diversity and motion quality. In
response to these limitations, we introduce MoFusion, i.e., a new
denoising-diffusion-based framework for high-quality conditional human motion
synthesis that can generate long, temporally plausible, and semantically
accurate motions based on a range of conditioning contexts (such as music and
text). We also present ways to introduce well-known kinematic losses for motion
plausibility within the motion diffusion framework through our scheduled
weighting strategy. The learned latent space can be used for several
interactive motion editing applications – like inbetweening, seed
conditioning, and text-based editing – thus, providing crucial abilities for
virtual character animation and robotics. Through comprehensive quantitative
evaluations and a perceptual user study, we demonstrate the effectiveness of
MoFusion compared to the state of the art on established benchmarks in the
literature. We urge the reader to watch our supplementary video and visit
https://vcai.mpi-inf.mpg.de/projects/MoFusion.
SINE: SINgle Image Editing with Text-to-Image Diffusion Models
December 08, 2022
Zhixing Zhang, Ligong Han, Arnab Ghosh, Dimitris Metaxas, Jian Ren
Recent works on diffusion models have demonstrated a strong capability for
conditioning image generation, e.g., text-guided image synthesis. Such success
inspires many efforts trying to use large-scale pre-trained diffusion models
for tackling a challenging problem–real image editing. Works conducted in this
area learn a unique textual token corresponding to several images containing
the same object. However, under many circumstances, only one image is
available, such as the painting of the Girl with a Pearl Earring. Using
existing works on fine-tuning the pre-trained diffusion models with a single
image causes severe overfitting issues. The information leakage from the
pre-trained diffusion models makes editing can not keep the same content as the
given image while creating new features depicted by the language guidance. This
work aims to address the problem of single-image editing. We propose a novel
model-based guidance built upon the classifier-free guidance so that the
knowledge from the model trained on a single image can be distilled into the
pre-trained diffusion model, enabling content creation even with one given
image. Additionally, we propose a patch-based fine-tuning that can effectively
help the model generate images of arbitrary resolution. We provide extensive
experiments to validate the design choices of our approach and show promising
editing capabilities, including changing style, content addition, and object
manipulation. The code is available for research purposes at
https://github.com/zhang-zx/SINE.git .
Diffusion Guided Domain Adaptation of Image Generators
December 08, 2022
Kunpeng Song, Ligong Han, Bingchen Liu, Dimitris Metaxas, Ahmed Elgammal
Can a text-to-image diffusion model be used as a training objective for
adapting a GAN generator to another domain? In this paper, we show that the
classifier-free guidance can be leveraged as a critic and enable generators to
distill knowledge from large-scale text-to-image diffusion models. Generators
can be efficiently shifted into new domains indicated by text prompts without
access to groundtruth samples from target domains. We demonstrate the
effectiveness and controllability of our method through extensive experiments.
Although not trained to minimize CLIP loss, our model achieves equally high
CLIP scores and significantly lower FID than prior work on short prompts, and
outperforms the baseline qualitatively and quantitatively on long and
complicated prompts. To our best knowledge, the proposed method is the first
attempt at incorporating large-scale pre-trained diffusion models and
distillation sampling for text-driven image generator domain adaptation and
gives a quality previously beyond possible. Moreover, we extend our work to
3D-aware style-based generators and DreamBooth guidance.
Executing your Commands via Motion Diffusion in Latent Space
December 08, 2022
Xin Chen, Biao Jiang, Wen Liu, Zilong Huang, Bin Fu, Tao Chen, Jingyi Yu, Gang Yu
We study a challenging task, conditional human motion generation, which
produces plausible human motion sequences according to various conditional
inputs, such as action classes or textual descriptors. Since human motions are
highly diverse and have a property of quite different distribution from
conditional modalities, such as textual descriptors in natural languages, it is
hard to learn a probabilistic mapping from the desired conditional modality to
the human motion sequences. Besides, the raw motion data from the motion
capture system might be redundant in sequences and contain noises; directly
modeling the joint distribution over the raw motion sequences and conditional
modalities would need a heavy computational overhead and might result in
artifacts introduced by the captured noises. To learn a better representation
of the various human motion sequences, we first design a powerful Variational
AutoEncoder (VAE) and arrive at a representative and low-dimensional latent
code for a human motion sequence. Then, instead of using a diffusion model to
establish the connections between the raw motion sequences and the conditional
inputs, we perform a diffusion process on the motion latent space. Our proposed
Motion Latent-based Diffusion model (MLD) could produce vivid motion sequences
conforming to the given conditional inputs and substantially reduce the
computational overhead in both the training and inference stages. Extensive
experiments on various human motion generation tasks demonstrate that our MLD
achieves significant improvements over the state-of-the-art methods among
extensive human motion generation tasks, with two orders of magnitude faster
than previous diffusion models on raw motion sequences.
Diffusion Art or Digital Forgery? Investigating Data Replication in Diffusion Models
December 07, 2022
Gowthami Somepalli, Vasu Singla, Micah Goldblum, Jonas Geiping, Tom Goldstein
Cutting-edge diffusion models produce images with high quality and
customizability, enabling them to be used for commercial art and graphic design
purposes. But do diffusion models create unique works of art, or are they
replicating content directly from their training sets? In this work, we study
image retrieval frameworks that enable us to compare generated images with
training samples and detect when content has been replicated. Applying our
frameworks to diffusion models trained on multiple datasets including Oxford
flowers, Celeb-A, ImageNet, and LAION, we discuss how factors such as training
set size impact rates of content replication. We also identify cases where
diffusion models, including the popular Stable Diffusion model, blatantly copy
from their training data.
Talking Head Generation with Probabilistic Audio-to-Visual Diffusion Priors
December 07, 2022
Zhentao Yu, Zixin Yin, Deyu Zhou, Duomin Wang, Finn Wong, Baoyuan Wang
cs.GR, cs.CV, cs.SD, eess.AS
In this paper, we introduce a simple and novel framework for one-shot
audio-driven talking head generation. Unlike prior works that require
additional driving sources for controlled synthesis in a deterministic manner,
we instead probabilistically sample all the holistic lip-irrelevant facial
motions (i.e. pose, expression, blink, gaze, etc.) to semantically match the
input audio while still maintaining both the photo-realism of audio-lip
synchronization and the overall naturalness. This is achieved by our newly
proposed audio-to-visual diffusion prior trained on top of the mapping between
audio and disentangled non-lip facial representations. Thanks to the
probabilistic nature of the diffusion prior, one big advantage of our framework
is it can synthesize diverse facial motion sequences given the same audio clip,
which is quite user-friendly for many real applications. Through comprehensive
evaluations on public benchmarks, we conclude that (1) our diffusion prior
outperforms auto-regressive prior significantly on almost all the concerned
metrics; (2) our overall system is competitive with prior works in terms of
audio-lip synchronization but can effectively sample rich and natural-looking
lip-irrelevant facial motions while still semantically harmonized with the
audio input.
One Sample Diffusion Model in Projection Domain for Low-Dose CT Imaging
December 07, 2022
Bin Huang, Liu Zhang, Shiyu Lu, Boyu Lin, Weiwen Wu, Qiegen Liu
Low-dose computed tomography (CT) plays a significant role in reducing the
radiation risk in clinical applications. However, lowering the radiation dose
will significantly degrade the image quality. With the rapid development and
wide application of deep learning, it has brought new directions for the
development of low-dose CT imaging algorithms. Therefore, we propose a fully
unsupervised one sample diffusion model (OSDM)in projection domain for low-dose
CT reconstruction. To extract sufficient prior information from single sample,
the Hankel matrix formulation is employed. Besides, the penalized weighted
least-squares and total variation are introduced to achieve superior image
quality. Specifically, we first train a score-based generative model on one
sinogram by extracting a great number of tensors from the structural-Hankel
matrix as the network input to capture prior distribution. Then, at the
inference stage, the stochastic differential equation solver and data
consistency step are performed iteratively to obtain the sinogram data.
Finally, the final image is obtained through the filtered back-projection
algorithm. The reconstructed results are approaching to the normal-dose
counterparts. The results prove that OSDM is practical and effective model for
reducing the artifacts and preserving the image quality.
Proposal of a Score Based Approach to Sampling Using Monte Carlo Estimation of Score and Oracle Access to Target Density
December 06, 2022
Curtis McDonald, Andrew Barron
Score based approaches to sampling have shown much success as a generative
algorithm to produce new samples from a target density given a pool of initial
samples. In this work, we consider if we have no initial samples from the
target density, but rather $0^{th}$ and $1^{st}$ order oracle access to the log
likelihood. Such problems may arise in Bayesian posterior sampling, or in
approximate minimization of non-convex functions. Using this knowledge alone,
we propose a Monte Carlo method to estimate the score empirically as a
particular expectation of a random variable. Using this estimator, we can then
run a discrete version of the backward flow SDE to produce samples from the
target density. This approach has the benefit of not relying on a pool of
initial samples from the target density, and it does not rely on a neural
network or other black box model to estimate the score.
Diffusion-SDF: Text-to-Shape via Voxelized Diffusion
December 06, 2022
Muheng Li, Yueqi Duan, Jie Zhou, Jiwen Lu
cs.CV, cs.AI, cs.GR, cs.LG
With the rising industrial attention to 3D virtual modeling technology,
generating novel 3D content based on specified conditions (e.g. text) has
become a hot issue. In this paper, we propose a new generative 3D modeling
framework called Diffusion-SDF for the challenging task of text-to-shape
synthesis. Previous approaches lack flexibility in both 3D data representation
and shape generation, thereby failing to generate highly diversified 3D shapes
conforming to the given text descriptions. To address this, we propose a SDF
autoencoder together with the Voxelized Diffusion model to learn and generate
representations for voxelized signed distance fields (SDFs) of 3D shapes.
Specifically, we design a novel UinU-Net architecture that implants a
local-focused inner network inside the standard U-Net architecture, which
enables better reconstruction of patch-independent SDF representations. We
extend our approach to further text-to-shape tasks including text-conditioned
shape completion and manipulation. Experimental results show that Diffusion-SDF
generates both higher quality and more diversified 3D shapes that conform well
to given text descriptions when compared to previous approaches. Code is
available at: https://github.com/ttlmh/Diffusion-SDF
NeRDi: Single-View NeRF Synthesis with Language-Guided Diffusion as General Image Priors
December 06, 2022
Congyue Deng, Chiyu "Max'' Jiang, Charles R. Qi, Xinchen Yan, Yin Zhou, Leonidas Guibas, Dragomir Anguelov
2D-to-3D reconstruction is an ill-posed problem, yet humans are good at
solving this problem due to their prior knowledge of the 3D world developed
over years. Driven by this observation, we propose NeRDi, a single-view NeRF
synthesis framework with general image priors from 2D diffusion models.
Formulating single-view reconstruction as an image-conditioned 3D generation
problem, we optimize the NeRF representations by minimizing a diffusion loss on
its arbitrary view renderings with a pretrained image diffusion model under the
input-view constraint. We leverage off-the-shelf vision-language models and
introduce a two-section language guidance as conditioning inputs to the
diffusion model. This is essentially helpful for improving multiview content
coherence as it narrows down the general image prior conditioned on the
semantic and visual features of the single-view input image. Additionally, we
introduce a geometric loss based on estimated depth maps to regularize the
underlying 3D geometry of the NeRF. Experimental results on the DTU MVS dataset
show that our method can synthesize novel views with higher quality even
compared to existing methods trained on this dataset. We also demonstrate our
generalizability in zero-shot NeRF synthesis for in-the-wild images.
SparseFusion: Distilling View-conditioned Diffusion for 3D Reconstruction
December 01, 2022
Zhizhuo Zhou, Shubham Tulsiani
We propose SparseFusion, a sparse view 3D reconstruction approach that
unifies recent advances in neural rendering and probabilistic image generation.
Existing approaches typically build on neural rendering with re-projected
features but fail to generate unseen regions or handle uncertainty under large
viewpoint changes. Alternate methods treat this as a (probabilistic) 2D
synthesis task, and while they can generate plausible 2D images, they do not
infer a consistent underlying 3D. However, we find that this trade-off between
3D consistency and probabilistic image generation does not need to exist. In
fact, we show that geometric consistency and generative inference can be
complementary in a mode-seeking behavior. By distilling a 3D consistent scene
representation from a view-conditioned latent diffusion model, we are able to
recover a plausible 3D representation whose renderings are both accurate and
realistic. We evaluate our approach across 51 categories in the CO3D dataset
and show that it outperforms existing methods, in both distortion and
perception metrics, for sparse-view novel view synthesis.
Score Jacobian Chaining: Lifting Pretrained 2D Diffusion Models for 3D Generation
December 01, 2022
Haochen Wang, Xiaodan Du, Jiahao Li, Raymond A. Yeh, Greg Shakhnarovich
A diffusion model learns to predict a vector field of gradients. We propose
to apply chain rule on the learned gradients, and back-propagate the score of a
diffusion model through the Jacobian of a differentiable renderer, which we
instantiate to be a voxel radiance field. This setup aggregates 2D scores at
multiple camera viewpoints into a 3D score, and repurposes a pretrained 2D
model for 3D data generation. We identify a technical challenge of distribution
mismatch that arises in this application, and propose a novel estimation
mechanism to resolve it. We run our algorithm on several off-the-shelf
diffusion image generative models, including the recently released Stable
Diffusion trained on the large-scale LAION dataset.
VIDM: Video Implicit Diffusion Models
December 01, 2022
Kangfu Mei, Vishal M. Patel
Diffusion models have emerged as a powerful generative method for
synthesizing high-quality and diverse set of images. In this paper, we propose
a video generation method based on diffusion models, where the effects of
motion are modeled in an implicit condition manner, i.e. one can sample
plausible video motions according to the latent feature of frames. We improve
the quality of the generated videos by proposing multiple strategies such as
sampling space truncation, robustness penalty, and positional group
normalization. Various experiments are conducted on datasets consisting of
videos with different resolutions and different number of frames. Results show
that the proposed method outperforms the state-of-the-art generative
adversarial network-based methods by a significant margin in terms of FVD
scores as well as perceptible visual quality.
Shape-Guided Diffusion with Inside-Outside Attention
December 01, 2022
Dong Huk Park, Grace Luo, Clayton Toste, Samaneh Azadi, Xihui Liu, Maka Karalashvili, Anna Rohrbach, Trevor Darrell
When manipulating an object, existing text-to-image diffusion models often
ignore the shape of the object and generate content that is incorrectly scaled,
cut off, or replaced with background content. We propose a training-free
method, Shape-Guided Diffusion, that modifies pretrained diffusion models to be
sensitive to shape input specified by a user or automatically inferred from
text. We use a novel Inside-Outside Attention mechanism during the inversion
and generation process to apply this shape constraint to the cross- and
self-attention maps. Our mechanism designates which spatial region is the
object (inside) vs. background (outside) then associates edits specified by
text prompts to the correct region. We demonstrate the efficacy of our method
on the shape-guided editing task, where the model must replace an object
according to a text prompt and object mask. We curate a new ShapePrompts
benchmark derived from MS-COCO and achieve SOTA results in shape faithfulness
without a degradation in text alignment or image realism according to both
automatic metrics and annotator ratings. Our data and code will be made
available at https://shape-guided-diffusion.github.io.
Denoising Diffusion for Sampling SAT Solutions
November 30, 2022
Karlis Freivalds, Sergejs Kozlovics
Generating diverse solutions to the Boolean Satisfiability Problem (SAT) is a
hard computational problem with practical applications for testing and
functional verification of software and hardware designs. We explore the way to
generate such solutions using Denoising Diffusion coupled with a Graph Neural
Network to implement the denoising function. We find that the obtained accuracy
is similar to the currently best purely neural method and the produced SAT
solutions are highly diverse, even if the system is trained with non-random
solutions from a standard solver.
Multiresolution Textual Inversion
November 30, 2022
Giannis Daras, Alexandros G. Dimakis
We extend Textual Inversion to learn pseudo-words that represent a concept at
different resolutions. This allows us to generate images that use the concept
with different levels of detail and also to manipulate different resolutions
using language. Once learned, the user can generate images at different levels
of agreement to the original concept; “A photo of $S^(0)$” produces the exact
object while the prompt “A photo of $S^(0.8)$” only matches the rough outlines
and colors. Our framework allows us to generate images that use different
resolutions of an image (e.g. details, textures, styles) as separate
pseudo-words that can be composed in various ways. We open-soure our code in
the following URL: https://github.com/giannisdaras/multires_textual_inversion
High-Fidelity Guided Image Synthesis with Latent Diffusion Models
November 30, 2022
Jaskirat Singh, Stephen Gould, Liang Zheng
cs.CV, cs.AI, cs.LG, stat.ML
Controllable image synthesis with user scribbles has gained huge public
interest with the recent advent of text-conditioned latent diffusion models.
The user scribbles control the color composition while the text prompt provides
control over the overall image semantics. However, we note that prior works in
this direction suffer from an intrinsic domain shift problem, wherein the
generated outputs often lack details and resemble simplistic representations of
the target domain. In this paper, we propose a novel guided image synthesis
framework, which addresses this problem by modeling the output image as the
solution of a constrained optimization problem. We show that while computing an
exact solution to the optimization is infeasible, an approximation of the same
can be achieved while just requiring a single pass of the reverse diffusion
process. Additionally, we show that by simply defining a cross-attention based
correspondence between the input text tokens and the user stroke-painting, the
user is also able to control the semantics of different painted regions without
requiring any conditional training or finetuning. Human user study results show
that the proposed approach outperforms the previous state-of-the-art by over
85.32% on the overall user satisfaction scores. Project page for our paper is
available at https://1jsingh.github.io/gradop.
DiffPose: Toward More Reliable 3D Pose Estimation
November 30, 2022
Jia Gong, Lin Geng Foo, Zhipeng Fan, Qiuhong Ke, Hossein Rahmani, Jun Liu
Monocular 3D human pose estimation is quite challenging due to the inherent
ambiguity and occlusion, which often lead to high uncertainty and
indeterminacy. On the other hand, diffusion models have recently emerged as an
effective tool for generating high-quality images from noise. Inspired by their
capability, we explore a novel pose estimation framework (DiffPose) that
formulates 3D pose estimation as a reverse diffusion process. We incorporate
novel designs into our DiffPose to facilitate the diffusion process for 3D pose
estimation: a pose-specific initialization of pose uncertainty distributions, a
Gaussian Mixture Model-based forward diffusion process, and a
context-conditioned reverse diffusion process. Our proposed DiffPose
significantly outperforms existing methods on the widely used pose estimation
benchmarks Human3.6M and MPI-INF-3DHP. Project page:
https://gongjia0208.github.io/Diffpose/.
Score-based Continuous-time Discrete Diffusion Models
November 30, 2022
Haoran Sun, Lijun Yu, Bo Dai, Dale Schuurmans, Hanjun Dai
Score-based modeling through stochastic differential equations (SDEs) has
provided a new perspective on diffusion models, and demonstrated superior
performance on continuous data. However, the gradient of the log-likelihood
function, i.e., the score function, is not properly defined for discrete
spaces. This makes it non-trivial to adapt \textcolor{\cdiff}{the score-based
modeling} to categorical data. In this paper, we extend diffusion models to
discrete variables by introducing a stochastic jump process where the reverse
process denoises via a continuous-time Markov chain. This formulation admits an
analytical simulation during backward sampling. To learn the reverse process,
we extend score matching to general categorical data and show that an unbiased
estimator can be obtained via simple matching of the conditional marginal
distributions. We demonstrate the effectiveness of the proposed method on a set
of synthetic and real-world music and image benchmarks.
3D Neural Field Generation using Triplane Diffusion
November 30, 2022
J. Ryan Shue, Eric Ryan Chan, Ryan Po, Zachary Ankner, Jiajun Wu, Gordon Wetzstein
Diffusion models have emerged as the state-of-the-art for image generation,
among other tasks. Here, we present an efficient diffusion-based model for
3D-aware generation of neural fields. Our approach pre-processes training data,
such as ShapeNet meshes, by converting them to continuous occupancy fields and
factoring them into a set of axis-aligned triplane feature representations.
Thus, our 3D training scenes are all represented by 2D feature planes, and we
can directly train existing 2D diffusion models on these representations to
generate 3D neural fields with high quality and diversity, outperforming
alternative approaches to 3D-aware generation. Our approach requires essential
modifications to existing triplane factorization pipelines to make the
resulting features easy to learn for the diffusion model. We demonstrate
state-of-the-art results on 3D generation on several object classes from
ShapeNet.
SinDDM: A Single Image Denoising Diffusion Model
November 29, 2022
Vladimir Kulikov, Shahar Yadin, Matan Kleiner, Tomer Michaeli
Denoising diffusion models (DDMs) have led to staggering performance leaps in
image generation, editing and restoration. However, existing DDMs use very
large datasets for training. Here, we introduce a framework for training a DDM
on a single image. Our method, which we coin SinDDM, learns the internal
statistics of the training image by using a multi-scale diffusion process. To
drive the reverse diffusion process, we use a fully-convolutional light-weight
denoiser, which is conditioned on both the noise level and the scale. This
architecture allows generating samples of arbitrary dimensions, in a
coarse-to-fine manner. As we illustrate, SinDDM generates diverse high-quality
samples, and is applicable in a wide array of tasks, including style transfer
and harmonization. Furthermore, it can be easily guided by external
supervision. Particularly, we demonstrate text-guided generation from a single
image using a pre-trained CLIP model.
DiffPose: Multi-hypothesis Human Pose Estimation using Diffusion models
November 29, 2022
Karl Holmquist, Bastian Wandt
Traditionally, monocular 3D human pose estimation employs a machine learning
model to predict the most likely 3D pose for a given input image. However, a
single image can be highly ambiguous and induces multiple plausible solutions
for the 2D-3D lifting step which results in overly confident 3D pose
predictors. To this end, we propose \emph{DiffPose}, a conditional diffusion
model, that predicts multiple hypotheses for a given input image. In comparison
to similar approaches, our diffusion model is straightforward and avoids
intensive hyperparameter tuning, complex network structures, mode collapse, and
unstable training. Moreover, we tackle a problem of the common two-step
approach that first estimates a distribution of 2D joint locations via
joint-wise heatmaps and consecutively approximates them based on first- or
second-moment statistics. Since such a simplification of the heatmaps removes
valid information about possibly correct, though labeled unlikely, joint
locations, we propose to represent the heatmaps as a set of 2D joint candidate
samples. To extract information about the original distribution from these
samples we introduce our \emph{embedding transformer} that conditions the
diffusion model. Experimentally, we show that DiffPose slightly improves upon
the state of the art for multi-hypothesis pose estimation for simple poses and
outperforms it by a large margin for highly ambiguous poses.
Wavelet Diffusion Models are fast and scalable Image Generators
November 29, 2022
Hao Phung, Quan Dao, Anh Tran
Diffusion models are rising as a powerful solution for high-fidelity image
generation, which exceeds GANs in quality in many circumstances. However, their
slow training and inference speed is a huge bottleneck, blocking them from
being used in real-time applications. A recent DiffusionGAN method
significantly decreases the models’ running time by reducing the number of
sampling steps from thousands to several, but their speeds still largely lag
behind the GAN counterparts. This paper aims to reduce the speed gap by
proposing a novel wavelet-based diffusion scheme. We extract low-and-high
frequency components from both image and feature levels via wavelet
decomposition and adaptively handle these components for faster processing
while maintaining good generation quality. Furthermore, we propose to use a
reconstruction term, which effectively boosts the model training convergence.
Experimental results on CelebA-HQ, CIFAR-10, LSUN-Church, and STL-10 datasets
prove our solution is a stepping-stone to offering real-time and high-fidelity
diffusion models. Our code and pre-trained checkpoints are available at
\url{https://github.com/VinAIResearch/WaveDiff.git}.
Dimensionality-Varying Diffusion Process
November 29, 2022
Han Zhang, Ruili Feng, Zhantao Yang, Lianghua Huang, Yu Liu, Yifei Zhang, Yujun Shen, Deli Zhao, Jingren Zhou, Fan Cheng
Diffusion models, which learn to reverse a signal destruction process to
generate new data, typically require the signal at each step to have the same
dimension. We argue that, considering the spatial redundancy in image signals,
there is no need to maintain a high dimensionality in the evolution process,
especially in the early generation phase. To this end, we make a theoretical
generalization of the forward diffusion process via signal decomposition.
Concretely, we manage to decompose an image into multiple orthogonal components
and control the attenuation of each component when perturbing the image. That
way, along with the noise strength increasing, we are able to diminish those
inconsequential components and thus use a lower-dimensional signal to represent
the source, barely losing information. Such a reformulation allows to vary
dimensions in both training and inference of diffusion models. Extensive
experiments on a range of datasets suggest that our approach substantially
reduces the computational cost and achieves on-par or even better synthesis
performance compared to baseline methods. We also show that our strategy
facilitates high-resolution image synthesis and improves FID of diffusion model
trained on FFHQ at $1024\times1024$ resolution from 52.40 to 10.46. Code and
models will be made publicly available.
Refining Generative Process with Discriminator Guidance in Score-based Diffusion Models
November 28, 2022
Dongjun Kim, Yeongmin Kim, Se Jung Kwon, Wanmo Kang, Il-Chul Moon
The proposed method, Discriminator Guidance, aims to improve sample
generation of pre-trained diffusion models. The approach introduces a
discriminator that gives explicit supervision to a denoising sample path
whether it is realistic or not. Unlike GANs, our approach does not require
joint training of score and discriminator networks. Instead, we train the
discriminator after score training, making discriminator training stable and
fast to converge. In sample generation, we add an auxiliary term to the
pre-trained score to deceive the discriminator. This term corrects the model
score to the data score at the optimal discriminator, which implies that the
discriminator helps better score estimation in a complementary way. Using our
algorithm, we achive state-of-the-art results on ImageNet 256x256 with FID 1.83
and recall 0.64, similar to the validation data’s FID (1.68) and recall (0.66).
We release the code at https://github.com/alsdudrla10/DG.
Post-training Quantization on Diffusion Models
November 28, 2022
Yuzhang Shang, Zhihang Yuan, Bin Xie, Bingzhe Wu, Yan Yan
Denoising diffusion (score-based) generative models have recently achieved
significant accomplishments in generating realistic and diverse data. These
approaches define a forward diffusion process for transforming data into noise
and a backward denoising process for sampling data from noise. Unfortunately,
the generation process of current denoising diffusion models is notoriously
slow due to the lengthy iterative noise estimations, which rely on cumbersome
neural networks. It prevents the diffusion models from being widely deployed,
especially on edge devices. Previous works accelerate the generation process of
diffusion model (DM) via finding shorter yet effective sampling trajectories.
However, they overlook the cost of noise estimation with a heavy network in
every iteration. In this work, we accelerate generation from the perspective of
compressing the noise estimation network. Due to the difficulty of retraining
DMs, we exclude mainstream training-aware compression paradigms and introduce
post-training quantization (PTQ) into DM acceleration. However, the output
distributions of noise estimation networks change with time-step, making
previous PTQ methods fail in DMs since they are designed for single-time step
scenarios. To devise a DM-specific PTQ method, we explore PTQ on DM in three
aspects: quantized operations, calibration dataset, and calibration metric. We
summarize and use several observations derived from all-inclusive
investigations to formulate our method, which especially targets the unique
multi-time-step structure of DMs. Experimentally, our method can directly
quantize full-precision DMs into 8-bit models while maintaining or even
improving their performance in a training-free manner. Importantly, our method
can serve as a plug-and-play module on other fast-sampling methods, e.g., DDIM.
The code is available at https://github.com/42Shawn/PTQ4DM .
Continuous diffusion for categorical data
November 28, 2022
Sander Dieleman, Laurent Sartran, Arman Roshannai, Nikolay Savinov, Yaroslav Ganin, Pierre H. Richemond, Arnaud Doucet, Robin Strudel, Chris Dyer, Conor Durkan, Curtis Hawthorne, Rémi Leblond, Will Grathwohl, Jonas Adler
Diffusion models have quickly become the go-to paradigm for generative
modelling of perceptual signals (such as images and sound) through iterative
refinement. Their success hinges on the fact that the underlying physical
phenomena are continuous. For inherently discrete and categorical data such as
language, various diffusion-inspired alternatives have been proposed. However,
the continuous nature of diffusion models conveys many benefits, and in this
work we endeavour to preserve it. We propose CDCD, a framework for modelling
categorical data with diffusion models that are continuous both in time and
input space. We demonstrate its efficacy on several language modelling tasks.
DiffusionBERT: Improving Generative Masked Language Models with Diffusion Models
November 28, 2022
Zhengfu He, Tianxiang Sun, Kuanning Wang, Xuanjing Huang, Xipeng Qiu
We present DiffusionBERT, a new generative masked language model based on
discrete diffusion models. Diffusion models and many pre-trained language
models have a shared training objective, i.e., denoising, making it possible to
combine the two powerful models and enjoy the best of both worlds. On the one
hand, diffusion models offer a promising training strategy that helps improve
the generation quality. On the other hand, pre-trained denoising language
models (e.g., BERT) can be used as a good initialization that accelerates
convergence. We explore training BERT to learn the reverse process of a
discrete diffusion process with an absorbing state and elucidate several
designs to improve it. First, we propose a new noise schedule for the forward
diffusion process that controls the degree of noise added at each step based on
the information of each token. Second, we investigate several designs of
incorporating the time step into BERT. Experiments on unconditional text
generation demonstrate that DiffusionBERT achieves significant improvement over
existing diffusion models for text (e.g., D3PM and Diffusion-LM) and previous
generative masked language models in terms of perplexity and BLEU score.
Diffusion Probabilistic Model Made Slim
November 27, 2022
Xingyi Yang, Daquan Zhou, Jiashi Feng, Xinchao Wang
Despite the recent visually-pleasing results achieved, the massive
computational cost has been a long-standing flaw for diffusion probabilistic
models (DPMs), which, in turn, greatly limits their applications on
resource-limited platforms. Prior methods towards efficient DPM, however, have
largely focused on accelerating the testing yet overlooked their huge
complexity and sizes. In this paper, we make a dedicated attempt to lighten DPM
while striving to preserve its favourable performance. We start by training a
small-sized latent diffusion model (LDM) from scratch, but observe a
significant fidelity drop in the synthetic images. Through a thorough
assessment, we find that DPM is intrinsically biased against high-frequency
generation, and learns to recover different frequency components at different
time-steps. These properties make compact networks unable to represent
frequency dynamics with accurate high-frequency estimation. Towards this end,
we introduce a customized design for slim DPM, which we term as Spectral
Diffusion (SD), for light-weight image synthesis. SD incorporates wavelet
gating in its architecture to enable frequency dynamic feature extraction at
every reverse steps, and conducts spectrum-aware distillation to promote
high-frequency recovery by inverse weighting the objective based on spectrum
magni tudes. Experimental results demonstrate that, SD achieves 8-18x
computational complexity reduction as compared to the latent diffusion models
on a series of conditional and unconditional image generation tasks while
retaining competitive image fidelity.
Traditional Classification Neural Networks are Good Generators: They are Competitive with DDPMs and GANs
November 27, 2022
Guangrun Wang, Philip H. S. Torr
cs.CV, cs.AI, cs.LG, cs.MM, stat.ML
Classifiers and generators have long been separated. We break down this
separation and showcase that conventional neural network classifiers can
generate high-quality images of a large number of categories, being comparable
to the state-of-the-art generative models (e.g., DDPMs and GANs). We achieve
this by computing the partial derivative of the classification loss function
with respect to the input to optimize the input to produce an image. Since it
is widely known that directly optimizing the inputs is similar to targeted
adversarial attacks incapable of generating human-meaningful images, we propose
a mask-based stochastic reconstruction module to make the gradients
semantic-aware to synthesize plausible images. We further propose a
progressive-resolution technique to guarantee fidelity, which produces
photorealistic images. Furthermore, we introduce a distance metric loss and a
non-trivial distribution loss to ensure classification neural networks can
synthesize diverse and high-fidelity images. Using traditional neural network
classifiers, we can generate good-quality images of 256$\times$256 resolution
on ImageNet. Intriguingly, our method is also applicable to text-to-image
generation by regarding image-text foundation models as generalized
classifiers.
Proving that classifiers have learned the data distribution and are ready for
image generation has far-reaching implications, for classifiers are much easier
to train than generative models like DDPMs and GANs. We don’t even need to
train classification models because tons of public ones are available for
download. Also, this holds great potential for the interpretability and
robustness of classifiers. Project page is at
\url{https://classifier-as-generator.github.io/}.
BeLFusion: Latent Diffusion for Behavior-Driven Human Motion Prediction
November 25, 2022
German Barquero, Sergio Escalera, Cristina Palmero
Stochastic human motion prediction (HMP) has generally been tackled with
generative adversarial networks and variational autoencoders. Most prior works
aim at predicting highly diverse movements in terms of the skeleton joints’
dispersion. This has led to methods predicting fast and motion-divergent
movements, which are often unrealistic and incoherent with past motion. Such
methods also neglect contexts that need to anticipate diverse low-range
behaviors, or actions, with subtle joint displacements. To address these
issues, we present BeLFusion, a model that, for the first time, leverages
latent diffusion models in HMP to sample from a latent space where behavior is
disentangled from pose and motion. As a result, diversity is encouraged from a
behavioral perspective. Thanks to our behavior coupler’s ability to transfer
sampled behavior to ongoing motion, BeLFusion’s predictions display a variety
of behaviors that are significantly more realistic than the state of the art.
To support it, we introduce two metrics, the Area of the Cumulative Motion
Distribution, and the Average Pairwise Distance Error, which are correlated to
our definition of realism according to a qualitative study with 126
participants. Finally, we prove BeLFusion’s generalization power in a new
cross-dataset scenario for stochastic HMP.
Latent Space Diffusion Models of Cryo-EM Structures
November 25, 2022
Karsten Kreis, Tim Dockhorn, Zihao Li, Ellen Zhong
Cryo-electron microscopy (cryo-EM) is unique among tools in structural
biology in its ability to image large, dynamic protein complexes. Key to this
ability is image processing algorithms for heterogeneous cryo-EM
reconstruction, including recent deep learning-based approaches. The
state-of-the-art method cryoDRGN uses a Variational Autoencoder (VAE) framework
to learn a continuous distribution of protein structures from single particle
cryo-EM imaging data. While cryoDRGN can model complex structural motions, the
Gaussian prior distribution of the VAE fails to match the aggregate approximate
posterior, which prevents generative sampling of structures especially for
multi-modal distributions (e.g. compositional heterogeneity). Here, we train a
diffusion model as an expressive, learnable prior in the cryoDRGN framework.
Our approach learns a high-quality generative model over molecular
conformations directly from cryo-EM imaging data. We show the ability to sample
from the model on two synthetic and two real datasets, where samples accurately
follow the data distribution unlike samples from the VAE prior distribution. We
also demonstrate how the diffusion model prior can be leveraged for fast latent
space traversal and interpolation between states of interest. By learning an
accurate model of the data distribution, our method unlocks tools in generative
modeling, sampling, and distribution analysis for heterogeneous cryo-EM
ensembles.
3DDesigner: Towards Photorealistic 3D Object Generation and Editing with Text-guided Diffusion Models
November 25, 2022
Gang Li, Heliang Zheng, Chaoyue Wang, Chang Li, Changwen Zheng, Dacheng Tao
Text-guided diffusion models have shown superior performance in image/video
generation and editing. While few explorations have been performed in 3D
scenarios. In this paper, we discuss three fundamental and interesting problems
on this topic. First, we equip text-guided diffusion models to achieve
3D-consistent generation. Specifically, we integrate a NeRF-like neural field
to generate low-resolution coarse results for a given camera view. Such results
can provide 3D priors as condition information for the following diffusion
process. During denoising diffusion, we further enhance the 3D consistency by
modeling cross-view correspondences with a novel two-stream (corresponding to
two different views) asynchronous diffusion process. Second, we study 3D local
editing and propose a two-step solution that can generate 360-degree
manipulated results by editing an object from a single view. Step 1, we propose
to perform 2D local editing by blending the predicted noises. Step 2, we
conduct a noise-to-text inversion process that maps 2D blended noises into the
view-independent text embedding space. Once the corresponding text embedding is
obtained, 360-degree images can be generated. Last but not least, we extend our
model to perform one-shot novel view synthesis by fine-tuning on a single
image, firstly showing the potential of leveraging text guidance for novel view
synthesis. Extensive experiments and various applications show the prowess of
our 3DDesigner. The project page is available at
https://3ddesigner-diffusion.github.io/.
DiffusionSDF: Conditional Generative Modeling of Signed Distance Functions
November 24, 2022
Gene Chou, Yuval Bahat, Felix Heide
Probabilistic diffusion models have achieved state-of-the-art results for
image synthesis, inpainting, and text-to-image tasks. However, they are still
in the early stages of generating complex 3D shapes. This work proposes
Diffusion-SDF, a generative model for shape completion, single-view
reconstruction, and reconstruction of real-scanned point clouds. We use neural
signed distance functions (SDFs) as our 3D representation to parameterize the
geometry of various signals (e.g., point clouds, 2D images) through neural
networks. Neural SDFs are implicit functions and diffusing them amounts to
learning the reversal of their neural network weights, which we solve using a
custom modulation module. Extensive experiments show that our method is capable
of both realistic unconditional generation and conditional generation from
partial inputs. This work expands the domain of diffusion models from learning
2D, explicit representations, to 3D, implicit representations.
Sketch-Guided Text-to-Image Diffusion Models
November 24, 2022
Andrey Voynov, Kfir Aberman, Daniel Cohen-Or
Text-to-Image models have introduced a remarkable leap in the evolution of
machine learning, demonstrating high-quality synthesis of images from a given
text-prompt. However, these powerful pretrained models still lack control
handles that can guide spatial properties of the synthesized images. In this
work, we introduce a universal approach to guide a pretrained text-to-image
diffusion model, with a spatial map from another domain (e.g., sketch) during
inference time. Unlike previous works, our method does not require to train a
dedicated model or a specialized encoder for the task. Our key idea is to train
a Latent Guidance Predictor (LGP) - a small, per-pixel, Multi-Layer Perceptron
(MLP) that maps latent features of noisy images to spatial maps, where the deep
features are extracted from the core Denoising Diffusion Probabilistic Model
(DDPM) network. The LGP is trained only on a few thousand images and
constitutes a differential guiding map predictor, over which the loss is
computed and propagated back to push the intermediate images to agree with the
spatial map. The per-pixel training offers flexibility and locality which
allows the technique to perform well on out-of-domain sketches, including
free-hand style drawings. We take a particular focus on the sketch-to-image
translation task, revealing a robust and expressive way to generate images that
follow the guidance of a sketch of arbitrary style or domain. Project page:
sketch-guided-diffusion.github.io
Fast Sampling of Diffusion Models via Operator Learning
November 24, 2022
Hongkai Zheng, Weili Nie, Arash Vahdat, Kamyar Azizzadenesheli, Anima Anandkumar
Diffusion models have found widespread adoption in various areas. However,
their sampling process is slow because it requires hundreds to thousands of
network evaluations to emulate a continuous process defined by differential
equations. In this work, we use neural operators, an efficient method to solve
the probability flow differential equations, to accelerate the sampling process
of diffusion models. Compared to other fast sampling methods that have a
sequential nature, we are the first to propose a parallel decoding method that
generates images with only one model forward pass. We propose diffusion model
sampling with neural operator (DSNO) that maps the initial condition, i.e.,
Gaussian distribution, to the continuous-time solution trajectory of the
reverse diffusion process. To model the temporal correlations along the
trajectory, we introduce temporal convolution layers that are parameterized in
the Fourier space into the given diffusion model backbone. We show our method
achieves state-of-the-art FID of 3.78 for CIFAR-10 and 7.83 for ImageNet-64 in
the one-model-evaluation setting.
Shifted Diffusion for Text-to-image Generation
November 24, 2022
Yufan Zhou, Bingchen Liu, Yizhe Zhu, Xiao Yang, Changyou Chen, Jinhui Xu
We present Corgi, a novel method for text-to-image generation. Corgi is based
on our proposed shifted diffusion model, which achieves better image embedding
generation from input text. Unlike the baseline diffusion model used in DALL-E
2, our method seamlessly encodes prior knowledge of the pre-trained CLIP model
in its diffusion process by designing a new initialization distribution and a
new transition step of the diffusion. Compared to the strong DALL-E 2 baseline,
our method performs better in generating image embedding from the text in terms
of both efficiency and effectiveness, resulting in better text-to-image
generation. Extensive large-scale experiments are conducted and evaluated in
terms of both quantitative measures and human evaluation, indicating a stronger
generation ability of our method compared to existing ones. Furthermore, our
model enables semi-supervised and language-free training for text-to-image
generation, where only part or none of the images in the training dataset have
an associated caption. Trained with only 1.7% of the images being captioned,
our semi-supervised model obtains FID results comparable to DALL-E 2 on
zero-shot text-to-image generation evaluated on MS-COCO. Corgi also achieves
new state-of-the-art results across different datasets on downstream
language-free text-to-image generation tasks, outperforming the previous
method, Lafite, by a large margin.
Improving dermatology classifiers across populations using images generated by large diffusion models
November 23, 2022
Luke W. Sagers, James A. Diao, Matthew Groh, Pranav Rajpurkar, Adewole S. Adamson, Arjun K. Manrai
Dermatological classification algorithms developed without sufficiently
diverse training data may generalize poorly across populations. While
intentional data collection and annotation offer the best means for improving
representation, new computational approaches for generating training data may
also aid in mitigating the effects of sampling bias. In this paper, we show
that DALL$\cdot$E 2, a large-scale text-to-image diffusion model, can produce
photorealistic images of skin disease across skin types. Using the Fitzpatrick
17k dataset as a benchmark, we demonstrate that augmenting training data with
DALL$\cdot$E 2-generated synthetic images improves classification of skin
disease overall and especially for underrepresented groups.
HouseDiffusion: Vector Floorplan Generation via a Diffusion Model with Discrete and Continuous Denoising
November 23, 2022
Mohammad Amin Shabani, Sepidehsadat Hosseini, Yasutaka Furukawa
The paper presents a novel approach for vector-floorplan generation via a
diffusion model, which denoises 2D coordinates of room/door corners with two
inference objectives: 1) a single-step noise as the continuous quantity to
precisely invert the continuous forward process; and 2) the final 2D coordinate
as the discrete quantity to establish geometric incident relationships such as
parallelism, orthogonality, and corner-sharing. Our task is graph-conditioned
floorplan generation, a common workflow in floorplan design. We represent a
floorplan as 1D polygonal loops, each of which corresponds to a room or a door.
Our diffusion model employs a Transformer architecture at the core, which
controls the attention masks based on the input graph-constraint and directly
generates vector-graphics floorplans via a discrete and continuous denoising
process. We have evaluated our approach on RPLAN dataset. The proposed approach
makes significant improvements in all the metrics against the state-of-the-art
with significant margins, while being capable of generating non-Manhattan
structures and controlling the exact number of corners per room. A project
website with supplementary video and document is here
https://aminshabani.github.io/housediffusion.
Paint by Example: Exemplar-based Image Editing with Diffusion Models
November 23, 2022
Binxin Yang, Shuyang Gu, Bo Zhang, Ting Zhang, Xuejin Chen, Xiaoyan Sun, Dong Chen, Fang Wen
Language-guided image editing has achieved great success recently. In this
paper, for the first time, we investigate exemplar-guided image editing for
more precise control. We achieve this goal by leveraging self-supervised
training to disentangle and re-organize the source image and the exemplar.
However, the naive approach will cause obvious fusing artifacts. We carefully
analyze it and propose an information bottleneck and strong augmentations to
avoid the trivial solution of directly copying and pasting the exemplar image.
Meanwhile, to ensure the controllability of the editing process, we design an
arbitrary shape mask for the exemplar image and leverage the classifier-free
guidance to increase the similarity to the exemplar image. The whole framework
involves a single forward of the diffusion model without any iterative
optimization. We demonstrate that our method achieves an impressive performance
and enables controllable editing on in-the-wild images with high fidelity.
Tetrahedral Diffusion Models for 3D Shape Generation
November 23, 2022
Nikolai Kalischek, Torben Peters, Jan D. Wegner, Konrad Schindler
Probabilistic denoising diffusion models (DDMs) have set a new standard for
2D image generation. Extending DDMs for 3D content creation is an active field
of research. Here, we propose TetraDiffusion, a diffusion model that operates
on a tetrahedral partitioning of 3D space to enable efficient, high-resolution
3D shape generation. Our model introduces operators for convolution and
transpose convolution that act directly on the tetrahedral partition, and
seamlessly includes additional attributes such as color. Remarkably,
TetraDiffusion enables rapid sampling of detailed 3D objects in nearly
real-time with unprecedented resolution. It’s also adaptable for generating 3D
shapes conditioned on 2D images. Compared to existing 3D mesh diffusion
techniques, our method is up to 200 times faster in inference speed, works on
standard consumer hardware, and delivers superior results.
Inversion-Based Creativity Transfer with Diffusion Models
November 23, 2022
Yuxin Zhang, Nisha Huang, Fan Tang, Haibin Huang, Chongyang Ma, Weiming Dong, Changsheng Xu
The artistic style within a painting is the means of expression, which
includes not only the painting material, colors, and brushstrokes, but also the
high-level attributes including semantic elements, object shapes, etc. Previous
arbitrary example-guided artistic image generation methods often fail to
control shape changes or convey elements. The pre-trained text-to-image
synthesis diffusion probabilistic models have achieved remarkable quality, but
it often requires extensive textual descriptions to accurately portray
attributes of a particular painting. We believe that the uniqueness of an
artwork lies precisely in the fact that it cannot be adequately explained with
normal language. Our key idea is to learn artistic style directly from a single
painting and then guide the synthesis without providing complex textual
descriptions. Specifically, we assume style as a learnable textual description
of a painting. We propose an inversion-based style transfer method (InST),
which can efficiently and accurately learn the key information of an image,
thus capturing and transferring the artistic style of a painting. We
demonstrate the quality and efficiency of our method on numerous paintings of
various artists and styles. Code and models are available at
https://github.com/zyxElsa/InST.
Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation
November 22, 2022
Narek Tumanyan, Michal Geyer, Shai Bagon, Tali Dekel
Large-scale text-to-image generative models have been a revolutionary
breakthrough in the evolution of generative AI, allowing us to synthesize
diverse images that convey highly complex visual concepts. However, a pivotal
challenge in leveraging such models for real-world content creation tasks is
providing users with control over the generated content. In this paper, we
present a new framework that takes text-to-image synthesis to the realm of
image-to-image translation – given a guidance image and a target text prompt,
our method harnesses the power of a pre-trained text-to-image diffusion model
to generate a new image that complies with the target text, while preserving
the semantic layout of the source image. Specifically, we observe and
empirically demonstrate that fine-grained control over the generated structure
can be achieved by manipulating spatial features and their self-attention
inside the model. This results in a simple and effective approach, where
features extracted from the guidance image are directly injected into the
generation process of the target image, requiring no training or fine-tuning
and applicable for both real or generated guidance images. We demonstrate
high-quality results on versatile text-guided image translation tasks,
including translating sketches, rough drawings and animations into realistic
images, changing of the class and appearance of objects in a given image, and
modifications of global qualities such as lighting and color.
Person Image Synthesis via Denoising Diffusion Model
November 22, 2022
Ankan Kumar Bhunia, Salman Khan, Hisham Cholakkal, Rao Muhammad Anwer, Jorma Laaksonen, Mubarak Shah, Fahad Shahbaz Khan
The pose-guided person image generation task requires synthesizing
photorealistic images of humans in arbitrary poses. The existing approaches use
generative adversarial networks that do not necessarily maintain realistic
textures or need dense correspondences that struggle to handle complex
deformations and severe occlusions. In this work, we show how denoising
diffusion models can be applied for high-fidelity person image synthesis with
strong sample diversity and enhanced mode coverage of the learnt data
distribution. Our proposed Person Image Diffusion Model (PIDM) disintegrates
the complex transfer problem into a series of simpler forward-backward
denoising steps. This helps in learning plausible source-to-target
transformation trajectories that result in faithful textures and undistorted
appearance details. We introduce a ‘texture diffusion module’ based on
cross-attention to accurately model the correspondences between appearance and
pose information available in source and target images. Further, we propose
‘disentangled classifier-free guidance’ to ensure close resemblance between the
conditional inputs and the synthesized output in terms of both pose and
appearance information. Our extensive results on two large-scale benchmarks and
a user study demonstrate the photorealism of our proposed approach under
challenging scenarios. We also show how our generated images can help in
downstream tasks. Our code and models will be publicly released.
November 22, 2022
Bram Wallace, Akash Gokul, Nikhil Naik
Finding an initial noise vector that produces an input image when fed into
the diffusion process (known as inversion) is an important problem in denoising
diffusion models (DDMs), with applications for real image editing. The
state-of-the-art approach for real image editing with inversion uses denoising
diffusion implicit models (DDIMs) to deterministically noise the image to the
intermediate state along the path that the denoising would follow given the
original conditioning. However, DDIM inversion for real images is unstable as
it relies on local linearization assumptions, which result in the propagation
of errors, leading to incorrect image reconstruction and loss of content. To
alleviate these problems, we propose Exact Diffusion Inversion via Coupled
Transformations (EDICT), an inversion method that draws inspiration from affine
coupling layers. EDICT enables mathematically exact inversion of real and
model-generated images by maintaining two coupled noise vectors which are used
to invert each other in an alternating fashion. Using Stable Diffusion, a
state-of-the-art latent diffusion model, we demonstrate that EDICT successfully
reconstructs real images with high fidelity. On complex image datasets like
MS-COCO, EDICT reconstruction significantly outperforms DDIM, improving the
mean square error of reconstruction by a factor of two. Using noise vectors
inverted from real images, EDICT enables a wide range of image edits–from
local and global semantic edits to image stylization–while maintaining
fidelity to the original image structure. EDICT requires no model
training/finetuning, prompt tuning, or extra data and can be combined with any
pretrained DDM. Code is available at https://github.com/salesforce/EDICT.
SinDiffusion: Learning a Diffusion Model from a Single Natural Image
November 22, 2022
Weilun Wang, Jianmin Bao, Wengang Zhou, Dongdong Chen, Dong Chen, Lu Yuan, Houqiang Li
We present SinDiffusion, leveraging denoising diffusion models to capture
internal distribution of patches from a single natural image. SinDiffusion
significantly improves the quality and diversity of generated samples compared
with existing GAN-based approaches. It is based on two core designs. First,
SinDiffusion is trained with a single model at a single scale instead of
multiple models with progressive growing of scales which serves as the default
setting in prior work. This avoids the accumulation of errors, which cause
characteristic artifacts in generated results. Second, we identify that a
patch-level receptive field of the diffusion network is crucial and effective
for capturing the image’s patch statistics, therefore we redesign the network
structure of the diffusion model. Coupling these two designs enables us to
generate photorealistic and diverse images from a single image. Furthermore,
SinDiffusion can be applied to various applications, i.e., text-guided image
generation, and image outpainting, due to the inherent capability of diffusion
models. Extensive experiments on a wide range of images demonstrate the
superiority of our proposed method for modeling the patch distribution.
Can denoising diffusion probabilistic models generate realistic astrophysical fields?
November 22, 2022
Nayantara Mudur, Douglas P. Finkbeiner
astro-ph.CO, astro-ph.GA, astro-ph.IM, cs.LG
Score-based generative models have emerged as alternatives to generative
adversarial networks (GANs) and normalizing flows for tasks involving learning
and sampling from complex image distributions. In this work we investigate the
ability of these models to generate fields in two astrophysical contexts: dark
matter mass density fields from cosmological simulations and images of
interstellar dust. We examine the fidelity of the sampled cosmological fields
relative to the true fields using three different metrics, and identify
potential issues to address. We demonstrate a proof-of-concept application of
the model trained on dust in denoising dust images. To our knowledge, this is
the first application of this class of models to the interstellar medium.
DOLCE: A Model-Based Probabilistic Diffusion Framework for Limited-Angle CT Reconstruction
November 22, 2022
Jiaming Liu, Rushil Anirudh, Jayaraman J. Thiagarajan, Stewart He, K. Aditya Mohan, Ulugbek S. Kamilov, Hyojin Kim
Limited-Angle Computed Tomography (LACT) is a non-destructive evaluation
technique used in a variety of applications ranging from security to medicine.
The limited angle coverage in LACT is often a dominant source of severe
artifacts in the reconstructed images, making it a challenging inverse problem.
We present DOLCE, a new deep model-based framework for LACT that uses a
conditional diffusion model as an image prior. Diffusion models are a recent
class of deep generative models that are relatively easy to train due to their
implementation as image denoisers. DOLCE can form high-quality images from
severely under-sampled data by integrating data-consistency updates with the
sampling updates of a diffusion model, which is conditioned on the transformed
limited-angle data. We show through extensive experimentation on several
challenging real LACT datasets that, the same pre-trained DOLCE model achieves
the SOTA performance on drastically different types of images. Additionally, we
show that, unlike standard LACT reconstruction methods, DOLCE naturally enables
the quantification of the reconstruction uncertainty by generating multiple
samples consistent with the measured data.
DiffDreamer: Consistent Single-view Perpetual View Generation with Conditional Diffusion Models
November 22, 2022
Shengqu Cai, Eric Ryan Chan, Songyou Peng, Mohamad Shahbazi, Anton Obukhov, Luc Van Gool, Gordon Wetzstein
Scene extrapolation – the idea of generating novel views by flying into a
given image – is a promising, yet challenging task. For each predicted frame,
a joint inpainting and 3D refinement problem has to be solved, which is ill
posed and includes a high level of ambiguity. Moreover, training data for
long-range scenes is difficult to obtain and usually lacks sufficient views to
infer accurate camera poses. We introduce DiffDreamer, an unsupervised
framework capable of synthesizing novel views depicting a long camera
trajectory while training solely on internet-collected images of nature scenes.
Utilizing the stochastic nature of the guided denoising steps, we train the
diffusion models to refine projected RGBD images but condition the denoising
steps on multiple past and future frames for inference. We demonstrate that
image-conditioned diffusion models can effectively perform long-range scene
extrapolation while preserving consistency significantly better than prior
GAN-based methods. DiffDreamer is a powerful and efficient solution for scene
extrapolation, producing impressive results despite limited supervision.
Project page: https://primecai.github.io/diffdreamer.
SinFusion: Training Diffusion Models on a Single Image or Video
November 21, 2022
Yaniv Nikankin, Niv Haim, Michal Irani
Diffusion models exhibited tremendous progress in image and video generation,
exceeding GANs in quality and diversity. However, they are usually trained on
very large datasets and are not naturally adapted to manipulate a given input
image or video. In this paper we show how this can be resolved by training a
diffusion model on a single input image or video. Our image/video-specific
diffusion model (SinFusion) learns the appearance and dynamics of the single
image or video, while utilizing the conditioning capabilities of diffusion
models. It can solve a wide array of image/video-specific manipulation tasks.
In particular, our model can learn from few frames the motion and dynamics of a
single input video. It can then generate diverse new video samples of the same
dynamic scene, extrapolate short videos into long ones (both forward and
backward in time) and perform video upsampling. Most of these tasks are not
realizable by current video-specific generation methods.
VectorFusion: Text-to-SVG by Abstracting Pixel-Based Diffusion Models
November 21, 2022
Ajay Jain, Amber Xie, Pieter Abbeel
cs.CV, cs.AI, cs.GR, cs.LG
Diffusion models have shown impressive results in text-to-image synthesis.
Using massive datasets of captioned images, diffusion models learn to generate
raster images of highly diverse objects and scenes. However, designers
frequently use vector representations of images like Scalable Vector Graphics
(SVGs) for digital icons or art. Vector graphics can be scaled to any size, and
are compact. We show that a text-conditioned diffusion model trained on pixel
representations of images can be used to generate SVG-exportable vector
graphics. We do so without access to large datasets of captioned SVGs. By
optimizing a differentiable vector graphics rasterizer, our method,
VectorFusion, distills abstract semantic knowledge out of a pretrained
diffusion model. Inspired by recent text-to-3D work, we learn an SVG consistent
with a caption using Score Distillation Sampling. To accelerate generation and
improve fidelity, VectorFusion also initializes from an image sample.
Experiments show greater quality than prior work, and demonstrate a range of
styles including pixel art and sketches. See our project webpage at
https://ajayj.com/vectorfusion .
Diffusion Denoising Process for Perceptron Bias in Out-of-distribution Detection
November 21, 2022
Luping Liu, Yi Ren, Xize Cheng, Rongjie Huang, Chongxuan Li, Zhou Zhao
cs.CV, cs.AI, cs.LG, stat.ML
Out-of-distribution (OOD) detection is a crucial task for ensuring the
reliability and safety of deep learning. Currently, discriminator models
outperform other methods in this regard. However, the feature extraction
process used by discriminator models suffers from the loss of critical
information, leaving room for bad cases and malicious attacks. In this paper,
we introduce a new perceptron bias assumption that suggests discriminator
models are more sensitive to certain features of the input, leading to the
overconfidence problem. To address this issue, we propose a novel framework
that combines discriminator and generation models and integrates diffusion
models (DMs) into OOD detection. We demonstrate that the diffusion denoising
process (DDP) of DMs serves as a novel form of asymmetric interpolation, which
is well-suited to enhance the input and mitigate the overconfidence problem.
The discriminator model features of OOD data exhibit sharp changes under DDP,
and we utilize the norm of this change as the indicator score. Our experiments
on CIFAR10, CIFAR100, and ImageNet show that our method outperforms SOTA
approaches. Notably, for the challenging InD ImageNet and OOD species datasets,
our method achieves an AUROC of 85.7, surpassing the previous SOTA method’s
score of 77.4. Our implementation is available at
\url{https://github.com/luping-liu/DiffOOD}.
Investigating Prompt Engineering in Diffusion Models
November 21, 2022
Sam Witteveen, Martin Andrews
With the spread of the use of Text2Img diffusion models such as DALL-E 2,
Imagen, Mid Journey and Stable Diffusion, one challenge that artists face is
selecting the right prompts to achieve the desired artistic output. We present
techniques for measuring the effect that specific words and phrases in prompts
have, and (in the Appendix) present guidance on the selection of prompts to
produce desired effects.
Diffusion-Based Scene Graph to Image Generation with Masked Contrastive Pre-Training
November 21, 2022
Ling Yang, Zhilin Huang, Yang Song, Shenda Hong, Guohao Li, Wentao Zhang, Bin Cui, Bernard Ghanem, Ming-Hsuan Yang
Generating images from graph-structured inputs, such as scene graphs, is
uniquely challenging due to the difficulty of aligning nodes and connections in
graphs with objects and their relations in images. Most existing methods
address this challenge by using scene layouts, which are image-like
representations of scene graphs designed to capture the coarse structures of
scene images. Because scene layouts are manually crafted, the alignment with
images may not be fully optimized, causing suboptimal compliance between the
generated images and the original scene graphs. To tackle this issue, we
propose to learn scene graph embeddings by directly optimizing their alignment
with images. Specifically, we pre-train an encoder to extract both global and
local information from scene graphs that are predictive of the corresponding
images, relying on two loss functions: masked autoencoding loss and contrastive
loss. The former trains embeddings by reconstructing randomly masked image
regions, while the latter trains embeddings to discriminate between compliant
and non-compliant images according to the scene graph. Given these embeddings,
we build a latent diffusion model to generate images from scene graphs. The
resulting method, called SGDiff, allows for the semantic manipulation of
generated images by modifying scene graph nodes and connections. On the Visual
Genome and COCO-Stuff datasets, we demonstrate that SGDiff outperforms
state-of-the-art methods, as measured by both the Inception Score and Fr'echet
Inception Distance (FID) metrics. We will release our source code and trained
models at https://github.com/YangLing0818/SGDiff.
MagicVideo: Efficient Video Generation With Latent Diffusion Models
November 20, 2022
Daquan Zhou, Weimin Wang, Hanshu Yan, Weiwei Lv, Yizhe Zhu, Jiashi Feng
We present an efficient text-to-video generation framework based on latent
diffusion models, termed MagicVideo. MagicVideo can generate smooth video clips
that are concordant with the given text descriptions. Due to a novel and
efficient 3D U-Net design and modeling video distributions in a low-dimensional
space, MagicVideo can synthesize video clips with 256x256 spatial resolution on
a single GPU card, which takes around 64x fewer computations than the Video
Diffusion Models (VDM) in terms of FLOPs. In specific, unlike existing works
that directly train video models in the RGB space, we use a pre-trained VAE to
map video clips into a low-dimensional latent space and learn the distribution
of videos’ latent codes via a diffusion model. Besides, we introduce two new
designs to adapt the U-Net denoiser trained on image tasks to video data: a
frame-wise lightweight adaptor for the image-to-video distribution adjustment
and a directed temporal attention module to capture temporal dependencies
across frames. Thus, we can exploit the informative weights of convolution
operators from a text-to-image model for accelerating video training. To
ameliorate the pixel dithering in the generated videos, we also propose a novel
VideoVAE auto-encoder for better RGB reconstruction. We conduct extensive
experiments and demonstrate that MagicVideo can generate high-quality video
clips with either realistic or imaginary content. Refer to
\url{https://magicvideo.github.io/#} for more examples.
IC3D: Image-Conditioned 3D Diffusion for Shape Generation
November 20, 2022
Cristian Sbrolli, Paolo Cudrano, Matteo Frosi, Matteo Matteucci
In recent years, Denoising Diffusion Probabilistic Models (DDPMs) have
demonstrated exceptional performance in various 2D generative tasks. Following
this success, DDPMs have been extended to 3D shape generation, surpassing
previous methodologies in this domain. While many of these models are
unconditional, some have explored the potential of using guidance from
different modalities. In particular, image guidance for 3D generation has been
explored through the utilization of CLIP embeddings. However, these embeddings
are designed to align images and text, and do not necessarily capture the
specific details needed for shape generation. To address this limitation and
enhance image-guided 3D DDPMs with augmented 3D understanding, we introduce
CISP (Contrastive Image-Shape Pre-training), obtaining a well-structured
image-shape joint embedding space. Building upon CISP, we then introduce IC3D,
a DDPM that harnesses CISP’s guidance for 3D shape generation from single-view
images. This generative diffusion model outperforms existing benchmarks in both
quality and diversity of generated 3D shapes. Moreover, despite IC3D’s
generative nature, its generated shapes are preferred by human evaluators over
a competitive single-view 3D reconstruction model. These properties contribute
to a coherent embedding space, enabling latent interpolation and conditioned
generation also from out-of-distribution images. We find IC3D able to generate
coherent and diverse completions also when presented with occluded views,
rendering it applicable in controlled real-world scenarios.
Diffusion Model Based Posterior Sampling for Noisy Linear Inverse Problems
November 20, 2022
Xiangming Meng, Yoshiyuki Kabashima
cs.LG, cs.CV, cs.IT, math.IT, stat.ML
We consider the ubiquitous linear inverse problems with additive Gaussian
noise and propose an unsupervised sampling approach called diffusion model
based posterior sampling (DMPS) to reconstruct the unknown signal from noisy
linear measurements. Specifically, using one diffusion model (DM) as an
implicit prior, the fundamental difficulty in performing posterior sampling is
that the noise-perturbed likelihood score, i.e., gradient of an annealed
likelihood function, is intractable. To circumvent this problem, we introduce a
simple yet effective closed-form approximation of it using an uninformative
prior assumption. Extensive experiments are conducted on a variety of noisy
linear inverse problems such as noisy super-resolution, denoising, deblurring,
and colorization. In all tasks, the proposed DMPS demonstrates highly
competitive or even better performances on various tasks while being 3 times
faster than the state-of-the-art competitor diffusion posterior sampling (DPS).
The code to reproduce the results is available at
https://github.com/mengxiangming/dmps.
NVDiff: Graph Generation through the Diffusion of Node Vectors
November 19, 2022
Xiaohui Chen, Yukun Li, Aonan Zhang, Li-Ping Liu
Learning to generate graphs is challenging as a graph is a set of pairwise
connected, unordered nodes encoding complex combinatorial structures. Recently,
several works have proposed graph generative models based on normalizing flows
or score-based diffusion models. However, these models need to generate nodes
and edges in parallel from the same process, whose dimensionality is
unnecessarily high. We propose NVDiff, which takes the VGAE structure and uses
a score-based generative model (SGM) as a flexible prior to sample node
vectors. By modeling only node vectors in the latent space, NVDiff
significantly reduces the dimension of the diffusion process and thus improves
sampling speed. Built on the NVDiff framework, we introduce an attention-based
score network capable of capturing both local and global contexts of graphs.
Experiments indicate that NVDiff significantly reduces computations and can
model much larger graphs than competing methods. At the same time, it achieves
superior or competitive performances over various datasets compared to previous
methods.
Parallel Diffusion Models of Operator and Image for Blind Inverse Problems
November 19, 2022
Hyungjin Chung, Jeongsol Kim, Sehui Kim, Jong Chul Ye
Diffusion model-based inverse problem solvers have demonstrated
state-of-the-art performance in cases where the forward operator is known (i.e.
non-blind). However, the applicability of the method to blind inverse problems
has yet to be explored. In this work, we show that we can indeed solve a family
of blind inverse problems by constructing another diffusion prior for the
forward operator. Specifically, parallel reverse diffusion guided by gradients
from the intermediate stages enables joint optimization of both the forward
operator parameters as well as the image, such that both are jointly estimated
at the end of the parallel reverse diffusion procedure. We show the efficacy of
our method on two representative tasks – blind deblurring, and imaging through
turbulence – and show that our method yields state-of-the-art performance,
while also being flexible to be applicable to general blind inverse problems
when we know the functional forms.
Magic3D: High-Resolution Text-to-3D Content Creation
November 18, 2022
Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, Tsung-Yi Lin
DreamFusion has recently demonstrated the utility of a pre-trained
text-to-image diffusion model to optimize Neural Radiance Fields (NeRF),
achieving remarkable text-to-3D synthesis results. However, the method has two
inherent limitations: (a) extremely slow optimization of NeRF and (b)
low-resolution image space supervision on NeRF, leading to low-quality 3D
models with a long processing time. In this paper, we address these limitations
by utilizing a two-stage optimization framework. First, we obtain a coarse
model using a low-resolution diffusion prior and accelerate with a sparse 3D
hash grid structure. Using the coarse representation as the initialization, we
further optimize a textured 3D mesh model with an efficient differentiable
renderer interacting with a high-resolution latent diffusion model. Our method,
dubbed Magic3D, can create high quality 3D mesh models in 40 minutes, which is
2x faster than DreamFusion (reportedly taking 1.5 hours on average), while also
achieving higher resolution. User studies show 61.7% raters to prefer our
approach over DreamFusion. Together with the image-conditioned generation
capabilities, we provide users with new ways to control 3D synthesis, opening
up new avenues to various creative applications.
A Structure-Guided Diffusion Model for Large-Hole Diverse Image Completion
November 18, 2022
Daichi Horita, Jiaolong Yang, Dong Chen, Yuki Koyama, Kiyoharu Aizawa, Nicu Sebe
Image completion techniques have made significant progress in filling missing
regions (i.e., holes) in images. However, large-hole completion remains
challenging due to limited structural information. In this paper, we address
this problem by integrating explicit structural guidance into diffusion-based
image completion, forming our structure-guided diffusion model (SGDM). It
consists of two cascaded diffusion probabilistic models: structure and texture
generators. The structure generator generates an edge image representing
plausible structures within the holes, which is then used for guiding the
texture generation process. To train both generators jointly, we devise a novel
strategy that leverages optimal Bayesian denoising, which denoises the output
of the structure generator in a single step and thus allows backpropagation.
Our diffusion-based approach enables a diversity of plausible completions,
while the editable edges allow for editing parts of an image. Our experiments
on natural scene (Places) and face (CelebA-HQ) datasets demonstrate that our
method achieves a superior or comparable visual quality compared to
state-of-the-art approaches. The code is available for research purposes at
https://github.com/UdonDa/Structure_Guided_Diffusion_Model.
Patch-Based Denoising Diffusion Probabilistic Model for Sparse-View CT Reconstruction
November 18, 2022
Wenjun Xia, Wenxiang Cong, Ge Wang
eess.IV, cs.LG, eess.SP, physics.med-ph
Sparse-view computed tomography (CT) can be used to reduce radiation dose
greatly but is suffers from severe image artifacts. Recently, the deep learning
based method for sparse-view CT reconstruction has attracted a major attention.
However, neural networks often have a limited ability to remove the artifacts
when they only work in the image domain. Deep learning-based sinogram
processing can achieve a better anti-artifact performance, but it inevitably
requires feature maps of the whole image in a video memory, which makes
handling large-scale or three-dimensional (3D) images rather challenging. In
this paper, we propose a patch-based denoising diffusion probabilistic model
(DDPM) for sparse-view CT reconstruction. A DDPM network based on patches
extracted from fully sampled projection data is trained and then used to
inpaint down-sampled projection data. The network does not require paired
full-sampled and down-sampled data, enabling unsupervised learning. Since the
data processing is patch-based, the deep learning workflow can be distributed
in parallel, overcoming the memory problem of large-scale data. Our experiments
show that the proposed method can effectively suppress few-view artifacts while
faithfully preserving textural details.
Invariant Learning via Diffusion Dreamed Distribution Shifts
November 18, 2022
Priyatham Kattakinda, Alexander Levine, Soheil Feizi
Though the background is an important signal for image classification, over
reliance on it can lead to incorrect predictions when spurious correlations
between foreground and background are broken at test time. Training on a
dataset where these correlations are unbiased would lead to more robust models.
In this paper, we propose such a dataset called Diffusion Dreamed Distribution
Shifts (D3S). D3S consists of synthetic images generated through
StableDiffusion using text prompts and image guides obtained by pasting a
sample foreground image onto a background template image. Using this scalable
approach we generate 120K images of objects from all 1000 ImageNet classes in
10 diverse backgrounds. Due to the incredible photorealism of the diffusion
model, our images are much closer to natural images than previous synthetic
datasets. D3S contains a validation set of more than 17K images whose labels
are human-verified in an MTurk study. Using the validation set, we evaluate
several popular DNN image classifiers and find that the classification
performance of models generally suffers on our background diverse images. Next,
we leverage the foreground & background labels in D3S to learn a foreground
(background) representation that is invariant to changes in background
(foreground) by penalizing the mutual information between the foreground
(background) features and the background (foreground) labels. Linear
classifiers trained on these features to predict foreground (background) from
foreground (background) have high accuracies at 82.9% (93.8%), while
classifiers that predict these labels from background and foreground have a
much lower accuracy of 2.4% and 45.6% respectively. This suggests that our
foreground and background features are well disentangled. We further test the
efficacy of these representations by training classifiers on a task with strong
spurious correlations.
RenderDiffusion: Image Diffusion for 3D Reconstruction, Inpainting and Generation
November 17, 2022
Titas Anciukevicius, Zexiang Xu, Matthew Fisher, Paul Henderson, Hakan Bilen, Niloy J. Mitra, Paul Guerrero
Diffusion models currently achieve state-of-the-art performance for both
conditional and unconditional image generation. However, so far, image
diffusion models do not support tasks required for 3D understanding, such as
view-consistent 3D generation or single-view object reconstruction. In this
paper, we present RenderDiffusion, the first diffusion model for 3D generation
and inference, trained using only monocular 2D supervision. Central to our
method is a novel image denoising architecture that generates and renders an
intermediate three-dimensional representation of a scene in each denoising
step. This enforces a strong inductive structure within the diffusion process,
providing a 3D consistent representation while only requiring 2D supervision.
The resulting 3D representation can be rendered from any view. We evaluate
RenderDiffusion on FFHQ, AFHQ, ShapeNet and CLEVR datasets, showing competitive
performance for generation of 3D scenes and inference of 3D scenes from 2D
images. Additionally, our diffusion-based approach allows us to use 2D
inpainting to edit 3D scenes.
Conffusion: Confidence Intervals for Diffusion Models
November 17, 2022
Eliahu Horwitz, Yedid Hoshen
Diffusion models have become the go-to method for many generative tasks,
particularly for image-to-image generation tasks such as super-resolution and
inpainting. Current diffusion-based methods do not provide statistical
guarantees regarding the generated results, often preventing their use in
high-stakes situations. To bridge this gap, we construct a confidence interval
around each generated pixel such that the true value of the pixel is guaranteed
to fall within the interval with a probability set by the user. Since diffusion
models parametrize the data distribution, a straightforward way of constructing
such intervals is by drawing multiple samples and calculating their bounds.
However, this method has several drawbacks: i) slow sampling speeds ii)
suboptimal bounds iii) requires training a diffusion model per task. To
mitigate these shortcomings we propose Conffusion, wherein we fine-tune a
pre-trained diffusion model to predict interval bounds in a single forward
pass. We show that Conffusion outperforms the baseline method while being three
orders of magnitude faster.
Null-text Inversion for Editing Real Images using Guided Diffusion Models
November 17, 2022
Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, Daniel Cohen-Or
Recent text-guided diffusion models provide powerful image generation
capabilities. Currently, a massive effort is given to enable the modification
of these images using text only as means to offer intuitive and versatile
editing. To edit a real image using these state-of-the-art tools, one must
first invert the image with a meaningful text prompt into the pretrained
model’s domain. In this paper, we introduce an accurate inversion technique and
thus facilitate an intuitive text-based modification of the image. Our proposed
inversion consists of two novel key components: (i) Pivotal inversion for
diffusion models. While current methods aim at mapping random noise samples to
a single input image, we use a single pivotal noise vector for each timestamp
and optimize around it. We demonstrate that a direct inversion is inadequate on
its own, but does provide a good anchor for our optimization. (ii) NULL-text
optimization, where we only modify the unconditional textual embedding that is
used for classifier-free guidance, rather than the input text embedding. This
allows for keeping both the model weights and the conditional embedding intact
and hence enables applying prompt-based editing while avoiding the cumbersome
tuning of the model’s weights. Our Null-text inversion, based on the publicly
available Stable Diffusion model, is extensively evaluated on a variety of
images and prompt editing, showing high-fidelity editing of real images.
DiffusionDet: Diffusion Model for Object Detection
November 17, 2022
Shoufa Chen, Peize Sun, Yibing Song, Ping Luo
We propose DiffusionDet, a new framework that formulates object detection as
a denoising diffusion process from noisy boxes to object boxes. During the
training stage, object boxes diffuse from ground-truth boxes to random
distribution, and the model learns to reverse this noising process. In
inference, the model refines a set of randomly generated boxes to the output
results in a progressive way. Our work possesses an appealing property of
flexibility, which enables the dynamic number of boxes and iterative
evaluation. The extensive experiments on the standard benchmarks show that
DiffusionDet achieves favorable performance compared to previous
well-established detectors. For example, DiffusionDet achieves 5.3 AP and 4.8
AP gains when evaluated with more boxes and iteration steps, under a zero-shot
transfer setting from COCO to CrowdHuman. Our code is available at
https://github.com/ShoufaChen/DiffusionDet.
Listen, denoise, action! Audio-driven motion synthesis with diffusion models
November 17, 2022
Simon Alexanderson, Rajmund Nagy, Jonas Beskow, Gustav Eje Henter
cs.LG, cs.CV, cs.GR, cs.HC, cs.SD, eess.AS, 68T07, G.3; I.2.6; I.3.7; J.5
Diffusion models have experienced a surge of interest as highly expressive
yet efficiently trainable probabilistic models. We show that these models are
an excellent fit for synthesising human motion that co-occurs with audio, e.g.,
dancing and co-speech gesticulation, since motion is complex and highly
ambiguous given audio, calling for a probabilistic description. Specifically,
we adapt the DiffWave architecture to model 3D pose sequences, putting
Conformers in place of dilated convolutions for improved modelling power. We
also demonstrate control over motion style, using classifier-free guidance to
adjust the strength of the stylistic expression. Experiments on gesture and
dance generation confirm that the proposed method achieves top-of-the-line
motion quality, with distinctive styles whose expression can be made more or
less pronounced. We also synthesise path-driven locomotion using the same model
architecture. Finally, we generalise the guidance procedure to obtain
product-of-expert ensembles of diffusion models and demonstrate how these may
be used for, e.g., style interpolation, a contribution we believe is of
independent interest. See
https://www.speech.kth.se/research/listen-denoise-action/ for video examples,
data, and code.
EmoDiff: Intensity Controllable Emotional Text-to-Speech with Soft-Label Guidance
November 17, 2022
Yiwei Guo, Chenpeng Du, Xie Chen, Kai Yu
eess.AS, cs.AI, cs.HC, cs.LG, cs.SD
Although current neural text-to-speech (TTS) models are able to generate
high-quality speech, intensity controllable emotional TTS is still a
challenging task. Most existing methods need external optimizations for
intensity calculation, leading to suboptimal results or degraded quality. In
this paper, we propose EmoDiff, a diffusion-based TTS model where emotion
intensity can be manipulated by a proposed soft-label guidance technique
derived from classifier guidance. Specifically, instead of being guided with a
one-hot vector for the specified emotion, EmoDiff is guided with a soft label
where the value of the specified emotion and \textit{Neutral} is set to
$\alpha$ and $1-\alpha$ respectively. The $\alpha$ here represents the emotion
intensity and can be chosen from 0 to 1. Our experiments show that EmoDiff can
precisely control the emotion intensity while maintaining high voice quality.
Moreover, diverse speech with specified emotion intensity can be generated by
sampling in the reverse denoising process.
Any-speaker Adaptive Text-To-Speech Synthesis with Diffusion Models
November 17, 2022
Minki Kang, Dongchan Min, Sung Ju Hwang
There has been a significant progress in Text-To-Speech (TTS) synthesis
technology in recent years, thanks to the advancement in neural generative
modeling. However, existing methods on any-speaker adaptive TTS have achieved
unsatisfactory performance, due to their suboptimal accuracy in mimicking the
target speakers’ styles. In this work, we present Grad-StyleSpeech, which is an
any-speaker adaptive TTS framework that is based on a diffusion model that can
generate highly natural speech with extremely high similarity to target
speakers’ voice, given a few seconds of reference speech. Grad-StyleSpeech
significantly outperforms recent speaker-adaptive TTS baselines on English
benchmarks. Audio samples are available at
https://nardien.github.io/grad-stylespeech-demo.
Fast Graph Generative Model via Spectral Diffusion
November 16, 2022
Tianze Luo, Zhanfeng Mo, Sinno Jialin Pan
Generating graph-structured data is a challenging problem, which requires
learning the underlying distribution of graphs. Various models such as graph
VAE, graph GANs, and graph diffusion models have been proposed to generate
meaningful and reliable graphs, among which the diffusion models have achieved
state-of-the-art performance. In this paper, we argue that running full-rank
diffusion SDEs on the whole graph adjacency matrix space hinders diffusion
models from learning graph topology generation, and hence significantly
deteriorates the quality of generated graph data. To address this limitation,
we propose an efficient yet effective Graph Spectral Diffusion Model (GSDM),
which is driven by low-rank diffusion SDEs on the graph spectrum space. Our
spectral diffusion model is further proven to enjoy a substantially stronger
theoretical guarantee than standard diffusion models. Extensive experiments
across various datasets demonstrate that, our proposed GSDM turns out to be the
SOTA model, by exhibiting both significantly higher generation quality and much
less computational consumption than the baselines.
Versatile Diffusion: Text, Images and Variations All in One Diffusion Model
November 15, 2022
Xingqian Xu, Zhangyang Wang, Eric Zhang, Kai Wang, Humphrey Shi
Recent advances in diffusion models have set an impressive milestone in many
generation tasks, and trending works such as DALL-E2, Imagen, and Stable
Diffusion have attracted great interest. Despite the rapid landscape changes,
recent new approaches focus on extensions and performance rather than capacity,
thus requiring separate models for separate tasks. In this work, we expand the
existing single-flow diffusion pipeline into a multi-task multimodal network,
dubbed Versatile Diffusion (VD), that handles multiple flows of text-to-image,
image-to-text, and variations in one unified model. The pipeline design of VD
instantiates a unified multi-flow diffusion framework, consisting of sharable
and swappable layer modules that enable the crossmodal generality beyond images
and text. Through extensive experiments, we demonstrate that VD successfully
achieves the following: a) VD outperforms the baseline approaches and handles
all its base tasks with competitive quality; b) VD enables novel extensions
such as disentanglement of style and semantics, dual- and multi-context
blending, etc.; c) The success of our multi-flow multimodal framework over
images and text may inspire further diffusion-based universal AI research. Our
code and models are open-sourced at
https://github.com/SHI-Labs/Versatile-Diffusion.
ShadowDiffusion: Diffusion-based Shadow Removal using Classifier-driven Attention and Structure Preservation
November 15, 2022
Yeying Jin, Wenhan Yang, Wei Ye, Yuan Yuan, Robby T. Tan
Removing soft and self shadows that lack clear boundaries from a single image
is still challenging. Self shadows are shadows that are cast on the object
itself. Most existing methods rely on binary shadow masks, without considering
the ambiguous boundaries of soft and self shadows. In this paper, we present
DeS3, a method that removes hard, soft and self shadows based on adaptive
attention and ViT similarity. Our novel ViT similarity loss utilizes features
extracted from a pre-trained Vision Transformer. This loss helps guide the
reverse sampling towards recovering scene structures. Our adaptive attention is
able to differentiate shadow regions from the underlying objects, as well as
shadow regions from the object casting the shadow. This capability enables DeS3
to better recover the structures of objects even when they are partially
occluded by shadows. Different from existing methods that rely on constraints
during the training phase, we incorporate the ViT similarity during the
sampling stage. Our method outperforms state-of-the-art methods on the SRD,
AISTD, LRSS, USR and UIUC datasets, removing hard, soft, and self shadows
robustly. Specifically, our method outperforms the SOTA method by 16% of the
RMSE of the whole image on the LRSS dataset.
CaDM: Codec-aware Diffusion Modeling for Neural-enhanced Video Streaming
November 15, 2022
Qihua Zhou, Ruibin Li, Song Guo, Peiran Dong, Yi Liu, Jingcai Guo, Zhenda Xu
eess.IV, cs.CV, cs.LG, cs.MM
Recent years have witnessed the dramatic growth of Internet video traffic,
where the video bitstreams are often compressed and delivered in low quality to
fit the streamer’s uplink bandwidth. To alleviate the quality degradation, it
comes the rise of Neural-enhanced Video Streaming (NVS), which shows great
prospects for recovering low-quality videos by mostly deploying neural
super-resolution (SR) on the media server. Despite its benefit, we reveal that
current mainstream works with SR enhancement have not achieved the desired
rate-distortion trade-off between bitrate saving and quality restoration, due
to: (1) overemphasizing the enhancement on the decoder side while omitting the
co-design of encoder, (2) limited generative capacity to recover high-fidelity
perceptual details, and (3) optimizing the compression-and-restoration pipeline
from the resolution perspective solely, without considering color bit-depth.
Aiming at overcoming these limitations, we are the first to conduct an
encoder-decoder (i.e., codec) synergy by leveraging the inherent
visual-generative property of diffusion models. Specifically, we present the
Codec-aware Diffusion Modeling (CaDM), a novel NVS paradigm to significantly
reduce streaming delivery bitrates while holding pretty higher restoration
capacity over existing methods. First, CaDM improves the encoder’s compression
efficiency by simultaneously reducing resolution and color bit-depth of video
frames. Second, CaDM empowers the decoder with high-quality enhancement by
making the denoising diffusion restoration aware of encoder’s resolution-color
conditions. Evaluation on public cloud services with OpenMMLab benchmarks shows
that CaDM effectively saves up to 5.12 - 21.44 times bitrates based on common
video standards and achieves much better recovery quality (e.g., FID of 0.61)
over state-of-the-art neural-enhancing methods.
Direct Inversion: Optimization-Free Text-Driven Real Image Editing with Diffusion Models
November 15, 2022
Adham Elarabawy, Harish Kamath, Samuel Denton
With the rise of large, publicly-available text-to-image diffusion models,
text-guided real image editing has garnered much research attention recently.
Existing methods tend to either rely on some form of per-instance or per-task
fine-tuning and optimization, require multiple novel views, or they inherently
entangle preservation of real image identity, semantic coherence, and
faithfulness to text guidance. In this paper, we propose an optimization-free
and zero fine-tuning framework that applies complex and non-rigid edits to a
single real image via a text prompt, avoiding all the pitfalls described above.
Using widely-available generic pre-trained text-to-image diffusion models, we
demonstrate the ability to modulate pose, scene, background, style, color, and
even racial identity in an extremely flexible manner through a single target
text detailing the desired edit. Furthermore, our method, which we name
$\textit{Direct Inversion}$, proposes multiple intuitively configurable
hyperparameters to allow for a wide range of types and extents of real image
edits. We prove our method’s efficacy in producing high-quality, diverse,
semantically coherent, and faithful real image edits through applying it on a
variety of inputs for a multitude of tasks. We also formalize our method in
well-established theory, detail future experiments for further improvement, and
compare against state-of-the-art attempts.
Diffusion Models for Medical Image Analysis: A Comprehensive Survey
November 14, 2022
Amirhossein Kazerouni, Ehsan Khodapanah Aghdam, Moein Heidari, Reza Azad, Mohsen Fayyaz, Ilker Hacihaliloglu, Dorit Merhof
Denoising diffusion models, a class of generative models, have garnered
immense interest lately in various deep-learning problems. A diffusion
probabilistic model defines a forward diffusion stage where the input data is
gradually perturbed over several steps by adding Gaussian noise and then learns
to reverse the diffusion process to retrieve the desired noise-free data from
noisy data samples. Diffusion models are widely appreciated for their strong
mode coverage and quality of the generated samples despite their known
computational burdens. Capitalizing on the advances in computer vision, the
field of medical imaging has also observed a growing interest in diffusion
models. To help the researcher navigate this profusion, this survey intends to
provide a comprehensive overview of diffusion models in the discipline of
medical image analysis. Specifically, we introduce the solid theoretical
foundation and fundamental concepts behind diffusion models and the three
generic diffusion modelling frameworks: diffusion probabilistic models,
noise-conditioned score networks, and stochastic differential equations. Then,
we provide a systematic taxonomy of diffusion models in the medical domain and
propose a multi-perspective categorization based on their application, imaging
modality, organ of interest, and algorithms. To this end, we cover extensive
applications of diffusion models in the medical domain. Furthermore, we
emphasize the practical use case of some selected approaches, and then we
discuss the limitations of the diffusion models in the medical domain and
propose several directions to fulfill the demands of this field. Finally, we
gather the overviewed studies with their available open-source implementations
at
https://github.com/amirhossein-kz/Awesome-Diffusion-Models-in-Medical-Imaging.
Extreme Generative Image Compression by Learning Text Embedding from Diffusion Models
November 14, 2022
Zhihong Pan, Xin Zhou, Hao Tian
Transferring large amount of high resolution images over limited bandwidth is
an important but very challenging task. Compressing images using extremely low
bitrates (<0.1 bpp) has been studied but it often results in low quality images
of heavy artifacts due to the strong constraint in the number of bits available
for the compressed data. It is often said that a picture is worth a thousand
words but on the other hand, language is very powerful in capturing the essence
of an image using short descriptions. With the recent success of diffusion
models for text-to-image generation, we propose a generative image compression
method that demonstrates the potential of saving an image as a short text
embedding which in turn can be used to generate high-fidelity images which is
equivalent to the original one perceptually. For a given image, its
corresponding text embedding is learned using the same optimization process as
the text-to-image diffusion model itself, using a learnable text embedding as
input after bypassing the original transformer. The optimization is applied
together with a learning compression model to achieve extreme compression of
low bitrates <0.1 bpp. Based on our experiments measured by a comprehensive set
of image quality metrics, our method outperforms the other state-of-the-art
deep learning methods in terms of both perceptual quality and diversity.
Arbitrary Style Guidance for Enhanced Diffusion-Based Text-to-Image Generation
November 14, 2022
Zhihong Pan, Xin Zhou, Hao Tian
Diffusion-based text-to-image generation models like GLIDE and DALLE-2 have
gained wide success recently for their superior performance in turning complex
text inputs into images of high quality and wide diversity. In particular, they
are proven to be very powerful in creating graphic arts of various formats and
styles. Although current models supported specifying style formats like oil
painting or pencil drawing, fine-grained style features like color
distributions and brush strokes are hard to specify as they are randomly picked
from a conditional distribution based on the given text input. Here we propose
a novel style guidance method to support generating images using arbitrary
style guided by a reference image. The generation method does not require a
separate style transfer model to generate desired styles while maintaining
image quality in generated content as controlled by the text input.
Additionally, the guidance method can be applied without a style reference,
denoted as self style guidance, to generate images of more diverse styles.
Comprehensive experiments prove that the proposed method remains robust and
effective in a wide range of conditions, including diverse graphic art forms,
image content types and diffusion models.
Denoising Diffusion Models for Out-of-Distribution Detection
November 14, 2022
Mark S. Graham, Walter H. L. Pinaya, Petru-Daniel Tudosiu, Parashkev Nachev, Sebastien Ourselin, M. Jorge Cardoso
Out-of-distribution detection is crucial to the safe deployment of machine
learning systems. Currently, unsupervised out-of-distribution detection is
dominated by generative-based approaches that make use of estimates of the
likelihood or other measurements from a generative model. Reconstruction-based
methods offer an alternative approach, in which a measure of reconstruction
error is used to determine if a sample is out-of-distribution. However,
reconstruction-based approaches are less favoured, as they require careful
tuning of the model’s information bottleneck - such as the size of the latent
dimension - to produce good results. In this work, we exploit the view of
denoising diffusion probabilistic models (DDPM) as denoising autoencoders where
the bottleneck is controlled externally, by means of the amount of noise
applied. We propose to use DDPMs to reconstruct an input that has been noised
to a range of noise levels, and use the resulting multi-dimensional
reconstruction error to classify out-of-distribution inputs. We validate our
approach both on standard computer-vision datasets and on higher dimension
medical datasets. Our approach outperforms not only reconstruction-based
methods, but also state-of-the-art generative-based approaches. Code is
available at https://github.com/marksgraham/ddpm-ood.
Latent-NeRF for Shape-Guided Generation of 3D Shapes and Textures
November 14, 2022
Gal Metzer, Elad Richardson, Or Patashnik, Raja Giryes, Daniel Cohen-Or
Text-guided image generation has progressed rapidly in recent years,
inspiring major breakthroughs in text-guided shape generation. Recently, it has
been shown that using score distillation, one can successfully text-guide a
NeRF model to generate a 3D object. We adapt the score distillation to the
publicly available, and computationally efficient, Latent Diffusion Models,
which apply the entire diffusion process in a compact latent space of a
pretrained autoencoder. As NeRFs operate in image space, a naive solution for
guiding them with latent score distillation would require encoding to the
latent space at each guidance step. Instead, we propose to bring the NeRF to
the latent space, resulting in a Latent-NeRF. Analyzing our Latent-NeRF, we
show that while Text-to-3D models can generate impressive results, they are
inherently unconstrained and may lack the ability to guide or enforce a
specific 3D structure. To assist and direct the 3D generation, we propose to
guide our Latent-NeRF using a Sketch-Shape: an abstract geometry that defines
the coarse structure of the desired object. Then, we present means to integrate
such a constraint directly into a Latent-NeRF. This unique combination of text
and shape guidance allows for increased control over the generation process. We
also show that latent score distillation can be successfully applied directly
on 3D meshes. This allows for generating high-quality textures on a given
geometry. Our experiments validate the power of our different forms of guidance
and the efficiency of using latent rendering. Implementation is available at
https://github.com/eladrich/latent-nerf
Fast Text-Conditional Discrete Denoising on Vector-Quantized Latent Spaces
November 14, 2022
Dominic Rampas, Pablo Pernias, Marc Aubreville
Recent advancements in the domain of text-to-image synthesis have culminated
in a multitude of enhancements pertaining to quality, fidelity, and diversity.
Contemporary techniques enable the generation of highly intricate visuals which
rapidly approach near-photorealistic quality. Nevertheless, as progress is
achieved, the complexity of these methodologies increases, consequently
intensifying the comprehension barrier between individuals within the field and
those external to it.
In an endeavor to mitigate this disparity, we propose a streamlined approach
for text-to-image generation, which encompasses both the training paradigm and
the sampling process. Despite its remarkable simplicity, our method yields
aesthetically pleasing images with few sampling iterations, allows for
intriguing ways for conditioning the model, and imparts advantages absent in
state-of-the-art techniques. To demonstrate the efficacy of this approach in
achieving outcomes comparable to existing works, we have trained a one-billion
parameter text-conditional model, which we refer to as “Paella”. In the
interest of fostering future exploration in this field, we have made our source
code and models publicly accessible for the research community.
November 13, 2022
Yongkang Li, Ming Zhang
cs.CL, cs.AI, cs.IR, cs.LG
With the development of deep neural language models, great progress has been
made in information extraction recently. However, deep learning models often
overfit on noisy data points, leading to poor performance. In this work, we
examine the role of information entropy in the overfitting process and draw a
key insight that overfitting is a process of overconfidence and entropy
decreasing. Motivated by such properties, we propose a simple yet effective
co-regularization joint-training framework TIER-A, Aggregation Joint-training
Framework with Temperature Calibration and Information Entropy Regularization.
Our framework consists of several neural models with identical structures.
These models are jointly trained and we avoid overfitting by introducing
temperature and information entropy regularization. Extensive experiments on
two widely-used but noisy datasets, TACRED and CoNLL03, demonstrate the
correctness of our assumption and the effectiveness of our framework.
DriftRec: Adapting diffusion models to blind image restoration tasks
November 12, 2022
Simon Welker, Henry N. Chapman, Timo Gerkmann
In this work, we utilize the high-fidelity generation abilities of diffusion
models to solve blind JPEG restoration at high compression levels. We propose
an elegant modification of the forward stochastic differential equation of
diffusion models to adapt them to this restoration task and name our method
DriftRec. Comparing DriftRec against an $L_2$ regression baseline with the same
network architecture and two state-of-the-art techniques for JPEG restoration,
we show that our approach can escape the tendency of other methods to generate
blurry images, and recovers the distribution of clean images significantly more
faithfully. For this, only a dataset of clean/corrupted image pairs and no
knowledge about the corruption operation is required, enabling wider
applicability to other restoration tasks. In contrast to other conditional and
unconditional diffusion models, we utilize the idea that the distributions of
clean and corrupted images are much closer to each other than each is to the
usual Gaussian prior of the reverse process in diffusion models. Our approach
therefore requires only low levels of added noise, and needs comparatively few
sampling steps even without further optimizations. We show that DriftRec
naturally generalizes to realistic and difficult scenarios such as unaligned
double JPEG compression and blind restoration of JPEGs found online, without
having encountered such examples during training.
HumanDiffusion: a Coarse-to-Fine Alignment Diffusion Framework for Controllable Text-Driven Person Image Generation
November 11, 2022
Kaiduo Zhang, Muyi Sun, Jianxin Sun, Binghao Zhao, Kunbo Zhang, Zhenan Sun, Tieniu Tan
Text-driven person image generation is an emerging and challenging task in
cross-modality image generation. Controllable person image generation promotes
a wide range of applications such as digital human interaction and virtual
try-on. However, previous methods mostly employ single-modality information as
the prior condition (e.g. pose-guided person image generation), or utilize the
preset words for text-driven human synthesis. Introducing a sentence composed
of free words with an editable semantic pose map to describe person appearance
is a more user-friendly way. In this paper, we propose HumanDiffusion, a
coarse-to-fine alignment diffusion framework, for text-driven person image
generation. Specifically, two collaborative modules are proposed, the Stylized
Memory Retrieval (SMR) module for fine-grained feature distillation in data
processing and the Multi-scale Cross-modality Alignment (MCA) module for
coarse-to-fine feature alignment in diffusion. These two modules guarantee the
alignment quality of the text and image, from image-level to feature-level,
from low-resolution to high-resolution. As a result, HumanDiffusion realizes
open-vocabulary person image generation with desired semantic poses. Extensive
experiments conducted on DeepFashion demonstrate the superiority of our method
compared with previous approaches. Moreover, better results could be obtained
for complicated person images with various details and uncommon poses.
StructDiffusion: Object-Centric Diffusion for Semantic Rearrangement of Novel Objects
November 08, 2022
Weiyu Liu, Yilun Du, Tucker Hermans, Sonia Chernova, Chris Paxton
cs.RO, cs.AI, cs.CL, cs.CV, cs.LG
Robots operating in human environments must be able to rearrange objects into
semantically-meaningful configurations, even if these objects are previously
unseen. In this work, we focus on the problem of building physically-valid
structures without step-by-step instructions. We propose StructDiffusion, which
combines a diffusion model and an object-centric transformer to construct
structures given partial-view point clouds and high-level language goals, such
as “set the table”. Our method can perform multiple challenging
language-conditioned multi-step 3D planning tasks using one model.
StructDiffusion even improves the success rate of assembling physically-valid
structures out of unseen objects by on average 16% over an existing multi-modal
transformer model trained on specific structures. We show experiments on
held-out objects in both simulation and on real-world rearrangement tasks.
Importantly, we show how integrating both a diffusion model and a
collision-discriminator model allows for improved generalization over other
methods when rearranging previously-unseen objects. For videos and additional
results, see our website: https://structdiffusion.github.io/.
DiffPhase: Generative Diffusion-based STFT Phase Retrieval
November 08, 2022
Tal Peer, Simon Welker, Timo Gerkmann
eess.AS, cs.LG, cs.SD, eess.SP
Diffusion probabilistic models have been recently used in a variety of tasks,
including speech enhancement and synthesis. As a generative approach, diffusion
models have been shown to be especially suitable for imputation problems, where
missing data is generated based on existing data. Phase retrieval is inherently
an imputation problem, where phase information has to be generated based on the
given magnitude. In this work we build upon previous work in the speech domain,
adapting a speech enhancement diffusion model specifically for STFT phase
retrieval. Evaluation using speech quality and intelligibility metrics shows
the diffusion approach is well-suited to the phase retrieval task, with
performance surpassing both classical and modern methods.
Self-conditioned Embedding Diffusion for Text Generation
November 08, 2022
Robin Strudel, Corentin Tallec, Florent Altché, Yilun Du, Yaroslav Ganin, Arthur Mensch, Will Grathwohl, Nikolay Savinov, Sander Dieleman, Laurent Sifre, Rémi Leblond
Can continuous diffusion models bring the same performance breakthrough on
natural language they did for image generation? To circumvent the discrete
nature of text data, we can simply project tokens in a continuous space of
embeddings, as is standard in language modeling. We propose Self-conditioned
Embedding Diffusion, a continuous diffusion mechanism that operates on token
embeddings and allows to learn flexible and scalable diffusion models for both
conditional and unconditional text generation. Through qualitative and
quantitative evaluation, we show that our text diffusion models generate
samples comparable with those produced by standard autoregressive language
models - while being in theory more efficient on accelerator hardware at
inference time. Our work paves the way for scaling up diffusion models for
text, similarly to autoregressive models, and for improving performance with
recent refinements to continuous diffusion.
Unsupervised vocal dereverberation with diffusion-based generative models
November 08, 2022
Koichi Saito, Naoki Murata, Toshimitsu Uesaka, Chieh-Hsin Lai, Yuhta Takida, Takao Fukui, Yuki Mitsufuji
Removing reverb from reverberant music is a necessary technique to clean up
audio for downstream music manipulations. Reverberation of music contains two
categories, natural reverb, and artificial reverb. Artificial reverb has a
wider diversity than natural reverb due to its various parameter setups and
reverberation types. However, recent supervised dereverberation methods may
fail because they rely on sufficiently diverse and numerous pairs of
reverberant observations and retrieved data for training in order to be
generalizable to unseen observations during inference. To resolve these
problems, we propose an unsupervised method that can remove a general kind of
artificial reverb for music without requiring pairs of data for training. The
proposed method is based on diffusion models, where it initializes the unknown
reverberation operator with a conventional signal processing technique and
simultaneously refines the estimate with the help of diffusion models. We show
through objective and perceptual evaluations that our method outperforms the
current leading vocal dereverberation benchmarks.
From Denoising Diffusions to Denoising Markov Models
November 07, 2022
Joe Benton, Yuyang Shi, Valentin De Bortoli, George Deligiannidis, Arnaud Doucet
Denoising diffusions are state-of-the-art generative models exhibiting
remarkable empirical performance. They work by diffusing the data distribution
into a Gaussian distribution and then learning to reverse this noising process
to obtain synthetic datapoints. The denoising diffusion relies on
approximations of the logarithmic derivatives of the noised data densities
using score matching. Such models can also be used to perform approximate
posterior simulation when one can only sample from the prior and likelihood. We
propose a unifying framework generalising this approach to a wide class of
spaces and leading to an original extension of score matching. We illustrate
the resulting models on various applications.
Medical Diffusion – Denoising Diffusion Probabilistic Models for 3D Medical Image Generation
November 07, 2022
Firas Khader, Gustav Mueller-Franzes, Soroosh Tayebi Arasteh, Tianyu Han, Christoph Haarburger, Maximilian Schulze-Hagen, Philipp Schad, Sandy Engelhardt, Bettina Baessler, Sebastian Foersch, Johannes Stegmaier, Christiane Kuhl, Sven Nebelung, Jakob Nikolas Kather, Daniel Truhn
Recent advances in computer vision have shown promising results in image
generation. Diffusion probabilistic models in particular have generated
realistic images from textual input, as demonstrated by DALL-E 2, Imagen and
Stable Diffusion. However, their use in medicine, where image data typically
comprises three-dimensional volumes, has not been systematically evaluated.
Synthetic images may play a crucial role in privacy preserving artificial
intelligence and can also be used to augment small datasets. Here we show that
diffusion probabilistic models can synthesize high quality medical imaging
data, which we show for Magnetic Resonance Images (MRI) and Computed Tomography
(CT) images. We provide quantitative measurements of their performance through
a reader study with two medical experts who rated the quality of the
synthesized images in three categories: Realistic image appearance, anatomical
correctness and consistency between slices. Furthermore, we demonstrate that
synthetic images can be used in a self-supervised pre-training and improve the
performance of breast segmentation models when data is scarce (dice score 0.91
vs. 0.95 without vs. with synthetic data). The code is publicly available on
GitHub: https://github.com/FirasGit/medicaldiffusion.
Modeling Temporal Data as Continuous Functions with Process Diffusion
November 04, 2022
Marin Biloš, Kashif Rasul, Anderson Schneider, Yuriy Nevmyvaka, Stephan Günnemann
Temporal data such as time series can be viewed as discretized measurements
of the underlying function. To build a generative model for such data we have
to model the stochastic process that governs it. We propose a solution by
defining the denoising diffusion model in the function space which also allows
us to naturally handle irregularly-sampled observations. The forward process
gradually adds noise to functions, preserving their continuity, while the
learned reverse process removes the noise and returns functions as new samples.
To this end, we define suitable noise sources and introduce novel denoising and
score-matching models. We show how our method can be used for multivariate
probabilistic forecasting and imputation, and how our model can be interpreted
as a neural process.
Cold Diffusion for Speech Enhancement
November 04, 2022
Hao Yen, François G. Germain, Gordon Wichern, Jonathan Le Roux
Diffusion models have recently shown promising results for difficult
enhancement tasks such as the conditional and unconditional restoration of
natural images and audio signals. In this work, we explore the possibility of
leveraging a recently proposed advanced iterative diffusion model, namely cold
diffusion, to recover clean speech signals from noisy signals. The unique
mathematical properties of the sampling process from cold diffusion could be
utilized to restore high-quality samples from arbitrary degradations. Based on
these properties, we propose an improved training algorithm and objective to
help the model generalize better during the sampling process. We verify our
proposed framework by investigating two model architectures. Experimental
results on benchmark speech enhancement dataset VoiceBank-DEMAND demonstrate
the strong performance of the proposed approach compared to representative
discriminative models and diffusion-based enhancement models.
Rickrolling the Artist: Injecting Invisible Backdoors into Text-Guided Image Generation Models
November 04, 2022
Lukas Struppek, Dominik Hintersdorf, Kristian Kersting
cs.LG, cs.AI, cs.CR, cs.CV
While text-to-image synthesis currently enjoys great popularity among
researchers and the general public, the security of these models has been
neglected so far. Many text-guided image generation models rely on pre-trained
text encoders from external sources, and their users trust that the retrieved
models will behave as promised. Unfortunately, this might not be the case. We
introduce backdoor attacks against text-guided generative models and
demonstrate that their text encoders pose a major tampering risk. Our attacks
only slightly alter an encoder so that no suspicious model behavior is apparent
for image generations with clean prompts. By then inserting a single character
trigger into the prompt, e.g., a non-Latin character or emoji, the adversary
can trigger the model to either generate images with pre-defined attributes or
images following a hidden, potentially malicious description. We empirically
demonstrate the high effectiveness of our attacks on Stable Diffusion and
highlight that the injection process of a single backdoor takes less than two
minutes. Besides phrasing our approach solely as an attack, it can also force
an encoder to forget phrases related to certain concepts, such as nudity or
violence, and help to make image generation safer.
Analysing Diffusion-based Generative Approaches versus Discriminative Approaches for Speech Restoration
November 04, 2022
Jean-Marie Lemercier, Julius Richter, Simon Welker, Timo Gerkmann
Diffusion-based generative models have had a high impact on the computer
vision and speech processing communities these past years. Besides data
generation tasks, they have also been employed for data restoration tasks like
speech enhancement and dereverberation. While discriminative models have
traditionally been argued to be more powerful e.g. for speech enhancement,
generative diffusion approaches have recently been shown to narrow this
performance gap considerably. In this paper, we systematically compare the
performance of generative diffusion models and discriminative approaches on
different speech restoration tasks. For this, we extend our prior contributions
on diffusion-based speech enhancement in the complex time-frequency domain to
the task of bandwith extension. We then compare it to a discriminatively
trained neural network with the same network architecture on three restoration
tasks, namely speech denoising, dereverberation and bandwidth extension. We
observe that the generative approach performs globally better than its
discriminative counterpart on all tasks, with the strongest benefit for
non-additive distortion models, like in dereverberation and bandwidth
extension. Code and audio examples can be found online at
https://uhh.de/inf-sp-sgmsemultitask
Efficient Spatially Sparse Inference for Conditional GANs and Diffusion Models
November 03, 2022
Muyang Li, Ji Lin, Chenlin Meng, Stefano Ermon, Song Han, Jun-Yan Zhu
During image editing, existing deep generative models tend to re-synthesize
the entire output from scratch, including the unedited regions. This leads to a
significant waste of computation, especially for minor editing operations. In
this work, we present Spatially Sparse Inference (SSI), a general-purpose
technique that selectively performs computation for edited regions and
accelerates various generative models, including both conditional GANs and
diffusion models. Our key observation is that users prone to gradually edit the
input image. This motivates us to cache and reuse the feature maps of the
original image. Given an edited image, we sparsely apply the convolutional
filters to the edited regions while reusing the cached features for the
unedited areas. Based on our algorithm, we further propose Sparse Incremental
Generative Engine (SIGE) to convert the computation reduction to latency
reduction on off-the-shelf hardware. With about $1\%$-area edits, SIGE
accelerates DDPM by $3.0\times$ on NVIDIA RTX 3090 and $4.6\times$ on Apple M1
Pro GPU, Stable Diffusion by $7.2\times$ on 3090, and GauGAN by $5.6\times$ on
3090 and $5.2\times$ on M1 Pro GPU. Compared to our conference version, we
extend SIGE to accommodate attention layers and apply it to Stable Diffusion.
Additionally, we offer support for Apple M1 Pro GPU and include more results
with large and sequential edits.
Evaluating a Synthetic Image Dataset Generated with Stable Diffusion
November 03, 2022
Andreas Stöckl
We generate synthetic images with the “Stable Diffusion” image generation
model using the Wordnet taxonomy and the definitions of concepts it contains.
This synthetic image database can be used as training data for data
augmentation in machine learning applications, and it is used to investigate
the capabilities of the Stable Diffusion model.
Analyses show that Stable Diffusion can produce correct images for a large
number of concepts, but also a large variety of different representations. The
results show differences depending on the test concepts considered and problems
with very specific concepts. These evaluations were performed using a vision
transformer model for image classification.
Convergence in KL Divergence of the Inexact Langevin Algorithm with Application to Score-based Generative Models
November 02, 2022
Kaylee Yingxi Yang, Andre Wibisono
We study the Inexact Langevin Dynamics (ILD), Inexact Langevin Algorithm
(ILA), and Score-based Generative Modeling (SGM) when utilizing estimated score
functions for sampling. Our focus lies in establishing stable biased
convergence guarantees in terms of the Kullback-Leibler (KL) divergence. To
achieve these guarantees, we impose two key assumptions: 1) the target
distribution satisfies the log-Sobolev inequality (LSI), and 2) the score
estimator exhibits a bounded Moment Generating Function (MGF) error. Notably,
the MGF error assumption we adopt is more lenient compared to the $L^\infty$
error assumption used in existing literature. However, it is stronger than the
$L^2$ error assumption utilized in recent works, which often leads to unstable
bounds. We explore the question of how to obtain a provably accurate score
estimator that satisfies the MGF error assumption. Specifically, we demonstrate
that a simple estimator based on kernel density estimation fulfills the MGF
error assumption for sub-Gaussian target distribution, at the population level.
An optimal control perspective on diffusion-based generative modeling
November 02, 2022
Julius Berner, Lorenz Richter, Karen Ullrich
We establish a connection between stochastic optimal control and generative
models based on stochastic differential equations (SDEs), such as recently
developed diffusion probabilistic models. In particular, we derive a
Hamilton-Jacobi-Bellman equation that governs the evolution of the
log-densities of the underlying SDE marginals. This perspective allows to
transfer methods from optimal control theory to generative modeling. First, we
show that the evidence lower bound is a direct consequence of the well-known
verification theorem from control theory. Further, we can formulate
diffusion-based generative modeling as a minimization of the Kullback-Leibler
divergence between suitable measures in path space. Finally, we develop a novel
diffusion-based method for sampling from unnormalized densities – a problem
frequently occurring in statistics and computational sciences. We demonstrate
that our time-reversed diffusion sampler (DIS) can outperform other
diffusion-based sampling approaches on multiple numerical examples.
eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers
November 02, 2022
Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Qinsheng Zhang, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, Bryan Catanzaro, Tero Karras, Ming-Yu Liu
Large-scale diffusion-based generative models have led to breakthroughs in
text-conditioned high-resolution image synthesis. Starting from random noise,
such text-to-image diffusion models gradually synthesize images in an iterative
fashion while conditioning on text prompts. We find that their synthesis
behavior qualitatively changes throughout this process: Early in sampling,
generation strongly relies on the text prompt to generate text-aligned content,
while later, the text conditioning is almost entirely ignored. This suggests
that sharing model parameters throughout the entire generation process may not
be ideal. Therefore, in contrast to existing works, we propose to train an
ensemble of text-to-image diffusion models specialized for different synthesis
stages. To maintain training efficiency, we initially train a single model,
which is then split into specialized models that are trained for the specific
stages of the iterative generation process. Our ensemble of diffusion models,
called eDiff-I, results in improved text alignment while maintaining the same
inference computation cost and preserving high visual quality, outperforming
previous large-scale text-to-image diffusion models on the standard benchmark.
In addition, we train our model to exploit a variety of embeddings for
conditioning, including the T5 text, CLIP text, and CLIP image embeddings. We
show that these different embeddings lead to different behaviors. Notably, the
CLIP image embedding allows an intuitive way of transferring the style of a
reference image to the target text-to-image output. Lastly, we show a technique
that enables eDiff-I’s “paint-with-words” capability. A user can select the
word in the input text and paint it in a canvas to control the output, which is
very handy for crafting the desired image in mind. The project page is
available at https://deepimagination.cc/eDiff-I/
Generation of Anonymous Chest Radiographs Using Latent Diffusion Models for Training Thoracic Abnormality Classification Systems
November 02, 2022
Kai Packhäuser, Lukas Folle, Florian Thamm, Andreas Maier
The availability of large-scale chest X-ray datasets is a requirement for
developing well-performing deep learning-based algorithms in thoracic
abnormality detection and classification. However, biometric identifiers in
chest radiographs hinder the public sharing of such data for research purposes
due to the risk of patient re-identification. To counteract this issue,
synthetic data generation offers a solution for anonymizing medical images.
This work employs a latent diffusion model to synthesize an anonymous chest
X-ray dataset of high-quality class-conditional images. We propose a
privacy-enhancing sampling strategy to ensure the non-transference of biometric
information during the image generation process. The quality of the generated
images and the feasibility of serving as exclusive training data are evaluated
on a thoracic abnormality classification task. Compared to a real classifier,
we achieve competitive results with a performance gap of only 3.5% in the area
under the receiver operating characteristic curve.
Entropic Neural Optimal Transport via Diffusion Processes
November 02, 2022
Nikita Gushchin, Alexander Kolesov, Alexander Korotin, Dmitry Vetrov, Evgeny Burnaev
We propose a novel neural algorithm for the fundamental problem of computing
the entropic optimal transport (EOT) plan between continuous probability
distributions which are accessible by samples. Our algorithm is based on the
saddle point reformulation of the dynamic version of EOT which is known as the
Schr"odinger Bridge problem. In contrast to the prior methods for large-scale
EOT, our algorithm is end-to-end and consists of a single learning step, has
fast inference procedure, and allows handling small values of the entropy
regularization coefficient which is of particular importance in some applied
problems. Empirically, we show the performance of the method on several
large-scale EOT tasks.
https://github.com/ngushchin/EntropicNeuralOptimalTransport
DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models
November 02, 2022
Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, Jun Zhu
Diffusion probabilistic models (DPMs) have achieved impressive success in
high-resolution image synthesis, especially in recent large-scale text-to-image
generation applications. An essential technique for improving the sample
quality of DPMs is guided sampling, which usually needs a large guidance scale
to obtain the best sample quality. The commonly-used fast sampler for guided
sampling is DDIM, a first-order diffusion ODE solver that generally needs 100
to 250 steps for high-quality samples. Although recent works propose dedicated
high-order solvers and achieve a further speedup for sampling without guidance,
their effectiveness for guided sampling has not been well-tested before. In
this work, we demonstrate that previous high-order fast samplers suffer from
instability issues, and they even become slower than DDIM when the guidance
scale grows large. To further speed up guided sampling, we propose
DPM-Solver++, a high-order solver for the guided sampling of DPMs. DPM-Solver++
solves the diffusion ODE with the data prediction model and adopts thresholding
methods to keep the solution matches training data distribution. We further
propose a multistep variant of DPM-Solver++ to address the instability issue by
reducing the effective step size. Experiments show that DPM-Solver++ can
generate high-quality samples within only 15 to 20 steps for guided sampling by
pixel-space and latent-space DPMs.
Spot the fake lungs: Generating Synthetic Medical Images using Neural Diffusion Models
November 02, 2022
Hazrat Ali, Shafaq Murad, Zubair Shah
eess.IV, cs.AI, cs.CV, cs.LG
Generative models are becoming popular for the synthesis of medical images.
Recently, neural diffusion models have demonstrated the potential to generate
photo-realistic images of objects. However, their potential to generate medical
images is not explored yet. In this work, we explore the possibilities of
synthesis of medical images using neural diffusion models. First, we use a
pre-trained DALLE2 model to generate lungs X-Ray and CT images from an input
text prompt. Second, we train a stable diffusion model with 3165 X-Ray images
and generate synthetic images. We evaluate the synthetic image data through a
qualitative analysis where two independent radiologists label randomly chosen
samples from the generated data as real, fake, or unsure. Results demonstrate
that images generated with the diffusion model can translate characteristics
that are otherwise very specific to certain medical conditions in chest X-Ray
or CT images. Careful tuning of the model can be very promising. To the best of
our knowledge, this is the first attempt to generate lungs X-Ray and CT images
using neural diffusion models. This work aims to introduce a new dimension in
artificial intelligence for medical imaging. Given that this is a new topic,
the paper will serve as an introduction and motivation for the research
community to explore the potential of diffusion models for medical image
synthesis. We have released the synthetic images on
https://www.kaggle.com/datasets/hazrat/awesomelungs.
Concrete Score Matching: Generalized Score Matching for Discrete Data
November 02, 2022
Chenlin Meng, Kristy Choi, Jiaming Song, Stefano Ermon
Representing probability distributions by the gradient of their density
functions has proven effective in modeling a wide range of continuous data
modalities. However, this representation is not applicable in discrete domains
where the gradient is undefined. To this end, we propose an analogous score
function called the “Concrete score”, a generalization of the (Stein) score for
discrete settings. Given a predefined neighborhood structure, the Concrete
score of any input is defined by the rate of change of the probabilities with
respect to local directional changes of the input. This formulation allows us
to recover the (Stein) score in continuous domains when measuring such changes
by the Euclidean distance, while using the Manhattan distance leads to our
novel score function in discrete domains. Finally, we introduce a new framework
to learn such scores from samples called Concrete Score Matching (CSM), and
propose an efficient training objective to scale our approach to high
dimensions. Empirically, we demonstrate the efficacy of CSM on density
estimation tasks on a mixture of synthetic, tabular, and high-dimensional image
datasets, and demonstrate that it performs favorably relative to existing
baselines for modeling discrete data.
On the detection of synthetic images generated by diffusion models
November 01, 2022
Riccardo Corvi, Davide Cozzolino, Giada Zingarini, Giovanni Poggi, Koki Nagano, Luisa Verdoliva
Over the past decade, there has been tremendous progress in creating
synthetic media, mainly thanks to the development of powerful methods based on
generative adversarial networks (GAN). Very recently, methods based on
diffusion models (DM) have been gaining the spotlight. In addition to providing
an impressive level of photorealism, they enable the creation of text-based
visual content, opening up new and exciting opportunities in many different
application fields, from arts to video games. On the other hand, this property
is an additional asset in the hands of malicious users, who can generate and
distribute fake media perfectly adapted to their attacks, posing new challenges
to the media forensic community. With this work, we seek to understand how
difficult it is to distinguish synthetic images generated by diffusion models
from pristine ones and whether current state-of-the-art detectors are suitable
for the task. To this end, first we expose the forensics traces left by
diffusion models, then study how current detectors, developed for GAN-generated
images, perform on these new synthetic images, especially in challenging
social-networks scenarios involving image compression and resizing. Datasets
and code are available at github.com/grip-unina/DMimageDetection.
MedSegDiff: Medical Image Segmentation with Diffusion Probabilistic Model
November 01, 2022
Junde Wu, Rao Fu, Huihui Fang, Yu Zhang, Yehui Yang, Haoyi Xiong, Huiying Liu, Yanwu Xu
Diffusion probabilistic model (DPM) recently becomes one of the hottest topic
in computer vision. Its image generation application such as Imagen, Latent
Diffusion Models and Stable Diffusion have shown impressive generation
capabilities, which aroused extensive discussion in the community. Many recent
studies also found it is useful in many other vision tasks, like image
deblurring, super-resolution and anomaly detection. Inspired by the success of
DPM, we propose the first DPM based model toward general medical image
segmentation tasks, which we named MedSegDiff. In order to enhance the
step-wise regional attention in DPM for the medical image segmentation, we
propose dynamic conditional encoding, which establishes the state-adaptive
conditions for each sampling step. We further propose Feature Frequency Parser
(FF-Parser), to eliminate the negative effect of high-frequency noise component
in this process. We verify MedSegDiff on three medical segmentation tasks with
different image modalities, which are optic cup segmentation over fundus
images, brain tumor segmentation over MRI images and thyroid nodule
segmentation over ultrasound images. The experimental results show that
MedSegDiff outperforms state-of-the-art (SOTA) methods with considerable
performance gap, indicating the generalization and effectiveness of the
proposed model. Our code is released at https://github.com/WuJunde/MedSegDiff.
DOLPH: Diffusion Models for Phase Retrieval
November 01, 2022
Shirin Shoushtari, Jiaming Liu, Ulugbek S. Kamilov
Phase retrieval refers to the problem of recovering an image from the
magnitudes of its complex-valued linear measurements. Since the problem is
ill-posed, the recovery requires prior knowledge on the unknown image. We
present DOLPH as a new deep model-based architecture for phase retrieval that
integrates an image prior specified using a diffusion model with a nonconvex
data-fidelity term for phase retrieval. Diffusion models are a recent class of
deep generative models that are relatively easy to train due to their
implementation as image denoisers. DOLPH reconstructs high-quality solutions by
alternating data-consistency updates with the sampling step of a diffusion
model. Our numerical results show the robustness of DOLPH to noise and its
ability to generate several candidate solutions given a set of measurements.
SDMuse: Stochastic Differential Music Editing and Generation via Hybrid Representation
November 01, 2022
Chen Zhang, Yi Ren, Kejun Zhang, Shuicheng Yan
While deep generative models have empowered music generation, it remains a
challenging and under-explored problem to edit an existing musical piece at
fine granularity. In this paper, we propose SDMuse, a unified Stochastic
Differential Music editing and generation framework, which can not only compose
a whole musical piece from scratch, but also modify existing musical pieces in
many ways, such as combination, continuation, inpainting, and style
transferring. The proposed SDMuse follows a two-stage pipeline to achieve music
generation and editing on top of a hybrid representation including pianoroll
and MIDI-event. In particular, SDMuse first generates/edits pianoroll by
iteratively denoising through a stochastic differential equation (SDE) based on
a diffusion model generative prior, and then refines the generated pianoroll
and predicts MIDI-event tokens auto-regressively. We evaluate the generated
music of our method on ailabs1k7 pop music dataset in terms of quality and
controllability on various music editing and generation tasks. Experimental
results demonstrate the effectiveness of our proposed stochastic differential
music editing and generation process, as well as the hybrid representations.
Accelerated Motion Correction for MRI using Score-Based Generative Models
November 01, 2022
Brett Levac, Sidharth Kumar, Ajil Jalal, Jonathan I. Tamir
Magnetic Resonance Imaging (MRI) is a powerful medical imaging modality, but
unfortunately suffers from long scan times which, aside from increasing
operational costs, can lead to image artifacts due to patient motion. Motion
during the acquisition leads to inconsistencies in measured data that manifest
as blurring and ghosting if unaccounted for in the image reconstruction
process. Various deep learning based reconstruction techniques have been
proposed which decrease scan time by reducing the number of measurements needed
for a high fidelity reconstructed image. Additionally, deep learning has been
used to correct motion using end-to-end techniques. This, however, increases
susceptibility to distribution shifts at test time (sampling pattern, motion
level). In this work we propose a framework for jointly reconstructing highly
sub-sampled MRI data while estimating patient motion using diffusion based
generative models. Our method does not make specific assumptions on the
sampling trajectory or motion pattern at training time and thus can be flexibly
applied to various types of measurement models and patient motion. We
demonstrate our framework on retrospectively accelerated 2D brain MRI corrupted
by rigid motion.
SSD-LM: Semi-autoregressive Simplex-based Diffusion Language Model for Text Generation and Modular Control
October 31, 2022
Xiaochuang Han, Sachin Kumar, Yulia Tsvetkov
Despite the growing success of diffusion models in continuous-valued domains
(e.g., images), similar efforts for discrete domains such as text have yet to
match the performance of autoregressive language models. In this work, we
present SSD-LM – a diffusion-based language model with two key design choices.
First, SSD-LM is semi-autoregressive, iteratively generating blocks of text,
allowing for flexible output length at decoding time while enabling local
bidirectional context updates. Second, it is simplex-based, performing
diffusion on the natural vocabulary space rather than a learned latent space,
allowing us to incorporate classifier guidance and modular control using
off-the-shelf classifiers without any adaptation. We evaluate SSD-LM on
unconstrained text generation benchmarks, and show that it matches or
outperforms strong autoregressive GPT-2 models across standard quality and
diversity metrics, while vastly outperforming diffusion-based baselines. On
controlled text generation, SSD-LM also outperforms competitive baselines, with
an extra advantage in modularity.
Guided Conditional Diffusion for Controllable Traffic Simulation
October 31, 2022
Ziyuan Zhong, Davis Rempe, Danfei Xu, Yuxiao Chen, Sushant Veer, Tong Che, Baishakhi Ray, Marco Pavone
cs.RO, cs.AI, cs.LG, stat.ML
Controllable and realistic traffic simulation is critical for developing and
verifying autonomous vehicles. Typical heuristic-based traffic models offer
flexible control to make vehicles follow specific trajectories and traffic
rules. On the other hand, data-driven approaches generate realistic and
human-like behaviors, improving transfer from simulated to real-world traffic.
However, to the best of our knowledge, no traffic model offers both
controllability and realism. In this work, we develop a conditional diffusion
model for controllable traffic generation (CTG) that allows users to control
desired properties of trajectories at test time (e.g., reach a goal or follow a
speed limit) while maintaining realism and physical feasibility through
enforced dynamics. The key technical idea is to leverage recent advances from
diffusion modeling and differentiable logic to guide generated trajectories to
meet rules defined using signal temporal logic (STL). We further extend
guidance to multi-agent settings and enable interaction-based rules like
collision avoidance. CTG is extensively evaluated on the nuScenes dataset for
diverse and composite rules, demonstrating improvement over strong baselines in
terms of the controllability-realism tradeoff.
Diffusion models for missing value imputation in tabular data
October 31, 2022
Shuhan Zheng, Nontawat Charoenphakdee
Missing value imputation in machine learning is the task of estimating the
missing values in the dataset accurately using available information. In this
task, several deep generative modeling methods have been proposed and
demonstrated their usefulness, e.g., generative adversarial imputation
networks. Recently, diffusion models have gained popularity because of their
effectiveness in the generative modeling task in images, texts, audio, etc. To
our knowledge, less attention has been paid to the investigation of the
effectiveness of diffusion models for missing value imputation in tabular data.
Based on recent development of diffusion models for time-series data
imputation, we propose a diffusion model approach called “Conditional
Score-based Diffusion Models for Tabular data” (TabCSDI). To effectively handle
categorical variables and numerical variables simultaneously, we investigate
three techniques: one-hot encoding, analog bits encoding, and feature
tokenization. Experimental results on benchmark datasets demonstrated the
effectiveness of TabCSDI compared with well-known existing methods, and also
emphasized the importance of the categorical embedding techniques.
DiffusER: Discrete Diffusion via Edit-based Reconstruction
October 30, 2022
Machel Reid, Vincent J. Hellendoorn, Graham Neubig
In text generation, models that generate text from scratch one token at a
time are currently the dominant paradigm. Despite being performant, these
models lack the ability to revise existing text, which limits their usability
in many practical scenarios. We look to address this, with DiffusER (Diffusion
via Edit-based Reconstruction), a new edit-based generative model for text
based on denoising diffusion models – a class of models that use a Markov
chain of denoising steps to incrementally generate data. DiffusER is not only a
strong generative model in general, rivalling autoregressive models on several
tasks spanning machine translation, summarization, and style transfer; it can
also perform other varieties of generation that standard autoregressive models
are not well-suited for. For instance, we demonstrate that DiffusER makes it
possible for a user to condition generation on a prototype, or an incomplete
sequence, and continue revising based on previous edit steps.
Conditioning and Sampling in Variational Diffusion Models for Speech Super-resolution
October 27, 2022
Chin-Yun Yu, Sung-Lin Yeh, György Fazekas, Hao Tang
Recently, diffusion models (DMs) have been increasingly used in audio
processing tasks, including speech super-resolution (SR), which aims to restore
high-frequency content given low-resolution speech utterances. This is commonly
achieved by conditioning the network of noise predictor with low-resolution
audio. In this paper, we propose a novel sampling algorithm that communicates
the information of the low-resolution audio via the reverse sampling process of
DMs. The proposed method can be a drop-in replacement for the vanilla sampling
process and can significantly improve the performance of the existing works.
Moreover, by coupling the proposed sampling method with an unconditional DM,
i.e., a DM with no auxiliary inputs to its noise predictor, we can generalize
it to a wide range of SR setups. We also attain state-of-the-art results on the
VCTK Multi-Speaker benchmark with this novel formulation.
LAD: Language Augmented Diffusion for Reinforcement Learning
October 27, 2022
Edwin Zhang, Yujie Lu, William Wang, Amy Zhang
Training generalist agents is difficult across several axes, requiring us to
deal with high-dimensional inputs (space), long horizons (time), and multiple
and new tasks. Recent advances with architectures have allowed for improved
scaling along one or two of these dimensions, but are still prohibitive
computationally. In this paper, we propose to address all three axes by
leveraging Language to Control Diffusion models as a hierarchical planner
conditioned on language (LCD). We effectively and efficiently scale diffusion
models for planning in extended temporal, state, and task dimensions to tackle
long horizon control problems conditioned on natural language instructions. We
compare LCD with other state-of-the-art models on the CALVIN language robotics
benchmark and find that LCD outperforms other SOTA methods in multi task
success rates while dramatically improving computational efficiency with a
single task success rate (SR) of 88.7% against the previous best of 82.6%. We
show that LCD can successfully leverage the unique strength of diffusion models
to produce coherent long range plans while addressing their weakness at
generating low-level details and control. We release our code and models at
https://github.com/ezhang7423/language-control-diffusion.
Solving Audio Inverse Problems with a Diffusion Model
October 27, 2022
Eloi Moliner, Jaakko Lehtinen, Vesa Välimäki
This paper presents CQT-Diff, a data-driven generative audio model that can,
once trained, be used for solving various different audio inverse problems in a
problem-agnostic setting. CQT-Diff is a neural diffusion model with an
architecture that is carefully constructed to exploit pitch-equivariant
symmetries in music. This is achieved by preconditioning the model with an
invertible Constant-Q Transform (CQT), whose logarithmically-spaced frequency
axis represents pitch equivariance as translation equivariance. The proposed
method is evaluated with objective and subjective metrics in three different
and varied tasks: audio bandwidth extension, inpainting, and declipping. The
results show that CQT-Diff outperforms the compared baselines and ablations in
audio bandwidth extension and, without retraining, delivers competitive
performance against modern baselines in audio inpainting and declipping. This
work represents the first diffusion-based general framework for solving inverse
problems in audio processing.
Accelerating Diffusion Models via Pre-segmentation Diffusion Sampling for Medical Image Segmentation
October 27, 2022
Xutao Guo, Yanwu Yang, Chenfei Ye, Shang Lu, Yang Xiang, Ting Ma
Based on the Denoising Diffusion Probabilistic Model (DDPM), medical image
segmentation can be described as a conditional image generation task, which
allows to compute pixel-wise uncertainty maps of the segmentation and allows an
implicit ensemble of segmentations to boost the segmentation performance.
However, DDPM requires many iterative denoising steps to generate segmentations
from Gaussian noise, resulting in extremely inefficient inference. To mitigate
the issue, we propose a principled acceleration strategy, called
pre-segmentation diffusion sampling DDPM (PD-DDPM), which is specially used for
medical image segmentation. The key idea is to obtain pre-segmentation results
based on a separately trained segmentation network, and construct noise
predictions (non-Gaussian distribution) according to the forward diffusion
rule. We can then start with noisy predictions and use fewer reverse steps to
generate segmentation results. Experiments show that PD-DDPM yields better
segmentation results over representative baseline methods even if the number of
reverse steps is significantly reduced. Moreover, PD-DDPM is orthogonal to
existing advanced segmentation models, which can be combined to further improve
the segmentation performance.
DiffusionDB: A Large-scale Prompt Gallery Dataset for Text-to-Image Generative Models
October 26, 2022
Zijie J. Wang, Evan Montoya, David Munechika, Haoyang Yang, Benjamin Hoover, Duen Horng Chau
cs.CV, cs.AI, cs.HC, cs.LG
With recent advancements in diffusion models, users can generate high-quality
images by writing text prompts in natural language. However, generating images
with desired details requires proper prompts, and it is often unclear how a
model reacts to different prompts or what the best prompts are. To help
researchers tackle these critical challenges, we introduce DiffusionDB, the
first large-scale text-to-image prompt dataset totaling 6.5TB, containing 14
million images generated by Stable Diffusion, 1.8 million unique prompts, and
hyperparameters specified by real users. We analyze the syntactic and semantic
characteristics of prompts. We pinpoint specific hyperparameter values and
prompt styles that can lead to model errors and present evidence of potentially
harmful model usage, such as the generation of misinformation. The
unprecedented scale and diversity of this human-actuated dataset provide
exciting research opportunities in understanding the interplay between prompts
and generative models, detecting deepfakes, and designing human-AI interaction
tools to help users more easily use these models. DiffusionDB is publicly
available at: https://poloclub.github.io/diffusiondb.
Categorical SDEs with Simplex Diffusion
October 26, 2022
Pierre H. Richemond, Sander Dieleman, Arnaud Doucet
Diffusion models typically operate in the standard framework of generative
modelling by producing continuously-valued datapoints. To this end, they rely
on a progressive Gaussian smoothing of the original data distribution, which
admits an SDE interpretation involving increments of a standard Brownian
motion. However, some applications such as text generation or reinforcement
learning might naturally be better served by diffusing categorical-valued data,
i.e., lifting the diffusion to a space of probability distributions. To this
end, this short theoretical note proposes Simplex Diffusion, a means to
directly diffuse datapoints located on an n-dimensional probability simplex. We
show how this relates to the Dirichlet distribution on the simplex and how the
analogous SDE is realized thanks to a multi-dimensional Cox-Ingersoll-Ross
process (abbreviated as CIR), previously used in economics and mathematical
finance. Finally, we make remarks as to the numerical implementation of
trajectories of the CIR process, and discuss some limitations of our approach.
Full-band General Audio Synthesis with Score-based Diffusion
October 26, 2022
Santiago Pascual, Gautam Bhattacharya, Chunghsin Yeh, Jordi Pons, Joan Serrà
Recent works have shown the capability of deep generative models to tackle
general audio synthesis from a single label, producing a variety of impulsive,
tonal, and environmental sounds. Such models operate on band-limited signals
and, as a result of an autoregressive approach, they are typically conformed by
pre-trained latent encoders and/or several cascaded modules. In this work, we
propose a diffusion-based generative model for general audio synthesis, named
DAG, which deals with full-band signals end-to-end in the waveform domain.
Results show the superiority of DAG over existing label-conditioned generators
in terms of both quality and diversity. More specifically, when compared to the
state of the art, the band-limited and full-band versions of DAG achieve
relative improvements that go up to 40 and 65%, respectively. We believe DAG is
flexible enough to accommodate different conditioning schemas while providing
good quality synthesis.
Towards the Detection of Diffusion Model Deepfakes
October 26, 2022
Jonas Ricker, Simon Damm, Thorsten Holz, Asja Fischer
Diffusion models (DMs) have recently emerged as a promising method in image
synthesis. However, to date, only little attention has been paid to the
detection of DM-generated images, which is critical to prevent adverse impacts
on our society. In this work, we address this pressing challenge from two
different angles: First, we evaluate the performance of state-of-the-art
detectors, which are very effective against images generated by generative
adversarial networks (GANs), on a variety of DMs. Second, we analyze
DM-generated images in the frequency domain and study different factors that
influence the spectral properties of these images. Most importantly, we
demonstrate that GANs and DMs produce images with different characteristics,
which requires adaptation of existing classifiers to ensure reliable detection.
We are convinced that this work provides the foundation and starting point for
further research on effective detection of DM-generated images.
On the failure of variational score matching for VAE models
October 24, 2022
Li Kevin Wenliang
Score matching (SM) is a convenient method for training flexible
probabilistic models, which is often preferred over the traditional
maximum-likelihood (ML) approach. However, these models are less interpretable
than normalized models; as such, training robustness is in general difficult to
assess. We present a critical study of existing variational SM objectives,
showing catastrophic failure on a wide range of datasets and network
architectures. Our theoretical insights on the objectives emerge directly from
their equivalent autoencoding losses when optimizing variational autoencoder
(VAE) models. First, we show that in the Fisher autoencoder, SM produces far
worse models than maximum-likelihood, and approximate inference by Fisher
divergence can lead to low-density local optima. However, with important
modifications, this objective reduces to a regularized autoencoding loss that
resembles the evidence lower bound (ELBO). This analysis predicts that the
modified SM algorithm should behave very similarly to ELBO on Gaussian VAEs. We
then review two other FD-based objectives from the literature and show that
they reduce to uninterpretable autoencoding losses, likely leading to poor
performance. The experiments verify our theoretical predictions and suggest
that only ELBO and the baseline objective robustly produce expected results,
while previously proposed SM methods do not.
Structure-based Drug Design with Equivariant Diffusion Models
October 24, 2022
Arne Schneuing, Yuanqi Du, Charles Harris, Arian Jamasb, Ilia Igashov, Weitao Du, Tom Blundell, Pietro Lió, Carla Gomes, Max Welling, Michael Bronstein, Bruno Correia
Structure-based drug design (SBDD) aims to design small-molecule ligands that
bind with high affinity and specificity to pre-determined protein targets. In
this paper, we formulate SBDD as a 3D-conditional generation problem and
present DiffSBDD, an SE(3)-equivariant 3D-conditional diffusion model that
generates novel ligands conditioned on protein pockets. Comprehensive in silico
experiments demonstrate the efficiency and effectiveness of DiffSBDD in
generating novel and diverse drug-like ligands with competitive docking scores.
We further explore the flexibility of the diffusion framework for a broader
range of tasks in drug design campaigns, such as off-the-shelf property
optimization and partial molecular design with inpainting.
October 24, 2022
Krunoslav Lehman Pavasovic, Jonas Rothfuss, Andreas Krause
Meta-learning aims to extract useful inductive biases from a set of related
datasets. In Bayesian meta-learning, this is typically achieved by constructing
a prior distribution over neural network parameters. However, specifying
families of computationally viable prior distributions over the
high-dimensional neural network parameters is difficult. As a result, existing
approaches resort to meta-learning restrictive diagonal Gaussian priors,
severely limiting their expressiveness and performance. To circumvent these
issues, we approach meta-learning through the lens of functional Bayesian
neural network inference, which views the prior as a stochastic process and
performs inference in the function space. Specifically, we view the
meta-training tasks as samples from the data-generating process and formalize
meta-learning as empirically estimating the law of this stochastic process. Our
approach can seamlessly acquire and represent complex prior knowledge by
meta-learning the score function of the data-generating process marginals
instead of parameter space priors. In a comprehensive benchmark, we demonstrate
that our method achieves state-of-the-art performance in terms of predictive
accuracy and substantial improvements in the quality of uncertainty estimates.
High-Resolution Image Editing via Multi-Stage Blended Diffusion
October 24, 2022
Johannes Ackermann, Minjun Li
Diffusion models have shown great results in image generation and in image
editing. However, current approaches are limited to low resolutions due to the
computational cost of training diffusion models for high-resolution generation.
We propose an approach that uses a pre-trained low-resolution diffusion model
to edit images in the megapixel range. We first use Blended Diffusion to edit
the image at a low resolution, and then upscale it in multiple stages, using a
super-resolution model and Blended Diffusion. Using our approach, we achieve
higher visual fidelity than by only applying off the shelf super-resolution
methods to the output of the diffusion model. We also obtain better global
consistency than directly using the diffusion model at a higher resolution.
Deep Equilibrium Approaches to Diffusion Models
October 23, 2022
Ashwini Pokle, Zhengyang Geng, Zico Kolter
Diffusion-based generative models are extremely effective in generating
high-quality images, with generated samples often surpassing the quality of
those produced by other models under several metrics. One distinguishing
feature of these models, however, is that they typically require long sampling
chains to produce high-fidelity images. This presents a challenge not only from
the lenses of sampling time, but also from the inherent difficulty in
backpropagating through these chains in order to accomplish tasks such as model
inversion, i.e. approximately finding latent states that generate known images.
In this paper, we look at diffusion models through a different perspective,
that of a (deep) equilibrium (DEQ) fixed point model. Specifically, we extend
the recent denoising diffusion implicit model (DDIM; Song et al. 2020), and
model the entire sampling chain as a joint, multivariate fixed point system.
This setup provides an elegant unification of diffusion and equilibrium models,
and shows benefits in 1) single image sampling, as it replaces the fully-serial
typical sampling process with a parallel one; and 2) model inversion, where we
can leverage fast gradients in the DEQ setting to much more quickly find the
noise that generates a given image. The approach is also orthogonal and thus
complementary to other methods used to reduce the sampling time, or improve
model inversion. We demonstrate our method’s strong performance across several
datasets, including CIFAR10, CelebA, and LSUN Bedrooms and Churches.
Diffusion Motion: Generate Text-Guided 3D Human Motion by Diffusion Model
October 22, 2022
Zhiyuan Ren, Zhihong Pan, Xin Zhou, Le Kang
We propose a simple and novel method for generating 3D human motion from
complex natural language sentences, which describe different velocity,
direction and composition of all kinds of actions. Different from existing
methods that use classical generative architecture, we apply the Denoising
Diffusion Probabilistic Model to this task, synthesizing diverse motion results
under the guidance of texts. The diffusion model converts white noise into
structured 3D motion by a Markov process with a series of denoising steps and
is efficiently trained by optimizing a variational lower bound. To achieve the
goal of text-conditioned image synthesis, we use the classifier-free guidance
strategy to fuse text embedding into the model during training. Our experiments
demonstrate that our model achieves competitive results on HumanML3D test set
quantitatively and can generate more visually natural and diverse examples. We
also show with experiments that our model is capable of zero-shot generation of
motions for unseen text guidance.
Score-based Denoising Diffusion with Non-Isotropic Gaussian Noise Models
October 21, 2022
Vikram Voleti, Christopher Pal, Adam Oberman
Generative models based on denoising diffusion techniques have led to an
unprecedented increase in the quality and diversity of imagery that is now
possible to create with neural generative models. However, most contemporary
state-of-the-art methods are derived from a standard isotropic Gaussian
formulation. In this work we examine the situation where non-isotropic Gaussian
distributions are used. We present the key mathematical derivations for
creating denoising diffusion models using an underlying non-isotropic Gaussian
noise model. We also provide initial experiments with the CIFAR-10 dataset to
help verify empirically that this more general modeling approach can also yield
high-quality samples.
Conditional Diffusion with Less Explicit Guidance via Model Predictive Control
October 21, 2022
Max W. Shen, Ehsan Hajiramezanali, Gabriele Scalia, Alex Tseng, Nathaniel Diamant, Tommaso Biancalani, Andreas Loukas
How much explicit guidance is necessary for conditional diffusion? We
consider the problem of conditional sampling using an unconditional diffusion
model and limited explicit guidance (e.g., a noised classifier, or a
conditional diffusion model) that is restricted to a small number of time
steps. We explore a model predictive control (MPC)-like approach to approximate
guidance by simulating unconditional diffusion forward, and backpropagating
explicit guidance feedback. MPC-approximated guides have high cosine similarity
to real guides, even over large simulation distances. Adding MPC steps improves
generative quality when explicit guidance is limited to five time steps.
Multitask Brain Tumor Inpainting with Diffusion Models: A Methodological Report
October 21, 2022
Pouria Rouzrokh, Bardia Khosravi, Shahriar Faghani, Mana Moassefi, Sanaz Vahdati, Bradley J. Erickson
Despite the ever-increasing interest in applying deep learning (DL) models to
medical imaging, the typical scarcity and imbalance of medical datasets can
severely impact the performance of DL models. The generation of synthetic data
that might be freely shared without compromising patient privacy is a
well-known technique for addressing these difficulties. Inpainting algorithms
are a subset of DL generative models that can alter one or more regions of an
input image while matching its surrounding context and, in certain cases,
non-imaging input conditions. Although the majority of inpainting techniques
for medical imaging data use generative adversarial networks (GANs), the
performance of these algorithms is frequently suboptimal due to their limited
output variety, a problem that is already well-known for GANs. Denoising
diffusion probabilistic models (DDPMs) are a recently introduced family of
generative networks that can generate results of comparable quality to GANs,
but with diverse outputs. In this paper, we describe a DDPM to execute multiple
inpainting tasks on 2D axial slices of brain MRI with various sequences, and
present proof-of-concept examples of its performance in a variety of evaluation
scenarios. Our model and a public online interface to try our tool are
available at: https://github.com/Mayo-Radiology-Informatics-Lab/MBTI
Boomerang: Local sampling on image manifolds using diffusion models
October 21, 2022
Lorenzo Luzi, Ali Siahkoohi, Paul M Mayer, Josue Casco-Rodriguez, Richard Baraniuk
Diffusion models can be viewed as mapping points in a high-dimensional latent
space onto a low-dimensional learned manifold, typically an image manifold. The
intermediate values between the latent space and image manifold can be
interpreted as noisy images which are determined by the noise scheduling scheme
employed during pre-training. We exploit this interpretation to introduce
Boomerang, a local image manifold sampling approach using the dynamics of
diffusion models. We call it Boomerang because we first add noise to an input
image, moving it closer to the latent space, then bring it back to the image
space through diffusion dynamics. We use this method to generate images which
are similar, but nonidentical, to the original input images on the image
manifold. We are able to set how close the generated image is to the original
based on how much noise we add. Additionally, the generated images have a
degree of stochasticity, allowing us to locally sample as many times as we want
without repetition. We show three applications for which Boomerang can be used.
First, we provide a framework for constructing privacy-preserving datasets
having controllable degrees of anonymity. Second, we show how to use Boomerang
for data augmentation while staying on the image manifold. Third, we introduce
a framework for image super-resolution with 8x upsampling. Boomerang does not
require any modification to the training of diffusion models and can be used
with pretrained models on a single, inexpensive GPU.
Diffusion Visual Counterfactual Explanations
October 21, 2022
Maximilian Augustin, Valentyn Boreiko, Francesco Croce, Matthias Hein
Visual Counterfactual Explanations (VCEs) are an important tool to understand
the decisions of an image classifier. They are ‘small’ but ‘realistic’ semantic
changes of the image changing the classifier decision. Current approaches for
the generation of VCEs are restricted to adversarially robust models and often
contain non-realistic artefacts, or are limited to image classification
problems with few classes. In this paper, we overcome this by generating
Diffusion Visual Counterfactual Explanations (DVCEs) for arbitrary ImageNet
classifiers via a diffusion process. Two modifications to the diffusion process
are key for our DVCEs: first, an adaptive parameterization, whose
hyperparameters generalize across images and models, together with distance
regularization and late start of the diffusion process, allow us to generate
images with minimal semantic changes to the original ones but different
classification. Second, our cone regularization via an adversarially robust
model ensures that the diffusion process does not converge to trivial
non-semantic changes, but instead produces realistic images of the target class
which achieve high confidence by the classifier.
Graphically Structured Diffusion Models
October 20, 2022
Christian Weilbach, William Harvey, Frank Wood
We introduce a framework for automatically defining and learning deep
generative models with problem-specific structure. We tackle problem domains
that are more traditionally solved by algorithms such as sorting, constraint
satisfaction for Sudoku, and matrix factorization. Concretely, we train
diffusion models with an architecture tailored to the problem specification.
This problem specification should contain a graphical model describing
relationships between variables, and often benefits from explicit
representation of subcomputations. Permutation invariances can also be
exploited. Across a diverse set of experiments we improve the scaling
relationship between problem dimension and our model’s performance, in terms of
both training time and final accuracy. Our code can be found at
https://github.com/plai-group/gsdm.
Representation Learning with Diffusion Models
October 20, 2022
Jeremias Traub
Diffusion models (DMs) have achieved state-of-the-art results for image
synthesis tasks as well as density estimation. Applied in the latent space of a
powerful pretrained autoencoder (LDM), their immense computational requirements
can be significantly reduced without sacrificing sampling quality. However, DMs
and LDMs lack a semantically meaningful representation space as the diffusion
process gradually destroys information in the latent variables. We introduce a
framework for learning such representations with diffusion models (LRDM). To
that end, a LDM is conditioned on the representation extracted from the clean
image by a separate encoder. In particular, the DM and the representation
encoder are trained jointly in order to learn rich representations specific to
the generative denoising process. By introducing a tractable representation
prior, we can efficiently sample from the representation distribution for
unconditional image synthesis without training of any additional model. We
demonstrate that i) competitive image generation results can be achieved with
image-parameterized LDMs, ii) LRDMs are capable of learning semantically
meaningful representations, allowing for faithful image reconstructions and
semantic interpolations. Our implementation is available at
https://github.com/jeremiastraub/diffusion.
Differentially Private Diffusion Models
October 18, 2022
Tim Dockhorn, Tianshi Cao, Arash Vahdat, Karsten Kreis
While modern machine learning models rely on increasingly large training
datasets, data is often limited in privacy-sensitive domains. Generative models
trained with differential privacy (DP) on sensitive data can sidestep this
challenge, providing access to synthetic data instead. We build on the recent
success of diffusion models (DMs) and introduce Differentially Private
Diffusion Models (DPDMs), which enforce privacy using differentially private
stochastic gradient descent (DP-SGD). We investigate the DM parameterization
and the sampling algorithm, which turn out to be crucial ingredients in DPDMs,
and propose noise multiplicity, a powerful modification of DP-SGD tailored to
the training of DMs. We validate our novel DPDMs on image generation benchmarks
and achieve state-of-the-art performance in all experiments. Moreover, on
standard benchmarks, classifiers trained on DPDM-generated synthetic data
perform on par with task-specific DP-SGD-trained classifiers, which has not
been demonstrated before for DP generative models. Project page and code:
https://nv-tlabs.github.io/DPDM.
Improving Adversarial Robustness by Contrastive Guided Diffusion Process
October 18, 2022
Yidong Ouyang, Liyan Xie, Guang Cheng
Synthetic data generation has become an emerging tool to help improve the
adversarial robustness in classification tasks since robust learning requires a
significantly larger amount of training samples compared with standard
classification tasks. Among various deep generative models, the diffusion model
has been shown to produce high-quality synthetic images and has achieved good
performance in improving the adversarial robustness. However, diffusion-type
methods are typically slow in data generation as compared with other generative
models. Although different acceleration techniques have been proposed recently,
it is also of great importance to study how to improve the sample efficiency of
generated data for the downstream task. In this paper, we first analyze the
optimality condition of synthetic distribution for achieving non-trivial robust
accuracy. We show that enhancing the distinguishability among the generated
data is critical for improving adversarial robustness. Thus, we propose the
Contrastive-Guided Diffusion Process (Contrastive-DP), which adopts the
contrastive loss to guide the diffusion model in data generation. We verify our
theoretical results using simulations and demonstrate the good performance of
Contrastive-DP on image datasets.
Swinv2-Imagen: Hierarchical Vision Transformer Diffusion Models for Text-to-Image Generation
October 18, 2022
Ruijun Li, Weihua Li, Yi Yang, Hanyu Wei, Jianhua Jiang, Quan Bai
cs.CV, cs.LG, 94A08, I.4.0
Recently, diffusion models have been proven to perform remarkably well in
text-to-image synthesis tasks in a number of studies, immediately presenting
new study opportunities for image generation. Google’s Imagen follows this
research trend and outperforms DALLE2 as the best model for text-to-image
generation. However, Imagen merely uses a T5 language model for text
processing, which cannot ensure learning the semantic information of the text.
Furthermore, the Efficient UNet leveraged by Imagen is not the best choice in
image processing. To address these issues, we propose the Swinv2-Imagen, a
novel text-to-image diffusion model based on a Hierarchical Visual Transformer
and a Scene Graph incorporating a semantic layout. In the proposed model, the
feature vectors of entities and relationships are extracted and involved in the
diffusion model, effectively improving the quality of generated images. On top
of that, we also introduce a Swin-Transformer-based UNet architecture, called
Swinv2-Unet, which can address the problems stemming from the CNN convolution
operations. Extensive experiments are conducted to evaluate the performance of
the proposed model by using three real-world datasets, i.e., MSCOCO, CUB and
MM-CelebA-HQ. The experimental results show that the proposed Swinv2-Imagen
model outperforms several popular state-of-the-art methods.
UniTune: Text-Driven Image Editing by Fine Tuning an Image Generation Model on a Single Image
October 17, 2022
Dani Valevski, Matan Kalman, Eyal Molad, Eyal Segalis, Yossi Matias, Yaniv Leviathan
Text-driven image generation methods have shown impressive results recently,
allowing casual users to generate high quality images by providing textual
descriptions. However, similar capabilities for editing existing images are
still out of reach. Text-driven image editing methods usually need edit masks,
struggle with edits that require significant visual changes and cannot easily
keep specific details of the edited portion. In this paper we make the
observation that image-generation models can be converted to image-editing
models simply by fine-tuning them on a single image. We also show that
initializing the stochastic sampler with a noised version of the base image
before the sampling and interpolating relevant details from the base image
after sampling further increase the quality of the edit operation. Combining
these observations, we propose UniTune, a novel image editing method. UniTune
gets as input an arbitrary image and a textual edit description, and carries
out the edit while maintaining high fidelity to the input image. UniTune does
not require additional inputs, like masks or sketches, and can perform multiple
edits on the same image without retraining. We test our method using the Imagen
model in a range of different use cases. We demonstrate that it is broadly
applicable and can perform a surprisingly wide range of expressive editing
operations, including those requiring significant visual changes that were
previously impossible.
Imagic: Text-Based Real Image Editing with Diffusion Models
October 17, 2022
Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, Michal Irani
Text-conditioned image editing has recently attracted considerable interest.
However, most methods are currently either limited to specific editing types
(e.g., object overlay, style transfer), or apply to synthetically generated
images, or require multiple input images of a common object. In this paper we
demonstrate, for the very first time, the ability to apply complex (e.g.,
non-rigid) text-guided semantic edits to a single real image. For example, we
can change the posture and composition of one or multiple objects inside an
image, while preserving its original characteristics. Our method can make a
standing dog sit down or jump, cause a bird to spread its wings, etc. – each
within its single high-resolution natural image provided by the user. Contrary
to previous work, our proposed method requires only a single input image and a
target text (the desired edit). It operates on real images, and does not
require any additional inputs (such as image masks or additional views of the
object). Our method, which we call “Imagic”, leverages a pre-trained
text-to-image diffusion model for this task. It produces a text embedding that
aligns with both the input image and the target text, while fine-tuning the
diffusion model to capture the image-specific appearance. We demonstrate the
quality and versatility of our method on numerous inputs from various domains,
showcasing a plethora of high quality complex semantic image edits, all within
a single unified framework.
DiffuSeq: Sequence to Sequence Text Generation with Diffusion Models
October 17, 2022
Shansan Gong, Mukai Li, Jiangtao Feng, Zhiyong Wu, Lingpeng Kong
Recently, diffusion models have emerged as a new paradigm for generative
models. Despite the success in domains using continuous signals such as vision
and audio, adapting diffusion models to natural language is under-explored due
to the discrete nature of texts, especially for conditional generation. We
tackle this challenge by proposing DiffuSeq: a diffusion model designed for
sequence-to-sequence (Seq2Seq) text generation tasks. Upon extensive evaluation
over a wide range of Seq2Seq tasks, we find DiffuSeq achieving comparable or
even better performance than six established baselines, including a
state-of-the-art model that is based on pre-trained language models. Apart from
quality, an intriguing property of DiffuSeq is its high diversity during
generation, which is desired in many Seq2Seq tasks. We further include a
theoretical analysis revealing the connection between DiffuSeq and
autoregressive/non-autoregressive models. Bringing together theoretical
analysis and empirical evidence, we demonstrate the great potential of
diffusion models in complex conditional language generation tasks. Code is
available at \url{https://github.com/Shark-NLP/DiffuSeq}
DiffGAR: Model-Agnostic Restoration from Generative Artifacts Using Image-to-Image Diffusion Models
October 16, 2022
Yueqin Yin, Lianghua Huang, Yu Liu, Kaiqi Huang
Recent generative models show impressive results in photo-realistic image
generation. However, artifacts often inevitably appear in the generated
results, leading to downgraded user experience and reduced performance in
downstream tasks. This work aims to develop a plugin post-processing module for
diverse generative models, which can faithfully restore images from diverse
generative artifacts. This is challenging because: (1) Unlike traditional
degradation patterns, generative artifacts are non-linear and the
transformation function is highly complex. (2) There are no readily available
artifact-image pairs. (3) Different from model-specific anti-artifact methods,
a model-agnostic framework views the generator as a black-box machine and has
no access to the architecture details. In this work, we first design a group of
mechanisms to simulate generative artifacts of popular generators (i.e., GANs,
autoregressive models, and diffusion models), given real images. Second, we
implement the model-agnostic anti-artifact framework as an image-to-image
diffusion model, due to its advantage in generation quality and capacity.
Finally, we design a conditioning scheme for the diffusion model to enable both
blind and non-blind image restoration. A guidance parameter is also introduced
to allow for a trade-off between restoration accuracy and image quality.
Extensive experiments show that our method significantly outperforms previous
approaches on the proposed datasets and real-world artifact images.
TransFusion: Transcribing Speech with Multinomial Diffusion
October 14, 2022
Matthew Baas, Kevin Eloff, Herman Kamper
Diffusion models have shown exceptional scaling properties in the image
synthesis domain, and initial attempts have shown similar benefits for applying
diffusion to unconditional text synthesis. Denoising diffusion models attempt
to iteratively refine a sampled noise signal until it resembles a coherent
signal (such as an image or written sentence). In this work we aim to see
whether the benefits of diffusion models can also be realized for speech
recognition. To this end, we propose a new way to perform speech recognition
using a diffusion model conditioned on pretrained speech features.
Specifically, we propose TransFusion: a transcribing diffusion model which
iteratively denoises a random character sequence into coherent text
corresponding to the transcript of a conditioning utterance. We demonstrate
comparable performance to existing high-performing contrastive models on the
LibriSpeech speech recognition benchmark. To the best of our knowledge, we are
the first to apply denoising diffusion to speech recognition. We also propose
new techniques for effectively sampling and decoding multinomial diffusion
models. These are required because traditional methods of sampling from
acoustic models are not possible with our new discrete diffusion approach. Code
and trained models are available: https://github.com/RF5/transfusion-asr
Hierarchical Diffusion Models for Singing Voice Neural Vocoder
October 14, 2022
Naoya Takahashi, Mayank Kumar, Singh, Yuki Mitsufuji
Recent progress in deep generative models has improved the quality of neural
vocoders in speech domain. However, generating a high-quality singing voice
remains challenging due to a wider variety of musical expressions in pitch,
loudness, and pronunciations. In this work, we propose a hierarchical diffusion
model for singing voice neural vocoders. The proposed method consists of
multiple diffusion models operating in different sampling rates; the model at
the lowest sampling rate focuses on generating accurate low-frequency
components such as pitch, and other models progressively generate the waveform
at higher sampling rates on the basis of the data at the lower sampling rate
and acoustic features. Experimental results show that the proposed method
produces high-quality singing voices for multiple singers, outperforming
state-of-the-art neural vocoders with a similar range of computational costs.
DE-FAKE: Detection and Attribution of Fake Images Generated by Text-to-Image Diffusion Models
October 13, 2022
Zeyang Sha, Zheng Li, Ning Yu, Yang Zhang
Text-to-image generation models that generate images based on prompt
descriptions have attracted an increasing amount of attention during the past
few months. Despite their encouraging performance, these models raise concerns
about the misuse of their generated fake images. To tackle this problem, we
pioneer a systematic study on the detection and attribution of fake images
generated by text-to-image generation models. Concretely, we first build a
machine learning classifier to detect the fake images generated by various
text-to-image generation models. We then attribute these fake images to their
source models, such that model owners can be held responsible for their models’
misuse. We further investigate how prompts that generate fake images affect
detection and attribution. We conduct extensive experiments on four popular
text-to-image generation models, including DALL$\cdot$E 2, Stable Diffusion,
GLIDE, and Latent Diffusion, and two benchmark prompt-image datasets. Empirical
results show that (1) fake images generated by various models can be
distinguished from real ones, as there exists a common artifact shared by fake
images from different models; (2) fake images can be effectively attributed to
their source models, as different models leave unique fingerprints in their
generated images; (3) prompts with the ``person’’ topic or a length between 25
and 75 enable models to generate fake images with higher authenticity. All
findings contribute to the community’s insight into the threats caused by
text-to-image generation models. We appeal to the community’s consideration of
the counterpart solutions, like ours, against the rapidly-evolving fake image
generation.
Action Matching: A Variational Method for Learning Stochastic Dynamics from Samples
October 13, 2022
Kirill Neklyudov, Rob Brekelmans, Daniel Severo, Alireza Makhzani
Learning the continuous dynamics of a system from snapshots of its temporal
marginals is a problem which appears throughout natural sciences and machine
learning, including in quantum systems, single-cell biological data, and
generative modeling. In these settings, we assume access to cross-sectional
samples that are uncorrelated over time, rather than full trajectories of
samples. In order to better understand the systems under observation, we would
like to learn a model of the underlying process that allows us to propagate
samples in time and thereby simulate entire individual trajectories. In this
work, we propose Action Matching, a method for learning a rich family of
dynamics using only independent samples from its time evolution. We derive a
tractable training objective, which does not rely on explicit assumptions about
the underlying dynamics and does not require back-propagation through
differential equations or optimal transport solvers. Inspired by connections
with optimal transport, we derive extensions of Action Matching to learn
stochastic differential equations and dynamics involving creation and
destruction of probability mass. Finally, we showcase applications of Action
Matching by achieving competitive performance in a diverse set of experiments
from biology, physics, and generative modeling.
Self-Guided Diffusion Models
October 12, 2022
Vincent Tao Hu, David W Zhang, Yuki M. Asano, Gertjan J. Burghouts, Cees G. M. Snoek
Diffusion models have demonstrated remarkable progress in image generation
quality, especially when guidance is used to control the generative process.
However, guidance requires a large amount of image-annotation pairs for
training and is thus dependent on their availability, correctness and
unbiasedness. In this paper, we eliminate the need for such annotation by
instead leveraging the flexibility of self-supervision signals to design a
framework for self-guided diffusion models. By leveraging a feature extraction
function and a self-annotation function, our method provides guidance signals
at various image granularities: from the level of holistic images to object
boxes and even segmentation masks. Our experiments on single-label and
multi-label image datasets demonstrate that self-labeled guidance always
outperforms diffusion models without guidance and may even surpass guidance
based on ground-truth labels, especially on unbalanced data. When equipped with
self-supervised box or mask proposals, our method further generates visually
diverse yet semantically consistent images, without the need for any class,
box, or segment label annotation. Self-guided diffusion is simple, flexible and
expected to profit from deployment at scale. Source code will be at:
https://taohu.me/sgdm/
Diffusion Models for Causal Discovery via Topological Ordering
October 12, 2022
Pedro Sanchez, Xiao Liu, Alison Q O'Neil, Sotirios A. Tsaftaris
Discovering causal relations from observational data becomes possible with
additional assumptions such as considering the functional relations to be
constrained as nonlinear with additive noise (ANM). Even with strong
assumptions, causal discovery involves an expensive search problem over the
space of directed acyclic graphs (DAGs). \emph{Topological ordering} approaches
reduce the optimisation space of causal discovery by searching over a
permutation rather than graph space. For ANMs, the \emph{Hessian} of the data
log-likelihood can be used for finding leaf nodes in a causal graph, allowing
its topological ordering. However, existing computational methods for obtaining
the Hessian still do not scale as the number of variables and the number of
samples increase. Therefore, inspired by recent innovations in diffusion
probabilistic models (DPMs), we propose \emph{DiffAN}\footnote{Implementation
is available at \url{https://github.com/vios-s/DiffAN} .}, a topological
ordering algorithm that leverages DPMs for learning a Hessian function. We
introduce theory for updating the learned Hessian without re-training the
neural network, and we show that computing with a subset of samples gives an
accurate approximation of the ordering, which allows scaling to datasets with
more samples and variables. We show empirically that our method scales
exceptionally well to datasets with up to $500$ nodes and up to $10^5$ samples
while still performing on par over small datasets with state-of-the-art causal
discovery methods. Implementation is available at
https://github.com/vios-s/DiffAN .
LION: Latent Point Diffusion Models for 3D Shape Generation
October 12, 2022
Xiaohui Zeng, Arash Vahdat, Francis Williams, Zan Gojcic, Or Litany, Sanja Fidler, Karsten Kreis
Denoising diffusion models (DDMs) have shown promising results in 3D point
cloud synthesis. To advance 3D DDMs and make them useful for digital artists,
we require (i) high generation quality, (ii) flexibility for manipulation and
applications such as conditional synthesis and shape interpolation, and (iii)
the ability to output smooth surfaces or meshes. To this end, we introduce the
hierarchical Latent Point Diffusion Model (LION) for 3D shape generation. LION
is set up as a variational autoencoder (VAE) with a hierarchical latent space
that combines a global shape latent representation with a point-structured
latent space. For generation, we train two hierarchical DDMs in these latent
spaces. The hierarchical VAE approach boosts performance compared to DDMs that
operate on point clouds directly, while the point-structured latents are still
ideally suited for DDM-based modeling. Experimentally, LION achieves
state-of-the-art generation performance on multiple ShapeNet benchmarks.
Furthermore, our VAE framework allows us to easily use LION for different
relevant tasks: LION excels at multimodal shape denoising and voxel-conditioned
synthesis, and it can be adapted for text- and image-driven 3D generation. We
also demonstrate shape autoencoding and latent shape interpolation, and we
augment LION with modern surface reconstruction techniques to generate smooth
3D meshes. We hope that LION provides a powerful tool for artists working with
3D shapes due to its high-quality generation, flexibility, and surface
reconstruction. Project page and code: https://nv-tlabs.github.io/LION.
Unifying Diffusion Models’ Latent Space, with Applications to CycleDiffusion and Guidance
October 11, 2022
Chen Henry Wu, Fernando De la Torre
Diffusion models have achieved unprecedented performance in generative
modeling. The commonly-adopted formulation of the latent code of diffusion
models is a sequence of gradually denoised samples, as opposed to the simpler
(e.g., Gaussian) latent space of GANs, VAEs, and normalizing flows. This paper
provides an alternative, Gaussian formulation of the latent space of various
diffusion models, as well as an invertible DPM-Encoder that maps images into
the latent space. While our formulation is purely based on the definition of
diffusion models, we demonstrate several intriguing consequences. (1)
Empirically, we observe that a common latent space emerges from two diffusion
models trained independently on related domains. In light of this finding, we
propose CycleDiffusion, which uses DPM-Encoder for unpaired image-to-image
translation. Furthermore, applying CycleDiffusion to text-to-image diffusion
models, we show that large-scale text-to-image diffusion models can be used as
zero-shot image-to-image editors. (2) One can guide pre-trained diffusion
models and GANs by controlling the latent codes in a unified, plug-and-play
formulation based on energy-based models. Using the CLIP model and a face
recognition model as guidance, we demonstrate that diffusion models have better
coverage of low-density sub-populations and individuals than GANs. The code is
publicly available at https://github.com/ChenWu98/cycle-diffusion.
GENIE: Higher-Order Denoising Diffusion Solvers
October 11, 2022
Tim Dockhorn, Arash Vahdat, Karsten Kreis
Denoising diffusion models (DDMs) have emerged as a powerful class of
generative models. A forward diffusion process slowly perturbs the data, while
a deep model learns to gradually denoise. Synthesis amounts to solving a
differential equation (DE) defined by the learnt model. Solving the DE requires
slow iterative solvers for high-quality generation. In this work, we propose
Higher-Order Denoising Diffusion Solvers (GENIE): Based on truncated Taylor
methods, we derive a novel higher-order solver that significantly accelerates
synthesis. Our solver relies on higher-order gradients of the perturbed data
distribution, that is, higher-order score functions. In practice, only
Jacobian-vector products (JVPs) are required and we propose to extract them
from the first-order score network via automatic differentiation. We then
distill the JVPs into a separate neural network that allows us to efficiently
compute the necessary higher-order terms for our novel sampler during
synthesis. We only need to train a small additional head on top of the
first-order score network. We validate GENIE on multiple image generation
benchmarks and demonstrate that GENIE outperforms all previous solvers. Unlike
recent methods that fundamentally alter the generation process in DDMs, our
GENIE solves the true generative DE and still enables applications such as
encoding and guided sampling. Project page and code:
https://nv-tlabs.github.io/GENIE.
Equivariant 3D-Conditional Diffusion Models for Molecular Linker Design
October 11, 2022
Ilia Igashov, Hannes Stärk, Clément Vignac, Victor Garcia Satorras, Pascal Frossard, Max Welling, Michael Bronstein, Bruno Correia
Fragment-based drug discovery has been an effective paradigm in early-stage
drug development. An open challenge in this area is designing linkers between
disconnected molecular fragments of interest to obtain chemically-relevant
candidate drug molecules. In this work, we propose DiffLinker, an
E(3)-equivariant 3D-conditional diffusion model for molecular linker design.
Given a set of disconnected fragments, our model places missing atoms in
between and designs a molecule incorporating all the initial fragments. Unlike
previous approaches that are only able to connect pairs of molecular fragments,
our method can link an arbitrary number of fragments. Additionally, the model
automatically determines the number of atoms in the linker and its attachment
points to the input fragments. We demonstrate that DiffLinker outperforms other
methods on the standard datasets generating more diverse and
synthetically-accessible molecules. Besides, we experimentally test our method
in real-world applications, showing that it can successfully generate valid
linkers conditioned on target protein pockets.
DiffRoll: Diffusion-based Generative Music Transcription with Unsupervised Pretraining Capability
October 11, 2022
Kin Wai Cheuk, Ryosuke Sawata, Toshimitsu Uesaka, Naoki Murata, Naoya Takahashi, Shusuke Takahashi, Dorien Herremans, Yuki Mitsufuji
cs.SD, cs.AI, cs.LG, eess.AS
In this paper we propose a novel generative approach, DiffRoll, to tackle
automatic music transcription (AMT). Instead of treating AMT as a
discriminative task in which the model is trained to convert spectrograms into
piano rolls, we think of it as a conditional generative task where we train our
model to generate realistic looking piano rolls from pure Gaussian noise
conditioned on spectrograms. This new AMT formulation enables DiffRoll to
transcribe, generate and even inpaint music. Due to the classifier-free nature,
DiffRoll is also able to be trained on unpaired datasets where only piano rolls
are available. Our experiments show that DiffRoll outperforms its
discriminative counterpart by 19 percentage points (ppt.) and our ablation
studies also indicate that it outperforms similar existing methods by 4.8 ppt.
Source code and demonstration are available https://sony.github.io/DiffRoll/.
Markup-to-Image Diffusion Models with Scheduled Sampling
October 11, 2022
Yuntian Deng, Noriyuki Kojima, Alexander M. Rush
Building on recent advances in image generation, we present a fully
data-driven approach to rendering markup into images. The approach is based on
diffusion models, which parameterize the distribution of data using a sequence
of denoising operations on top of a Gaussian noise distribution. We view the
diffusion denoising process as a sequential decision making process, and show
that it exhibits compounding errors similar to exposure bias issues in
imitation learning problems. To mitigate these issues, we adapt the scheduled
sampling algorithm to diffusion training. We conduct experiments on four markup
datasets: mathematical formulas (LaTeX), table layouts (HTML), sheet music
(LilyPond), and molecular images (SMILES). These experiments each verify the
effectiveness of the diffusion process and the use of scheduled sampling to fix
generation issues. These results also show that the markup-to-image task
presents a useful controlled compositional setting for diagnosing and analyzing
generative image models.
October 10, 2022
Jiatao Gu, Shuangfei Zhai, Yizhe Zhang, Miguel Angel Bautista, Josh Susskind
Diffusion models (DMs) have recently emerged as SoTA tools for generative
modeling in various domains. Standard DMs can be viewed as an instantiation of
hierarchical variational autoencoders (VAEs) where the latent variables are
inferred from input-centered Gaussian distributions with fixed scales and
variances. Unlike VAEs, this formulation limits DMs from changing the latent
spaces and learning abstract representations. In this work, we propose f-DM, a
generalized family of DMs which allows progressive signal transformation. More
precisely, we extend DMs to incorporate a set of (hand-designed or learned)
transformations, where the transformed input is the mean of each diffusion
step. We propose a generalized formulation and derive the corresponding
de-noising objective with a modified sampling algorithm. As a demonstration, we
apply f-DM in image generation tasks with a range of functions, including
down-sampling, blurring, and learned transformations based on the encoder of
pretrained VAEs. In addition, we identify the importance of adjusting the noise
levels whenever the signal is sub-sampled and propose a simple rescaling
recipe. f-DM can produce high-quality samples on standard image generation
benchmarks like FFHQ, AFHQ, LSUN, and ImageNet with better efficiency and
semantic interpretation.
What the DAAM: Interpreting Stable Diffusion Using Cross Attention
October 10, 2022
Raphael Tang, Linqing Liu, Akshat Pandey, Zhiying Jiang, Gefei Yang, Karun Kumar, Pontus Stenetorp, Jimmy Lin, Ferhan Ture
Large-scale diffusion neural networks represent a substantial milestone in
text-to-image generation, but they remain poorly understood, lacking
interpretability analyses. In this paper, we perform a text-image attribution
analysis on Stable Diffusion, a recently open-sourced model. To produce
pixel-level attribution maps, we upscale and aggregate cross-attention
word-pixel scores in the denoising subnetwork, naming our method DAAM. We
evaluate its correctness by testing its semantic segmentation ability on nouns,
as well as its generalized attribution quality on all parts of speech, rated by
humans. We then apply DAAM to study the role of syntax in the pixel space,
characterizing head–dependent heat map interaction patterns for ten common
dependency relations. Finally, we study several semantic phenomena using DAAM,
with a focus on feature entanglement, where we find that cohyponyms worsen
generation quality and descriptive adjectives attend too broadly. To our
knowledge, we are the first to interpret large diffusion models from a
visuolinguistic perspective, which enables future lines of research. Our code
is at https://github.com/castorini/daam.
Sequential Neural Score Estimation: Likelihood-Free Inference with Conditional Score Based Diffusion Models
October 10, 2022
Louis Sharrock, Jack Simons, Song Liu, Mark Beaumont
We introduce Sequential Neural Posterior Score Estimation (SNPSE) and
Sequential Neural Likelihood Score Estimation (SNLSE), two new score-based
methods for Bayesian inference in simulator-based models. Our methods, inspired
by the success of score-based methods in generative modelling, leverage
conditional score-based diffusion models to generate samples from the posterior
distribution of interest. These models can be trained using one of two possible
objective functions, one of which approximates the score of the intractable
likelihood, while the other directly estimates the score of the posterior. We
embed these models into a sequential training procedure, which guides
simulations using the current approximation of the posterior at the observation
of interest, thereby reducing the simulation cost. We validate our methods, as
well as their amortised, non-sequential variants, on several numerical
examples, demonstrating comparable or superior performance to existing
state-of-the-art methods such as Sequential Neural Posterior Estimation (SNPE)
and Sequential Neural Likelihood Estimation (SNLE).
CLIP-Diffusion-LM: Apply Diffusion Model on Image Captioning
October 10, 2022
Shitong Xu
Image captioning task has been extensively researched by previous work.
However, limited experiments focus on generating captions based on
non-autoregressive text decoder. Inspired by the recent success of the
denoising diffusion model on image synthesis tasks, we apply denoising
diffusion probabilistic models to text generation in image captioning tasks. We
show that our CLIP-Diffusion-LM is capable of generating image captions using
significantly fewer inference steps than autoregressive models. On the Flickr8k
dataset, the model achieves 0.1876 BLEU-4 score. By training on the combined
Flickr8k and Flickr30k dataset, our model achieves 0.2470 BLEU-4 score. Our
code is available at https://github.com/xu-shitong/diffusion-image-captioning.
Regularizing Score-based Models with Score Fokker-Planck Equations
October 09, 2022
Chieh-Hsin Lai, Yuhta Takida, Naoki Murata, Toshimitsu Uesaka, Yuki Mitsufuji, Stefano Ermon
Score-based generative models (SGMs) learn a family of noise-conditional
score functions corresponding to the data density perturbed with increasingly
large amounts of noise. These perturbed data densities are linked together by
the Fokker-Planck equation (FPE), a partial differential equation (PDE)
governing the spatial-temporal evolution of a density undergoing a diffusion
process. In this work, we derive a corresponding equation called the score FPE
that characterizes the noise-conditional scores of the perturbed data densities
(i.e., their gradients). Surprisingly, despite the impressive empirical
performance, we observe that scores learned through denoising score matching
(DSM) fail to fulfill the underlying score FPE, which is an inherent
self-consistency property of the ground truth score. We prove that satisfying
the score FPE is desirable as it improves the likelihood and the degree of
conservativity. Hence, we propose to regularize the DSM objective to enforce
satisfaction of the score FPE, and we show the effectiveness of this approach
across various datasets.
STaSy: Score-based Tabular data Synthesis
October 08, 2022
Jayoung Kim, Chaejeong Lee, Noseong Park
Tabular data synthesis is a long-standing research topic in machine learning.
Many different methods have been proposed over the past decades, ranging from
statistical methods to deep generative methods. However, it has not always been
successful due to the complicated nature of real-world tabular data. In this
paper, we present a new model named Score-based Tabular data Synthesis (STaSy)
and its training strategy based on the paradigm of score-based generative
modeling. Despite the fact that score-based generative models have resolved
many issues in generative models, there still exists room for improvement in
tabular data synthesis. Our proposed training strategy includes a self-paced
learning technique and a fine-tuning strategy, which further increases the
sampling quality and diversity by stabilizing the denoising score matching
training. Furthermore, we also conduct rigorous experimental studies in terms
of the generative task trilemma: sampling quality, diversity, and time. In our
experiments with 15 benchmark tabular datasets and 7 baselines, our method
outperforms existing methods in terms of task-dependant evaluations and
diversity. Code is available at https://github.com/JayoungKim408/STaSy.
Efficient Diffusion Models for Vision: A Survey
October 07, 2022
Anwaar Ulhaq, Naveed Akhtar, Ganna Pogrebna
Diffusion Models (DMs) have demonstrated state-of-the-art performance in
content generation without requiring adversarial training. These models are
trained using a two-step process. First, a forward - diffusion - process
gradually adds noise to a datum (usually an image). Then, a backward - reverse
diffusion - process gradually removes the noise to turn it into a sample of the
target distribution being modelled. DMs are inspired by non-equilibrium
thermodynamics and have inherent high computational complexity. Due to the
frequent function evaluations and gradient calculations in high-dimensional
spaces, these models incur considerable computational overhead during both
training and inference stages. This can not only preclude the democratization
of diffusion-based modelling, but also hinder the adaption of diffusion models
in real-life applications. Not to mention, the efficiency of computational
models is fast becoming a significant concern due to excessive energy
consumption and environmental scares. These factors have led to multiple
contributions in the literature that focus on devising computationally
efficient DMs. In this review, we present the most recent advances in diffusion
models for vision, specifically focusing on the important design aspects that
affect the computational efficiency of DMs. In particular, we emphasize the
recently proposed design choices that have led to more efficient DMs. Unlike
the other recent reviews, which discuss diffusion models from a broad
perspective, this survey is aimed at pushing this research direction forward by
highlighting the design strategies in the literature that are resulting in
practicable models for the broader research community. We also provide a future
outlook of diffusion models in vision from their computational efficiency
viewpoint.
On Distillation of Guided Diffusion Models
October 06, 2022
Chenlin Meng, Robin Rombach, Ruiqi Gao, Diederik P. Kingma, Stefano Ermon, Jonathan Ho, Tim Salimans
Classifier-free guided diffusion models have recently been shown to be highly
effective at high-resolution image generation, and they have been widely used
in large-scale diffusion frameworks including DALLE-2, Stable Diffusion and
Imagen. However, a downside of classifier-free guided diffusion models is that
they are computationally expensive at inference time since they require
evaluating two diffusion models, a class-conditional model and an unconditional
model, tens to hundreds of times. To deal with this limitation, we propose an
approach to distilling classifier-free guided diffusion models into models that
are fast to sample from: Given a pre-trained classifier-free guided model, we
first learn a single model to match the output of the combined conditional and
unconditional models, and then we progressively distill that model to a
diffusion model that requires much fewer sampling steps. For standard diffusion
models trained on the pixel-space, our approach is able to generate images
visually comparable to that of the original model using as few as 4 sampling
steps on ImageNet 64x64 and CIFAR-10, achieving FID/IS scores comparable to
that of the original model while being up to 256 times faster to sample from.
For diffusion models trained on the latent-space (e.g., Stable Diffusion), our
approach is able to generate high-fidelity images using as few as 1 to 4
denoising steps, accelerating inference by at least 10-fold compared to
existing methods on ImageNet 256x256 and LAION datasets. We further demonstrate
the effectiveness of our approach on text-guided image editing and inpainting,
where our distilled model is able to generate high-quality results using as few
as 2-4 denoising steps.
Novel View Synthesis with Diffusion Models
October 06, 2022
Daniel Watson, William Chan, Ricardo Martin-Brualla, Jonathan Ho, Andrea Tagliasacchi, Mohammad Norouzi
We present 3DiM, a diffusion model for 3D novel view synthesis, which is able
to translate a single input view into consistent and sharp completions across
many views. The core component of 3DiM is a pose-conditional image-to-image
diffusion model, which takes a source view and its pose as inputs, and
generates a novel view for a target pose as output. 3DiM can generate multiple
views that are 3D consistent using a novel technique called stochastic
conditioning. The output views are generated autoregressively, and during the
generation of each novel view, one selects a random conditioning view from the
set of available views at each denoising step. We demonstrate that stochastic
conditioning significantly improves the 3D consistency of a naive sampler for
an image-to-image diffusion model, which involves conditioning on a single
fixed view. We compare 3DiM to prior work on the SRN ShapeNet dataset,
demonstrating that 3DiM’s generated completions from a single view achieve much
higher fidelity, while being approximately 3D consistent. We also introduce a
new evaluation methodology, 3D consistency scoring, to measure the 3D
consistency of a generated object by training a neural field on the model’s
output views. 3DiM is geometry free, does not rely on hyper-networks or
test-time optimization for novel view synthesis, and allows a single model to
easily scale to a large number of scenes.
Flow Matching for Generative Modeling
October 06, 2022
Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, Matt Le
We introduce a new paradigm for generative modeling built on Continuous
Normalizing Flows (CNFs), allowing us to train CNFs at unprecedented scale.
Specifically, we present the notion of Flow Matching (FM), a simulation-free
approach for training CNFs based on regressing vector fields of fixed
conditional probability paths. Flow Matching is compatible with a general
family of Gaussian probability paths for transforming between noise and data
samples – which subsumes existing diffusion paths as specific instances.
Interestingly, we find that employing FM with diffusion paths results in a more
robust and stable alternative for training diffusion models. Furthermore, Flow
Matching opens the door to training CNFs with other, non-diffusion probability
paths. An instance of particular interest is using Optimal Transport (OT)
displacement interpolation to define the conditional probability paths. These
paths are more efficient than diffusion paths, provide faster training and
sampling, and result in better generalization. Training CNFs using Flow
Matching on ImageNet leads to consistently better performance than alternative
diffusion-based methods in terms of both likelihood and sample quality, and
allows fast and reliable sample generation using off-the-shelf numerical ODE
solvers.
DALL-E-Bot: Introducing Web-Scale Diffusion Models to Robotics
October 05, 2022
Ivan Kapelyukh, Vitalis Vosylius, Edward Johns
We introduce the first work to explore web-scale diffusion models for
robotics. DALL-E-Bot enables a robot to rearrange objects in a scene, by first
inferring a text description of those objects, then generating an image
representing a natural, human-like arrangement of those objects, and finally
physically arranging the objects according to that goal image. We show that
this is possible zero-shot using DALL-E, without needing any further example
arrangements, data collection, or training. DALL-E-Bot is fully autonomous and
is not restricted to a pre-defined set of objects or scenes, thanks to DALL-E’s
web-scale pre-training. Encouraging real-world results, with both human studies
and objective metrics, show that integrating web-scale diffusion models into
robotics pipelines is a promising direction for scalable, unsupervised robot
learning.
clip2latent: Text driven sampling of a pre-trained StyleGAN using denoising diffusion and CLIP
October 05, 2022
Justin N. M. Pinkney, Chuan Li
We introduce a new method to efficiently create text-to-image models from a
pre-trained CLIP and StyleGAN. It enables text driven sampling with an existing
generative model without any external data or fine-tuning. This is achieved by
training a diffusion model conditioned on CLIP embeddings to sample latent
vectors of a pre-trained StyleGAN, which we call clip2latent. We leverage the
alignment between CLIP’s image and text embeddings to avoid the need for any
text labelled data for training the conditional diffusion model. We demonstrate
that clip2latent allows us to generate high-resolution (1024x1024 pixels)
images based on text prompts with fast sampling, high image quality, and low
training compute and data requirements. We also show that the use of the well
studied StyleGAN architecture, without further fine-tuning, allows us to
directly apply existing methods to control and modify the generated images
adding a further layer of control to our text-to-image pipeline.
Imagen Video: High Definition Video Generation with Diffusion Models
October 05, 2022
Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P. Kingma, Ben Poole, Mohammad Norouzi, David J. Fleet, Tim Salimans
We present Imagen Video, a text-conditional video generation system based on
a cascade of video diffusion models. Given a text prompt, Imagen Video
generates high definition videos using a base video generation model and a
sequence of interleaved spatial and temporal video super-resolution models. We
describe how we scale up the system as a high definition text-to-video model
including design decisions such as the choice of fully-convolutional temporal
and spatial super-resolution models at certain resolutions, and the choice of
the v-parameterization of diffusion models. In addition, we confirm and
transfer findings from previous work on diffusion-based image generation to the
video generation setting. Finally, we apply progressive distillation to our
video models with classifier-free guidance for fast, high quality sampling. We
find Imagen Video not only capable of generating videos of high fidelity, but
also having a high degree of controllability and world knowledge, including the
ability to generate diverse videos and text animations in various artistic
styles and with 3D object understanding. See
https://imagen.research.google/video/ for samples.
Progressive Denoising Model for Fine-Grained Text-to-Image Generation
October 05, 2022
Zhengcong Fei, Mingyuan Fan, Li Zhu, Junshi Huang
Recently, Vector Quantized AutoRegressive (VQ-AR) models have shown
remarkable results in text-to-image synthesis by equally predicting discrete
image tokens from the top left to bottom right in the latent space. Although
the simple generative process surprisingly works well, is this the best way to
generate the image? For instance, human creation is more inclined to the
outline-to-fine of an image, while VQ-AR models themselves do not consider any
relative importance of image patches. In this paper, we present a progressive
model for high-fidelity text-to-image generation. The proposed method takes
effect by creating new image tokens from coarse to fine based on the existing
context in a parallel manner, and this procedure is recursively applied with
the proposed error revision mechanism until an image sequence is completed. The
resulting coarse-to-fine hierarchy makes the image generation process intuitive
and interpretable. Extensive experiments in MS COCO benchmark demonstrate that
the progressive model produces significantly better results compared with the
previous VQ-AR method in FID score across a wide variety of categories and
aspects. Moreover, the design of parallel generation in each step allows more
than $\times 13$ inference acceleration with slight performance loss.
LDEdit: Towards Generalized Text Guided Image Manipulation via Latent Diffusion Models
October 05, 2022
Paramanand Chandramouli, Kanchana Vaishnavi Gandikota
Research in vision-language models has seen rapid developments off-late,
enabling natural language-based interfaces for image generation and
manipulation. Many existing text guided manipulation techniques are restricted
to specific classes of images, and often require fine-tuning to transfer to a
different style or domain. Nevertheless, generic image manipulation using a
single model with flexible text inputs is highly desirable. Recent work
addresses this task by guiding generative models trained on the generic image
datasets using pretrained vision-language encoders. While promising, this
approach requires expensive optimization for each input. In this work, we
propose an optimization-free method for the task of generic image manipulation
from text prompts. Our approach exploits recent Latent Diffusion Models (LDM)
for text to image generation to achieve zero-shot text guided manipulation. We
employ a deterministic forward diffusion in a lower dimensional latent space,
and the desired manipulation is achieved by simply providing the target text to
condition the reverse diffusion process. We refer to our approach as LDEdit. We
demonstrate the applicability of our method on semantic image manipulation and
artistic style transfer. Our method can accomplish image manipulation on
diverse domains and enables editing multiple attributes in a straightforward
fashion. Extensive experiments demonstrate the benefit of our approach over
competing baselines.
DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking
October 04, 2022
Gabriele Corso, Hannes Stärk, Bowen Jing, Regina Barzilay, Tommi Jaakkola
q-bio.BM, cs.LG, physics.bio-ph
Predicting the binding structure of a small molecule ligand to a protein – a
task known as molecular docking – is critical to drug design. Recent deep
learning methods that treat docking as a regression problem have decreased
runtime compared to traditional search-based methods but have yet to offer
substantial improvements in accuracy. We instead frame molecular docking as a
generative modeling problem and develop DiffDock, a diffusion generative model
over the non-Euclidean manifold of ligand poses. To do so, we map this manifold
to the product space of the degrees of freedom (translational, rotational, and
torsional) involved in docking and develop an efficient diffusion process on
this space. Empirically, DiffDock obtains a 38% top-1 success rate (RMSD<2A) on
PDBBind, significantly outperforming the previous state-of-the-art of
traditional docking (23%) and deep learning (20%) methods. Moreover, while
previous methods are not able to dock on computationally folded structures
(maximum accuracy 10.4%), DiffDock maintains significantly higher precision
(21.7%). Finally, DiffDock has fast inference times and provides confidence
estimates with high selective accuracy.
Red-Teaming the Stable Diffusion Safety Filter
October 03, 2022
Javier Rando, Daniel Paleka, David Lindner, Lennart Heim, Florian Tramèr
cs.AI, cs.CR, cs.CV, cs.CY, cs.LG
Stable Diffusion is a recent open-source image generation model comparable to
proprietary models such as DALLE, Imagen, or Parti. Stable Diffusion comes with
a safety filter that aims to prevent generating explicit images. Unfortunately,
the filter is obfuscated and poorly documented. This makes it hard for users to
prevent misuse in their applications, and to understand the filter’s
limitations and improve it. We first show that it is easy to generate
disturbing content that bypasses the safety filter. We then reverse-engineer
the filter and find that while it aims to prevent sexual content, it ignores
violence, gore, and other similarly disturbing content. Based on our analysis,
we argue safety measures in future model releases should strive to be fully
open and properly documented to stimulate security contributions from the
community.
Improving Sample Quality of Diffusion Models Using Self-Attention Guidance
October 03, 2022
Susung Hong, Gyuseong Lee, Wooseok Jang, Seungryong Kim
Denoising diffusion models (DDMs) have attracted attention for their
exceptional generation quality and diversity. This success is largely
attributed to the use of class- or text-conditional diffusion guidance methods,
such as classifier and classifier-free guidance. In this paper, we present a
more comprehensive perspective that goes beyond the traditional guidance
methods. From this generalized perspective, we introduce novel condition- and
training-free strategies to enhance the quality of generated images. As a
simple solution, blur guidance improves the suitability of intermediate samples
for their fine-scale information and structures, enabling diffusion models to
generate higher quality samples with a moderate guidance scale. Improving upon
this, Self-Attention Guidance (SAG) uses the intermediate self-attention maps
of diffusion models to enhance their stability and efficacy. Specifically, SAG
adversarially blurs only the regions that diffusion models attend to at each
iteration and guides them accordingly. Our experimental results show that our
SAG improves the performance of various diffusion models, including ADM, IDDPM,
Stable Diffusion, and DiT. Moreover, combining SAG with conventional guidance
methods leads to further improvement.
Statistical Efficiency of Score Matching: The View from Isoperimetry
October 03, 2022
Frederic Koehler, Alexander Heckett, Andrej Risteski
cs.LG, math.ST, stat.ML, stat.TH
Deep generative models parametrized up to a normalizing constant (e.g.
energy-based models) are difficult to train by maximizing the likelihood of the
data because the likelihood and/or gradients thereof cannot be explicitly or
efficiently written down. Score matching is a training method, whereby instead
of fitting the likelihood $\log p(x)$ for the training data, we instead fit the
score function $\nabla_x \log p(x)$ – obviating the need to evaluate the
partition function. Though this estimator is known to be consistent, its
unclear whether (and when) its statistical efficiency is comparable to that of
maximum likelihood – which is known to be (asymptotically) optimal. We
initiate this line of inquiry in this paper, and show a tight connection
between statistical efficiency of score matching and the isoperimetric
properties of the distribution being estimated – i.e. the Poincar'e,
log-Sobolev and isoperimetric constant – quantities which govern the mixing
time of Markov processes like Langevin dynamics. Roughly, we show that the
score matching estimator is statistically comparable to the maximum likelihood
when the distribution has a small isoperimetric constant. Conversely, if the
distribution has a large isoperimetric constant – even for simple families of
distributions like exponential families with rich enough sufficient statistics
– score matching will be substantially less efficient than maximum likelihood.
We suitably formalize these results both in the finite sample regime, and in
the asymptotic regime. Finally, we identify a direct parallel in the discrete
setting, where we connect the statistical properties of pseudolikelihood
estimation with approximate tensorization of entropy and the Glauber dynamics.
Generated Faces in the Wild: Quantitative Comparison of Stable Diffusion, Midjourney and DALL-E 2
October 02, 2022
Ali Borji
The field of image synthesis has made great strides in the last couple of
years. Recent models are capable of generating images with astonishing quality.
Fine-grained evaluation of these models on some interesting categories such as
faces is still missing. Here, we conduct a quantitative comparison of three
popular systems including Stable Diffusion, Midjourney, and DALL-E 2 in their
ability to generate photorealistic faces in the wild. We find that Stable
Diffusion generates better faces than the other systems, according to the FID
score. We also introduce a dataset of generated faces in the wild dubbed GFW,
including a total of 15,076 faces. Furthermore, we hope that our study spurs
follow-up research in assessing the generative models and improving them. Data
and code are available at data and code, respectively.
OCD: Learning to Overfit with Conditional Diffusion Models
October 02, 2022
Shahar Lutati, Lior Wolf
We present a dynamic model in which the weights are conditioned on an input
sample x and are learned to match those that would be obtained by finetuning a
base model on x and its label y. This mapping between an input sample and
network weights is approximated by a denoising diffusion model. The diffusion
model we employ focuses on modifying a single layer of the base model and is
conditioned on the input, activations, and output of this layer. Since the
diffusion model is stochastic in nature, multiple initializations generate
different networks, forming an ensemble, which leads to further improvements.
Our experiments demonstrate the wide applicability of the method for image
classification, 3D reconstruction, tabular data, speech separation, and natural
language processing. Our code is available at
https://github.com/ShaharLutatiPersonal/OCD
Protein structure generation via folding diffusion
September 30, 2022
Kevin E. Wu, Kevin K. Yang, Rianne van den Berg, James Y. Zou, Alex X. Lu, Ava P. Amini
q-bio.BM, cs.AI, I.2.0; J.3
The ability to computationally generate novel yet physically foldable protein
structures could lead to new biological discoveries and new treatments
targeting yet incurable diseases. Despite recent advances in protein structure
prediction, directly generating diverse, novel protein structures from neural
networks remains difficult. In this work, we present a new diffusion-based
generative model that designs protein backbone structures via a procedure that
mirrors the native folding process. We describe protein backbone structure as a
series of consecutive angles capturing the relative orientation of the
constituent amino acid residues, and generate new structures by denoising from
a random, unfolded state towards a stable folded structure. Not only does this
mirror how proteins biologically twist into energetically favorable
conformations, the inherent shift and rotational invariance of this
representation crucially alleviates the need for complex equivariant networks.
We train a denoising diffusion probabilistic model with a simple transformer
backbone and demonstrate that our resulting model unconditionally generates
highly realistic protein structures with complexity and structural patterns
akin to those of naturally-occurring proteins. As a useful resource, we release
the first open-source codebase and trained models for protein structure
diffusion.
Building Normalizing Flows with Stochastic Interpolants
September 30, 2022
Michael S. Albergo, Eric Vanden-Eijnden
A generative model based on a continuous-time normalizing flow between any
pair of base and target probability densities is proposed. The velocity field
of this flow is inferred from the probability current of a time-dependent
density that interpolates between the base and the target in finite time.
Unlike conventional normalizing flow inference methods based the maximum
likelihood principle, which require costly backpropagation through ODE solvers,
our interpolant approach leads to a simple quadratic loss for the velocity
itself which is expressed in terms of expectations that are readily amenable to
empirical estimation. The flow can be used to generate samples from either the
base or target, and to estimate the likelihood at any time along the
interpolant. In addition, the flow can be optimized to minimize the path length
of the interpolant density, thereby paving the way for building optimal
transport maps. In situations where the base is a Gaussian density, we also
show that the velocity of our normalizing flow can also be used to construct a
diffusion model to sample the target as well as estimate its score. However,
our approach shows that we can bypass this diffusion completely and work at the
level of the probability flow with greater simplicity, opening an avenue for
methods based solely on ordinary differential equations as an alternative to
those based on stochastic differential equations. Benchmarking on density
estimation tasks illustrates that the learned flow can match and surpass
conventional continuous flows at a fraction of the cost, and compares well with
diffusions on image generation on CIFAR-10 and ImageNet $32\times32$. The
method scales ab-initio ODE flows to previously unreachable image resolutions,
demonstrated up to $128\times128$.
TabDDPM: Modelling Tabular Data with Diffusion Models
September 30, 2022
Akim Kotelnikov, Dmitry Baranchuk, Ivan Rubachev, Artem Babenko
Denoising diffusion probabilistic models are currently becoming the leading
paradigm of generative modeling for many important data modalities. Being the
most prevalent in the computer vision community, diffusion models have also
recently gained some attention in other domains, including speech, NLP, and
graph-like data. In this work, we investigate if the framework of diffusion
models can be advantageous for general tabular problems, where datapoints are
typically represented by vectors of heterogeneous features. The inherent
heterogeneity of tabular data makes it quite challenging for accurate modeling,
since the individual features can be of completely different nature, i.e., some
of them can be continuous and some of them can be discrete. To address such
data types, we introduce TabDDPM – a diffusion model that can be universally
applied to any tabular dataset and handles any type of feature. We extensively
evaluate TabDDPM on a wide set of benchmarks and demonstrate its superiority
over existing GAN/VAE alternatives, which is consistent with the advantage of
diffusion models in other fields. Additionally, we show that TabDDPM is
eligible for privacy-oriented setups, where the original datapoints cannot be
publicly shared.
Diffusion-based Image Translation using Disentangled Style and Content Representation
September 30, 2022
Gihyun Kwon, Jong Chul Ye
cs.CV, cs.AI, cs.LG, stat.ML
Diffusion-based image translation guided by semantic texts or a single target
image has enabled flexible style transfer which is not limited to the specific
domains. Unfortunately, due to the stochastic nature of diffusion models, it is
often difficult to maintain the original content of the image during the
reverse diffusion. To address this, here we present a novel diffusion-based
unsupervised image translation method using disentangled style and content
representation.
Specifically, inspired by the splicing Vision Transformer, we extract
intermediate keys of multihead self attention layer from ViT model and used
them as the content preservation loss. Then, an image guided style transfer is
performed by matching the [CLS] classification token from the denoised samples
and target image, whereas additional CLIP loss is used for the text-driven
style transfer. To further accelerate the semantic change during the reverse
diffusion, we also propose a novel semantic divergence loss and resampling
strategy. Our experimental results show that the proposed method outperforms
state-of-the-art baseline models in both text-guided and image-guided
translation tasks.
DreamFusion: Text-to-3D using 2D Diffusion
September 29, 2022
Ben Poole, Ajay Jain, Jonathan T. Barron, Ben Mildenhall
Recent breakthroughs in text-to-image synthesis have been driven by diffusion
models trained on billions of image-text pairs. Adapting this approach to 3D
synthesis would require large-scale datasets of labeled 3D data and efficient
architectures for denoising 3D data, neither of which currently exist. In this
work, we circumvent these limitations by using a pretrained 2D text-to-image
diffusion model to perform text-to-3D synthesis. We introduce a loss based on
probability density distillation that enables the use of a 2D diffusion model
as a prior for optimization of a parametric image generator. Using this loss in
a DeepDream-like procedure, we optimize a randomly-initialized 3D model (a
Neural Radiance Field, or NeRF) via gradient descent such that its 2D
renderings from random angles achieve a low loss. The resulting 3D model of the
given text can be viewed from any angle, relit by arbitrary illumination, or
composited into any 3D environment. Our approach requires no 3D training data
and no modifications to the image diffusion model, demonstrating the
effectiveness of pretrained image diffusion models as priors.
Human Motion Diffusion Model
September 29, 2022
Guy Tevet, Sigal Raab, Brian Gordon, Yonatan Shafir, Daniel Cohen-Or, Amit H. Bermano
Natural and expressive human motion generation is the holy grail of computer
animation. It is a challenging task, due to the diversity of possible motion,
human perceptual sensitivity to it, and the difficulty of accurately describing
it. Therefore, current generative solutions are either low-quality or limited
in expressiveness. Diffusion models, which have already shown remarkable
generative capabilities in other domains, are promising candidates for human
motion due to their many-to-many nature, but they tend to be resource hungry
and hard to control. In this paper, we introduce Motion Diffusion Model (MDM),
a carefully adapted classifier-free diffusion-based generative model for the
human motion domain. MDM is transformer-based, combining insights from motion
generation literature. A notable design-choice is the prediction of the sample,
rather than the noise, in each diffusion step. This facilitates the use of
established geometric losses on the locations and velocities of the motion,
such as the foot contact loss. As we demonstrate, MDM is a generic approach,
enabling different modes of conditioning, and different generation tasks. We
show that our model is trained with lightweight resources and yet achieves
state-of-the-art results on leading benchmarks for text-to-motion and
action-to-motion. https://guytevet.github.io/mdm-page/ .
Denoising Diffusion Probabilistic Models for Styled Walking Synthesis
September 29, 2022
Edmund J. C. Findlay, Haozheng Zhang, Ziyi Chang, Hubert P. H. Shum
cs.CV, cs.AI, cs.GR, cs.LG
Generating realistic motions for digital humans is time-consuming for many
graphics applications. Data-driven motion synthesis approaches have seen solid
progress in recent years through deep generative models. These results offer
high-quality motions but typically suffer in motion style diversity. For the
first time, we propose a framework using the denoising diffusion probabilistic
model (DDPM) to synthesize styled human motions, integrating two tasks into one
pipeline with increased style diversity compared with traditional motion
synthesis methods. Experimental results show that our system can generate
high-quality and diverse walking motions.
Analyzing Diffusion as Serial Reproduction
September 29, 2022
Raja Marjieh, Ilia Sucholutsky, Thomas A. Langlois, Nori Jacoby, Thomas L. Griffiths
Diffusion models are a class of generative models that learn to synthesize
samples by inverting a diffusion process that gradually maps data into noise.
While these models have enjoyed great success recently, a full theoretical
understanding of their observed properties is still lacking, in particular,
their weak sensitivity to the choice of noise family and the role of adequate
scheduling of noise levels for good synthesis. By identifying a correspondence
between diffusion models and a well-known paradigm in cognitive science known
as serial reproduction, whereby human agents iteratively observe and reproduce
stimuli from memory, we show how the aforementioned properties of diffusion
models can be explained as a natural consequence of this correspondence. We
then complement our theoretical analysis with simulations that exhibit these
key features. Our work highlights how classic paradigms in cognitive science
can shed light on state-of-the-art machine learning problems.
Make-A-Video: Text-to-Video Generation without Text-Video Data
September 29, 2022
Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, Yaniv Taigman
We propose Make-A-Video – an approach for directly translating the
tremendous recent progress in Text-to-Image (T2I) generation to Text-to-Video
(T2V). Our intuition is simple: learn what the world looks like and how it is
described from paired text-image data, and learn how the world moves from
unsupervised video footage. Make-A-Video has three advantages: (1) it
accelerates training of the T2V model (it does not need to learn visual and
multimodal representations from scratch), (2) it does not require paired
text-video data, and (3) the generated videos inherit the vastness (diversity
in aesthetic, fantastical depictions, etc.) of today’s image generation models.
We design a simple yet effective way to build on T2I models with novel and
effective spatial-temporal modules. First, we decompose the full temporal U-Net
and attention tensors and approximate them in space and time. Second, we design
a spatial temporal pipeline to generate high resolution and frame rate videos
with a video decoder, interpolation model and two super resolution models that
can enable various applications besides T2V. In all aspects, spatial and
temporal resolution, faithfulness to text, and quality, Make-A-Video sets the
new state-of-the-art in text-to-video generation, as determined by both
qualitative and quantitative measures.
DiGress: Discrete Denoising diffusion for graph generation
September 29, 2022
Clement Vignac, Igor Krawczuk, Antoine Siraudin, Bohan Wang, Volkan Cevher, Pascal Frossard
This work introduces DiGress, a discrete denoising diffusion model for
generating graphs with categorical node and edge attributes. Our model utilizes
a discrete diffusion process that progressively edits graphs with noise,
through the process of adding or removing edges and changing the categories. A
graph transformer network is trained to revert this process, simplifying the
problem of distribution learning over graphs into a sequence of node and edge
classification tasks. We further improve sample quality by introducing a
Markovian noise model that preserves the marginal distribution of node and edge
types during diffusion, and by incorporating auxiliary graph-theoretic
features. A procedure for conditioning the generation on graph-level features
is also proposed. DiGress achieves state-of-the-art performance on molecular
and non-molecular datasets, with up to 3x validity improvement on a planar
graph dataset. It is also the first model to scale to the large GuacaMol
dataset containing 1.3M drug-like molecules without the use of
molecule-specific representations.
Creative Painting with Latent Diffusion Models
September 29, 2022
Xianchao Wu
cs.CV, cs.AI, cs.CL, cs.GR, cs.LG
Artistic painting has achieved significant progress during recent years.
Using an autoencoder to connect the original images with compressed latent
spaces and a cross attention enhanced U-Net as the backbone of diffusion,
latent diffusion models (LDMs) have achieved stable and high fertility image
generation. In this paper, we focus on enhancing the creative painting ability
of current LDMs in two directions, textual condition extension and model
retraining with Wikiart dataset. Through textual condition extension, users’
input prompts are expanded with rich contextual knowledge for deeper
understanding and explaining the prompts. Wikiart dataset contains 80K famous
artworks drawn during recent 400 years by more than 1,000 famous artists in
rich styles and genres. Through the retraining, we are able to ask these
artists to draw novel and creative painting on modern topics. Direct
comparisons with the original model show that the creativity and artistry are
enriched.
Diffusion Posterior Sampling for General Noisy Inverse Problems
September 29, 2022
Hyungjin Chung, Jeongsol Kim, Michael T. Mccann, Marc L. Klasky, Jong Chul Ye
stat.ML, cs.AI, cs.CV, cs.LG
Diffusion models have been recently studied as powerful generative inverse
problem solvers, owing to their high quality reconstructions and the ease of
combining existing iterative solvers. However, most works focus on solving
simple linear inverse problems in noiseless settings, which significantly
under-represents the complexity of real-world problems. In this work, we extend
diffusion solvers to efficiently handle general noisy (non)linear inverse
problems via approximation of the posterior sampling. Interestingly, the
resulting posterior sampling scheme is a blended version of diffusion sampling
with the manifold constrained gradient without a strict measurement consistency
projection step, yielding a more desirable generative path in noisy settings
compared to the previous studies. Our method demonstrates that diffusion models
can incorporate various measurement noise statistics such as Gaussian and
Poisson, and also efficiently handle noisy nonlinear inverse problems such as
Fourier phase retrieval and non-uniform deblurring. Code available at
https://github.com/DPS2022/diffusion-posterior-sampling
Denoising MCMC for Accelerating Diffusion-Based Generative Models
September 29, 2022
Beomsu Kim, Jong Chul Ye
Diffusion models are powerful generative models that simulate the reverse of
diffusion processes using score functions to synthesize data from noise. The
sampling process of diffusion models can be interpreted as solving the reverse
stochastic differential equation (SDE) or the ordinary differential equation
(ODE) of the diffusion process, which often requires up to thousands of
discretization steps to generate a single image. This has sparked a great
interest in developing efficient integration techniques for reverse-S/ODEs.
Here, we propose an orthogonal approach to accelerating score-based sampling:
Denoising MCMC (DMCMC). DMCMC first uses MCMC to produce samples in the product
space of data and variance (or diffusion time). Then, a reverse-S/ODE
integrator is used to denoise the MCMC samples. Since MCMC traverses close to
the data manifold, the computation cost of producing a clean sample for DMCMC
is much less than that of producing a clean sample from noise. To verify the
proposed concept, we show that Denoising Langevin Gibbs (DLG), an instance of
DMCMC, successfully accelerates all six reverse-S/ODE integrators considered in
this work on the tasks of CIFAR10 and CelebA-HQ-256 image generation. Notably,
combined with integrators of Karras et al. (2022) and pre-trained score models
of Song et al. (2021b), DLG achieves SOTA results. In the limited number of
score function evaluation (NFE) settings on CIFAR10, we have $3.86$ FID with
$\approx 10$ NFE and $2.63$ FID with $\approx 20$ NFE. On CelebA-HQ-256, we
have $6.99$ FID with $\approx 160$ NFE, which beats the current best record of
Kim et al. (2022) among score-based models, $7.16$ FID with $4000$ NFE. Code:
https://github.com/1202kbs/DMCMC
Diffusion Adversarial Representation Learning for Self-supervised Vessel Segmentation
September 29, 2022
Boah Kim, Yujin Oh, Jong Chul Ye
Vessel segmentation in medical images is one of the important tasks in the
diagnosis of vascular diseases and therapy planning. Although learning-based
segmentation approaches have been extensively studied, a large amount of
ground-truth labels are required in supervised methods and confusing background
structures make neural networks hard to segment vessels in an unsupervised
manner. To address this, here we introduce a novel diffusion adversarial
representation learning (DARL) model that leverages a denoising diffusion
probabilistic model with adversarial learning, and apply it to vessel
segmentation. In particular, for self-supervised vessel segmentation, DARL
learns the background signal using a diffusion module, which lets a generation
module effectively provide vessel representations. Also, by adversarial
learning based on the proposed switchable spatially-adaptive denormalization,
our model estimates synthetic fake vessel images as well as vessel segmentation
masks, which further makes the model capture vessel-relevant semantic
information. Once the proposed model is trained, the model generates
segmentation masks in a single step and can be applied to general vascular
structure segmentation of coronary angiography and retinal images. Experimental
results on various datasets show that our method significantly outperforms
existing unsupervised and self-supervised vessel segmentation methods.
Re-Imagen: Retrieval-Augmented Text-to-Image Generator
September 29, 2022
Wenhu Chen, Hexiang Hu, Chitwan Saharia, William W. Cohen
Research on text-to-image generation has witnessed significant progress in
generating diverse and photo-realistic images, driven by diffusion and
auto-regressive models trained on large-scale image-text data. Though
state-of-the-art models can generate high-quality images of common entities,
they often have difficulty generating images of uncommon entities, such as
Chortai (dog)' or
Picarones (food)’. To tackle this issue, we present the
Retrieval-Augmented Text-to-Image Generator (Re-Imagen), a generative model
that uses retrieved information to produce high-fidelity and faithful images,
even for rare or unseen entities. Given a text prompt, Re-Imagen accesses an
external multi-modal knowledge base to retrieve relevant (image, text) pairs
and uses them as references to generate the image. With this retrieval step,
Re-Imagen is augmented with the knowledge of high-level semantics and low-level
visual details of the mentioned entities, and thus improves its accuracy in
generating the entities’ visual appearances. We train Re-Imagen on a
constructed dataset containing (image, text, retrieval) triples to teach the
model to ground on both text prompt and retrieval. Furthermore, we develop a
new sampling strategy to interleave the classifier-free guidance for text and
retrieval conditions to balance the text and retrieval alignment. Re-Imagen
achieves significant gain on FID score over COCO and WikiImage. To further
evaluate the capabilities of the model, we introduce EntityDrawBench, a new
benchmark that evaluates image generation for diverse entities, from frequent
to rare, across multiple object categories including dogs, foods, landmarks,
birds, and characters. Human evaluation on EntityDrawBench shows that Re-Imagen
can significantly improve the fidelity of generated images, especially on less
frequent entities.
Score Modeling for Simulation-based Inference
September 28, 2022
Tomas Geffner, George Papamakarios, Andriy Mnih
Neural Posterior Estimation methods for simulation-based inference can be
ill-suited for dealing with posterior distributions obtained by conditioning on
multiple observations, as they tend to require a large number of simulator
calls to learn accurate approximations. In contrast, Neural Likelihood
Estimation methods can handle multiple observations at inference time after
learning from individual observations, but they rely on standard inference
methods, such as MCMC or variational inference, which come with certain
performance drawbacks. We introduce a new method based on conditional score
modeling that enjoys the benefits of both approaches. We model the scores of
the (diffused) posterior distributions induced by individual observations, and
introduce a way of combining the learned scores to approximately sample from
the target posterior distribution. Our approach is sample-efficient, can
naturally aggregate multiple observations at inference time, and avoids the
drawbacks of standard inference methods.
Spectral Diffusion Processes
September 28, 2022
Angus Phillips, Thomas Seror, Michael Hutchinson, Valentin De Bortoli, Arnaud Doucet, Emile Mathieu
Score-based generative modelling (SGM) has proven to be a very effective
method for modelling densities on finite-dimensional spaces. In this work we
propose to extend this methodology to learn generative models over functional
spaces. To do so, we represent functional data in spectral space to dissociate
the stochastic part of the processes from their space-time part. Using
dimensionality reduction techniques we then sample from their stochastic
component using finite dimensional SGM. We demonstrate our method’s
effectiveness for modelling various multimodal datasets.
Draw Your Art Dream: Diverse Digital Art Synthesis with Multimodal Guided Diffusion
September 27, 2022
Nisha Huang, Fan Tang, Weiming Dong, Changsheng Xu
Digital art synthesis is receiving increasing attention in the multimedia
community because of engaging the public with art effectively. Current digital
art synthesis methods usually use single-modality inputs as guidance, thereby
limiting the expressiveness of the model and the diversity of generated
results. To solve this problem, we propose the multimodal guided artwork
diffusion (MGAD) model, which is a diffusion-based digital artwork generation
approach that utilizes multimodal prompts as guidance to control the
classifier-free diffusion model. Additionally, the contrastive language-image
pretraining (CLIP) model is used to unify text and image modalities. Extensive
experimental results on the quality and quantity of the generated digital art
paintings confirm the effectiveness of the combination of the diffusion model
and multimodal guidance. Code is available at
https://github.com/haha-lisa/MGAD-multimodal-guided-artwork-diffusion.
Learning to Learn with Generative Models of Neural Network Checkpoints
September 26, 2022
William Peebles, Ilija Radosavovic, Tim Brooks, Alexei A. Efros, Jitendra Malik
We explore a data-driven approach for learning to optimize neural networks.
We construct a dataset of neural network checkpoints and train a generative
model on the parameters. In particular, our model is a conditional diffusion
transformer that, given an initial input parameter vector and a prompted loss,
error, or return, predicts the distribution over parameter updates that achieve
the desired metric. At test time, it can optimize neural networks with unseen
parameters for downstream tasks in just one update. We find that our approach
successfully generates parameters for a wide range of loss prompts. Moreover,
it can sample multimodal parameter solutions and has favorable scaling
properties. We apply our method to different neural network architectures and
tasks in supervised and reinforcement learning.
Quasi-Conservative Score-based Generative Models
September 26, 2022
Chen-Hao Chao, Wei-Fang Sun, Bo-Wun Cheng, Chun-Yi Lee
Existing Score-Based Models (SBMs) can be categorized into constrained SBMs
(CSBMs) or unconstrained SBMs (USBMs) according to their parameterization
approaches. CSBMs model probability density functions as Boltzmann
distributions, and assign their predictions as the negative gradients of some
scalar-valued energy functions. On the other hand, USBMs employ flexible
architectures capable of directly estimating scores without the need to
explicitly model energy functions. In this paper, we demonstrate that the
architectural constraints of CSBMs may limit their modeling ability. In
addition, we show that USBMs’ inability to preserve the property of
conservativeness may lead to degraded performance in practice. To address the
above issues, we propose Quasi-Conservative Score-Based Models (QCSBMs) for
keeping the advantages of both CSBMs and USBMs. Our theoretical derivations
demonstrate that the training objective of QCSBMs can be efficiently integrated
into the training processes by leveraging the Hutchinson’s trace estimator. In
addition, our experimental results on the CIFAR-10, CIFAR-100, ImageNet, and
SVHN datasets validate the effectiveness of QCSBMs. Finally, we justify the
advantage of QCSBMs using an example of a one-layered autoencoder.
Convergence of score-based generative modeling for general data distributions
September 26, 2022
Holden Lee, Jianfeng Lu, Yixin Tan
cs.LG, math.PR, math.ST, stat.ML, stat.TH
Score-based generative modeling (SGM) has grown to be a hugely successful
method for learning to generate samples from complex data distributions such as
that of images and audio. It is based on evolving an SDE that transforms white
noise into a sample from the learned distribution, using estimates of the score
function, or gradient log-pdf. Previous convergence analyses for these methods
have suffered either from strong assumptions on the data distribution or
exponential dependencies, and hence fail to give efficient guarantees for the
multimodal and non-smooth distributions that arise in practice and for which
good empirical performance is observed. We consider a popular kind of SGM –
denoising diffusion models – and give polynomial convergence guarantees for
general data distributions, with no assumptions related to functional
inequalities or smoothness. Assuming $L^2$-accurate score estimates, we obtain
Wasserstein distance guarantees for any distribution of bounded support or
sufficiently decaying tails, as well as TV guarantees for distributions with
further smoothness assumptions.
Conversion Between CT and MRI Images Using Diffusion and Score-Matching Models
September 24, 2022
Qing Lyu, Ge Wang
eess.IV, cs.CV, physics.med-ph
MRI and CT are most widely used medical imaging modalities. It is often
necessary to acquire multi-modality images for diagnosis and treatment such as
radiotherapy planning. However, multi-modality imaging is not only costly but
also introduces misalignment between MRI and CT images. To address this
challenge, computational conversion is a viable approach between MRI and CT
images, especially from MRI to CT images. In this paper, we propose to use an
emerging deep learning framework called diffusion and score-matching models in
this context. Specifically, we adapt denoising diffusion probabilistic and
score-matching models, use four different sampling strategies, and compare
their performance metrics with that using a convolutional neural network and a
generative adversarial network model. Our results show that the diffusion and
score-matching models generate better synthetic CT images than the CNN and GAN
models. Furthermore, we investigate the uncertainties associated with the
diffusion and score-matching networks using the Monte-Carlo method, and improve
the results by averaging their Monte-Carlo outputs. Our study suggests that
diffusion and score-matching models are powerful to generate high quality
images conditioned on an image obtained using a complementary imaging modality,
analytically rigorous with clear explainability, and highly competitive with
CNNs and GANs for image synthesis.
Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions
September 22, 2022
Sitan Chen, Sinho Chewi, Jerry Li, Yuanzhi Li, Adil Salim, Anru R. Zhang
We provide theoretical convergence guarantees for score-based generative
models (SGMs) such as denoising diffusion probabilistic models (DDPMs), which
constitute the backbone of large-scale real-world generative models such as
DALL$\cdot$E 2. Our main result is that, assuming accurate score estimates,
such SGMs can efficiently sample from essentially any realistic data
distribution. In contrast to prior works, our results (1) hold for an
$L^2$-accurate score estimate (rather than $L^\infty$-accurate); (2) do not
require restrictive functional inequality conditions that preclude substantial
non-log-concavity; (3) scale polynomially in all relevant problem parameters;
and (4) match state-of-the-art complexity guarantees for discretization of the
Langevin diffusion, provided that the score error is sufficiently small. We
view this as strong theoretical justification for the empirical success of
SGMs. We also examine SGMs based on the critically damped Langevin diffusion
(CLD). Contrary to conventional wisdom, we provide evidence that the use of the
CLD does not reduce the complexity of SGMs.
Poisson Flow Generative Models
September 22, 2022
Yilun Xu, Ziming Liu, Max Tegmark, Tommi Jaakkola
We propose a new “Poisson flow” generative model (PFGM) that maps a uniform
distribution on a high-dimensional hemisphere into any data distribution. We
interpret the data points as electrical charges on the $z=0$ hyperplane in a
space augmented with an additional dimension $z$, generating a high-dimensional
electric field (the gradient of the solution to Poisson equation). We prove
that if these charges flow upward along electric field lines, their initial
distribution in the $z=0$ plane transforms into a distribution on the
hemisphere of radius $r$ that becomes uniform in the $r \to\infty$ limit. To
learn the bijective transformation, we estimate the normalized field in the
augmented space. For sampling, we devise a backward ODE that is anchored by the
physically meaningful additional dimension: the samples hit the unaugmented
data manifold when the $z$ reaches zero. Experimentally, PFGM achieves current
state-of-the-art performance among the normalizing flow models on CIFAR-10,
with an Inception score of $9.68$ and a FID score of $2.35$. It also performs
on par with the state-of-the-art SDE approaches while offering $10\times $ to
$20 \times$ acceleration on image generation tasks. Additionally, PFGM appears
more tolerant of estimation errors on a weaker network architecture and robust
to the step size in the Euler method. The code is available at
https://github.com/Newbeeer/poisson_flow .
MIDMs: Matching Interleaved Diffusion Models for Exemplar-based Image Translation
September 22, 2022
Junyoung Seo, Gyuseong Lee, Seokju Cho, Jiyoung Lee, Seungryong Kim
We present a novel method for exemplar-based image translation, called
matching interleaved diffusion models (MIDMs). Most existing methods for this
task were formulated as GAN-based matching-then-generation framework. However,
in this framework, matching errors induced by the difficulty of semantic
matching across cross-domain, e.g., sketch and photo, can be easily propagated
to the generation step, which in turn leads to degenerated results. Motivated
by the recent success of diffusion models overcoming the shortcomings of GANs,
we incorporate the diffusion models to overcome these limitations.
Specifically, we formulate a diffusion-based matching-and-generation framework
that interleaves cross-domain matching and diffusion steps in the latent space
by iteratively feeding the intermediate warp into the noising process and
denoising it to generate a translated image. In addition, to improve the
reliability of the diffusion process, we design a confidence-aware process
using cycle-consistency to consider only confident regions during translation.
Experimental results show that our MIDMs generate more plausible images than
state-of-the-art methods.
Implementing and Experimenting with Diffusion Models for Text-to-Image Generation
September 22, 2022
Robin Zbinden
Taking advantage of the many recent advances in deep learning, text-to-image
generative models currently have the merit of attracting the general public
attention. Two of these models, DALL-E 2 and Imagen, have demonstrated that
highly photorealistic images could be generated from a simple textual
description of an image. Based on a novel approach for image generation called
diffusion models, text-to-image models enable the production of many different
types of high resolution images, where human imagination is the only limit.
However, these models require exceptionally large amounts of computational
resources to train, as well as handling huge datasets collected from the
internet. In addition, neither the codebase nor the models have been released.
It consequently prevents the AI community from experimenting with these
cutting-edge models, making the reproduction of their results complicated, if
not impossible.
In this thesis, we aim to contribute by firstly reviewing the different
approaches and techniques used by these models, and then by proposing our own
implementation of a text-to-image model. Highly based on DALL-E 2, we introduce
several slight modifications to tackle the high computational cost induced. We
thus have the opportunity to experiment in order to understand what these
models are capable of, especially in a low resource regime. In particular, we
provide additional and analyses deeper than the ones performed by the authors
of DALL-E 2, including ablation studies.
Besides, diffusion models use so-called guidance methods to help the
generating process. We introduce a new guidance method which can be used in
conjunction with other guidance methods to improve image quality. Finally, the
images generated by our model are of reasonably good quality, without having to
sustain the significant training costs of state-of-the-art text-to-image
models.
Mandarin Singing Voice Synthesis with Denoising Diffusion Probabilistic Wasserstein GAN
September 21, 2022
Yin-Ping Cho, Yu Tsao, Hsin-Min Wang, Yi-Wen Liu
Singing voice synthesis (SVS) is the computer production of a human-like
singing voice from given musical scores. To accomplish end-to-end SVS
effectively and efficiently, this work adopts the acoustic model-neural vocoder
architecture established for high-quality speech and singing voice synthesis.
Specifically, this work aims to pursue a higher level of expressiveness in
synthesized voices by combining the diffusion denoising probabilistic model
(DDPM) and \emph{Wasserstein} generative adversarial network (WGAN) to
construct the backbone of the acoustic model. On top of the proposed acoustic
model, a HiFi-GAN neural vocoder is adopted with integrated fine-tuning to
ensure optimal synthesis quality for the resulting end-to-end SVS system. This
end-to-end system was evaluated with the multi-singer Mpop600 Mandarin singing
voice dataset. In the experiments, the proposed system exhibits improvements
over previous landmark counterparts in terms of musical expressiveness and
high-frequency acoustic details. Moreover, the adversarial acoustic model
converged stably without the need to enforce reconstruction objectives,
indicating the convergence stability of the proposed DDPM and WGAN combined
architecture over alternative GAN-based SVS systems.
Deep Generalized Schrödinger Bridge
September 20, 2022
Guan-Horng Liu, Tianrong Chen, Oswin So, Evangelos A. Theodorou
stat.ML, cs.GT, cs.LG, math.OC
Mean-Field Game (MFG) serves as a crucial mathematical framework in modeling
the collective behavior of individual agents interacting stochastically with a
large population. In this work, we aim at solving a challenging class of MFGs
in which the differentiability of these interacting preferences may not be
available to the solver, and the population is urged to converge exactly to
some desired distribution. These setups are, despite being well-motivated for
practical purposes, complicated enough to paralyze most (deep) numerical
solvers. Nevertheless, we show that Schr"odinger Bridge - as an
entropy-regularized optimal transport model - can be generalized to accepting
mean-field structures, hence solving these MFGs. This is achieved via the
application of Forward-Backward Stochastic Differential Equations theory,
which, intriguingly, leads to a computational framework with a similar
structure to Temporal Difference learning. As such, it opens up novel
algorithmic connections to Deep Reinforcement Learning that we leverage to
facilitate practical training. We show that our proposed objective function
provides necessary and sufficient conditions to the mean-field problem. Our
method, named Deep Generalized Schr"odinger Bridge (DeepGSB), not only
outperforms prior methods in solving classical population navigation MFGs, but
is also capable of solving 1000-dimensional opinion depolarization, setting a
new state-of-the-art numerical solver for high-dimensional MFGs. Our code will
be made available at https://github.com/ghliu/DeepGSB.
September 20, 2022
Siyuan Mei, Fuxin Fan, Andreas Maier
During orthopaedic surgery, the inserting of metallic implants or screws are
often performed under mobile C-arm systems. Due to the high attenuation of
metals, severe metal artifacts occur in 3D reconstructions, which degrade the
image quality greatly. To reduce the artifacts, many metal artifact reduction
algorithms have been developed and metal inpainting in projection domain is an
essential step. In this work, a score-based generative model is trained on
simulated knee projections and the inpainted image is obtained by removing the
noise in conditional resampling process. The result implies that the inpainted
images by score-based generative model have more detailed information and
achieve the lowest mean absolute error and the highest
peak-signal-to-noise-ratio compared with interpolation and CNN based method.
Besides, the score-based model can also recover projections with big circlar
and rectangular masks, showing its generalization in inpainting task.
T2V-DDPM: Thermal to Visible Face Translation using Denoising Diffusion Probabilistic Models
September 19, 2022
Nithin Gopalakrishnan Nair, Vishal M. Patel
Modern-day surveillance systems perform person recognition using deep
learning-based face verification networks. Most state-of-the-art facial
verification systems are trained using visible spectrum images. But, acquiring
images in the visible spectrum is impractical in scenarios of low-light and
nighttime conditions, and often images are captured in an alternate domain such
as the thermal infrared domain. Facial verification in thermal images is often
performed after retrieving the corresponding visible domain images. This is a
well-established problem often known as the Thermal-to-Visible (T2V) image
translation. In this paper, we propose a Denoising Diffusion Probabilistic
Model (DDPM) based solution for T2V translation specifically for facial images.
During training, the model learns the conditional distribution of visible
facial images given their corresponding thermal image through the diffusion
process. During inference, the visible domain image is obtained by starting
from Gaussian noise and performing denoising repeatedly. The existing inference
process for DDPMs is stochastic and time-consuming. Hence, we propose a novel
inference strategy for speeding up the inference time of DDPMs, specifically
for the problem of T2V image translation. We achieve the state-of-the-art
results on multiple datasets. The code and pretrained models are publically
available at http://github.com/Nithin-GK/T2V-DDPM
Denoising Diffusion Error Correction Codes
September 16, 2022
Yoni Choukroun, Lior Wolf
cs.IT, cs.AI, cs.LG, math.IT
Error correction code (ECC) is an integral part of the physical communication
layer, ensuring reliable data transfer over noisy channels. Recently, neural
decoders have demonstrated their advantage over classical decoding techniques.
However, recent state-of-the-art neural decoders suffer from high complexity
and lack the important iterative scheme characteristic of many legacy decoders.
In this work, we propose to employ denoising diffusion models for the soft
decoding of linear codes at arbitrary block lengths. Our framework models the
forward channel corruption as a series of diffusion steps that can be reversed
iteratively. Three contributions are made: (i) a diffusion process suitable for
the decoding setting is introduced, (ii) the neural diffusion decoder is
conditioned on the number of parity errors, which indicates the level of
corruption at a given step, (iii) a line search procedure based on the code’s
syndrome obtains the optimal reverse diffusion step size. The proposed approach
demonstrates the power of diffusion models for ECC and is able to achieve state
of the art accuracy, outperforming the other neural decoders by sizable
margins, even for a single reverse diffusion step.
Brain Imaging Generation with Latent Diffusion Models
September 15, 2022
Walter H. L. Pinaya, Petru-Daniel Tudosiu, Jessica Dafflon, Pedro F da Costa, Virginia Fernandez, Parashkev Nachev, Sebastien Ourselin, M. Jorge Cardoso
Deep neural networks have brought remarkable breakthroughs in medical image
analysis. However, due to their data-hungry nature, the modest dataset sizes in
medical imaging projects might be hindering their full potential. Generating
synthetic data provides a promising alternative, allowing to complement
training datasets and conducting medical image research at a larger scale.
Diffusion models recently have caught the attention of the computer vision
community by producing photorealistic synthetic images. In this study, we
explore using Latent Diffusion Models to generate synthetic images from
high-resolution 3D brain images. We used T1w MRI images from the UK Biobank
dataset (N=31,740) to train our models to learn about the probabilistic
distribution of brain images, conditioned on covariables, such as age, sex, and
brain structure volumes. We found that our models created realistic data, and
we could use the conditioning variables to control the data generation
effectively. Besides that, we created a synthetic dataset with 100,000 brain
images and made it openly available to the scientific community.
Lossy Image Compression with Conditional Diffusion Models
September 14, 2022
Ruihan Yang, Stephan Mandt
eess.IV, cs.CV, cs.LG, stat.ML
This paper outlines an end-to-end optimized lossy image compression framework
using diffusion generative models. The approach relies on the transform coding
paradigm, where an image is mapped into a latent space for entropy coding and,
from there, mapped back to the data space for reconstruction. In contrast to
VAE-based neural compression, where the (mean) decoder is a deterministic
neural network, our decoder is a conditional diffusion model. Our approach thus
introduces an additional content'' latent variable on which the reverse
diffusion process is conditioned and uses this variable to store information
about the image. The remaining
texture’’ variables characterizing the
diffusion process are synthesized at decoding time. We show that the model’s
performance can be tuned toward perceptual metrics of interest. Our extensive
experiments involving multiple datasets and image quality assessment metrics
show that our approach yields stronger reported FID scores than the GAN-based
model, while also yielding competitive performance with VAE-based models in
several distortion metrics. Furthermore, training the diffusion with
$\mathcal{X}$-parameterization enables high-quality reconstructions in only a
handful of decoding steps, greatly affecting the model’s practicality. Our code
is available at: \url{https://github.com/buggyyang/CDC_compression}
PET image denoising based on denoising diffusion probabilistic models
September 13, 2022
Kuang Gong, Keith A. Johnson, Georges El Fakhri, Quanzheng Li, Tinsu Pan
eess.IV, cs.CV, physics.med-ph
Due to various physical degradation factors and limited counts received, PET
image quality needs further improvements. The denoising diffusion probabilistic
models (DDPM) are distribution learning-based models, which try to transform a
normal distribution into a specific data distribution based on iterative
refinements. In this work, we proposed and evaluated different DDPM-based
methods for PET image denoising. Under the DDPM framework, one way to perform
PET image denoising is to provide the PET image and/or the prior image as the
network input. Another way is to supply the prior image as the input with the
PET image included in the refinement steps, which can fit for scenarios of
different noise levels. 120 18F-FDG datasets and 140 18F-MK-6240 datasets were
utilized to evaluate the proposed DDPM-based methods. Quantification show that
the DDPM-based frameworks with PET information included can generate better
results than the nonlocal mean and Unet-based denoising methods. Adding
additional MR prior in the model can help achieve better performance and
further reduce the uncertainty during image denoising. Solely relying on MR
prior while ignoring the PET information can result in large bias. Regional and
surface quantification shows that employing MR prior as the network input while
embedding PET image as a data-consistency constraint during inference can
achieve the best performance. In summary, DDPM-based PET image denoising is a
flexible framework, which can efficiently utilize prior information and achieve
better performance than the nonlocal mean and Unet-based denoising methods.
Blurring Diffusion Models
September 12, 2022
Emiel Hoogeboom, Tim Salimans
Recently, Rissanen et al., (2022) have presented a new type of diffusion
process for generative modeling based on heat dissipation, or blurring, as an
alternative to isotropic Gaussian diffusion. Here, we show that blurring can
equivalently be defined through a Gaussian diffusion process with non-isotropic
noise. In making this connection, we bridge the gap between inverse heat
dissipation and denoising diffusion, and we shed light on the inductive bias
that results from this modeling choice. Finally, we propose a generalized class
of diffusion models that offers the best of both standard Gaussian denoising
diffusion and inverse heat dissipation, which we call Blurring Diffusion
Models.
Soft Diffusion: Score Matching for General Corruptions
September 12, 2022
Giannis Daras, Mauricio Delbracio, Hossein Talebi, Alexandros G. Dimakis, Peyman Milanfar
We define a broader family of corruption processes that generalizes
previously known diffusion models. To reverse these general diffusions, we
propose a new objective called Soft Score Matching that provably learns the
score function for any linear corruption process and yields state of the art
results for CelebA. Soft Score Matching incorporates the degradation process in
the network. Our new loss trains the model to predict a clean image,
\textit{that after corruption}, matches the diffused observation. We show that
our objective learns the gradient of the likelihood under suitable regularity
conditions for a family of corruption processes. We further develop a
principled way to select the corruption levels for general diffusion processes
and a novel sampling method that we call Momentum Sampler. We show
experimentally that our framework works for general linear corruption
processes, such as Gaussian blur and masking. We achieve state-of-the-art FID
score $1.85$ on CelebA-64, outperforming all previous linear diffusion models.
We also show significant computational benefits compared to vanilla denoising
diffusion.
Diffusion Models in Vision: A Survey
September 10, 2022
Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, Mubarak Shah
Denoising diffusion models represent a recent emerging topic in computer
vision, demonstrating remarkable results in the area of generative modeling. A
diffusion model is a deep generative model that is based on two stages, a
forward diffusion stage and a reverse diffusion stage. In the forward diffusion
stage, the input data is gradually perturbed over several steps by adding
Gaussian noise. In the reverse stage, a model is tasked at recovering the
original input data by learning to gradually reverse the diffusion process,
step by step. Diffusion models are widely appreciated for the quality and
diversity of the generated samples, despite their known computational burdens,
i.e. low speeds due to the high number of steps involved during sampling. In
this survey, we provide a comprehensive review of articles on denoising
diffusion models applied in vision, comprising both theoretical and practical
contributions in the field. First, we identify and present three generic
diffusion modeling frameworks, which are based on denoising diffusion
probabilistic models, noise conditioned score networks, and stochastic
differential equations. We further discuss the relations between diffusion
models and other deep generative models, including variational auto-encoders,
generative adversarial networks, energy-based models, autoregressive models and
normalizing flows. Then, we introduce a multi-perspective categorization of
diffusion models applied in computer vision. Finally, we illustrate the current
limitations of diffusion models and envision some interesting directions for
future research.
SE(3)-DiffusionFields: Learning cost functions for joint grasp and motion optimization through diffusion
September 08, 2022
Julen Urain, Niklas Funk, Jan Peters, Georgia Chalvatzaki
Multi-objective optimization problems are ubiquitous in robotics, e.g., the
optimization of a robot manipulation task requires a joint consideration of
grasp pose configurations, collisions and joint limits. While some demands can
be easily hand-designed, e.g., the smoothness of a trajectory, several
task-specific objectives need to be learned from data. This work introduces a
method for learning data-driven SE(3) cost functions as diffusion models.
Diffusion models can represent highly-expressive multimodal distributions and
exhibit proper gradients over the entire space due to their score-matching
training objective. Learning costs as diffusion models allows their seamless
integration with other costs into a single differentiable objective function,
enabling joint gradient-based motion optimization. In this work, we focus on
learning SE(3) diffusion models for 6DoF grasping, giving rise to a novel
framework for joint grasp and motion optimization without needing to decouple
grasp selection from trajectory generation. We evaluate the representation
power of our SE(3) diffusion models w.r.t. classical generative models, and we
showcase the superior performance of our proposed optimization framework in a
series of simulated and real-world robotic manipulation tasks against
representative baselines.
Unifying Generative Models with GFlowNets
September 06, 2022
Dinghuai Zhang, Ricky T. Q. Chen, Nikolay Malkin, Yoshua Bengio
There are many frameworks for deep generative modeling, each often presented
with their own specific training algorithms and inference methods. Here, we
demonstrate the connections between existing deep generative models and the
recently introduced GFlowNet framework, a probabilistic inference machine which
treats sampling as a decision-making process. This analysis sheds light on
their overlapping traits and provides a unifying viewpoint through the lens of
learning with Markovian trajectories. Our framework provides a means for
unifying training and inference algorithms, and provides a route to shine a
unifying light over many generative models. Beyond this, we provide a practical
and experimentally verified recipe for improving generative modeling with
insights from the GFlowNet perspective.
Instrument Separation of Symbolic Music by Explicitly Guided Diffusion Model
September 05, 2022
Sangjun Han, Hyeongrae Ihm, DaeHan Ahn, Woohyung Lim
Similar to colorization in computer vision, instrument separation is to
assign instrument labels (e.g. piano, guitar…) to notes from unlabeled
mixtures which contain only performance information. To address the problem, we
adopt diffusion models and explicitly guide them to preserve consistency
between mixtures and music. The quantitative results show that our proposed
model can generate high-fidelity samples for multitrack symbolic music with
creativity.
First Hitting Diffusion Models
September 02, 2022
Mao Ye, Lemeng Wu, Qiang Liu
We propose a family of First Hitting Diffusion Models (FHDM), deep generative
models that generate data with a diffusion process that terminates at a random
first hitting time. This yields an extension of the standard fixed-time
diffusion models that terminate at a pre-specified deterministic time. Although
standard diffusion models are designed for continuous unconstrained data, FHDM
is naturally designed to learn distributions on continuous as well as a range
of discrete and structure domains. Moreover, FHDM enables instance-dependent
terminate time and accelerates the diffusion process to sample higher quality
data with fewer diffusion steps. Technically, we train FHDM by maximum
likelihood estimation on diffusion trajectories augmented from observed data
with conditional first hitting processes (i.e., bridge) derived based on Doob’s
$h$-transform, deviating from the commonly used time-reversal mechanism. We
apply FHDM to generate data in various domains such as point cloud (general
continuous distribution), climate and geographical events on earth (continuous
distribution on the sphere), unweighted graphs (distribution of binary
matrices), and segmentation maps of 2D images (high-dimensional categorical
distribution). We observe considerable improvement compared with the
state-of-the-art approaches in both quality and speed.
September 02, 2022
Lemeng Wu, Chengyue Gong, Xingchao Liu, Mao Ye, Qiang Liu
AI-based molecule generation provides a promising approach to a large area of
biomedical sciences and engineering, such as antibody design, hydrolase
engineering, or vaccine development. Because the molecules are governed by
physical laws, a key challenge is to incorporate prior information into the
training procedure to generate high-quality and realistic molecules. We propose
a simple and novel approach to steer the training of diffusion-based generative
models with physical and statistics prior information. This is achieved by
constructing physically informed diffusion bridges, stochastic processes that
guarantee to yield a given observation at the fixed terminal time. We develop a
Lyapunov function based method to construct and determine bridges, and propose
a number of proposals of informative prior bridges for both high-quality
molecule generation and uniformity-promoted 3D point cloud generation. With
comprehensive experiments, we show that our method provides a powerful approach
to the 3D generation task, yielding molecule structures with better quality and
stability scores and more uniformly distributed point clouds of high qualities.
Diffusion Models: A Comprehensive Survey of Methods and Applications
September 02, 2022
Ling Yang, Zhilong Zhang, Yang Song, Shenda Hong, Runsheng Xu, Yue Zhao, Wentao Zhang, Bin Cui, Ming-Hsuan Yang
Diffusion models have emerged as a powerful new family of deep generative
models with record-breaking performance in many applications, including image
synthesis, video generation, and molecule design. In this survey, we provide an
overview of the rapidly expanding body of work on diffusion models,
categorizing the research into three key areas: efficient sampling, improved
likelihood estimation, and handling data with special structures. We also
discuss the potential for combining diffusion models with other generative
models for enhanced results. We further review the wide-ranging applications of
diffusion models in fields spanning from computer vision, natural language
generation, temporal data modeling, to interdisciplinary applications in other
scientific disciplines. This survey aims to provide a contextualized, in-depth
look at the state of diffusion models, identifying the key areas of focus and
pointing to potential areas for further exploration. Github:
https://github.com/YangLing0818/Diffusion-Models-Papers-Survey-Taxonomy.
MotionDiffuse: Text-Driven Human Motion Generation with Diffusion Model
August 31, 2022
Mingyuan Zhang, Zhongang Cai, Liang Pan, Fangzhou Hong, Xinying Guo, Lei Yang, Ziwei Liu
Human motion modeling is important for many modern graphics applications,
which typically require professional skills. In order to remove the skill
barriers for laymen, recent motion generation methods can directly generate
human motions conditioned on natural languages. However, it remains challenging
to achieve diverse and fine-grained motion generation with various text inputs.
To address this problem, we propose MotionDiffuse, the first diffusion
model-based text-driven motion generation framework, which demonstrates several
desired properties over existing methods. 1) Probabilistic Mapping. Instead of
a deterministic language-motion mapping, MotionDiffuse generates motions
through a series of denoising steps in which variations are injected. 2)
Realistic Synthesis. MotionDiffuse excels at modeling complicated data
distribution and generating vivid motion sequences. 3) Multi-Level
Manipulation. MotionDiffuse responds to fine-grained instructions on body
parts, and arbitrary-length motion synthesis with time-varied text prompts. Our
experiments show MotionDiffuse outperforms existing SoTA methods by convincing
margins on text-driven motion generation and action-conditioned motion
generation. A qualitative analysis further demonstrates MotionDiffuse’s
controllability for comprehensive motion generation. Homepage:
https://mingyuan-zhang.github.io/projects/MotionDiffuse.html
Let us Build Bridges: Understanding and Extending Diffusion Generative Models
August 31, 2022
Xingchao Liu, Lemeng Wu, Mao Ye, Qiang Liu
Diffusion-based generative models have achieved promising results recently,
but raise an array of open questions in terms of conceptual understanding,
theoretical analysis, algorithm improvement and extensions to discrete,
structured, non-Euclidean domains. This work tries to re-exam the overall
framework, in order to gain better theoretical understandings and develop
algorithmic extensions for data from arbitrary domains. By viewing diffusion
models as latent variable models with unobserved diffusion trajectories and
applying maximum likelihood estimation (MLE) with latent trajectories imputed
from an auxiliary distribution, we show that both the model construction and
the imputation of latent trajectories amount to constructing diffusion bridge
processes that achieve deterministic values and constraints at end point, for
which we provide a systematic study and a suit of tools. Leveraging our
framework, we present 1) a first theoretical error analysis for learning
diffusion generation models, and 2) a simple and unified approach to learning
on data from different discrete and constrained domains. Experiments show that
our methods perform superbly on generating images, semantic segments and 3D
point clouds.
A Diffusion Model Predicts 3D Shapes from 2D Microscopy Images
August 30, 2022
Dominik J. E. Waibel, Ernst Röell, Bastian Rieck, Raja Giryes, Carsten Marr
cs.CV, cs.AI, cs.LG, stat.ML, 68-06
Diffusion models are a special type of generative model, capable of
synthesising new data from a learnt distribution. We introduce DISPR, a
diffusion-based model for solving the inverse problem of three-dimensional (3D)
cell shape prediction from two-dimensional (2D) single cell microscopy images.
Using the 2D microscopy image as a prior, DISPR is conditioned to predict
realistic 3D shape reconstructions. To showcase the applicability of DISPR as a
data augmentation tool in a feature-based single cell classification task, we
extract morphological features from the red blood cells grouped into six highly
imbalanced classes. Adding features from the DISPR predictions to the three
minority classes improved the macro F1 score from $F1\text{macro} = 55.2 \pm
4.6\%$ to $F1\text{macro} = 72.2 \pm 4.9\%$. We thus demonstrate that
diffusion models can be successfully applied to inverse biomedical problems,
and that they learn to reconstruct 3D shapes with realistic morphological
features from 2D microscopy images.
Frido: Feature Pyramid Diffusion for Complex Scene Image Synthesis
August 29, 2022
Wan-Cyuan Fan, Yen-Chun Chen, Dongdong Chen, Yu Cheng, Lu Yuan, Yu-Chiang Frank Wang
Diffusion models (DMs) have shown great potential for high-quality image
synthesis. However, when it comes to producing images with complex scenes, how
to properly describe both image global structures and object details remains a
challenging task. In this paper, we present Frido, a Feature Pyramid Diffusion
model performing a multi-scale coarse-to-fine denoising process for image
synthesis. Our model decomposes an input image into scale-dependent vector
quantized features, followed by a coarse-to-fine gating for producing image
output. During the above multi-scale representation learning stage, additional
input conditions like text, scene graph, or image layout can be further
exploited. Thus, Frido can be also applied for conditional or cross-modality
image synthesis. We conduct extensive experiments over various unconditioned
and conditional image generation tasks, ranging from text-to-image synthesis,
layout-to-image, scene-graph-to-image, to label-to-image. More specifically, we
achieved state-of-the-art FID scores on five benchmarks, namely layout-to-image
on COCO and OpenImages, scene-graph-to-image on COCO and Visual Genome, and
label-to-image on COCO. Code is available at
https://github.com/davidhalladay/Frido.
DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation
August 25, 2022
Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, Kfir Aberman
Large text-to-image models achieved a remarkable leap in the evolution of AI,
enabling high-quality and diverse synthesis of images from a given text prompt.
However, these models lack the ability to mimic the appearance of subjects in a
given reference set and synthesize novel renditions of them in different
contexts. In this work, we present a new approach for “personalization” of
text-to-image diffusion models. Given as input just a few images of a subject,
we fine-tune a pretrained text-to-image model such that it learns to bind a
unique identifier with that specific subject. Once the subject is embedded in
the output domain of the model, the unique identifier can be used to synthesize
novel photorealistic images of the subject contextualized in different scenes.
By leveraging the semantic prior embedded in the model with a new autogenous
class-specific prior preservation loss, our technique enables synthesizing the
subject in diverse scenes, poses, views and lighting conditions that do not
appear in the reference images. We apply our technique to several
previously-unassailable tasks, including subject recontextualization,
text-guided view synthesis, and artistic rendering, all while preserving the
subject’s key features. We also provide a new dataset and evaluation protocol
for this new task of subject-driven generation. Project page:
https://dreambooth.github.io/
Understanding Diffusion Models: A Unified Perspective
August 25, 2022
Calvin Luo
Diffusion models have shown incredible capabilities as generative models;
indeed, they power the current state-of-the-art models on text-conditioned
image generation such as Imagen and DALL-E 2. In this work we review,
demystify, and unify the understanding of diffusion models across both
variational and score-based perspectives. We first derive Variational Diffusion
Models (VDM) as a special case of a Markovian Hierarchical Variational
Autoencoder, where three key assumptions enable tractable computation and
scalable optimization of the ELBO. We then prove that optimizing a VDM boils
down to learning a neural network to predict one of three potential objectives:
the original source input from any arbitrary noisification of it, the original
source noise from any arbitrarily noisified input, or the score function of a
noisified input at any arbitrary noise level. We then dive deeper into what it
means to learn the score function, and connect the variational perspective of a
diffusion model explicitly with the Score-based Generative Modeling perspective
through Tweedie’s Formula. Lastly, we cover how to learn a conditional
distribution using diffusion models via guidance.
AT-DDPM: Restoring Faces degraded by Atmospheric Turbulence using Denoising Diffusion Probabilistic Models
August 24, 2022
Nithin Gopalakrishnan Nair, Kangfu Mei, Vishal M. Patel
Although many long-range imaging systems are designed to support extended
vision applications, a natural obstacle to their operation is degradation due
to atmospheric turbulence. Atmospheric turbulence causes significant
degradation to image quality by introducing blur and geometric distortion. In
recent years, various deep learning-based single image atmospheric turbulence
mitigation methods, including CNN-based and GAN inversion-based, have been
proposed in the literature which attempt to remove the distortion in the image.
However, some of these methods are difficult to train and often fail to
reconstruct facial features and produce unrealistic results especially in the
case of high turbulence. Denoising Diffusion Probabilistic Models (DDPMs) have
recently gained some traction because of their stable training process and
their ability to generate high quality images. In this paper, we propose the
first DDPM-based solution for the problem of atmospheric turbulence mitigation.
We also propose a fast sampling technique for reducing the inference times for
conditional DDPMs. Extensive experiments are conducted on synthetic and
real-world data to show the significance of our model. To facilitate further
research, all codes and pretrained models are publically available at
http://github.com/Nithin-GK/AT-DDPM
PointDP: Diffusion-driven Purification against Adversarial Attacks on 3D Point Cloud Recognition
August 21, 2022
Jiachen Sun, Weili Nie, Zhiding Yu, Z. Morley Mao, Chaowei Xiao
3D Point cloud is becoming a critical data representation in many real-world
applications like autonomous driving, robotics, and medical imaging. Although
the success of deep learning further accelerates the adoption of 3D point
clouds in the physical world, deep learning is notorious for its vulnerability
to adversarial attacks. In this work, we first identify that the
state-of-the-art empirical defense, adversarial training, has a major
limitation in applying to 3D point cloud models due to gradient obfuscation. We
further propose PointDP, a purification strategy that leverages diffusion
models to defend against 3D adversarial attacks. We extensively evaluate
PointDP on six representative 3D point cloud architectures, and leverage 10+
strong and adaptive attacks to demonstrate its lower-bound robustness. Our
evaluation shows that PointDP achieves significantly better robustness than
state-of-the-art purification methods under strong attacks. Results of
certified defenses on randomized smoothing combined with PointDP will be
included in the near future.
August 20, 2022
François Mazé, Faez Ahmed
Structural topology optimization, which aims to find the optimal physical
structure that maximizes mechanical performance, is vital in engineering design
applications in aerospace, mechanical, and civil engineering. Generative
adversarial networks (GANs) have recently emerged as a popular alternative to
traditional iterative topology optimization methods. However, these models are
often difficult to train, have limited generalizability, and due to their goal
of mimicking optimal structures, neglect manufacturability and performance
objectives like mechanical compliance. We propose TopoDiff - a conditional
diffusion-model-based architecture to perform performance-aware and
manufacturability-aware topology optimization that overcomes these issues. Our
model introduces a surrogate model-based guidance strategy that actively favors
structures with low compliance and good manufacturability. Our method
significantly outperforms a state-of-art conditional GAN by reducing the
average error on physical performance by a factor of eight and by producing
eleven times fewer infeasible samples. By introducing diffusion models to
topology optimization, we show that conditional diffusion models have the
ability to outperform GANs in engineering design synthesis applications too.
Our work also suggests a general framework for engineering optimization
problems using diffusion models and external performance with constraint-aware
guidance. We publicly share the data, code, and trained models here:
https://decode.mit.edu/projects/topodiff/.
Diffusion-based Time Series Imputation and Forecasting with Structured State Space Models
August 19, 2022
Juan Miguel Lopez Alcaraz, Nils Strodthoff
The imputation of missing values represents a significant obstacle for many
real-world data analysis pipelines. Here, we focus on time series data and put
forward SSSD, an imputation model that relies on two emerging technologies,
(conditional) diffusion models as state-of-the-art generative models and
structured state space models as internal model architecture, which are
particularly suited to capture long-term dependencies in time series data. We
demonstrate that SSSD matches or even exceeds state-of-the-art probabilistic
imputation and forecasting performance on a broad range of data sets and
different missingness scenarios, including the challenging blackout-missing
scenarios, where prior approaches failed to provide meaningful results.
August 19, 2022
Arpit Bansal, Eitan Borgnia, Hong-Min Chu, Jie S. Li, Hamid Kazemi, Furong Huang, Micah Goldblum, Jonas Geiping, Tom Goldstein
Standard diffusion models involve an image transform – adding Gaussian noise
– and an image restoration operator that inverts this degradation. We observe
that the generative behavior of diffusion models is not strongly dependent on
the choice of image degradation, and in fact an entire family of generative
models can be constructed by varying this choice. Even when using completely
deterministic degradations (e.g., blur, masking, and more), the training and
test-time update rules that underlie diffusion models can be easily generalized
to create generative models. The success of these fully deterministic models
calls into question the community’s understanding of diffusion models, which
relies on noise in either gradient Langevin dynamics or variational inference,
and paves the way for generalized diffusion models that invert arbitrary
processes. Our code is available at
https://github.com/arpitbansal297/Cold-Diffusion-Models
Vector Quantized Diffusion Model with CodeUnet for Text-to-Sign Pose Sequences Generation
August 19, 2022
Pan Xie, Qipeng Zhang, Taiyi Peng, Hao Tang, Yao Du, Zexian Li
The Sign Language Production (SLP) project aims to automatically translate
spoken languages into sign sequences. Our approach focuses on the
transformation of sign gloss sequences into their corresponding sign pose
sequences (G2P). In this paper, we present a novel solution for this task by
converting the continuous pose space generation problem into a discrete
sequence generation problem. We introduce the Pose-VQVAE framework, which
combines Variational Autoencoders (VAEs) with vector quantization to produce a
discrete latent representation for continuous pose sequences. Additionally, we
propose the G2P-DDM model, a discrete denoising diffusion architecture for
length-varied discrete sequence data, to model the latent prior. To further
enhance the quality of pose sequence generation in the discrete space, we
present the CodeUnet model to leverage spatial-temporal information. Lastly, we
develop a heuristic sequential clustering method to predict variable lengths of
pose sequences for corresponding gloss sequences. Our results show that our
model outperforms state-of-the-art G2P models on the public SLP evaluation
benchmark. For more generated results, please visit our project page:
\textcolor{blue}{\url{https://slpdiffusier.github.io/g2p-ddm}}
Enhancing Diffusion-Based Image Synthesis with Robust Classifier Guidance
August 18, 2022
Bahjat Kawar, Roy Ganz, Michael Elad
Denoising diffusion probabilistic models (DDPMs) are a recent family of
generative models that achieve state-of-the-art results. In order to obtain
class-conditional generation, it was suggested to guide the diffusion process
by gradients from a time-dependent classifier. While the idea is theoretically
sound, deep learning-based classifiers are infamously susceptible to
gradient-based adversarial attacks. Therefore, while traditional classifiers
may achieve good accuracy scores, their gradients are possibly unreliable and
might hinder the improvement of the generation results. Recent work discovered
that adversarially robust classifiers exhibit gradients that are aligned with
human perception, and these could better guide a generative process towards
semantically meaningful images. We utilize this observation by defining and
training a time-dependent adversarially robust classifier and use it as
guidance for a generative diffusion model. In experiments on the highly
challenging and diverse ImageNet dataset, our scheme introduces significantly
more intelligible intermediate gradients, better alignment with theoretical
findings, as well as improved generation results under several evaluation
metrics. Furthermore, we conduct an opinion survey whose findings indicate that
human raters prefer our method’s results.
Your ViT is Secretly a Hybrid Discriminative-Generative Diffusion Model
August 16, 2022
Xiulong Yang, Sheng-Min Shih, Yinlin Fu, Xiaoting Zhao, Shihao Ji
Diffusion Denoising Probability Models (DDPM) and Vision Transformer (ViT)
have demonstrated significant progress in generative tasks and discriminative
tasks, respectively, and thus far these models have largely been developed in
their own domains. In this paper, we establish a direct connection between DDPM
and ViT by integrating the ViT architecture into DDPM, and introduce a new
generative model called Generative ViT (GenViT). The modeling flexibility of
ViT enables us to further extend GenViT to hybrid discriminative-generative
modeling, and introduce a Hybrid ViT (HybViT). Our work is among the first to
explore a single ViT for image generation and classification jointly. We
conduct a series of experiments to analyze the performance of proposed models
and demonstrate their superiority over prior state-of-the-arts in both
generative and discriminative tasks. Our code and pre-trained models can be
found in https://github.com/sndnyang/Diffusion_ViT .
Langevin Diffusion Variational Inference
August 16, 2022
Tomas Geffner, Justin Domke
Many methods that build powerful variational distributions based on
unadjusted Langevin transitions exist. Most of these were developed using a
wide range of different approaches and techniques. Unfortunately, the lack of a
unified analysis and derivation makes developing new methods and reasoning
about existing ones a challenging task. We address this giving a single
analysis that unifies and generalizes these existing techniques. The main idea
is to augment the target and variational by numerically simulating the
underdamped Langevin diffusion process and its time reversal. The benefits of
this approach are twofold: it provides a unified formulation for many existing
methods, and it simplifies the development of new ones. In fact, using our
formulation we propose a new method that combines the strengths of previously
existing algorithms; it uses underdamped Langevin transitions and powerful
augmentations parameterized by a score network. Our empirical evaluation shows
that our proposed method consistently outperforms relevant baselines in a wide
range of tasks.
Score-Based Diffusion meets Annealed Importance Sampling
August 16, 2022
Arnaud Doucet, Will Grathwohl, Alexander G. D. G. Matthews, Heiko Strathmann
More than twenty years after its introduction, Annealed Importance Sampling
(AIS) remains one of the most effective methods for marginal likelihood
estimation. It relies on a sequence of distributions interpolating between a
tractable initial distribution and the target distribution of interest which we
simulate from approximately using a non-homogeneous Markov chain. To obtain an
importance sampling estimate of the marginal likelihood, AIS introduces an
extended target distribution to reweight the Markov chain proposal. While much
effort has been devoted to improving the proposal distribution used by AIS, an
underappreciated issue is that AIS uses a convenient but suboptimal extended
target distribution. We here leverage recent progress in score-based generative
modeling (SGM) to approximate the optimal extended target distribution
minimizing the variance of the marginal likelihood estimate for AIS proposals
corresponding to the discretization of Langevin and Hamiltonian dynamics. We
demonstrate these novel, differentiable, AIS procedures on a number of
synthetic benchmark distributions and variational auto-encoders.
Applying Regularized Schrödinger-Bridge-Based Stochastic Process in Generative Modeling
August 15, 2022
Ki-Ung Song
Compared to the existing function-based models in deep generative modeling,
the recently proposed diffusion models have achieved outstanding performance
with a stochastic-process-based approach. But a long sampling time is required
for this approach due to many timesteps for discretization. Schr"odinger
bridge (SB)-based models attempt to tackle this problem by training
bidirectional stochastic processes between distributions. However, they still
have a slow sampling speed compared to generative models such as generative
adversarial networks. And due to the training of the bidirectional stochastic
processes, they require a relatively long training time. Therefore, this study
tried to reduce the number of timesteps and training time required and proposed
regularization terms to the existing SB models to make the bidirectional
stochastic processes consistent and stable with a reduced number of timesteps.
Each regularization term was integrated into a single term to enable more
efficient training in computation time and memory usage. Applying this
regularized stochastic process to various generation tasks, the desired
translations between different distributions were obtained, and accordingly,
the possibility of generative modeling based on a stochastic process with
faster sampling speed could be confirmed. The code is available at
https://github.com/KiUngSong/RSB.
Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning
August 12, 2022
Zhendong Wang, Jonathan J Hunt, Mingyuan Zhou
Offline reinforcement learning (RL), which aims to learn an optimal policy
using a previously collected static dataset, is an important paradigm of RL.
Standard RL methods often perform poorly in this regime due to the function
approximation errors on out-of-distribution actions. While a variety of
regularization methods have been proposed to mitigate this issue, they are
often constrained by policy classes with limited expressiveness that can lead
to highly suboptimal solutions. In this paper, we propose representing the
policy as a diffusion model, a recent class of highly-expressive deep
generative models. We introduce Diffusion Q-learning (Diffusion-QL) that
utilizes a conditional diffusion model to represent the policy. In our
approach, we learn an action-value function and we add a term maximizing
action-values into the training loss of the conditional diffusion model, which
results in a loss that seeks optimal actions that are near the behavior policy.
We show the expressiveness of the diffusion model-based policy, and the
coupling of the behavior cloning and policy improvement under the diffusion
model both contribute to the outstanding performance of Diffusion-QL. We
illustrate the superiority of our method compared to prior works in a simple 2D
bandit example with a multimodal behavior policy. We then show that our method
can achieve state-of-the-art performance on the majority of the D4RL benchmark
tasks.
Convergence of denoising diffusion models under the manifold hypothesis
August 10, 2022
Valentin De Bortoli
Denoising diffusion models are a recent class of generative models exhibiting
state-of-the-art performance in image and audio synthesis. Such models
approximate the time-reversal of a forward noising process from a target
distribution to a reference density, which is usually Gaussian. Despite their
strong empirical results, the theoretical analysis of such models remains
limited. In particular, all current approaches crucially assume that the target
density admits a density w.r.t. the Lebesgue measure. This does not cover
settings where the target distribution is supported on a lower-dimensional
manifold or is given by some empirical distribution. In this paper, we bridge
this gap by providing the first convergence results for diffusion models in
this more general setting. In particular, we provide quantitative bounds on the
Wasserstein distance of order one between the target data distribution and the
generative distribution of the diffusion model.
Wavelet Score-Based Generative Modeling
August 09, 2022
Florentin Guth, Simon Coste, Valentin De Bortoli, Stephane Mallat
Score-based generative models (SGMs) synthesize new data samples from
Gaussian white noise by running a time-reversed Stochastic Differential
Equation (SDE) whose drift coefficient depends on some probabilistic score. The
discretization of such SDEs typically requires a large number of time steps and
hence a high computational cost. This is because of ill-conditioning properties
of the score that we analyze mathematically. We show that SGMs can be
considerably accelerated, by factorizing the data distribution into a product
of conditional probabilities of wavelet coefficients across scales. The
resulting Wavelet Score-based Generative Model (WSGM) synthesizes wavelet
coefficients with the same number of time steps at all scales, and its time
complexity therefore grows linearly with the image size. This is proved
mathematically over Gaussian distributions, and shown numerically over physical
processes at phase transition and natural image datasets.
Analog Bits: Generating Discrete Data using Diffusion Models with Self-Conditioning
August 08, 2022
Ting Chen, Ruixiang Zhang, Geoffrey Hinton
cs.CV, cs.AI, cs.CL, cs.LG
We present Bit Diffusion: a simple and generic approach for generating
discrete data with continuous state and continuous time diffusion models. The
main idea behind our approach is to first represent the discrete data as binary
bits, and then train a continuous diffusion model to model these bits as real
numbers which we call analog bits. To generate samples, the model first
generates the analog bits, which are then thresholded to obtain the bits that
represent the discrete variables. We further propose two simple techniques,
namely Self-Conditioning and Asymmetric Time Intervals, which lead to a
significant improvement in sample quality. Despite its simplicity, the proposed
approach can achieve strong performance in both discrete image generation and
image captioning tasks. For discrete image generation, we significantly improve
previous state-of-the-art on both CIFAR-10 (which has 3K discrete 8-bit tokens)
and ImageNet-64x64 (which has 12K discrete 8-bit tokens), outperforming the
best autoregressive model in both sample quality (measured by FID) and
efficiency. For image captioning on MS-COCO dataset, our approach achieves
competitive results compared to autoregressive models.
Pyramidal Denoising Diffusion Probabilistic Models
August 03, 2022
Dohoon Ryu, Jong Chul Ye
Recently, diffusion model have demonstrated impressive image generation
performances, and have been extensively studied in various computer vision
tasks. Unfortunately, training and evaluating diffusion models consume a lot of
time and computational resources. To address this problem, here we present a
novel pyramidal diffusion model that can generate high resolution images
starting from much coarser resolution images using a {\em single} score
function trained with a positional embedding. This enables a neural network to
be much lighter and also enables time-efficient image generation without
compromising its performances. Furthermore, we show that the proposed approach
can be also efficiently used for multi-scale super-resolution problem using a
single score function.
Text-Guided Synthesis of Artistic Images with Retrieval-Augmented Diffusion Models
July 26, 2022
Robin Rombach, Andreas Blattmann, Björn Ommer
Novel architectures have recently improved generative image synthesis leading
to excellent visual quality in various tasks. Of particular note is the field
of AI-Art'', which has seen unprecedented growth with the emergence of
powerful multimodal models such as CLIP. By combining speech and image
synthesis models, so-called
prompt-engineering’’ has become established, in
which carefully selected and composed sentences are used to achieve a certain
visual style in the synthesized image. In this note, we present an alternative
approach based on retrieval-augmented diffusion models (RDMs). In RDMs, a set
of nearest neighbors is retrieved from an external database during training for
each training instance, and the diffusion model is conditioned on these
informative samples. During inference (sampling), we replace the retrieval
database with a more specialized database that contains, for example, only
images of a particular visual style. This provides a novel way to prompt a
general trained model after training and thereby specify a particular visual
style. As shown by our experiments, this approach is superior to specifying the
visual style within the text prompt. We open-source code and model weights at
https://github.com/CompVis/latent-diffusion .
Classifier-Free Diffusion Guidance
July 26, 2022
Jonathan Ho, Tim Salimans
Classifier guidance is a recently introduced method to trade off mode
coverage and sample fidelity in conditional diffusion models post training, in
the same spirit as low temperature sampling or truncation in other types of
generative models. Classifier guidance combines the score estimate of a
diffusion model with the gradient of an image classifier and thereby requires
training an image classifier separate from the diffusion model. It also raises
the question of whether guidance can be performed without a classifier. We show
that guidance can be indeed performed by a pure generative model without such a
classifier: in what we call classifier-free guidance, we jointly train a
conditional and an unconditional diffusion model, and we combine the resulting
conditional and unconditional score estimates to attain a trade-off between
sample quality and diversity similar to that obtained using classifier
guidance.
July 20, 2022
Georgios Batzolis, Jan Stanczuk, Carola-Bibiane Schönlieb, Christian Etmann
Diffusion models have emerged as one of the most promising frameworks for
deep generative modeling. In this work, we explore the potential of non-uniform
diffusion models. We show that non-uniform diffusion leads to multi-scale
diffusion models which have similar structure to this of multi-scale
normalizing flows. We experimentally find that in the same or less training
time, the multi-scale diffusion model achieves better FID score than the
standard uniform diffusion model. More importantly, it generates samples $4.4$
times faster in $128\times 128$ resolution. The speed-up is expected to be
higher in higher resolutions where more scales are used. Moreover, we show that
non-uniform diffusion leads to a novel estimator for the conditional score
function which achieves on par performance with the state-of-the-art
conditional denoising estimator. Our theoretical and experimental findings are
accompanied by an open source library MSDiff which can facilitate further
research of non-uniform diffusion models.
Unsupervised Medical Image Translation with Adversarial Diffusion Models
July 17, 2022
Muzaffer Özbey, Onat Dalmaz, Salman UH Dar, Hasan A Bedel, Şaban Özturk, Alper Güngör, Tolga Çukur
Imputation of missing images via source-to-target modality translation can
improve diversity in medical imaging protocols. A pervasive approach for
synthesizing target images involves one-shot mapping through generative
adversarial networks (GAN). Yet, GAN models that implicitly characterize the
image distribution can suffer from limited sample fidelity. Here, we propose a
novel method based on adversarial diffusion modeling, SynDiff, for improved
performance in medical image translation. To capture a direct correlate of the
image distribution, SynDiff leverages a conditional diffusion process that
progressively maps noise and source images onto the target image. For fast and
accurate image sampling during inference, large diffusion steps are taken with
adversarial projections in the reverse diffusion direction. To enable training
on unpaired datasets, a cycle-consistent architecture is devised with coupled
diffusive and non-diffusive modules that bilaterally translate between two
modalities. Extensive assessments are reported on the utility of SynDiff
against competing GAN and diffusion models in multi-contrast MRI and MRI-CT
translation. Our demonstrations indicate that SynDiff offers quantitatively and
qualitatively superior performance against competing baselines.
Threat Model-Agnostic Adversarial Defense using Diffusion Models
July 17, 2022
Tsachi Blau, Roy Ganz, Bahjat Kawar, Alex Bronstein, Michael Elad
Deep Neural Networks (DNNs) are highly sensitive to imperceptible malicious
perturbations, known as adversarial attacks. Following the discovery of this
vulnerability in real-world imaging and vision applications, the associated
safety concerns have attracted vast research attention, and many defense
techniques have been developed. Most of these defense methods rely on
adversarial training (AT) – training the classification network on images
perturbed according to a specific threat model, which defines the magnitude of
the allowed modification. Although AT leads to promising results, training on a
specific threat model fails to generalize to other types of perturbations. A
different approach utilizes a preprocessing step to remove the adversarial
perturbation from the attacked image. In this work, we follow the latter path
and aim to develop a technique that leads to robust classifiers across various
realizations of threat models. To this end, we harness the recent advances in
stochastic generative modeling, and means to leverage these for sampling from
conditional distributions. Our defense relies on an addition of Gaussian i.i.d
noise to the attacked image, followed by a pretrained diffusion process – an
architecture that performs a stochastic iterative process over a denoising
network, yielding a high perceptual quality denoised outcome. The obtained
robustness with this stochastic preprocessing step is validated through
extensive experiments on the CIFAR-10 dataset, showing that our method
outperforms the leading defense methods under various threat models.
DiffuStereo: High Quality Human Reconstruction via Diffusion-based Stereo Using Sparse Cameras
July 16, 2022
Ruizhi Shao, Zerong Zheng, Hongwen Zhang, Jingxiang Sun, Yebin Liu
We propose DiffuStereo, a novel system using only sparse cameras (8 in this
work) for high-quality 3D human reconstruction. At its core is a novel
diffusion-based stereo module, which introduces diffusion models, a type of
powerful generative models, into the iterative stereo matching network. To this
end, we design a new diffusion kernel and additional stereo constraints to
facilitate stereo matching and depth estimation in the network. We further
present a multi-level stereo network architecture to handle high-resolution (up
to 4k) inputs without requiring unaffordable memory footprint. Given a set of
sparse-view color images of a human, the proposed multi-level diffusion-based
stereo network can produce highly accurate depth maps, which are then converted
into a high-quality 3D human model through an efficient multi-view fusion
strategy. Overall, our method enables automatic reconstruction of human models
with quality on par to high-end dense-view camera rigs, and this is achieved
using a much more light-weight hardware setup. Experiments show that our method
outperforms state-of-the-art methods by a large margin both qualitatively and
quantitatively.
Progressive Deblurring of Diffusion Models for Coarse-to-Fine Image Synthesis
July 16, 2022
Sangyun Lee, Hyungjin Chung, Jaehyeon Kim, Jong Chul Ye
Recently, diffusion models have shown remarkable results in image synthesis
by gradually removing noise and amplifying signals. Although the simple
generative process surprisingly works well, is this the best way to generate
image data? For instance, despite the fact that human perception is more
sensitive to the low frequencies of an image, diffusion models themselves do
not consider any relative importance of each frequency component. Therefore, to
incorporate the inductive bias for image data, we propose a novel generative
process that synthesizes images in a coarse-to-fine manner. First, we
generalize the standard diffusion models by enabling diffusion in a rotated
coordinate system with different velocities for each component of the vector.
We further propose a blur diffusion as a special case, where each frequency
component of an image is diffused at different speeds. Specifically, the
proposed blur diffusion consists of a forward process that blurs an image and
adds noise gradually, after which a corresponding reverse process deblurs an
image and removes noise progressively. Experiments show that the proposed model
outperforms the previous method in FID on LSUN bedroom and church datasets.
Code is available at https://github.com/sangyun884/blur-diffusion.
EGSDE: Unpaired Image-to-Image Translation via Energy-Guided Stochastic Differential Equations
July 14, 2022
Min Zhao, Fan Bao, Chongxuan Li, Jun Zhu
Score-based diffusion models (SBDMs) have achieved the SOTA FID results in
unpaired image-to-image translation (I2I). However, we notice that existing
methods totally ignore the training data in the source domain, leading to
sub-optimal solutions for unpaired I2I. To this end, we propose energy-guided
stochastic differential equations (EGSDE) that employs an energy function
pretrained on both the source and target domains to guide the inference process
of a pretrained SDE for realistic and faithful unpaired I2I. Building upon two
feature extractors, we carefully design the energy function such that it
encourages the transferred image to preserve the domain-independent features
and discard domain-specific ones. Further, we provide an alternative
explanation of the EGSDE as a product of experts, where each of the three
experts (corresponding to the SDE and two feature extractors) solely
contributes to faithfulness or realism. Empirically, we compare EGSDE to a
large family of baselines on three widely-adopted unpaired I2I tasks under four
metrics. EGSDE not only consistently outperforms existing SBDMs-based methods
in almost all settings but also achieves the SOTA realism results without
harming the faithful performance. Furthermore, EGSDE allows for flexible
trade-offs between realism and faithfulness and we improve the realism results
further (e.g., FID of 51.04 in Cat to Dog and FID of 50.43 in Wild to Dog on
AFHQ) by tuning hyper-parameters. The code is available at
https://github.com/ML-GSAI/EGSDE.
Adaptive Diffusion Priors for Accelerated MRI Reconstruction
July 12, 2022
Alper Güngör, Salman UH Dar, Şaban Öztürk, Yilmaz Korkmaz, Gokberk Elmas, Muzaffer Özbey, Tolga Çukur
Deep MRI reconstruction is commonly performed with conditional models that
de-alias undersampled acquisitions to recover images consistent with
fully-sampled data. Since conditional models are trained with knowledge of the
imaging operator, they can show poor generalization across variable operators.
Unconditional models instead learn generative image priors decoupled from the
operator to improve reliability against domain shifts related to the imaging
operator. Recent diffusion models are particularly promising given their high
sample fidelity. Nevertheless, inference with a static image prior can perform
suboptimally. Here we propose the first adaptive diffusion prior for MRI
reconstruction, AdaDiff, to improve performance and reliability against domain
shifts. AdaDiff leverages an efficient diffusion prior trained via adversarial
mapping over large reverse diffusion steps. A two-phase reconstruction is
executed following training: a rapid-diffusion phase that produces an initial
reconstruction with the trained prior, and an adaptation phase that further
refines the result by updating the prior to minimize data-consistency loss.
Demonstrations on multi-contrast brain MRI clearly indicate that AdaDiff
outperforms competing conditional and unconditional methods under domain
shifts, and achieves superior or on par within-domain performance.
Improving Diffusion Model Efficiency Through Patching
July 09, 2022
Troy Luhman, Eric Luhman
Diffusion models are a powerful class of generative models that iteratively
denoise samples to produce data. While many works have focused on the number of
iterations in this sampling procedure, few have focused on the cost of each
iteration. We find that adding a simple ViT-style patching transformation can
considerably reduce a diffusion model’s sampling time and memory usage. We
justify our approach both through an analysis of the diffusion model objective,
and through empirical experiments on LSUN Church, ImageNet 256, and FFHQ 1024.
We provide implementations in Tensorflow and Pytorch.
Back to the Source: Diffusion-Driven Test-Time Adaptation
July 07, 2022
Jin Gao, Jialing Zhang, Xihui Liu, Trevor Darrell, Evan Shelhamer, Dequan Wang
Test-time adaptation harnesses test inputs to improve the accuracy of a model
trained on source data when tested on shifted target data. Existing methods
update the source model by (re-)training on each target domain. While
effective, re-training is sensitive to the amount and order of the data and the
hyperparameters for optimization. We instead update the target data, by
projecting all test inputs toward the source domain with a generative diffusion
model. Our diffusion-driven adaptation method, DDA, shares its models for
classification and generation across all domains. Both models are trained on
the source domain, then fixed during testing. We augment diffusion with image
guidance and self-ensembling to automatically decide how much to adapt. Input
adaptation by DDA is more robust than prior model adaptation approaches across
a variety of corruptions, architectures, and data regimes on the ImageNet-C
benchmark. With its input-wise updates, DDA succeeds where model adaptation
degrades on too little data in small batches, dependent data in non-uniform
order, or mixed data with multiple corruptions.
A Novel Unified Conditional Score-based Generative Framework for Multi-modal Medical Image Completion
July 07, 2022
Xiangxi Meng, Yuning Gu, Yongsheng Pan, Nizhuan Wang, Peng Xue, Mengkang Lu, Xuming He, Yiqiang Zhan, Dinggang Shen
Multi-modal medical image completion has been extensively applied to
alleviate the missing modality issue in a wealth of multi-modal diagnostic
tasks. However, for most existing synthesis methods, their inferences of
missing modalities can collapse into a deterministic mapping from the available
ones, ignoring the uncertainties inherent in the cross-modal relationships.
Here, we propose the Unified Multi-Modal Conditional Score-based Generative
Model (UMM-CSGM) to take advantage of Score-based Generative Model (SGM) in
modeling and stochastically sampling a target probability distribution, and
further extend SGM to cross-modal conditional synthesis for various
missing-modality configurations in a unified framework. Specifically, UMM-CSGM
employs a novel multi-in multi-out Conditional Score Network (mm-CSN) to learn
a comprehensive set of cross-modal conditional distributions via conditional
diffusion and reverse generation in the complete modality space. In this way,
the generation process can be accurately conditioned by all available
information, and can fit all possible configurations of missing modalities in a
single network. Experiments on BraTS19 dataset show that the UMM-CSGM can more
reliably synthesize the heterogeneous enhancement and irregular area in
tumor-induced lesions for any missing modalities.
Riemannian Diffusion Schrödinger Bridge
July 07, 2022
James Thornton, Michael Hutchinson, Emile Mathieu, Valentin De Bortoli, Yee Whye Teh, Arnaud Doucet
Score-based generative models exhibit state of the art performance on density
estimation and generative modeling tasks. These models typically assume that
the data geometry is flat, yet recent extensions have been developed to
synthesize data living on Riemannian manifolds. Existing methods to accelerate
sampling of diffusion models are typically not applicable in the Riemannian
setting and Riemannian score-based methods have not yet been adapted to the
important task of interpolation of datasets. To overcome these issues, we
introduce \emph{Riemannian Diffusion Schr"odinger Bridge}. Our proposed method
generalizes Diffusion Schr"odinger Bridge introduced in
\cite{debortoli2021neurips} to the non-Euclidean setting and extends Riemannian
score-based models beyond the first time reversal. We validate our proposed
method on synthetic data and real Earth and climate data.
Semantic Image Synthesis via Diffusion Models
June 30, 2022
Weilun Wang, Jianmin Bao, Wengang Zhou, Dongdong Chen, Dong Chen, Lu Yuan, Houqiang Li
Denoising Diffusion Probabilistic Models (DDPMs) have achieved remarkable
success in various image generation tasks compared with Generative Adversarial
Nets (GANs). Recent work on semantic image synthesis mainly follows the
\emph{de facto} GAN-based approaches, which may lead to unsatisfactory quality
or diversity of generated images. In this paper, we propose a novel framework
based on DDPM for semantic image synthesis. Unlike previous conditional
diffusion model directly feeds the semantic layout and noisy image as input to
a U-Net structure, which may not fully leverage the information in the input
semantic mask, our framework processes semantic layout and noisy image
differently. It feeds noisy image to the encoder of the U-Net structure while
the semantic layout to the decoder by multi-layer spatially-adaptive
normalization operators. To further improve the generation quality and semantic
interpretability in semantic image synthesis, we introduce the classifier-free
guidance sampling strategy, which acknowledge the scores of an unconditional
model for sampling process. Extensive experiments on three benchmark datasets
demonstrate the effectiveness of our proposed method, achieving
state-of-the-art performance in terms of fidelity (FID) and diversity (LPIPS).
SPI-GAN: Distilling Score-based Generative Models with Straight-Path Interpolations
June 29, 2022
Jinsung Jeon, Noseong Park
Score-based generative models (SGMs) are a recently proposed paradigm for
deep generative tasks and now show the state-of-the-art sampling performance.
It is known that the original SGM design solves the two problems of the
generative trilemma: i) sampling quality, and ii) sampling diversity. However,
the last problem of the trilemma was not solved, i.e., their training/sampling
complexity is notoriously high. To this end, distilling SGMs into simpler
models, e.g., generative adversarial networks (GANs), is gathering much
attention currently. We present an enhanced distillation method, called
straight-path interpolation GAN (SPI-GAN), which can be compared to the
state-of-the-art shortcut-based distillation method, called denoising diffusion
GAN (DD-GAN). However, our method corresponds to an extreme method that does
not use any intermediate shortcut information of the reverse SDE path, in which
case DD-GAN fails to obtain good results. Nevertheless, our straight-path
interpolation method greatly stabilizes the overall training process. As a
result, SPI-GAN is one of the best models in terms of the sampling
quality/diversity/time for CIFAR-10, CelebA-HQ-256, and LSUN-Church-256.
June 27, 2022
Boah Kim, Jong Chul Ye
Temporal volume images with 3D+t (4D) information are often used in medical
imaging to statistically analyze temporal dynamics or capture disease
progression. Although deep-learning-based generative models for natural images
have been extensively studied, approaches for temporal medical image generation
such as 4D cardiac volume data are limited. In this work, we present a novel
deep learning model that generates intermediate temporal volumes between source
and target volumes. Specifically, we propose a diffusion deformable model (DDM)
by adapting the denoising diffusion probabilistic model that has recently been
widely investigated for realistic image generation. Our proposed DDM is
composed of the diffusion and the deformation modules so that DDM can learn
spatial deformation information between the source and target volumes and
provide a latent code for generating intermediate frames along a geodesic path.
Once our model is trained, the latent code estimated from the diffusion module
is simply interpolated and fed into the deformation module, which enables DDM
to generate temporal frames along the continuous trajectory while preserving
the topology of the source image. We demonstrate the proposed method with the
4D cardiac MR image generation between the diastolic and systolic phases for
each subject. Compared to the existing deformation methods, our DDM achieves
high performance on temporal volume generation.
DDPM-CD: Remote Sensing Change Detection using Denoising Diffusion Probabilistic Models
June 23, 2022
Wele Gedara Chaminda Bandara, Nithin Gopalakrishnan Nair, Vishal M. Patel
Remote sensing change detection is crucial for understanding the dynamics of
our planet’s surface, facilitating the monitoring of environmental changes,
evaluating human impact, predicting future trends, and supporting
decision-making. In this work, we introduce a novel approach for change
detection that can leverage off-the-shelf, unlabeled remote sensing images in
the training process by pre-training a Denoising Diffusion Probabilistic Model
(DDPM) - a class of generative models used in image synthesis. DDPMs learn the
training data distribution by gradually converting training images into a
Gaussian distribution using a Markov chain. During inference (i.e., sampling),
they can generate a diverse set of samples closer to the training distribution,
starting from Gaussian noise, achieving state-of-the-art image synthesis
results. However, in this work, our focus is not on image synthesis but on
utilizing it as a pre-trained feature extractor for the downstream application
of change detection. Specifically, we fine-tune a lightweight change classifier
utilizing the feature representations produced by the pre-trained DDPM
alongside change labels. Experiments conducted on the LEVIR-CD, WHU-CD,
DSIFN-CD, and CDD datasets demonstrate that the proposed DDPM-CD method
significantly outperforms the existing state-of-the-art change detection
methods in terms of F1 score, IoU, and overall accuracy, highlighting the
pivotal role of pre-trained DDPM as a feature extractor for downstream
applications. We have made both the code and pre-trained models available at
https://github.com/wgcban/ddpm-cd
Guided Diffusion Model for Adversarial Purification from Random Noise
June 22, 2022
Quanlin Wu, Hang Ye, Yuntian Gu
In this paper, we propose a novel guided diffusion purification approach to
provide a strong defense against adversarial attacks. Our model achieves 89.62%
robust accuracy under PGD-L_inf attack (eps = 8/255) on the CIFAR-10 dataset.
We first explore the essential correlations between unguided diffusion models
and randomized smoothing, enabling us to apply the models to certified
robustness. The empirical results show that our models outperform randomized
smoothing by 5% when the certified L2 radius r is larger than 0.5.
(Certified!!) Adversarial Robustness for Free!
June 21, 2022
Nicholas Carlini, Florian Tramer, Krishnamurthy Dj Dvijotham, Leslie Rice, Mingjie Sun, J. Zico Kolter
In this paper we show how to achieve state-of-the-art certified adversarial
robustness to 2-norm bounded perturbations by relying exclusively on
off-the-shelf pretrained models. To do so, we instantiate the denoised
smoothing approach of Salman et al. 2020 by combining a pretrained denoising
diffusion probabilistic model and a standard high-accuracy classifier. This
allows us to certify 71% accuracy on ImageNet under adversarial perturbations
constrained to be within an 2-norm of 0.5, an improvement of 14 percentage
points over the prior certified SoTA using any approach, or an improvement of
30 percentage points over denoised smoothing. We obtain these results using
only pretrained diffusion models and image classifiers, without requiring any
fine tuning or retraining of model parameters.
Faster Diffusion Cardiac MRI with Deep Learning-based breath hold reduction
June 21, 2022
Michael Tanzer, Pedro Ferreira, Andrew Scott, Zohya Khalique, Maria Dwornik, Dudley Pennell, Guang Yang, Daniel Rueckert, Sonia Nielles-Vallespin
Diffusion Tensor Cardiac Magnetic Resonance (DT-CMR) enables us to probe the
microstructural arrangement of cardiomyocytes within the myocardium in vivo and
non-invasively, which no other imaging modality allows. This innovative
technology could revolutionise the ability to perform cardiac clinical
diagnosis, risk stratification, prognosis and therapy follow-up. However,
DT-CMR is currently inefficient with over six minutes needed to acquire a
single 2D static image. Therefore, DT-CMR is currently confined to research but
not used clinically. We propose to reduce the number of repetitions needed to
produce DT-CMR datasets and subsequently de-noise them, decreasing the
acquisition time by a linear factor while maintaining acceptable image quality.
Our proposed approach, based on Generative Adversarial Networks, Vision
Transformers, and Ensemble Learning, performs significantly and considerably
better than previous proposed approaches, bringing single breath-hold DT-CMR
closer to reality.
Generative Modelling With Inverse Heat Dissipation
June 21, 2022
Severi Rissanen, Markus Heinonen, Arno Solin
While diffusion models have shown great success in image generation, their
noise-inverting generative process does not explicitly consider the structure
of images, such as their inherent multi-scale nature. Inspired by diffusion
models and the empirical success of coarse-to-fine modelling, we propose a new
diffusion-like model that generates images through stochastically reversing the
heat equation, a PDE that locally erases fine-scale information when run over
the 2D plane of the image. We interpret the solution of the forward heat
equation with constant additive noise as a variational approximation in the
diffusion latent variable model. Our new model shows emergent qualitative
properties not seen in standard diffusion models, such as disentanglement of
overall colour and shape in images. Spectral analysis on natural images
highlights connections to diffusion models and reveals an implicit
coarse-to-fine inductive bias in them.
June 18, 2022
Giannis Daras, Yuval Dagan, Alexandros G. Dimakis, Constantinos Daskalakis
We prove fast mixing and characterize the stationary distribution of the
Langevin Algorithm for inverting random weighted DNN generators. This result
extends the work of Hand and Voroninski from efficient inversion to efficient
posterior sampling. In practice, to allow for increased expressivity, we
propose to do posterior sampling in the latent space of a pre-trained
generative model. To achieve that, we train a score-based model in the latent
space of a StyleGAN-2 and we use it to solve inverse problems. Our framework,
Score-Guided Intermediate Layer Optimization (SGILO), extends prior work by
replacing the sparsity regularization with a generative prior in the
intermediate layer. Experimentally, we obtain significant improvements over the
previous state-of-the-art, especially in the low measurement regime.
Diffusion models as plug-and-play priors
June 17, 2022
Alexandros Graikos, Nikolay Malkin, Nebojsa Jojic, Dimitris Samaras
We consider the problem of inferring high-dimensional data $\mathbf{x}$ in a
model that consists of a prior $p(\mathbf{x})$ and an auxiliary differentiable
constraint $c(\mathbf{x},\mathbf{y})$ on $x$ given some additional information
$\mathbf{y}$. In this paper, the prior is an independently trained denoising
diffusion generative model. The auxiliary constraint is expected to have a
differentiable form, but can come from diverse sources. The possibility of such
inference turns diffusion models into plug-and-play modules, thereby allowing a
range of potential applications in adapting models to new domains and tasks,
such as conditional generation or image segmentation. The structure of
diffusion models allows us to perform approximate inference by iterating
differentiation through the fixed denoising network enriched with different
amounts of noise at each step. Considering many noised versions of $\mathbf{x}$
in evaluation of its fitness is a novel search mechanism that may lead to new
algorithms for solving combinatorial optimization problems.
Score-based Generative Models for Calorimeter Shower Simulation
June 17, 2022
Vinicius Mikuni, Benjamin Nachman
hep-ph, cs.LG, hep-ex, physics.data-an, physics.ins-det
Score-based generative models are a new class of generative algorithms that
have been shown to produce realistic images even in high dimensional spaces,
currently surpassing other state-of-the-art models for different benchmark
categories and applications. In this work we introduce CaloScore, a score-based
generative model for collider physics applied to calorimeter shower generation.
Three different diffusion models are investigated using the Fast Calorimeter
Simulation Challenge 2022 dataset. CaloScore is the first application of a
score-based generative model in collider physics and is able to produce
high-fidelity calorimeter images for all datasets, providing an alternative
paradigm for calorimeter shower simulation.
Lossy Compression with Gaussian Diffusion
June 17, 2022
Lucas Theis, Tim Salimans, Matthew D. Hoffman, Fabian Mentzer
stat.ML, cs.IT, cs.LG, math.IT
We consider a novel lossy compression approach based on unconditional
diffusion generative models, which we call DiffC. Unlike modern compression
schemes which rely on transform coding and quantization to restrict the
transmitted information, DiffC relies on the efficient communication of pixels
corrupted by Gaussian noise. We implement a proof of concept and find that it
works surprisingly well despite the lack of an encoder transform, outperforming
the state-of-the-art generative compression method HiFiC on ImageNet 64x64.
DiffC only uses a single model to encode and denoise corrupted pixels at
arbitrary bitrates. The approach further provides support for progressive
coding, that is, decoding from partial bit streams. We perform a
rate-distortion analysis to gain a deeper understanding of its performance,
providing analytical results for multivariate Gaussian data as well as
theoretic bounds for general distributions. Furthermore, we prove that a
flow-based reconstruction achieves a 3 dB gain over ancestral sampling at high
bitrates.
A Flexible Diffusion Model
June 17, 2022
Weitao Du, Tao Yang, He Zhang, Yuanqi Du
Diffusion (score-based) generative models have been widely used for modeling
various types of complex data, including images, audios, and point clouds.
Recently, the deep connection between forward-backward stochastic differential
equations (SDEs) and diffusion-based models has been revealed, and several new
variants of SDEs are proposed (e.g., sub-VP, critically-damped Langevin) along
this line. Despite the empirical success of the hand-crafted fixed forward
SDEs, a great quantity of proper forward SDEs remain unexplored. In this work,
we propose a general framework for parameterizing the diffusion model,
especially the spatial part of the forward SDE. An abstract formalism is
introduced with theoretical guarantees, and its connection with previous
diffusion models is leveraged. We demonstrate the theoretical advantage of our
method from an optimization perspective. Numerical experiments on synthetic
datasets, MINIST and CIFAR10 are also presented to validate the effectiveness
of our framework.
SOS: Score-based Oversampling for Tabular Data
June 17, 2022
Jayoung Kim, Chaejeong Lee, Yehjin Shin, Sewon Park, Minjung Kim, Noseong Park, Jihoon Cho
Score-based generative models (SGMs) are a recent breakthrough in generating
fake images. SGMs are known to surpass other generative models, e.g.,
generative adversarial networks (GANs) and variational autoencoders (VAEs).
Being inspired by their big success, in this work, we fully customize them for
generating fake tabular data. In particular, we are interested in oversampling
minor classes since imbalanced classes frequently lead to sub-optimal training
outcomes. To our knowledge, we are the first presenting a score-based tabular
data oversampling method. Firstly, we re-design our own score network since we
have to process tabular data. Secondly, we propose two options for our
generation method: the former is equivalent to a style transfer for tabular
data and the latter uses the standard generative policy of SGMs. Lastly, we
define a fine-tuning method, which further enhances the oversampling quality.
In our experiments with 6 datasets and 10 baselines, our method outperforms
other oversampling methods in all cases.
Maximum Likelihood Training for Score-Based Diffusion ODEs by High-Order Denoising Score Matching
June 16, 2022
Cheng Lu, Kaiwen Zheng, Fan Bao, Jianfei Chen, Chongxuan Li, Jun Zhu
Score-based generative models have excellent performance in terms of
generation quality and likelihood. They model the data distribution by matching
a parameterized score network with first-order data score functions. The score
network can be used to define an ODE (“score-based diffusion ODE”) for exact
likelihood evaluation. However, the relationship between the likelihood of the
ODE and the score matching objective is unclear. In this work, we prove that
matching the first-order score is not sufficient to maximize the likelihood of
the ODE, by showing a gap between the maximum likelihood and score matching
objectives. To fill up this gap, we show that the negative likelihood of the
ODE can be bounded by controlling the first, second, and third-order score
matching errors; and we further present a novel high-order denoising score
matching method to enable maximum likelihood training of score-based diffusion
ODEs. Our algorithm guarantees that the higher-order matching error is bounded
by the training error and the lower-order errors. We empirically observe that
by high-order score matching, score-based diffusion ODEs achieve better
likelihood on both synthetic data and CIFAR-10, while retaining the high
generation quality.
Discrete Contrastive Diffusion for Cross-Modal and Conditional Generation
June 15, 2022
Ye Zhu, Yu Wu, Kyle Olszewski, Jian Ren, Sergey Tulyakov, Yan Yan
Diffusion probabilistic models (DPMs) have become a popular approach to
conditional generation, due to their promising results and support for
cross-modal synthesis. A key desideratum in conditional synthesis is to achieve
high correspondence between the conditioning input and generated output. Most
existing methods learn such relationships implicitly, by incorporating the
prior into the variational lower bound. In this work, we take a different route
– we explicitly enhance input-output connections by maximizing their mutual
information. To this end, we introduce a Conditional Discrete Contrastive
Diffusion (CDCD) loss and design two contrastive diffusion mechanisms to
effectively incorporate it into the denoising process, combining the diffusion
training and contrastive learning for the first time by connecting it with the
conventional variational objectives. We demonstrate the efficacy of our
approach in evaluations with diverse multimodal conditional synthesis tasks:
dance-to-music generation, text-to-image synthesis, as well as
class-conditioned image synthesis. On each, we enhance the input-output
correspondence and achieve higher or competitive general synthesis quality.
Furthermore, the proposed approach improves the convergence of diffusion
models, reducing the number of required diffusion steps by more than 35% on two
benchmarks, significantly increasing the inference speed.
Diffusion Models for Video Prediction and Infilling
June 15, 2022
Tobias Höppe, Arash Mehrjou, Stefan Bauer, Didrik Nielsen, Andrea Dittadi
Predicting and anticipating future outcomes or reasoning about missing
information in a sequence are critical skills for agents to be able to make
intelligent decisions. This requires strong, temporally coherent generative
capabilities. Diffusion models have shown remarkable success in several
generative tasks, but have not been extensively explored in the video domain.
We present Random-Mask Video Diffusion (RaMViD), which extends image diffusion
models to videos using 3D convolutions, and introduces a new conditioning
technique during training. By varying the mask we condition on, the model is
able to perform video prediction, infilling, and upsampling. Due to our simple
conditioning scheme, we can utilize the same architecture as used for
unconditional training, which allows us to train the model in a conditional and
unconditional fashion at the same time. We evaluate RaMViD on two benchmark
datasets for video prediction, on which we achieve state-of-the-art results,
and one for video generation. High-resolution videos are provided at
https://sites.google.com/view/video-diffusion-prediction.
Estimating the Optimal Covariance with Imperfect Mean in Diffusion Probabilistic Models
June 15, 2022
Fan Bao, Chongxuan Li, Jiacheng Sun, Jun Zhu, Bo Zhang
Diffusion probabilistic models (DPMs) are a class of powerful deep generative
models (DGMs). Despite their success, the iterative generation process over the
full timesteps is much less efficient than other DGMs such as GANs. Thus, the
generation performance on a subset of timesteps is crucial, which is greatly
influenced by the covariance design in DPMs. In this work, we consider diagonal
and full covariances to improve the expressive power of DPMs. We derive the
optimal result for such covariances, and then correct it when the mean of DPMs
is imperfect. Both the optimal and the corrected ones can be decomposed into
terms of conditional expectations over functions of noise. Building upon it, we
propose to estimate the optimal covariance and its correction given imperfect
mean by learning these conditional expectations. Our method can be applied to
DPMs with both discrete and continuous timesteps. We consider the diagonal
covariance in our implementation for computational efficiency. For an efficient
practical implementation, we adopt a parameter sharing scheme and a two-stage
training process. Empirically, our method outperforms a wide variety of
covariance design on likelihood results, and improves the sample quality
especially on a small number of timesteps.
CARD: Classification and Regression Diffusion Models
June 15, 2022
Xizewen Han, Huangjie Zheng, Mingyuan Zhou
stat.ML, cs.LG, stat.CO, stat.ME
Learning the distribution of a continuous or categorical response variable
$\boldsymbol y$ given its covariates $\boldsymbol x$ is a fundamental problem
in statistics and machine learning. Deep neural network-based supervised
learning algorithms have made great progress in predicting the mean of
$\boldsymbol y$ given $\boldsymbol x$, but they are often criticized for their
ability to accurately capture the uncertainty of their predictions. In this
paper, we introduce classification and regression diffusion (CARD) models,
which combine a denoising diffusion-based conditional generative model and a
pre-trained conditional mean estimator, to accurately predict the distribution
of $\boldsymbol y$ given $\boldsymbol x$. We demonstrate the outstanding
ability of CARD in conditional distribution prediction with both toy examples
and real-world datasets, the experimental results on which show that CARD in
general outperforms state-of-the-art methods, including Bayesian neural
network-based ones that are designed for uncertainty estimation, especially
when the conditional distribution of $\boldsymbol y$ given $\boldsymbol x$ is
multi-modal. In addition, we utilize the stochastic nature of the generative
model outputs to obtain a finer granularity in model confidence assessment at
the instance level for classification tasks.
Realistic Gramophone Noise Synthesis using a Diffusion Model
June 13, 2022
Eloi Moliner, Vesa Välimäki
This paper introduces a novel data-driven strategy for synthesizing
gramophone noise audio textures. A diffusion probabilistic model is applied to
generate highly realistic quasiperiodic noises. The proposed model is designed
to generate samples of length equal to one disk revolution, but a method to
generate plausible periodic variations between revolutions is also proposed. A
guided approach is also applied as a conditioning method, where an audio signal
generated with manually-tuned signal processing is refined via reverse
diffusion to improve realism. The method has been evaluated in a subjective
listening test, in which the participants were often unable to recognize the
synthesized signals from the real ones. The synthetic noises produced with the
best proposed unconditional method are statistically indistinguishable from
real noise recordings. This work shows the potential of diffusion models for
highly realistic audio synthesis tasks.
Convergence for score-based generative modeling with polynomial complexity
June 13, 2022
Holden Lee, Jianfeng Lu, Yixin Tan
cs.LG, math.PR, math.ST, stat.ML, stat.TH
Score-based generative modeling (SGM) is a highly successful approach for
learning a probability distribution from data and generating further samples.
We prove the first polynomial convergence guarantees for the core mechanic
behind SGM: drawing samples from a probability density $p$ given a score
estimate (an estimate of $\nabla \ln p$) that is accurate in $L^2(p)$. Compared
to previous works, we do not incur error that grows exponentially in time or
that suffers from a curse of dimensionality. Our guarantee works for any smooth
distribution and depends polynomially on its log-Sobolev constant. Using our
guarantee, we give a theoretical analysis of score-based generative modeling,
which transforms white-noise input into samples from a learned data
distribution given score estimates at different noise scales. Our analysis
gives theoretical grounding to the observation that an annealed procedure is
required in practice to generate good samples, as our proof depends essentially
on using annealing to obtain a warm start at each step. Moreover, we show that
a predictor-corrector algorithm gives better convergence than using either
portion alone.
Latent Diffusion Energy-Based Model for Interpretable Text Modeling
June 13, 2022
Peiyu Yu, Sirui Xie, Xiaojian Ma, Baoxiong Jia, Bo Pang, Ruiqi Gao, Yixin Zhu, Song-Chun Zhu, Ying Nian Wu
Latent space Energy-Based Models (EBMs), also known as energy-based priors,
have drawn growing interests in generative modeling. Fueled by its flexibility
in the formulation and strong modeling power of the latent space, recent works
built upon it have made interesting attempts aiming at the interpretability of
text modeling. However, latent space EBMs also inherit some flaws from EBMs in
data space; the degenerate MCMC sampling quality in practice can lead to poor
generation quality and instability in training, especially on data with complex
latent structures. Inspired by the recent efforts that leverage diffusion
recovery likelihood learning as a cure for the sampling issue, we introduce a
novel symbiosis between the diffusion models and latent space EBMs in a
variational learning framework, coined as the latent diffusion energy-based
model. We develop a geometric clustering-based regularization jointly with the
information bottleneck to further improve the quality of the learned latent
space. Experiments on several challenging tasks demonstrate the superior
performance of our model on interpretable text modeling over strong
counterparts.
gDDIM: Generalized denoising diffusion implicit models
June 11, 2022
Qinsheng Zhang, Molei Tao, Yongxin Chen
Our goal is to extend the denoising diffusion implicit model (DDIM) to
general diffusion models~(DMs) besides isotropic diffusions. Instead of
constructing a non-Markov noising process as in the original DDIM, we examine
the mechanism of DDIM from a numerical perspective. We discover that the DDIM
can be obtained by using some specific approximations of the score when solving
the corresponding stochastic differential equation. We present an
interpretation of the accelerating effects of DDIM that also explains the
advantages of a deterministic sampling scheme over the stochastic one for fast
sampling. Building on this insight, we extend DDIM to general DMs, coined
generalized DDIM (gDDIM), with a small but delicate modification in
parameterizing the score network. We validate gDDIM in two non-isotropic DMs:
Blurring diffusion model (BDM) and Critically-damped Langevin diffusion model
(CLD). We observe more than 20 times acceleration in BDM. In the CLD, a
diffusion model by augmenting the diffusion process with velocity, our
algorithm achieves an FID score of 2.26, on CIFAR10, with only 50 number of
score function evaluations~(NFEs) and an FID score of 2.86 with only 27 NFEs.
Code is available at https://github.com/qsh-zh/gDDIM
Multi-instrument Music Synthesis with Spectrogram Diffusion
June 11, 2022
Curtis Hawthorne, Ian Simon, Adam Roberts, Neil Zeghidour, Josh Gardner, Ethan Manilow, Jesse Engel
An ideal music synthesizer should be both interactive and expressive,
generating high-fidelity audio in realtime for arbitrary combinations of
instruments and notes. Recent neural synthesizers have exhibited a tradeoff
between domain-specific models that offer detailed control of only specific
instruments, or raw waveform models that can train on any music but with
minimal control and slow generation. In this work, we focus on a middle ground
of neural synthesizers that can generate audio from MIDI sequences with
arbitrary combinations of instruments in realtime. This enables training on a
wide range of transcription datasets with a single model, which in turn offers
note-level control of composition and instrumentation across a wide range of
instruments. We use a simple two-stage process: MIDI to spectrograms with an
encoder-decoder Transformer, then spectrograms to audio with a generative
adversarial network (GAN) spectrogram inverter. We compare training the decoder
as an autoregressive model and as a Denoising Diffusion Probabilistic Model
(DDPM) and find that the DDPM approach is superior both qualitatively and as
measured by audio reconstruction and Fr'echet distance metrics. Given the
interactivity and generality of this approach, we find this to be a promising
first step towards interactive and expressive neural synthesis for arbitrary
combinations of instruments and notes.
How Much is Enough? A Study on Diffusion Times in Score-based Generative Models
June 10, 2022
Giulio Franzese, Simone Rossi, Lixuan Yang, Alessandro Finamore, Dario Rossi, Maurizio Filippone, Pietro Michiardi
Score-based diffusion models are a class of generative models whose dynamics
is described by stochastic differential equations that map noise into data.
While recent works have started to lay down a theoretical foundation for these
models, an analytical understanding of the role of the diffusion time T is
still lacking. Current best practice advocates for a large T to ensure that the
forward dynamics brings the diffusion sufficiently close to a known and simple
noise distribution; however, a smaller value of T should be preferred for a
better approximation of the score-matching objective and higher computational
efficiency. Starting from a variational interpretation of diffusion models, in
this work we quantify this trade-off, and suggest a new method to improve
quality and efficiency of both training and sampling, by adopting smaller
diffusion times. Indeed, we show how an auxiliary model can be used to bridge
the gap between the ideal and the simulated forward dynamics, followed by a
standard reverse diffusion process. Empirical results support our analysis; for
image data, our method is competitive w.r.t. the state-of-the-art, according to
standard sample quality metrics and log-likelihood.
Image Generation with Multimodal Priors using Denoising Diffusion Probabilistic Models
June 10, 2022
Nithin Gopalakrishnan Nair, Wele Gedara Chaminda Bandara, Vishal M Patel
Image synthesis under multi-modal priors is a useful and challenging task
that has received increasing attention in recent years. A major challenge in
using generative models to accomplish this task is the lack of paired data
containing all modalities (i.e. priors) and corresponding outputs. In recent
work, a variational auto-encoder (VAE) model was trained in a weakly supervised
manner to address this challenge. Since the generative power of VAEs is usually
limited, it is difficult for this method to synthesize images belonging to
complex distributions. To this end, we propose a solution based on a denoising
diffusion probabilistic models to synthesise images under multi-model priors.
Based on the fact that the distribution over each time step in the diffusion
model is Gaussian, in this work we show that there exists a closed-form
expression to the generate the image corresponds to the given modalities. The
proposed solution does not require explicit retraining for all modalities and
can leverage the outputs of individual modalities to generate realistic images
according to different constraints. We conduct studies on two real-world
datasets to demonstrate the effectiveness of our approach
Probability flow solution of the Fokker-Planck equation
June 09, 2022
Nicholas M. Boffi, Eric Vanden-Eijnden
cs.LG, cond-mat.dis-nn, cond-mat.stat-mech, cs.NA, math.NA, math.PR
The method of choice for integrating the time-dependent Fokker-Planck
equation in high-dimension is to generate samples from the solution via
integration of the associated stochastic differential equation. Here, we study
an alternative scheme based on integrating an ordinary differential equation
that describes the flow of probability. Acting as a transport map, this
equation deterministically pushes samples from the initial density onto samples
from the solution at any later time. Unlike integration of the stochastic
dynamics, the method has the advantage of giving direct access to quantities
that are challenging to estimate from trajectories alone, such as the
probability current, the density itself, and its entropy. The probability flow
equation depends on the gradient of the logarithm of the solution (its
“score”), and so is a-priori unknown. To resolve this dependence, we model the
score with a deep neural network that is learned on-the-fly by propagating a
set of samples according to the instantaneous probability current. We show
theoretically that the proposed approach controls the KL divergence from the
learned solution to the target, while learning on external samples from the
stochastic differential equation does not control either direction of the KL
divergence. Empirically, we consider several high-dimensional Fokker-Planck
equations from the physics of interacting particle systems. We find that the
method accurately matches analytical solutions when they are available as well
as moments computed via Monte-Carlo when they are not. Moreover, the method
offers compelling predictions for the global entropy production rate that
out-perform those obtained from learning on stochastic trajectories, and can
effectively capture non-equilibrium steady-state probability currents over long
time intervals.
Diffusion probabilistic modeling of protein backbones in 3D for the motif-scaffolding problem
June 08, 2022
Brian L. Trippe, Jason Yim, Doug Tischer, David Baker, Tamara Broderick, Regina Barzilay, Tommi Jaakkola
Construction of a scaffold structure that supports a desired motif,
conferring protein function, shows promise for the design of vaccines and
enzymes. But a general solution to this motif-scaffolding problem remains open.
Current machine-learning techniques for scaffold design are either limited to
unrealistically small scaffolds (up to length 20) or struggle to produce
multiple diverse scaffolds. We propose to learn a distribution over diverse and
longer protein backbone structures via an E(3)-equivariant graph neural
network. We develop SMCDiff to efficiently sample scaffolds from this
distribution conditioned on a given motif; our algorithm is the first to
theoretically guarantee conditional samples from a diffusion model in the
large-compute limit. We evaluate our designed backbones by how well they align
with AlphaFold2-predicted structures. We show that our method can (1) sample
scaffolds up to 80 residues and (2) achieve structurally diverse scaffolds for
a fixed motif.
Accelerating Score-based Generative Models for High-Resolution Image Synthesis
June 08, 2022
Hengyuan Ma, Li Zhang, Xiatian Zhu, Jingfeng Zhang, Jianfeng Feng
Score-based generative models (SGMs) have recently emerged as a promising
class of generative models. The key idea is to produce high-quality images by
recurrently adding Gaussian noises and gradients to a Gaussian sample until
converging to the target distribution, a.k.a. the diffusion sampling. To ensure
stability of convergence in sampling and generation quality, however, this
sequential sampling process has to take a small step size and many sampling
iterations (e.g., 2000). Several acceleration methods have been proposed with
focus on low-resolution generation. In this work, we consider the acceleration
of high-resolution generation with SGMs, a more challenging yet more important
problem. We prove theoretically that this slow convergence drawback is
primarily due to the ignorance of the target distribution. Further, we
introduce a novel Target Distribution Aware Sampling (TDAS) method by
leveraging the structural priors in space and frequency domains. Extensive
experiments on CIFAR-10, CelebA, LSUN, and FFHQ datasets validate that TDAS can
consistently accelerate state-of-the-art SGMs, particularly on more challenging
high resolution (1024x1024) image generation tasks by up to 18.4x, whilst
largely maintaining the synthesis quality. With fewer sampling iterations, TDAS
can still generate good quality images. In contrast, the existing methods
degrade drastically or even fails completely
Neural Diffusion Processes
June 08, 2022
Vincent Dutordoir, Alan Saul, Zoubin Ghahramani, Fergus Simpson
Neural network approaches for meta-learning distributions over functions have
desirable properties such as increased flexibility and a reduced complexity of
inference. Building on the successes of denoising diffusion models for
generative modelling, we propose Neural Diffusion Processes (NDPs), a novel
approach that learns to sample from a rich distribution over functions through
its finite marginals. By introducing a custom attention block we are able to
incorporate properties of stochastic processes, such as exchangeability,
directly into the NDP’s architecture. We empirically show that NDPs can capture
functional distributions close to the true Bayesian posterior, demonstrating
that they can successfully emulate the behaviour of Gaussian processes and
surpass the performance of neural processes. NDPs enable a variety of
downstream tasks, including regression, implicit hyperparameter
marginalisation, non-Gaussian posterior prediction and global optimisation.
Fast Unsupervised Brain Anomaly Detection and Segmentation with Diffusion Models
June 07, 2022
Walter H. L. Pinaya, Mark S. Graham, Robert Gray, Pedro F Da Costa, Petru-Daniel Tudosiu, Paul Wright, Yee H. Mah, Andrew D. MacKinnon, James T. Teo, Rolf Jager, David Werring, Geraint Rees, Parashkev Nachev, Sebastien Ourselin, M. Jorge Cardoso
Deep generative models have emerged as promising tools for detecting
arbitrary anomalies in data, dispensing with the necessity for manual
labelling. Recently, autoregressive transformers have achieved state-of-the-art
performance for anomaly detection in medical imaging. Nonetheless, these models
still have some intrinsic weaknesses, such as requiring images to be modelled
as 1D sequences, the accumulation of errors during the sampling process, and
the significant inference times associated with transformers. Denoising
diffusion probabilistic models are a class of non-autoregressive generative
models recently shown to produce excellent samples in computer vision
(surpassing Generative Adversarial Networks), and to achieve log-likelihoods
that are competitive with transformers while having fast inference times.
Diffusion models can be applied to the latent representations learnt by
autoencoders, making them easily scalable and great candidates for application
to high dimensional data, such as medical images. Here, we propose a method
based on diffusion models to detect and segment anomalies in brain imaging. By
training the models on healthy data and then exploring its diffusion and
reverse steps across its Markov chain, we can identify anomalous areas in the
latent space and hence identify anomalies in the pixel space. Our diffusion
models achieve competitive performance compared with autoregressive approaches
across a series of experiments with 2D CT and MRI data involving synthetic and
real pathological lesions with much reduced inference times, making their usage
clinically viable.
Universal Speech Enhancement with Score-based Diffusion
June 07, 2022
Joan Serrà, Santiago Pascual, Jordi Pons, R. Oguz Araz, Davide Scaini
Removing background noise from speech audio has been the subject of
considerable effort, especially in recent years due to the rise of virtual
communication and amateur recordings. Yet background noise is not the only
unpleasant disturbance that can prevent intelligibility: reverb, clipping,
codec artifacts, problematic equalization, limited bandwidth, or inconsistent
loudness are equally disturbing and ubiquitous. In this work, we propose to
consider the task of speech enhancement as a holistic endeavor, and present a
universal speech enhancement system that tackles 55 different distortions at
the same time. Our approach consists of a generative model that employs
score-based diffusion, together with a multi-resolution conditioning network
that performs enhancement with mixture density networks. We show that this
approach significantly outperforms the state of the art in a subjective test
performed by expert listeners. We also show that it achieves competitive
objective scores with just 4-8 diffusion steps, despite not considering any
particular strategy for fast sampling. We hope that both our methodology and
technical contributions encourage researchers and practitioners to adopt a
universal approach to speech enhancement, possibly framing it as a generative
task.
Blended Latent Diffusion
June 06, 2022
Omri Avrahami, Ohad Fried, Dani Lischinski
The tremendous progress in neural image generation, coupled with the
emergence of seemingly omnipotent vision-language models has finally enabled
text-based interfaces for creating and editing images. Handling generic images
requires a diverse underlying generative model, hence the latest works utilize
diffusion models, which were shown to surpass GANs in terms of diversity. One
major drawback of diffusion models, however, is their relatively slow inference
time. In this paper, we present an accelerated solution to the task of local
text-driven editing of generic images, where the desired edits are confined to
a user-provided mask. Our solution leverages a recent text-to-image Latent
Diffusion Model (LDM), which speeds up diffusion by operating in a
lower-dimensional latent space. We first convert the LDM into a local image
editor by incorporating Blended Diffusion into it. Next we propose an
optimization-based solution for the inherent inability of this LDM to
accurately reconstruct images. Finally, we address the scenario of performing
local edits using thin masks. We evaluate our method against the available
baselines both qualitatively and quantitatively and demonstrate that in
addition to being faster, our method achieves better precision than the
baselines while mitigating some of their artifacts.
Diffusion-GAN: Training GANs with Diffusion
June 05, 2022
Zhendong Wang, Huangjie Zheng, Pengcheng He, Weizhu Chen, Mingyuan Zhou
Generative adversarial networks (GANs) are challenging to train stably, and a
promising remedy of injecting instance noise into the discriminator input has
not been very effective in practice. In this paper, we propose Diffusion-GAN, a
novel GAN framework that leverages a forward diffusion chain to generate
Gaussian-mixture distributed instance noise. Diffusion-GAN consists of three
components, including an adaptive diffusion process, a diffusion
timestep-dependent discriminator, and a generator. Both the observed and
generated data are diffused by the same adaptive diffusion process. At each
diffusion timestep, there is a different noise-to-data ratio and the
timestep-dependent discriminator learns to distinguish the diffused real data
from the diffused generated data. The generator learns from the discriminator’s
feedback by backpropagating through the forward diffusion chain, whose length
is adaptively adjusted to balance the noise and data levels. We theoretically
show that the discriminator’s timestep-dependent strategy gives consistent and
helpful guidance to the generator, enabling it to match the true data
distribution. We demonstrate the advantages of Diffusion-GAN over strong GAN
baselines on various datasets, showing that it can produce more realistic
images with higher stability and data efficiency than state-of-the-art GANs.
Zero-Shot Voice Conditioning for Denoising Diffusion TTS Models
June 05, 2022
Alon Levkovitch, Eliya Nachmani, Lior Wolf
cs.SD, cs.AI, cs.LG, eess.AS, eess.SP
We present a novel way of conditioning a pretrained denoising diffusion
speech model to produce speech in the voice of a novel person unseen during
training. The method requires a short (~3 seconds) sample from the target
person, and generation is steered at inference time, without any training
steps. At the heart of the method lies a sampling process that combines the
estimation of the denoising model with a low-pass version of the new speaker’s
sample. The objective and subjective evaluations show that our sampling method
can generate a voice similar to that of the target speaker in terms of
frequency, with an accuracy comparable to state-of-the-art methods, and without
training.
Compositional Visual Generation with Composable Diffusion Models
June 03, 2022
Nan Liu, Shuang Li, Yilun Du, Antonio Torralba, Joshua B. Tenenbaum
Large text-guided diffusion models, such as DALLE-2, are able to generate
stunning photorealistic images given natural language descriptions. While such
models are highly flexible, they struggle to understand the composition of
certain concepts, such as confusing the attributes of different objects or
relations between objects. In this paper, we propose an alternative structured
approach for compositional generation using diffusion models. An image is
generated by composing a set of diffusion models, with each of them modeling a
certain component of the image. To do this, we interpret diffusion models as
energy-based models in which the data distributions defined by the energy
functions may be explicitly combined. The proposed method can generate scenes
at test time that are substantially more complex than those seen in training,
composing sentence descriptions, object relations, human facial attributes, and
even generalizing to new combinations that are rarely seen in the real world.
We further illustrate how our approach may be used to compose pre-trained
text-guided diffusion models and generate photorealistic images containing all
the details described in the input descriptions, including the binding of
certain object attributes that have been shown difficult for DALLE-2. These
results point to the effectiveness of the proposed method in promoting
structured generalization for visual generation. Project page:
https://energy-based-model.github.io/Compositional-Visual-Generation-with-Composable-Diffusion-Models/
Score-Based Generative Models Detect Manifolds
June 02, 2022
Jakiw Pidstrigach
stat.ML, cs.LG, cs.NA, math.NA, math.PR, 68T99, I.2.0
Score-based generative models (SGMs) need to approximate the scores $\nabla
\log p_t$ of the intermediate distributions as well as the final distribution
$p_T$ of the forward process. The theoretical underpinnings of the effects of
these approximations are still lacking. We find precise conditions under which
SGMs are able to produce samples from an underlying (low-dimensional) data
manifold $\mathcal{M}$. This assures us that SGMs are able to generate the
“right kind of samples”. For example, taking $\mathcal{M}$ to be the subset of
images of faces, we find conditions under which the SGM robustly produces an
image of a face, even though the relative frequencies of these images might not
accurately represent the true data generating distribution. Moreover, this
analysis is a first step towards understanding the generalization properties of
SGMs: Taking $\mathcal{M}$ to be the set of all training samples, our results
provide a precise description of when the SGM memorizes its training data.
Improving Diffusion Models for Inverse Problems using Manifold Constraints
June 02, 2022
Hyungjin Chung, Byeongsu Sim, Dohoon Ryu, Jong Chul Ye
cs.LG, cs.AI, cs.CV, stat.ML
Recently, diffusion models have been used to solve various inverse problems
in an unsupervised manner with appropriate modifications to the sampling
process. However, the current solvers, which recursively apply a reverse
diffusion step followed by a projection-based measurement consistency step,
often produce suboptimal results. By studying the generative sampling path,
here we show that current solvers throw the sample path off the data manifold,
and hence the error accumulates. To address this, we propose an additional
correction term inspired by the manifold constraint, which can be used
synergistically with the previous solvers to make the iterations close to the
manifold. The proposed manifold constraint is straightforward to implement
within a few lines of code, yet boosts the performance by a surprisingly large
margin. With extensive experiments, we show that our method is superior to the
previous methods both theoretically and empirically, producing promising
results in many applications such as image inpainting, colorization, and
sparse-view computed tomography. Code available
https://github.com/HJ-harry/MCG_diffusion
DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps
June 02, 2022
Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, Jun Zhu
Diffusion probabilistic models (DPMs) are emerging powerful generative
models. Despite their high-quality generation performance, DPMs still suffer
from their slow sampling as they generally need hundreds or thousands of
sequential function evaluations (steps) of large neural networks to draw a
sample. Sampling from DPMs can be viewed alternatively as solving the
corresponding diffusion ordinary differential equations (ODEs). In this work,
we propose an exact formulation of the solution of diffusion ODEs. The
formulation analytically computes the linear part of the solution, rather than
leaving all terms to black-box ODE solvers as adopted in previous works. By
applying change-of-variable, the solution can be equivalently simplified to an
exponentially weighted integral of the neural network. Based on our
formulation, we propose DPM-Solver, a fast dedicated high-order solver for
diffusion ODEs with the convergence order guarantee. DPM-Solver is suitable for
both discrete-time and continuous-time DPMs without any further training.
Experimental results show that DPM-Solver can generate high-quality samples in
only 10 to 20 function evaluations on various datasets. We achieve 4.70 FID in
10 function evaluations and 2.87 FID in 20 function evaluations on the CIFAR10
dataset, and a $4\sim 16\times$ speedup compared with previous state-of-the-art
training-free samplers on various datasets.
Elucidating the Design Space of Diffusion-Based Generative Models
June 01, 2022
Tero Karras, Miika Aittala, Timo Aila, Samuli Laine
cs.CV, cs.AI, cs.LG, cs.NE, stat.ML
We argue that the theory and practice of diffusion-based generative models
are currently unnecessarily convoluted and seek to remedy the situation by
presenting a design space that clearly separates the concrete design choices.
This lets us identify several changes to both the sampling and training
processes, as well as preconditioning of the score networks. Together, our
improvements yield new state-of-the-art FID of 1.79 for CIFAR-10 in a
class-conditional setting and 1.97 in an unconditional setting, with much
faster sampling (35 network evaluations per image) than prior designs. To
further demonstrate their modular nature, we show that our design changes
dramatically improve both the efficiency and quality obtainable with
pre-trained score networks from previous work, including improving the FID of a
previously trained ImageNet-64 model from 2.07 to near-SOTA 1.55, and after
re-training with our proposed improvements to a new SOTA of 1.36.
June 01, 2022
Bowen Jing, Gabriele Corso, Jeffrey Chang, Regina Barzilay, Tommi Jaakkola
physics.chem-ph, cs.LG, q-bio.BM
Molecular conformer generation is a fundamental task in computational
chemistry. Several machine learning approaches have been developed, but none
have outperformed state-of-the-art cheminformatics methods. We propose
torsional diffusion, a novel diffusion framework that operates on the space of
torsion angles via a diffusion process on the hypertorus and an
extrinsic-to-intrinsic score model. On a standard benchmark of drug-like
molecules, torsional diffusion generates superior conformer ensembles compared
to machine learning and cheminformatics methods in terms of both RMSD and
chemical properties, and is orders of magnitude faster than previous
diffusion-based models. Moreover, our model provides exact likelihoods, which
we employ to build the first generalizable Boltzmann generator. Code is
available at https://github.com/gcorso/torsional-diffusion.
Discovering the Hidden Vocabulary of DALLE-2
June 01, 2022
Giannis Daras, Alexandros G. Dimakis
cs.LG, cs.CL, cs.CR, cs.CV
We discover that DALLE-2 seems to have a hidden vocabulary that can be used
to generate images with absurd prompts. For example, it seems that
\texttt{Apoploe vesrreaitais} means birds and \texttt{Contarra ccetnxniams
luryca tanniounons} (sometimes) means bugs or pests. We find that these prompts
are often consistent in isolation but also sometimes in combinations. We
present our black-box method to discover words that seem random but have some
correspondence to visual concepts. This creates important security and
interpretability challenges.
On Analyzing Generative and Denoising Capabilities of Diffusion-based Deep Generative Models
May 31, 2022
Kamil Deja, Anna Kuzina, Tomasz Trzciński, Jakub M. Tomczak
Diffusion-based Deep Generative Models (DDGMs) offer state-of-the-art
performance in generative modeling. Their main strength comes from their unique
setup in which a model (the backward diffusion process) is trained to reverse
the forward diffusion process, which gradually adds noise to the input signal.
Although DDGMs are well studied, it is still unclear how the small amount of
noise is transformed during the backward diffusion process. Here, we focus on
analyzing this problem to gain more insight into the behavior of DDGMs and
their denoising and generative capabilities. We observe a fluid transition
point that changes the functionality of the backward diffusion process from
generating a (corrupted) image from noise to denoising the corrupted image to
the final sample. Based on this observation, we postulate to divide a DDGM into
two parts: a denoiser and a generator. The denoiser could be parameterized by a
denoising auto-encoder, while the generator is a diffusion-based model with its
own set of parameters. We experimentally validate our proposition, showing its
pros and cons.
Few-Shot Diffusion Models
May 30, 2022
Giorgio Giannone, Didrik Nielsen, Ole Winther
Denoising diffusion probabilistic models (DDPM) are powerful hierarchical
latent variable models with remarkable sample generation quality and training
stability. These properties can be attributed to parameter sharing in the
generative hierarchy, as well as a parameter-free diffusion-based inference
procedure. In this paper, we present Few-Shot Diffusion Models (FSDM), a
framework for few-shot generation leveraging conditional DDPMs. FSDMs are
trained to adapt the generative process conditioned on a small set of images
from a given class by aggregating image patch information using a set-based
Vision Transformer (ViT). At test time, the model is able to generate samples
from previously unseen classes conditioned on as few as 5 samples from that
class. We empirically show that FSDM can perform few-shot generation and
transfer to new datasets. We benchmark variants of our method on complex vision
datasets for few-shot learning and compare to unconditional and conditional
DDPM baselines. Additionally, we show how conditioning the model on patch-based
input set information improves training convergence.
Guided-TTS 2: A Diffusion Model for High-quality Adaptive Text-to-Speech with Untranscribed Data
May 30, 2022
Sungwon Kim, Heeseung Kim, Sungroh Yoon
We propose Guided-TTS 2, a diffusion-based generative model for high-quality
adaptive TTS using untranscribed data. Guided-TTS 2 combines a
speaker-conditional diffusion model with a speaker-dependent phoneme classifier
for adaptive text-to-speech. We train the speaker-conditional diffusion model
on large-scale untranscribed datasets for a classifier-free guidance method and
further fine-tune the diffusion model on the reference speech of the target
speaker for adaptation, which only takes 40 seconds. We demonstrate that
Guided-TTS 2 shows comparable performance to high-quality single-speaker TTS
baselines in terms of speech quality and speaker similarity with only a
ten-second untranscribed data. We further show that Guided-TTS 2 outperforms
adaptive TTS baselines on multi-speaker datasets even with a zero-shot
adaptation setting. Guided-TTS 2 can adapt to a wide range of voices only using
untranscribed speech, which enables adaptive TTS with the voice of non-human
characters such as Gollum in \textit{“The Lord of the Rings”}.
Guided Diffusion Model for Adversarial Purification
May 30, 2022
Jinyi Wang, Zhaoyang Lyu, Dahua Lin, Bo Dai, Hongfei Fu
With wider application of deep neural networks (DNNs) in various algorithms
and frameworks, security threats have become one of the concerns. Adversarial
attacks disturb DNN-based image classifiers, in which attackers can
intentionally add imperceptible adversarial perturbations on input images to
fool the classifiers. In this paper, we propose a novel purification approach,
referred to as guided diffusion model for purification (GDMP), to help protect
classifiers from adversarial attacks. The core of our approach is to embed
purification into the diffusion denoising process of a Denoised Diffusion
Probabilistic Model (DDPM), so that its diffusion process could submerge the
adversarial perturbations with gradually added Gaussian noises, and both of
these noises can be simultaneously removed following a guided denoising
process. On our comprehensive experiments across various datasets, the proposed
GDMP is shown to reduce the perturbations raised by adversarial attacks to a
shallow range, thereby significantly improving the correctness of
classification. GDMP improves the robust accuracy by 5%, obtaining 90.1% under
PGD attack on the CIFAR10 dataset. Moreover, GDMP achieves 70.94% robustness on
the challenging ImageNet dataset.
Diffusion-LM Improves Controllable Text Generation
May 27, 2022
Xiang Lisa Li, John Thickstun, Ishaan Gulrajani, Percy Liang, Tatsunori B. Hashimoto
Controlling the behavior of language models (LMs) without re-training is a
major open problem in natural language generation. While recent works have
demonstrated successes on controlling simple sentence attributes (e.g.,
sentiment), there has been little progress on complex, fine-grained controls
(e.g., syntactic structure). To address this challenge, we develop a new
non-autoregressive language model based on continuous diffusions that we call
Diffusion-LM. Building upon the recent successes of diffusion models in
continuous domains, Diffusion-LM iteratively denoises a sequence of Gaussian
vectors into word vectors, yielding a sequence of intermediate latent
variables. The continuous, hierarchical nature of these intermediate variables
enables a simple gradient-based algorithm to perform complex, controllable
generation tasks. We demonstrate successful control of Diffusion-LM for six
challenging fine-grained control tasks, significantly outperforming prior work.
Maximum Likelihood Training of Implicit Nonlinear Diffusion Models
May 27, 2022
Dongjun Kim, Byeonghu Na, Se Jung Kwon, Dongsoo Lee, Wanmo Kang, Il-Chul Moon
Whereas diverse variations of diffusion models exist, extending the linear
diffusion into a nonlinear diffusion process is investigated by very few works.
The nonlinearity effect has been hardly understood, but intuitively, there
would be promising diffusion patterns to efficiently train the generative
distribution towards the data distribution. This paper introduces a
data-adaptive nonlinear diffusion process for score-based diffusion models. The
proposed Implicit Nonlinear Diffusion Model (INDM) learns by combining a
normalizing flow and a diffusion process. Specifically, INDM implicitly
constructs a nonlinear diffusion on the \textit{data space} by leveraging a
linear diffusion on the \textit{latent space} through a flow network. This flow
network is key to forming a nonlinear diffusion, as the nonlinearity depends on
the flow network. This flexible nonlinearity improves the learning curve of
INDM to nearly Maximum Likelihood Estimation (MLE) against the non-MLE curve of
DDPM++, which turns out to be an inflexible version of INDM with the flow fixed
as an identity mapping. Also, the discretization of INDM shows the sampling
robustness. In experiments, INDM achieves the state-of-the-art FID of 1.75 on
CelebA. We release our code at https://github.com/byeonghu-na/INDM.
Accelerating Diffusion Models via Early Stop of the Diffusion Process
May 25, 2022
Zhaoyang Lyu, Xudong XU, Ceyuan Yang, Dahua Lin, Bo Dai
Denoising Diffusion Probabilistic Models (DDPMs) have achieved impressive
performance on various generation tasks. By modeling the reverse process of
gradually diffusing the data distribution into a Gaussian distribution,
generating a sample in DDPMs can be regarded as iteratively denoising a
randomly sampled Gaussian noise. However, in practice DDPMs often need hundreds
even thousands of denoising steps to obtain a high-quality sample from the
Gaussian noise, leading to extremely low inference efficiency. In this work, we
propose a principled acceleration strategy, referred to as Early-Stopped DDPM
(ES-DDPM), for DDPMs. The key idea is to stop the diffusion process early where
only the few initial diffusing steps are considered and the reverse denoising
process starts from a non-Gaussian distribution. By further adopting a powerful
pre-trained generative model, such as GAN and VAE, in ES-DDPM, sampling from
the target non-Gaussian distribution can be efficiently achieved by diffusing
samples obtained from the pre-trained generative model. In this way, the number
of required denoising steps is significantly reduced. In the meantime, the
sample quality of ES-DDPM also improves substantially, outperforming both the
vanilla DDPM and the adopted pre-trained generative model. On extensive
experiments across CIFAR-10, CelebA, ImageNet, LSUN-Bedroom and LSUN-Cat,
ES-DDPM obtains promising acceleration effect and performance improvement over
representative baseline methods. Moreover, ES-DDPM also demonstrates several
attractive properties, including being orthogonal to existing acceleration
methods, as well as simultaneously enabling both global semantic and local
pixel-level control in image generation.
Flexible Diffusion Modeling of Long Videos
May 23, 2022
William Harvey, Saeid Naderiparizi, Vaden Masrani, Christian Weilbach, Frank Wood
We present a framework for video modeling based on denoising diffusion
probabilistic models that produces long-duration video completions in a variety
of realistic environments. We introduce a generative model that can at
test-time sample any arbitrary subset of video frames conditioned on any other
subset and present an architecture adapted for this purpose. Doing so allows us
to efficiently compare and optimize a variety of schedules for the order in
which frames in a long video are sampled and use selective sparse and
long-range conditioning on previously sampled frames. We demonstrate improved
video modeling over prior work on a number of datasets and sample temporally
coherent videos over 25 minutes in length. We additionally release a new video
modeling dataset and semantically meaningful metrics based on videos generated
in the CARLA autonomous driving simulator.
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
May 23, 2022
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, Mohammad Norouzi
We present Imagen, a text-to-image diffusion model with an unprecedented
degree of photorealism and a deep level of language understanding. Imagen
builds on the power of large transformer language models in understanding text
and hinges on the strength of diffusion models in high-fidelity image
generation. Our key discovery is that generic large language models (e.g. T5),
pretrained on text-only corpora, are surprisingly effective at encoding text
for image synthesis: increasing the size of the language model in Imagen boosts
both sample fidelity and image-text alignment much more than increasing the
size of the image diffusion model. Imagen achieves a new state-of-the-art FID
score of 7.27 on the COCO dataset, without ever training on COCO, and human
raters find Imagen samples to be on par with the COCO data itself in image-text
alignment. To assess text-to-image models in greater depth, we introduce
DrawBench, a comprehensive and challenging benchmark for text-to-image models.
With DrawBench, we compare Imagen with recent methods including VQ-GAN+CLIP,
Latent Diffusion Models, and DALL-E 2, and find that human raters prefer Imagen
over other models in side-by-side comparisons, both in terms of sample quality
and image-text alignment. See https://imagen.research.google/ for an overview
of the results.
Planning with Diffusion for Flexible Behavior Synthesis
May 20, 2022
Michael Janner, Yilun Du, Joshua B. Tenenbaum, Sergey Levine
Model-based reinforcement learning methods often use learning only for the
purpose of estimating an approximate dynamics model, offloading the rest of the
decision-making work to classical trajectory optimizers. While conceptually
simple, this combination has a number of empirical shortcomings, suggesting
that learned models may not be well-suited to standard trajectory optimization.
In this paper, we consider what it would look like to fold as much of the
trajectory optimization pipeline as possible into the modeling problem, such
that sampling from the model and planning with it become nearly identical. The
core of our technical approach lies in a diffusion probabilistic model that
plans by iteratively denoising trajectories. We show how classifier-guided
sampling and image inpainting can be reinterpreted as coherent planning
strategies, explore the unusual and useful properties of diffusion-based
planning methods, and demonstrate the effectiveness of our framework in control
settings that emphasize long-horizon decision-making and test-time flexibility.
Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation
May 19, 2022
Vikram Voleti, Alexia Jolicoeur-Martineau, Christopher Pal
Video prediction is a challenging task. The quality of video frames from
current state-of-the-art (SOTA) generative models tends to be poor and
generalization beyond the training data is difficult. Furthermore, existing
prediction frameworks are typically not capable of simultaneously handling
other video-related tasks such as unconditional generation or interpolation. In
this work, we devise a general-purpose framework called Masked Conditional
Video Diffusion (MCVD) for all of these video synthesis tasks using a
probabilistic conditional score-based denoising diffusion model, conditioned on
past and/or future frames. We train the model in a manner where we randomly and
independently mask all the past frames or all the future frames. This novel but
straightforward setup allows us to train a single model that is capable of
executing a broad range of video tasks, specifically: future/past prediction –
when only future/past frames are masked; unconditional generation – when both
past and future frames are masked; and interpolation – when neither past nor
future frames are masked. Our experiments show that this approach can generate
high-quality frames for diverse types of videos. Our MCVD models are built from
simple non-recurrent 2D-convolutional architectures, conditioning on blocks of
frames and generating blocks of frames. We generate videos of arbitrary lengths
autoregressively in a block-wise manner. Our approach yields SOTA results
across standard video prediction and interpolation benchmarks, with computation
times for training models measured in 1-12 days using $\le$ 4 GPUs. Project
page: https://mask-cond-video-diffusion.github.io ; Code :
https://github.com/voletiv/mcvd-pytorch
Diffusion Models for Adversarial Purification
May 16, 2022
Weili Nie, Brandon Guo, Yujia Huang, Chaowei Xiao, Arash Vahdat, Anima Anandkumar
Adversarial purification refers to a class of defense methods that remove
adversarial perturbations using a generative model. These methods do not make
assumptions on the form of attack and the classification model, and thus can
defend pre-existing classifiers against unseen threats. However, their
performance currently falls behind adversarial training methods. In this work,
we propose DiffPure that uses diffusion models for adversarial purification:
Given an adversarial example, we first diffuse it with a small amount of noise
following a forward diffusion process, and then recover the clean image through
a reverse generative process. To evaluate our method against strong adaptive
attacks in an efficient and scalable way, we propose to use the adjoint method
to compute full gradients of the reverse generative process. Extensive
experiments on three image datasets including CIFAR-10, ImageNet and CelebA-HQ
with three classifier architectures including ResNet, WideResNet and ViT
demonstrate that our method achieves the state-of-the-art results,
outperforming current adversarial training and adversarial purification
methods, often by a large margin. Project page: https://diffpure.github.io.
May 08, 2022
Vedant Singh, Surgan Jandial, Ayush Chopra, Siddharth Ramesh, Balaji Krishnamurthy, Vineeth N. Balasubramanian
Conditional image generation has paved the way for several breakthroughs in
image editing, generating stock photos and 3-D object generation. This
continues to be a significant area of interest with the rise of new
state-of-the-art methods that are based on diffusion models. However, diffusion
models provide very little control over the generated image, which led to
subsequent works exploring techniques like classifier guidance, that provides a
way to trade off diversity with fidelity. In this work, we explore techniques
to condition diffusion models with carefully crafted input noise artifacts.
This allows generation of images conditioned on semantic attributes. This is
different from existing approaches that input Gaussian noise and further
introduce conditioning at the diffusion model’s inference step. Our experiments
over several examples and conditional settings show the potential of our
approach.
Subspace Diffusion Generative Models
May 03, 2022
Bowen Jing, Gabriele Corso, Renato Berlinghieri, Tommi Jaakkola
Score-based models generate samples by mapping noise to data (and vice versa)
via a high-dimensional diffusion process. We question whether it is necessary
to run this entire process at high dimensionality and incur all the
inconveniences thereof. Instead, we restrict the diffusion via projections onto
subspaces as the data distribution evolves toward noise. When applied to
state-of-the-art models, our framework simultaneously improves sample quality
– reaching an FID of 2.17 on unconditional CIFAR-10 – and reduces the
computational cost of inference for the same number of denoising steps. Our
framework is fully compatible with continuous-time diffusion and retains its
flexible capabilities, including exact log-likelihoods and controllable
generation. Code is available at
https://github.com/bjing2016/subspace-diffusion.
Fast Sampling of Diffusion Models with Exponential Integrator
April 29, 2022
Qinsheng Zhang, Yongxin Chen
The past few years have witnessed the great success of Diffusion models~(DMs)
in generating high-fidelity samples in generative modeling tasks. A major
limitation of the DM is its notoriously slow sampling procedure which normally
requires hundreds to thousands of time discretization steps of the learned
diffusion process to reach the desired accuracy. Our goal is to develop a fast
sampling method for DMs with a much less number of steps while retaining high
sample quality. To this end, we systematically analyze the sampling procedure
in DMs and identify key factors that affect the sample quality, among which the
method of discretization is most crucial. By carefully examining the learned
diffusion process, we propose Diffusion Exponential Integrator Sampler~(DEIS).
It is based on the Exponential Integrator designed for discretizing ordinary
differential equations (ODEs) and leverages a semilinear structure of the
learned diffusion process to reduce the discretization error. The proposed
method can be applied to any DMs and can generate high-fidelity samples in as
few as 10 steps. In our experiments, it takes about 3 minutes on one A6000 GPU
to generate $50k$ images from CIFAR10. Moreover, by directly using pre-trained
DMs, we achieve the state-of-art sampling performance when the number of score
function evaluation~(NFE) is limited, e.g., 4.17 FID with 10 NFEs, 3.37 FID,
and 9.74 IS with only 15 NFEs on CIFAR10. Code is available at
https://github.com/qsh-zh/deis
Retrieval-Augmented Diffusion Models
April 25, 2022
Andreas Blattmann, Robin Rombach, Kaan Oktay, Jonas Müller, Björn Ommer
Novel architectures have recently improved generative image synthesis leading
to excellent visual quality in various tasks. Much of this success is due to
the scalability of these architectures and hence caused by a dramatic increase
in model complexity and in the computational resources invested in training
these models. Our work questions the underlying paradigm of compressing large
training data into ever growing parametric representations. We rather present
an orthogonal, semi-parametric approach. We complement comparably small
diffusion or autoregressive models with a separate image database and a
retrieval strategy. During training we retrieve a set of nearest neighbors from
this external database for each training instance and condition the generative
model on these informative samples. While the retrieval approach is providing
the (local) content, the model is focusing on learning the composition of
scenes based on this content. As demonstrated by our experiments, simply
swapping the database for one with different contents transfers a trained model
post-hoc to a novel domain. The evaluation shows competitive performance on
tasks which the generative model has not been trained on, such as
class-conditional synthesis, zero-shot stylization or text-to-image synthesis
without requiring paired text-image data. With negligible memory and
computational overhead for the external database and retrieval we can
significantly reduce the parameter count of the generative model and still
outperform the state-of-the-art.
FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis
April 21, 2022
Rongjie Huang, Max W. Y. Lam, Jun Wang, Dan Su, Dong Yu, Yi Ren, Zhou Zhao
Denoising diffusion probabilistic models (DDPMs) have recently achieved
leading performances in many generative tasks. However, the inherited iterative
sampling process costs hindered their applications to speech synthesis. This
paper proposes FastDiff, a fast conditional diffusion model for high-quality
speech synthesis. FastDiff employs a stack of time-aware location-variable
convolutions of diverse receptive field patterns to efficiently model long-term
time dependencies with adaptive conditions. A noise schedule predictor is also
adopted to reduce the sampling steps without sacrificing the generation
quality. Based on FastDiff, we design an end-to-end text-to-speech synthesizer,
FastDiff-TTS, which generates high-fidelity speech waveforms without any
intermediate feature (e.g., Mel-spectrogram). Our evaluation of FastDiff
demonstrates the state-of-the-art results with higher-quality (MOS 4.28) speech
samples. Also, FastDiff enables a sampling speed of 58x faster than real-time
on a V100 GPU, making diffusion models practically applicable to speech
synthesis deployment for the first time. We further show that FastDiff
generalized well to the mel-spectrogram inversion of unseen speakers, and
FastDiff-TTS outperformed other competing methods in end-to-end text-to-speech
synthesis. Audio samples are available at \url{https://FastDiff.github.io/}.
A Score-based Geometric Model for Molecular Dynamics Simulations
April 19, 2022
Fang Wu, Stan Z. Li
Molecular dynamics (MD) has long been the de facto choice for simulating
complex atomistic systems from first principles. Recently deep learning models
become a popular way to accelerate MD. Notwithstanding, existing models depend
on intermediate variables such as the potential energy or force fields to
update atomic positions, which requires additional computations to perform
back-propagation. To waive this requirement, we propose a novel model called
DiffMD by directly estimating the gradient of the log density of molecular
conformations. DiffMD relies on a score-based denoising diffusion generative
model that perturbs the molecular structure with a conditional noise depending
on atomic accelerations and treats conformations at previous timeframes as the
prior distribution for sampling. Another challenge of modeling such a
conformation generation process is that a molecule is kinetic instead of
static, which no prior works have strictly studied. To solve this challenge, we
propose an equivariant geometric Transformer as the score function in the
diffusion process to calculate corresponding gradients. It incorporates the
directions and velocities of atomic motions via 3D spherical Fourier-Bessel
representations. With multiple architectural improvements, we outperform
state-of-the-art baselines on MD17 and isomers of C7O2H10 datasets. This work
contributes to accelerating material and drug discovery.
Hierarchical Text-Conditional Image Generation with CLIP Latents
April 13, 2022
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen
Contrastive models like CLIP have been shown to learn robust representations
of images that capture both semantics and style. To leverage these
representations for image generation, we propose a two-stage model: a prior
that generates a CLIP image embedding given a text caption, and a decoder that
generates an image conditioned on the image embedding. We show that explicitly
generating image representations improves image diversity with minimal loss in
photorealism and caption similarity. Our decoders conditioned on image
representations can also produce variations of an image that preserve both its
semantics and style, while varying the non-essential details absent from the
image representation. Moreover, the joint embedding space of CLIP enables
language-guided image manipulations in a zero-shot fashion. We use diffusion
models for the decoder and experiment with both autoregressive and diffusion
models for the prior, finding that the latter are computationally more
efficient and produce higher-quality samples.
Video Diffusion Models
April 07, 2022
Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, David J. Fleet
Generating temporally coherent high fidelity video is an important milestone
in generative modeling research. We make progress towards this milestone by
proposing a diffusion model for video generation that shows very promising
initial results. Our model is a natural extension of the standard image
diffusion architecture, and it enables jointly training from image and video
data, which we find to reduce the variance of minibatch gradients and speed up
optimization. To generate long and higher resolution videos we introduce a new
conditional sampling technique for spatial and temporal video extension that
performs better than previously proposed methods. We present the first results
on a large text-conditioned video generation task, as well as state-of-the-art
results on established benchmarks for video prediction and unconditional video
generation. Supplementary material is available at
https://video-diffusion.github.io/
KNN-Diffusion: Image Generation via Large-Scale Retrieval
April 06, 2022
Shelly Sheynin, Oron Ashual, Adam Polyak, Uriel Singer, Oran Gafni, Eliya Nachmani, Yaniv Taigman
cs.CV, cs.AI, cs.CL, cs.GR, cs.LG
Recent text-to-image models have achieved impressive results. However, since
they require large-scale datasets of text-image pairs, it is impractical to
train them on new domains where data is scarce or not labeled. In this work, we
propose using large-scale retrieval methods, in particular, efficient
k-Nearest-Neighbors (kNN), which offers novel capabilities: (1) training a
substantially small and efficient text-to-image diffusion model without any
text, (2) generating out-of-distribution images by simply swapping the
retrieval database at inference time, and (3) performing text-driven local
semantic manipulations while preserving object identity. To demonstrate the
robustness of our method, we apply our kNN approach on two state-of-the-art
diffusion backbones, and show results on several different datasets. As
evaluated by human studies and automatic metrics, our method achieves
state-of-the-art results compared to existing approaches that train
text-to-image generation models using images only (without paired text data)
Perception Prioritized Training of Diffusion Models
April 01, 2022
Jooyoung Choi, Jungbeom Lee, Chaehun Shin, Sungwon Kim, Hyunwoo Kim, Sungroh Yoon
Diffusion models learn to restore noisy data, which is corrupted with
different levels of noise, by optimizing the weighted sum of the corresponding
loss terms, i.e., denoising score matching loss. In this paper, we show that
restoring data corrupted with certain noise levels offers a proper pretext task
for the model to learn rich visual concepts. We propose to prioritize such
noise levels over other levels during training, by redesigning the weighting
scheme of the objective function. We show that our simple redesign of the
weighting scheme significantly improves the performance of diffusion models
regardless of the datasets, architectures, and sampling strategies.
Generating High Fidelity Data from Low-density Regions using Diffusion Models
March 31, 2022
Vikash Sehwag, Caner Hazirbas, Albert Gordo, Firat Ozgenel, Cristian Canton Ferrer
Our work focuses on addressing sample deficiency from low-density regions of
data manifold in common image datasets. We leverage diffusion process based
generative models to synthesize novel images from low-density regions. We
observe that uniform sampling from diffusion models predominantly samples from
high-density regions of the data manifold. Therefore, we modify the sampling
process to guide it towards low-density regions while simultaneously
maintaining the fidelity of synthetic data. We rigorously demonstrate that our
process successfully generates novel high fidelity samples from low-density
regions. We further examine generated samples and show that the model does not
memorize low-density data and indeed learns to generate novel samples from
low-density regions.
Speech Enhancement with Score-Based Generative Models in the Complex STFT Domain
March 31, 2022
Simon Welker, Julius Richter, Timo Gerkmann
Score-based generative models (SGMs) have recently shown impressive results
for difficult generative tasks such as the unconditional and conditional
generation of natural images and audio signals. In this work, we extend these
models to the complex short-time Fourier transform (STFT) domain, proposing a
novel training task for speech enhancement using a complex-valued deep neural
network. We derive this training task within the formalism of stochastic
differential equations (SDEs), thereby enabling the use of predictor-corrector
samplers. We provide alternative formulations inspired by previous publications
on using generative diffusion models for speech enhancement, avoiding the need
for any prior assumptions on the noise distribution and making the training
task purely generative which, as we show, results in improved enhancement
performance.
Equivariant Diffusion for Molecule Generation in 3D
March 31, 2022
Emiel Hoogeboom, Victor Garcia Satorras, Clément Vignac, Max Welling
This work introduces a diffusion model for molecule generation in 3D that is
equivariant to Euclidean transformations. Our E(3) Equivariant Diffusion Model
(EDM) learns to denoise a diffusion process with an equivariant network that
jointly operates on both continuous (atom coordinates) and categorical features
(atom types). In addition, we provide a probabilistic analysis which admits
likelihood computation of molecules using our model. Experimentally, the
proposed method significantly outperforms previous 3D molecular generative
methods regarding the quality of generated samples and efficiency at training
time.
SpecGrad: Diffusion Probabilistic Model based Neural Vocoder with Adaptive Noise Spectral Shaping
March 31, 2022
Yuma Koizumi, Heiga Zen, Kohei Yatabe, Nanxin Chen, Michiel Bacchiani
eess.AS, cs.LG, cs.SD, stat.ML
Neural vocoder using denoising diffusion probabilistic model (DDPM) has been
improved by adaptation of the diffusion noise distribution to given acoustic
features. In this study, we propose SpecGrad that adapts the diffusion noise so
that its time-varying spectral envelope becomes close to the conditioning
log-mel spectrogram. This adaptation by time-varying filtering improves the
sound quality especially in the high-frequency bands. It is processed in the
time-frequency domain to keep the computational cost almost the same as the
conventional DDPM-based neural vocoders. Experimental results showed that
SpecGrad generates higher-fidelity speech waveform than conventional DDPM-based
neural vocoders in both analysis-synthesis and speech enhancement scenarios.
Audio demos are available at wavegrad.github.io/specgrad/.
Diffusion Models for Counterfactual Explanations
March 29, 2022
Guillaume Jeanneret, Loïc Simon, Frédéric Jurie
Counterfactual explanations have shown promising results as a post-hoc
framework to make image classifiers more explainable. In this paper, we propose
DiME, a method allowing the generation of counterfactual images using the
recent diffusion models. By leveraging the guided generative diffusion process,
our proposed methodology shows how to use the gradients of the target
classifier to generate counterfactual explanations of input instances. Further,
we analyze current approaches to evaluate spurious correlations and extend the
evaluation measurements by proposing a new metric: Correlation Difference. Our
experimental validations show that the proposed algorithm surpasses previous
State-of-the-Art results on 5 out of 6 metrics on CelebA.
Denoising Likelihood Score Matching for Conditional Score-based Data Generation
March 27, 2022
Chen-Hao Chao, Wei-Fang Sun, Bo-Wun Cheng, Yi-Chen Lo, Chia-Che Chang, Yu-Lun Liu, Yu-Lin Chang, Chia-Ping Chen, Chun-Yi Lee
Many existing conditional score-based data generation methods utilize Bayes’
theorem to decompose the gradients of a log posterior density into a mixture of
scores. These methods facilitate the training procedure of conditional score
models, as a mixture of scores can be separately estimated using a score model
and a classifier. However, our analysis indicates that the training objectives
for the classifier in these methods may lead to a serious score mismatch issue,
which corresponds to the situation that the estimated scores deviate from the
true ones. Such an issue causes the samples to be misled by the deviated scores
during the diffusion process, resulting in a degraded sampling quality. To
resolve it, we formulate a novel training objective, called Denoising
Likelihood Score Matching (DLSM) loss, for the classifier to match the
gradients of the true log likelihood density. Our experimental evidence shows
that the proposed method outperforms the previous methods on both Cifar-10 and
Cifar-100 benchmarks noticeably in terms of several key evaluation metrics. We
thus conclude that, by adopting DLSM, the conditional scores can be accurately
modeled, and the effect of the score mismatch issue is alleviated.
BBDM: Bilateral Denoising Diffusion Models for Fast and High-Quality Speech Synthesis
March 25, 2022
Max W. Y. Lam, Jun Wang, Dan Su, Dong Yu
eess.AS, cs.AI, cs.LG, cs.SD, eess.SP
Diffusion probabilistic models (DPMs) and their extensions have emerged as
competitive generative models yet confront challenges of efficient sampling. We
propose a new bilateral denoising diffusion model (BDDM) that parameterizes
both the forward and reverse processes with a schedule network and a score
network, which can train with a novel bilateral modeling objective. We show
that the new surrogate objective can achieve a lower bound of the log marginal
likelihood tighter than a conventional surrogate. We also find that BDDM allows
inheriting pre-trained score network parameters from any DPMs and consequently
enables speedy and stable learning of the schedule network and optimization of
a noise schedule for sampling. Our experiments demonstrate that BDDMs can
generate high-fidelity audio samples with as few as three sampling steps.
Moreover, compared to other state-of-the-art diffusion-based neural vocoders,
BDDMs produce comparable or higher quality samples indistinguishable from human
speech, notably with only seven sampling steps (143x faster than WaveGrad and
28.6x faster than DiffWave). We release our code at
https://github.com/tencent-ailab/bddm.
Accelerating Bayesian Optimization for Biological Sequence design with Denoising Autoencoders
March 23, 2022
Samuel Stanton, Wesley Maddox, Nate Gruver, Phillip Maffettone, Emily Delaney, Peyton Greenside, Andrew Gordon Wilson
cs.LG, cs.NE, q-bio.QM, stat.ML
Bayesian optimization (BayesOpt) is a gold standard for query-efficient
continuous optimization. However, its adoption for drug design has been
hindered by the discrete, high-dimensional nature of the decision variables. We
develop a new approach (LaMBO) which jointly trains a denoising autoencoder
with a discriminative multi-task Gaussian process head, allowing gradient-based
optimization of multi-objective acquisition functions in the latent space of
the autoencoder. These acquisition functions allow LaMBO to balance the
explore-exploit tradeoff over multiple design rounds, and to balance objective
tradeoffs by optimizing sequences at many different points on the Pareto
frontier. We evaluate LaMBO on two small-molecule design tasks, and introduce
new tasks optimizing \emph{in silico} and \emph{in vitro} properties of
large-molecule fluorescent proteins. In our experiments LaMBO outperforms
genetic optimizers and does not require a large pretraining corpus,
demonstrating that BayesOpt is practical and effective for biological sequence
design.
MR Image Denoising and Super-Resolution Using Regularized Reverse Diffusion
March 23, 2022
Hyungjin Chung, Eun Sun Lee, Jong Chul Ye
eess.IV, cs.AI, cs.CV, cs.LG
Patient scans from MRI often suffer from noise, which hampers the diagnostic
capability of such images. As a method to mitigate such artifact, denoising is
largely studied both within the medical imaging community and beyond the
community as a general subject. However, recent deep neural network-based
approaches mostly rely on the minimum mean squared error (MMSE) estimates,
which tend to produce a blurred output. Moreover, such models suffer when
deployed in real-world sitautions: out-of-distribution data, and complex noise
distributions that deviate from the usual parametric noise models. In this
work, we propose a new denoising method based on score-based reverse diffusion
sampling, which overcomes all the aforementioned drawbacks. Our network,
trained only with coronal knee scans, excels even on out-of-distribution in
vivo liver MRI data, contaminated with complex mixture of noise. Even more, we
propose a method to enhance the resolution of the denoised image with the same
network. With extensive experiments, we show that our method establishes
state-of-the-art performance, while having desirable properties which prior
MMSE denoisers did not have: flexibly choosing the extent of denoising, and
quantifying uncertainty.
Dual Diffusion Implicit Bridges for Image-to-Image Translation
March 16, 2022
Xuan Su, Jiaming Song, Chenlin Meng, Stefano Ermon
Common image-to-image translation methods rely on joint training over data
from both source and target domains. The training process requires concurrent
access to both datasets, which hinders data separation and privacy protection;
and existing models cannot be easily adapted for translation of new domain
pairs. We present Dual Diffusion Implicit Bridges (DDIBs), an image translation
method based on diffusion models, that circumvents training on domain pairs.
Image translation with DDIBs relies on two diffusion models trained
independently on each domain, and is a two-step process: DDIBs first obtain
latent encodings for source images with the source diffusion model, and then
decode such encodings using the target model to construct target images. Both
steps are defined via ordinary differential equations (ODEs), thus the process
is cycle consistent only up to discretization errors of the ODE solvers.
Theoretically, we interpret DDIBs as concatenation of source to latent, and
latent to target Schrodinger Bridges, a form of entropy-regularized optimal
transport, to explain the efficacy of the method. Experimentally, we apply
DDIBs on synthetic and high-resolution image datasets, to demonstrate their
utility in a wide variety of translation tasks and their inherent optimal
transport properties.
Diffusion Probabilistic Modeling for Video Generation
March 16, 2022
Ruihan Yang, Prakhar Srivastava, Stephan Mandt
Denoising diffusion probabilistic models are a promising new class of
generative models that mark a milestone in high-quality image generation. This
paper showcases their ability to sequentially generate video, surpassing prior
methods in perceptual and probabilistic forecasting metrics. We propose an
autoregressive, end-to-end optimized video diffusion model inspired by recent
advances in neural video compression. The model successively generates future
frames by correcting a deterministic next-frame prediction using a stochastic
residual generated by an inverse diffusion process. We compare this approach
against five baselines on four datasets involving natural and simulation-based
videos. We find significant improvements in terms of perceptual quality for all
datasets. Furthermore, by introducing a scalable version of the Continuous
Ranked Probability Score (CRPS) applicable to video, we show that our model
also outperforms existing approaches in their probabilistic frame forecasting
ability.
Score matching enables causal discovery of nonlinear additive noise models
March 08, 2022
Paul Rolland, Volkan Cevher, Matthäus Kleindessner, Chris Russel, Bernhard Schölkopf, Dominik Janzing, Francesco Locatello
This paper demonstrates how to recover causal graphs from the score of the
data distribution in non-linear additive (Gaussian) noise models. Using score
matching algorithms as a building block, we show how to design a new generation
of scalable causal discovery methods. To showcase our approach, we also propose
a new efficient method for approximating the score’s Jacobian, enabling to
recover the causal graph. Empirically, we find that the new algorithm, called
SCORE, is competitive with state-of-the-art causal discovery methods while
being significantly faster.
Diffusion Models for Medical Anomaly Detection
March 08, 2022
Julia Wolleb, Florentin Bieder, Robin Sandkühler, Philippe C. Cattin
In medical applications, weakly supervised anomaly detection methods are of
great interest, as only image-level annotations are required for training.
Current anomaly detection methods mainly rely on generative adversarial
networks or autoencoder models. Those models are often complicated to train or
have difficulties to preserve fine details in the image. We present a novel
weakly supervised anomaly detection method based on denoising diffusion
implicit models. We combine the deterministic iterative noising and denoising
scheme with classifier guidance for image-to-image translation between diseased
and healthy subjects. Our method generates very detailed anomaly maps without
the need for a complex training procedure. We evaluate our method on the
BRATS2020 dataset for brain tumor detection and the CheXpert dataset for
detecting pleural effusions.
Dynamic Dual-Output Diffusion Models
March 08, 2022
Yaniv Benny, Lior Wolf
Iterative denoising-based generation, also known as denoising diffusion
models, has recently been shown to be comparable in quality to other classes of
generative models, and even surpass them. Including, in particular, Generative
Adversarial Networks, which are currently the state of the art in many
sub-tasks of image generation. However, a major drawback of this method is that
it requires hundreds of iterations to produce a competitive result. Recent
works have proposed solutions that allow for faster generation with fewer
iterations, but the image quality gradually deteriorates with increasingly
fewer iterations being applied during generation. In this paper, we reveal some
of the causes that affect the generation quality of diffusion models,
especially when sampling with few iterations, and come up with a simple, yet
effective, solution to mitigate them. We consider two opposite equations for
the iterative denoising, the first predicts the applied noise, and the second
predicts the image directly. Our solution takes the two options and learns to
dynamically alternate between them through the denoising process. Our proposed
solution is general and can be applied to any existing diffusion model. As we
show, when applied to various SOTA architectures, our solution immediately
improves their generation quality, with negligible added complexity and
parameters. We experiment on multiple datasets and configurations and run an
extensive ablation study to support these findings.
March 08, 2022
Cheng Peng, Pengfei Guo, S. Kevin Zhou, Vishal Patel, Rama Chellappa
Magnetic Resonance (MR) image reconstruction from under-sampled acquisition
promises faster scanning time. To this end, current State-of-The-Art (SoTA)
approaches leverage deep neural networks and supervised training to learn a
recovery model. While these approaches achieve impressive performances, the
learned model can be fragile on unseen degradation, e.g. when given a different
acceleration factor. These methods are also generally deterministic and provide
a single solution to an ill-posed problem; as such, it can be difficult for
practitioners to understand the reliability of the reconstruction. We introduce
DiffuseRecon, a novel diffusion model-based MR reconstruction method.
DiffuseRecon guides the generation process based on the observed signals and a
pre-trained diffusion model, and does not require additional training on
specific acceleration factors. DiffuseRecon is stochastic in nature and
generates results from a distribution of fully-sampled MR images; as such, it
allows us to explicitly visualize different potential reconstruction solutions.
Lastly, DiffuseRecon proposes an accelerated, coarse-to-fine Monte-Carlo
sampling scheme to approximate the most likely reconstruction candidate. The
proposed DiffuseRecon achieves SoTA performances reconstructing from raw
acquisition signals in fastMRI and SKM-TEA. Code will be open-sourced at
www.github.com/cpeng93/DiffuseRecon.
Score-Based Generative Models for Molecule Generation
March 07, 2022
Dwaraknath Gnaneshwar, Bharath Ramsundar, Dhairya Gandhi, Rachel Kurchin, Venkatasubramanian Viswanathan
Recent advances in generative models have made exploring design spaces easier
for de novo molecule generation. However, popular generative models like GANs
and normalizing flows face challenges such as training instabilities due to
adversarial training and architectural constraints, respectively. Score-based
generative models sidestep these challenges by modelling the gradient of the
log probability density using a score function approximation, as opposed to
modelling the density function directly, and sampling from it using annealed
Langevin Dynamics. We believe that score-based generative models could open up
new opportunities in molecule generation due to their architectural
flexibility, such as replacing the score function with an SE(3) equivariant
model. In this work, we lay the foundations by testing the efficacy of
score-based models for molecule generation. We train a Transformer-based score
function on Self-Referencing Embedded Strings (SELFIES) representations of 1.5
million samples from the ZINC dataset and use the Moses benchmarking framework
to evaluate the generated samples on a suite of metrics.
Conditional Simulation Using Diffusion Schrodinger Bridges
February 27, 2022
Yuyang Shi, Valentin De Bortoli, George Deligiannidis, Arnaud Doucet
Denoising diffusion models have recently emerged as a powerful class of
generative models. They provide state-of-the-art results, not only for
unconditional simulation, but also when used to solve conditional simulation
problems arising in a wide range of inverse problems. A limitation of these
models is that they are computationally intensive at generation time as they
require simulating a diffusion process over a long time horizon. When
performing unconditional simulation, a Schr"odinger bridge formulation of
generative modeling leads to a theoretically grounded algorithm shortening
generation time which is complementary to other proposed acceleration
techniques. We extend the Schr"odinger bridge framework to conditional
simulation. We demonstrate this novel methodology on various applications
including image super-resolution, optimal filtering for state-space models and
the refinement of pre-trained networks. Our code can be found at
https://github.com/vdeborto/cdsb.
Diffusion Causal Models for Counterfactual Estimation
February 21, 2022
Pedro Sanchez, Sotirios A. Tsaftaris
We consider the task of counterfactual estimation from observational imaging
data given a known causal structure. In particular, quantifying the causal
effect of interventions for high-dimensional data with neural networks remains
an open challenge. Herein we propose Diff-SCM, a deep structural causal model
that builds on recent advances of generative energy-based models. In our
setting, inference is performed by iteratively sampling gradients of the
marginal and conditional distributions entailed by the causal model.
Counterfactual estimation is achieved by firstly inferring latent variables
with deterministic forward diffusion, then intervening on a reverse diffusion
process using the gradients of an anti-causal predictor w.r.t the input.
Furthermore, we propose a metric for evaluating the generated counterfactuals.
We find that Diff-SCM produces more realistic and minimal counterfactuals than
baselines on MNIST data and can also be applied to ImageNet data. Code is
available https://github.com/vios-s/Diff-SCM.
Pseudo Numerical Methods for Diffusion Models on Manifolds
February 20, 2022
Luping Liu, Yi Ren, Zhijie Lin, Zhou Zhao
cs.CV, cs.LG, cs.NA, math.NA, stat.ML
Denoising Diffusion Probabilistic Models (DDPMs) can generate high-quality
samples such as image and audio samples. However, DDPMs require hundreds to
thousands of iterations to produce final samples. Several prior works have
successfully accelerated DDPMs through adjusting the variance schedule (e.g.,
Improved Denoising Diffusion Probabilistic Models) or the denoising equation
(e.g., Denoising Diffusion Implicit Models (DDIMs)). However, these
acceleration methods cannot maintain the quality of samples and even introduce
new noise at a high speedup rate, which limit their practicability. To
accelerate the inference process while keeping the sample quality, we provide a
fresh perspective that DDPMs should be treated as solving differential
equations on manifolds. Under such a perspective, we propose pseudo numerical
methods for diffusion models (PNDMs). Specifically, we figure out how to solve
differential equations on manifolds and show that DDIMs are simple cases of
pseudo numerical methods. We change several classical numerical methods to
corresponding pseudo numerical methods and find that the pseudo linear
multi-step method is the best in most situations. According to our experiments,
by directly using pre-trained models on Cifar10, CelebA and LSUN, PNDMs can
generate higher quality synthetic images with only 50 steps compared with
1000-step DDIMs (20x speedup), significantly outperform DDIMs with 250 steps
(by around 0.4 in FID) and have good generalization on different variance
schedules. Our implementation is available at
https://github.com/luping-liu/PNDM.
Truncated Diffusion Probabilistic Models
February 19, 2022
Huangjie Zheng, Pengcheng He, Weizhu Chen, Mingyuan Zhou
Employing a forward diffusion chain to gradually map the data to a noise
distribution, diffusion-based generative models learn how to generate the data
by inferring a reverse diffusion chain. However, this approach is slow and
costly because it needs many forward and reverse steps. We propose a faster and
cheaper approach that adds noise not until the data become pure random noise,
but until they reach a hidden noisy data distribution that we can confidently
learn. Then, we use fewer reverse steps to generate data by starting from this
hidden distribution that is made similar to the noisy data. We reveal that the
proposed model can be cast as an adversarial auto-encoder empowered by both the
diffusion process and a learnable implicit prior. Experimental results show
even with a significantly smaller number of reverse diffusion steps, the
proposed truncated diffusion probabilistic models can provide consistent
improvements over the non-truncated ones in terms of performance in both
unconditional and text-guided image generations.
Understanding DDPM Latent Codes Through Optimal Transport
February 14, 2022
Valentin Khrulkov, Gleb Ryzhakov, Andrei Chertkov, Ivan Oseledets
stat.ML, cs.AI, cs.LG, cs.NA, math.AP, math.NA
Diffusion models have recently outperformed alternative approaches to model
the distribution of natural images, such as GANs. Such diffusion models allow
for deterministic sampling via the probability flow ODE, giving rise to a
latent space and an encoder map. While having important practical applications,
such as estimation of the likelihood, the theoretical properties of this map
are not yet fully understood. In the present work, we partially address this
question for the popular case of the VP SDE (DDPM) approach. We show that,
perhaps surprisingly, the DDPM encoder map coincides with the optimal transport
map for common distributions; we support this claim theoretically and by
extensive numerical experiments.
Learning Fast Samplers for Diffusion Models by Differentiating Through Sample Quality
February 11, 2022
Daniel Watson, William Chan, Jonathan Ho, Mohammad Norouzi
Diffusion models have emerged as an expressive family of generative models
rivaling GANs in sample quality and autoregressive models in likelihood scores.
Standard diffusion models typically require hundreds of forward passes through
the model to generate a single high-fidelity sample. We introduce
Differentiable Diffusion Sampler Search (DDSS): a method that optimizes fast
samplers for any pre-trained diffusion model by differentiating through sample
quality scores. We also present Generalized Gaussian Diffusion Models (GGDM), a
family of flexible non-Markovian samplers for diffusion models. We show that
optimizing the degrees of freedom of GGDM samplers by maximizing sample quality
scores via gradient descent leads to improved sample quality. Our optimization
procedure backpropagates through the sampling process using the
reparametrization trick and gradient rematerialization. DDSS achieves strong
results on unconditional image generation across various datasets (e.g., FID
scores on LSUN church 128x128 of 11.6 with only 10 inference steps, and 4.82
with 20 steps, compared to 51.1 and 14.9 with strongest DDPM/DDIM baselines).
Our method is compatible with any pre-trained diffusion model without
fine-tuning or re-training required.
Conditional Diffusion Probabilistic Model for Speech Enhancement
February 10, 2022
Yen-Ju Lu, Zhong-Qiu Wang, Shinji Watanabe, Alexander Richard, Cheng Yu, Yu Tsao
Speech enhancement is a critical component of many user-oriented audio
applications, yet current systems still suffer from distorted and unnatural
outputs. While generative models have shown strong potential in speech
synthesis, they are still lagging behind in speech enhancement. This work
leverages recent advances in diffusion probabilistic models, and proposes a
novel speech enhancement algorithm that incorporates characteristics of the
observed noisy speech signal into the diffusion and reverse processes. More
specifically, we propose a generalized formulation of the diffusion
probabilistic model named conditional diffusion probabilistic model that, in
its reverse process, can adapt to non-Gaussian real noises in the estimated
speech signal. In our experiments, we demonstrate strong performance of the
proposed approach compared to representative generative models, and investigate
the generalization capability of our models to other datasets with noise
characteristics unseen during training.
Diffusion bridges vector quantized Variational AutoEncoders
February 10, 2022
Max Cohen, Guillaume Quispe, Sylvain Le Corff, Charles Ollion, Eric Moulines
Vector Quantized-Variational AutoEncoders (VQ-VAE) are generative models
based on discrete latent representations of the data, where inputs are mapped
to a finite set of learned embeddings.To generate new samples, an
autoregressive prior distribution over the discrete states must be trained
separately. This prior is generally very complex and leads to slow generation.
In this work, we propose a new model to train the prior and the encoder/decoder
networks simultaneously. We build a diffusion bridge between a continuous coded
vector and a non-informative prior distribution. The latent discrete states are
then given as random functions of these continuous vectors. We show that our
model is competitive with the autoregressive prior on the mini-Imagenet and
CIFAR dataset and is efficient in both optimization and sampling. Our framework
also extends the standard VQ-VAE and enables end-to-end training.
InferGrad: Improving Diffusion Models for Vocoder by Considering Inference in Training
February 08, 2022
Zehua Chen, Xu Tan, Ke Wang, Shifeng Pan, Danilo Mandic, Lei He, Sheng Zhao
eess.AS, cs.AI, cs.CL, cs.LG, cs.SD
Denoising diffusion probabilistic models (diffusion models for short) require
a large number of iterations in inference to achieve the generation quality
that matches or surpasses the state-of-the-art generative models, which
invariably results in slow inference speed. Previous approaches aim to optimize
the choice of inference schedule over a few iterations to speed up inference.
However, this results in reduced generation quality, mainly because the
inference process is optimized separately, without jointly optimizing with the
training process. In this paper, we propose InferGrad, a diffusion model for
vocoder that incorporates inference process into training, to reduce the
inference iterations while maintaining high generation quality. More
specifically, during training, we generate data from random noise through a
reverse process under inference schedules with a few iterations, and impose a
loss to minimize the gap between the generated and ground-truth data samples.
Then, unlike existing approaches, the training of InferGrad considers the
inference process. The advantages of InferGrad are demonstrated through
experiments on the LJSpeech dataset showing that InferGrad achieves better
voice quality than the baseline WaveGrad under same conditions while
maintaining the same voice quality as the baseline but with $3$x speedup ($2$
iterations for InferGrad vs $6$ iterations for WaveGrad).
Riemannian Score-Based Generative Modeling
February 06, 2022
Valentin De Bortoli, Emile Mathieu, Michael Hutchinson, James Thornton, Yee Whye Teh, Arnaud Doucet
Score-based generative models (SGMs) are a powerful class of generative
models that exhibit remarkable empirical performance. Score-based generative
modelling (SGM) consists of a noising'' stage, whereby a diffusion is used to
gradually add Gaussian noise to data, and a generative model, which entails a
denoising’’ process defined by approximating the time-reversal of the
diffusion. Existing SGMs assume that data is supported on a Euclidean space,
i.e. a manifold with flat geometry. In many domains such as robotics,
geoscience or protein modelling, data is often naturally described by
distributions living on Riemannian manifolds and current SGM techniques are not
appropriate. We introduce here Riemannian Score-based Generative Models
(RSGMs), a class of generative models extending SGMs to Riemannian manifolds.
We demonstrate our approach on a variety of manifolds, and in particular with
earth and climate science spherical data.
Score-based Generative Modeling of Graphs via the System of Stochastic Differential Equations
February 05, 2022
Jaehyeong Jo, Seul Lee, Sung Ju Hwang
Generating graph-structured data requires learning the underlying
distribution of graphs. Yet, this is a challenging problem, and the previous
graph generative methods either fail to capture the permutation-invariance
property of graphs or cannot sufficiently model the complex dependency between
nodes and edges, which is crucial for generating real-world graphs such as
molecules. To overcome such limitations, we propose a novel score-based
generative model for graphs with a continuous-time framework. Specifically, we
propose a new graph diffusion process that models the joint distribution of the
nodes and edges through a system of stochastic differential equations (SDEs).
Then, we derive novel score matching objectives tailored for the proposed
diffusion process to estimate the gradient of the joint log-density with
respect to each component, and introduce a new solver for the system of SDEs to
efficiently sample from the reverse diffusion process. We validate our graph
generation method on diverse datasets, on which it either achieves
significantly superior or competitive performance to the baselines. Further
analysis shows that our method is able to generate molecules that lie close to
the training distribution yet do not violate the chemical valency rule,
demonstrating the effectiveness of the system of SDEs in modeling the node-edge
relationships. Our code is available at https://github.com/harryjo97/GDSS.
Progressive Distillation for Fast Sampling of Diffusion Models
February 01, 2022
Tim Salimans, Jonathan Ho
Diffusion models have recently shown great promise for generative modeling,
outperforming GANs on perceptual quality and autoregressive models at density
estimation. A remaining downside is their slow sampling time: generating high
quality samples takes many hundreds or thousands of model evaluations. Here we
make two contributions to help eliminate this downside: First, we present new
parameterizations of diffusion models that provide increased stability when
using few sampling steps. Second, we present a method to distill a trained
deterministic diffusion sampler, using many steps, into a new diffusion model
that takes half as many sampling steps. We then keep progressively applying
this distillation procedure to our model, halving the number of required
sampling steps each time. On standard image generation benchmarks like
CIFAR-10, ImageNet, and LSUN, we start out with state-of-the-art samplers
taking as many as 8192 steps, and are able to distill down to models taking as
few as 4 steps without losing much perceptual quality; achieving, for example,
a FID of 3.0 on CIFAR-10 in 4 steps. Finally, we show that the full progressive
distillation procedure does not take more time than it takes to train the
original model, thus representing an efficient solution for generative modeling
using diffusion at both train and test time.
From data to functa: Your data point is a function and you should treat it like one
January 28, 2022
Emilien Dupont, Hyunjik Kim, S. M. Ali Eslami, Danilo Rezende, Dan Rosenbaum
It is common practice in deep learning to represent a measurement of the
world on a discrete grid, e.g. a 2D grid of pixels. However, the underlying
signal represented by these measurements is often continuous, e.g. the scene
depicted in an image. A powerful continuous alternative is then to represent
these measurements using an implicit neural representation, a neural function
trained to output the appropriate measurement value for any input spatial
location. In this paper, we take this idea to its next level: what would it
take to perform deep learning on these functions instead, treating them as
data? In this context we refer to the data as functa, and propose a framework
for deep learning on functa. This view presents a number of challenges around
efficient conversion from data to functa, compact representation of functa, and
effectively solving downstream tasks on functa. We outline a recipe to overcome
these challenges and apply it to a wide range of data modalities including
images, 3D shapes, neural radiance fields (NeRF) and data on manifolds. We
demonstrate that this approach has various compelling properties across data
modalities, in particular on the canonical tasks of generative modeling, data
imputation, novel view synthesis and classification. Code:
https://github.com/deepmind/functa
DiffGAN-TTS: High-Fidelity and Efficient Text-to-Speech with Denoising Diffusion GANs
January 28, 2022
Songxiang Liu, Dan Su, Dong Yu
Denoising diffusion probabilistic models (DDPMs) are expressive generative
models that have been used to solve a variety of speech synthesis problems.
However, because of their high sampling costs, DDPMs are difficult to use in
real-time speech processing applications. In this paper, we introduce
DiffGAN-TTS, a novel DDPM-based text-to-speech (TTS) model achieving
high-fidelity and efficient speech synthesis. DiffGAN-TTS is based on denoising
diffusion generative adversarial networks (GANs), which adopt an
adversarially-trained expressive model to approximate the denoising
distribution. We show with multi-speaker TTS experiments that DiffGAN-TTS can
generate high-fidelity speech samples within only 4 denoising steps. We present
an active shallow diffusion mechanism to further speed up inference. A
two-stage training scheme is proposed, with a basic TTS acoustic model trained
at stage one providing valuable prior information for a DDPM trained at stage
two. Our experiments show that DiffGAN-TTS can achieve high synthesis
performance with only 1 denoising step.
Denoising Diffusion Restoration Models
January 27, 2022
Bahjat Kawar, Michael Elad, Stefano Ermon, Jiaming Song
Many interesting tasks in image restoration can be cast as linear inverse
problems. A recent family of approaches for solving these problems uses
stochastic algorithms that sample from the posterior distribution of natural
images given the measurements. However, efficient solutions often require
problem-specific supervised training to model the posterior, whereas
unsupervised methods that are not problem-specific typically rely on
inefficient iterative methods. This work addresses these issues by introducing
Denoising Diffusion Restoration Models (DDRM), an efficient, unsupervised
posterior sampling method. Motivated by variational inference, DDRM takes
advantage of a pre-trained denoising diffusion generative model for solving any
linear inverse problem. We demonstrate DDRM’s versatility on several image
datasets for super-resolution, deblurring, inpainting, and colorization under
various amounts of measurement noise. DDRM outperforms the current leading
unsupervised methods on the diverse ImageNet dataset in reconstruction quality,
perceptual quality, and runtime, being 5x faster than the nearest competitor.
DDRM also generalizes well for natural images out of the distribution of the
observed ImageNet training set.
Unsupervised Denoising of Retinal OCT with Diffusion Probabilistic Model
January 27, 2022
Dewei Hu, Yuankai K. Tao, Ipek Oguz
Optical coherence tomography (OCT) is a prevalent non-invasive imaging method
which provides high resolution volumetric visualization of retina. However, its
inherent defect, the speckle noise, can seriously deteriorate the tissue
visibility in OCT. Deep learning based approaches have been widely used for
image restoration, but most of these require a noise-free reference image for
supervision. In this study, we present a diffusion probabilistic model that is
fully unsupervised to learn from noise instead of signal. A diffusion process
is defined by adding a sequence of Gaussian noise to self-fused OCT b-scans.
Then the reverse process of diffusion, modeled by a Markov chain, provides an
adjustable level of denoising. Our experiment results demonstrate that our
method can significantly improve the image quality with a simple working
pipeline and a small amount of training data.
RePaint: Inpainting using Denoising Diffusion Probabilistic Models
January 24, 2022
Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, Luc Van Gool
Free-form inpainting is the task of adding new content to an image in the
regions specified by an arbitrary binary mask. Most existing approaches train
for a certain distribution of masks, which limits their generalization
capabilities to unseen mask types. Furthermore, training with pixel-wise and
perceptual losses often leads to simple textural extensions towards the missing
areas instead of semantically meaningful generation. In this work, we propose
RePaint: A Denoising Diffusion Probabilistic Model (DDPM) based inpainting
approach that is applicable to even extreme masks. We employ a pretrained
unconditional DDPM as the generative prior. To condition the generation
process, we only alter the reverse diffusion iterations by sampling the
unmasked regions using the given image information. Since this technique does
not modify or condition the original DDPM network itself, the model produces
high-quality and diverse output images for any inpainting form. We validate our
method for both faces and general-purpose image inpainting using standard and
extreme masks.
RePaint outperforms state-of-the-art Autoregressive, and GAN approaches for
at least five out of six mask distributions.
Github Repository: git.io/RePaint
Analytic-DPM: an Analytic Estimate of the Optimal Reverse Variance in Diffusion Probabilistic Models
January 17, 2022
Fan Bao, Chongxuan Li, Jun Zhu, Bo Zhang
Diffusion probabilistic models (DPMs) represent a class of powerful
generative models. Despite their success, the inference of DPMs is expensive
since it generally needs to iterate over thousands of timesteps. A key problem
in the inference is to estimate the variance in each timestep of the reverse
process. In this work, we present a surprising result that both the optimal
reverse variance and the corresponding optimal KL divergence of a DPM have
analytic forms w.r.t. its score function. Building upon it, we propose
Analytic-DPM, a training-free inference framework that estimates the analytic
forms of the variance and KL divergence using the Monte Carlo method and a
pretrained score-based model. Further, to correct the potential bias caused by
the score-based model, we derive both lower and upper bounds of the optimal
variance and clip the estimate for a better result. Empirically, our
analytic-DPM improves the log-likelihood of various DPMs, produces high-quality
samples, and meanwhile enjoys a 20x to 80x speed up.
Probabilistic Mass Mapping with Neural Score Estimation
January 14, 2022
Benjamin Remy, Francois Lanusse, Niall Jeffrey, Jia Liu, Jean-Luc Starck, Ken Osato, Tim Schrabback
astro-ph.CO, astro-ph.IM, cs.LG
Weak lensing mass-mapping is a useful tool to access the full distribution of
dark matter on the sky, but because of intrinsic galaxy ellipticies and finite
fields/missing data, the recovery of dark matter maps constitutes a challenging
ill-posed inverse problem. We introduce a novel methodology allowing for
efficient sampling of the high-dimensional Bayesian posterior of the weak
lensing mass-mapping problem, and relying on simulations for defining a fully
non-Gaussian prior. We aim to demonstrate the accuracy of the method on
simulations, and then proceed to applying it to the mass reconstruction of the
HST/ACS COSMOS field. The proposed methodology combines elements of Bayesian
statistics, analytic theory, and a recent class of Deep Generative Models based
on Neural Score Matching. This approach allows us to do the following: 1) Make
full use of analytic cosmological theory to constrain the 2pt statistics of the
solution. 2) Learn from cosmological simulations any differences between this
analytic prior and full simulations. 3) Obtain samples from the full Bayesian
posterior of the problem for robust Uncertainty Quantification. We demonstrate
the method on the $\kappa$TNG simulations and find that the posterior mean
significantly outperfoms previous methods (Kaiser-Squires, Wiener filter,
Sparsity priors) both on root-mean-square error and in terms of the Pearson
correlation. We further illustrate the interpretability of the recovered
posterior by establishing a close correlation between posterior convergence
values and SNR of clusters artificially introduced into a field. Finally, we
apply the method to the reconstruction of the HST/ACS COSMOS field and yield
the highest quality convergence map of this field to date.
DiffuseVAE: Efficient, Controllable and High-Fidelity Generation from Low-Dimensional Latents
January 02, 2022
Kushagra Pandey, Avideep Mukherjee, Piyush Rai, Abhishek Kumar
Diffusion probabilistic models have been shown to generate state-of-the-art
results on several competitive image synthesis benchmarks but lack a
low-dimensional, interpretable latent space, and are slow at generation. On the
other hand, standard Variational Autoencoders (VAEs) typically have access to a
low-dimensional latent space but exhibit poor sample quality. We present
DiffuseVAE, a novel generative framework that integrates VAE within a diffusion
model framework, and leverage this to design novel conditional
parameterizations for diffusion models. We show that the resulting model equips
diffusion models with a low-dimensional VAE inferred latent code which can be
used for downstream tasks like controllable synthesis. The proposed method also
improves upon the speed vs quality tradeoff exhibited in standard unconditional
DDPM/DDIM models (for instance, FID of 16.47 vs 34.36 using a standard DDIM on
the CelebA-HQ-128 benchmark using T=10 reverse process steps) without having
explicitly trained for such an objective. Furthermore, the proposed model
exhibits synthesis quality comparable to state-of-the-art models on standard
image synthesis benchmarks like CIFAR-10 and CelebA-64 while outperforming most
existing VAE-based methods. Lastly, we show that the proposed method exhibits
inherent generalization to different types of noise in the conditioning signal.
For reproducibility, our source code is publicly available at
https://github.com/kpandey008/DiffuseVAE.
Ito-Taylor Sampling Scheme for Denoising Diffusion Probabilistic Models using Ideal Derivatives
December 26, 2021
Hideyuki Tachibana, Mocho Go, Muneyoshi Inahara, Yotaro Katayama, Yotaro Watanabe
stat.ML, cs.CV, cs.LG, cs.SD, eess.AS
Diffusion generative models have emerged as a new challenger to popular deep
neural generative models such as GANs, but have the drawback that they often
require a huge number of neural function evaluations (NFEs) during synthesis
unless some sophisticated sampling strategies are employed. This paper proposes
new efficient samplers based on the numerical schemes derived by the familiar
Taylor expansion, which directly solves the ODE/SDE of interest. In general, it
is not easy to compute the derivatives that are required in higher-order Taylor
schemes, but in the case of diffusion models, this difficulty is alleviated by
the trick that the authors call ideal derivative substitution,'' in which the
higher-order derivatives are replaced by tractable ones. To derive ideal
derivatives, the authors argue the
single point approximation,’’ in which the
true score function is approximated by a conditional one, holds in many cases,
and considered the derivatives of this approximation. Applying thus obtained
new quasi-Taylor samplers to image generation tasks, the authors experimentally
confirmed that the proposed samplers could synthesize plausible images in small
number of NFEs, and that the performance was better or at the same level as
DDIM and Runge-Kutta methods. The paper also argues the relevance of the
proposed samplers to the existing ones mentioned above.
High-Resolution Image Synthesis with Latent Diffusion Models
December 20, 2021
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Björn Ommer
By decomposing the image formation process into a sequential application of
denoising autoencoders, diffusion models (DMs) achieve state-of-the-art
synthesis results on image data and beyond. Additionally, their formulation
allows for a guiding mechanism to control the image generation process without
retraining. However, since these models typically operate directly in pixel
space, optimization of powerful DMs often consumes hundreds of GPU days and
inference is expensive due to sequential evaluations. To enable DM training on
limited computational resources while retaining their quality and flexibility,
we apply them in the latent space of powerful pretrained autoencoders. In
contrast to previous work, training diffusion models on such a representation
allows for the first time to reach a near-optimal point between complexity
reduction and detail preservation, greatly boosting visual fidelity. By
introducing cross-attention layers into the model architecture, we turn
diffusion models into powerful and flexible generators for general conditioning
inputs such as text or bounding boxes and high-resolution synthesis becomes
possible in a convolutional manner. Our latent diffusion models (LDMs) achieve
a new state of the art for image inpainting and highly competitive performance
on various tasks, including unconditional image generation, semantic scene
synthesis, and super-resolution, while significantly reducing computational
requirements compared to pixel-based DMs. Code is available at
https://github.com/CompVis/latent-diffusion .
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
December 20, 2021
Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, Mark Chen
Diffusion models have recently been shown to generate high-quality synthetic
images, especially when paired with a guidance technique to trade off diversity
for fidelity. We explore diffusion models for the problem of text-conditional
image synthesis and compare two different guidance strategies: CLIP guidance
and classifier-free guidance. We find that the latter is preferred by human
evaluators for both photorealism and caption similarity, and often produces
photorealistic samples. Samples from a 3.5 billion parameter text-conditional
diffusion model using classifier-free guidance are favored by human evaluators
to those from DALL-E, even when the latter uses expensive CLIP reranking.
Additionally, we find that our models can be fine-tuned to perform image
inpainting, enabling powerful text-driven image editing. We train a smaller
model on a filtered dataset and release the code and weights at
https://github.com/openai/glide-text2im.
Heavy-tailed denoising score matching
December 17, 2021
Jacob Deasy, Nikola Simidjievski, Pietro Liò
Score-based model research in the last few years has produced state of the
art generative models by employing Gaussian denoising score-matching (DSM).
However, the Gaussian noise assumption has several high-dimensional
limitations, motivating a more concrete route toward even higher dimension PDF
estimation in future. We outline this limitation, before extending the theory
to a broader family of noising distributions – namely, the generalised normal
distribution. To theoretically ground this, we relax a key assumption in
(denoising) score matching theory, demonstrating that distributions which are
differentiable almost everywhere permit the same objective simplification as
Gaussians. For noise vector norm distributions, we demonstrate favourable
concentration of measure in the high-dimensional spaces prevalent in deep
learning. In the process, we uncover a skewed noise vector norm distribution
and develop an iterative noise scaling algorithm to consistently initialise the
multiple levels of noise in annealed Langevin dynamics (LD). On the practical
side, our use of heavy-tailed DSM leads to improved score estimation,
controllable sampling convergence, and more balanced unconditional generative
performance for imbalanced datasets.
Tackling the Generative Learning Trilemma with Denoising Diffusion GANs
December 15, 2021
Zhisheng Xiao, Karsten Kreis, Arash Vahdat
A wide variety of deep generative models has been developed in the past
decade. Yet, these models often struggle with simultaneously addressing three
key requirements including: high sample quality, mode coverage, and fast
sampling. We call the challenge imposed by these requirements the generative
learning trilemma, as the existing models often trade some of them for others.
Particularly, denoising diffusion models have shown impressive sample quality
and diversity, but their expensive sampling does not yet allow them to be
applied in many real-world applications. In this paper, we argue that slow
sampling in these models is fundamentally attributed to the Gaussian assumption
in the denoising step which is justified only for small step sizes. To enable
denoising with large steps, and hence, to reduce the total number of denoising
steps, we propose to model the denoising distribution using a complex
multimodal distribution. We introduce denoising diffusion generative
adversarial networks (denoising diffusion GANs) that model each denoising step
using a multimodal conditional GAN. Through extensive evaluations, we show that
denoising diffusion GANs obtain sample quality and diversity competitive with
original diffusion models while being 2000$\times$ faster on the CIFAR-10
dataset. Compared to traditional GANs, our model exhibits better mode coverage
and sample diversity. To the best of our knowledge, denoising diffusion GAN is
the first model that reduces sampling cost in diffusion models to an extent
that allows them to be applied to real-world applications inexpensively.
Project page and code can be found at
https://nvlabs.github.io/denoising-diffusion-gan
Score-Based Generative Modeling with Critically-Damped Langevin Diffusion
December 14, 2021
Tim Dockhorn, Arash Vahdat, Karsten Kreis
Score-based generative models (SGMs) have demonstrated remarkable synthesis
quality. SGMs rely on a diffusion process that gradually perturbs the data
towards a tractable distribution, while the generative model learns to denoise.
The complexity of this denoising task is, apart from the data distribution
itself, uniquely determined by the diffusion process. We argue that current
SGMs employ overly simplistic diffusions, leading to unnecessarily complex
denoising processes, which limit generative modeling performance. Based on
connections to statistical mechanics, we propose a novel critically-damped
Langevin diffusion (CLD) and show that CLD-based SGMs achieve superior
performance. CLD can be interpreted as running a joint diffusion in an extended
space, where the auxiliary variables can be considered “velocities” that are
coupled to the data variables as in Hamiltonian dynamics. We derive a novel
score matching objective for CLD and show that the model only needs to learn
the score function of the conditional distribution of the velocity given data,
an easier task than learning scores of the data directly. We also derive a new
sampling scheme for efficient synthesis from CLD-based diffusion models. We
find that CLD outperforms previous SGMs in synthesis quality for similar
network architectures and sampling compute budgets. We show that our novel
sampler for CLD significantly outperforms solvers such as Euler–Maruyama. Our
framework provides new insights into score-based denoising diffusion models and
can be readily used for high-resolution image synthesis. Project page and code:
https://nv-tlabs.github.io/CLD-SGM.
Step-unrolled Denoising Autoencoders for Text Generation
December 13, 2021
Nikolay Savinov, Junyoung Chung, Mikolaj Binkowski, Erich Elsen, Aaron van den Oord
In this paper we propose a new generative model of text, Step-unrolled
Denoising Autoencoder (SUNDAE), that does not rely on autoregressive models.
Similarly to denoising diffusion techniques, SUNDAE is repeatedly applied on a
sequence of tokens, starting from random inputs and improving them each time
until convergence. We present a simple new improvement operator that converges
in fewer iterations than diffusion methods, while qualitatively producing
better samples on natural language datasets. SUNDAE achieves state-of-the-art
results (among non-autoregressive methods) on the WMT’14 English-to-German
translation task and good qualitative results on unconditional language
modeling on the Colossal Cleaned Common Crawl dataset and a dataset of Python
code from GitHub. The non-autoregressive nature of SUNDAE opens up
possibilities beyond left-to-right prompted generation, by filling in arbitrary
blank patterns in a template.
More Control for Free! Image Synthesis with Semantic Diffusion Guidance
December 10, 2021
Xihui Liu, Dong Huk Park, Samaneh Azadi, Gong Zhang, Arman Chopikyan, Yuxiao Hu, Humphrey Shi, Anna Rohrbach, Trevor Darrell
Controllable image synthesis models allow creation of diverse images based on
text instructions or guidance from a reference image. Recently, denoising
diffusion probabilistic models have been shown to generate more realistic
imagery than prior methods, and have been successfully demonstrated in
unconditional and class-conditional settings. We investigate fine-grained,
continuous control of this model class, and introduce a novel unified framework
for semantic diffusion guidance, which allows either language or image
guidance, or both. Guidance is injected into a pretrained unconditional
diffusion model using the gradient of image-text or image matching scores,
without re-training the diffusion model. We explore CLIP-based language
guidance as well as both content and style-based image guidance in a unified
framework. Our text-guided synthesis approach can be applied to datasets
without associated text annotations. We conduct experiments on FFHQ and LSUN
datasets, and show results on fine-grained text-guided image synthesis,
synthesis of images related to a style or content reference image, and examples
with both textual and image guidance.
December 09, 2021
Boah Kim, Inhwa Han, Jong Chul Ye
Deformable image registration is one of the fundamental tasks in medical
imaging. Classical registration algorithms usually require a high computational
cost for iterative optimizations. Although deep-learning-based methods have
been developed for fast image registration, it is still challenging to obtain
realistic continuous deformations from a moving image to a fixed image with
less topological folding problem. To address this, here we present a novel
diffusion-model-based image registration method, called DiffuseMorph.
DiffuseMorph not only generates synthetic deformed images through reverse
diffusion but also allows image registration by deformation fields.
Specifically, the deformation fields are generated by the conditional score
function of the deformation between the moving and fixed images, so that the
registration can be performed from continuous deformation by simply scaling the
latent feature of the score. Experimental results on 2D facial and 3D medical
image registration tasks demonstrate that our method provides flexible
deformations with topology preservation capability.
Come-Closer-Diffuse-Faster: Accelerating Conditional Diffusion Models for Inverse Problems through Stochastic Contraction
December 09, 2021
Hyungjin Chung, Byeongsu Sim, Jong Chul Ye
eess.IV, cs.CV, cs.LG, stat.ML
Diffusion models have recently attained significant interest within the
community owing to their strong performance as generative models. Furthermore,
its application to inverse problems have demonstrated state-of-the-art
performance. Unfortunately, diffusion models have a critical downside - they
are inherently slow to sample from, needing few thousand steps of iteration to
generate images from pure Gaussian noise. In this work, we show that starting
from Gaussian noise is unnecessary. Instead, starting from a single forward
diffusion with better initialization significantly reduces the number of
sampling steps in the reverse conditional diffusion. This phenomenon is
formally explained by the contraction theory of the stochastic difference
equations like our conditional diffusion strategy - the alternating
applications of reverse diffusion followed by a non-expansive data consistency
step. The new sampling strategy, dubbed Come-Closer-Diffuse-Faster (CCDF), also
reveals a new insight on how the existing feed-forward neural network
approaches for inverse problems can be synergistically combined with the
diffusion models. Experimental results with super-resolution, image inpainting,
and compressed sensing MRI demonstrate that our method can achieve
state-of-the-art reconstruction performance at significantly reduced sampling
steps.
A Conditional Point DIffusion-Refinement Paradigm for 3D Point Cloud Completion
December 07, 2021
Zhaoyang Lyu, Zhifeng Kong, Xudong Xu, Liang Pan, Dahua Lin
3D point cloud is an important 3D representation for capturing real world 3D
objects. However, real-scanned 3D point clouds are often incomplete, and it is
important to recover complete point clouds for downstream applications. Most
existing point cloud completion methods use Chamfer Distance (CD) loss for
training. The CD loss estimates correspondences between two point clouds by
searching nearest neighbors, which does not capture the overall point density
distribution on the generated shape, and therefore likely leads to non-uniform
point cloud generation. To tackle this problem, we propose a novel Point
Diffusion-Refinement (PDR) paradigm for point cloud completion. PDR consists of
a Conditional Generation Network (CGNet) and a ReFinement Network (RFNet). The
CGNet uses a conditional generative model called the denoising diffusion
probabilistic model (DDPM) to generate a coarse completion conditioned on the
partial observation. DDPM establishes a one-to-one pointwise mapping between
the generated point cloud and the uniform ground truth, and then optimizes the
mean squared error loss to realize uniform generation. The RFNet refines the
coarse output of the CGNet and further improves quality of the completed point
cloud. Furthermore, we develop a novel dual-path architecture for both
networks. The architecture can (1) effectively and efficiently extract
multi-level features from partially observed point clouds to guide completion,
and (2) accurately manipulate spatial locations of 3D points to obtain smooth
surfaces and sharp details. Extensive experimental results on various benchmark
datasets show that our PDR paradigm outperforms previous state-of-the-art
methods for point cloud completion. Remarkably, with the help of the RFNet, we
can accelerate the iterative generation process of the DDPM by up to 50 times
without much performance drop.
Deblurring via Stochastic Refinement
December 05, 2021
Jay Whang, Mauricio Delbracio, Hossein Talebi, Chitwan Saharia, Alexandros G. Dimakis, Peyman Milanfar
Image deblurring is an ill-posed problem with multiple plausible solutions
for a given input image. However, most existing methods produce a deterministic
estimate of the clean image and are trained to minimize pixel-level distortion.
These metrics are known to be poorly correlated with human perception, and
often lead to unrealistic reconstructions. We present an alternative framework
for blind deblurring based on conditional diffusion models. Unlike existing
techniques, we train a stochastic sampler that refines the output of a
deterministic predictor and is capable of producing a diverse set of plausible
reconstructions for a given input. This leads to a significant improvement in
perceptual quality over existing state-of-the-art methods across multiple
standard benchmarks. Our predict-and-refine approach also enables much more
efficient sampling compared to typical diffusion models. Combined with a
carefully tuned network architecture and inference procedure, our method is
competitive in terms of distortion metrics such as PSNR. These results show
clear benefits of our diffusion-based method for deblurring and challenge the
widely used strategy of producing a single, deterministic reconstruction.
Noise Distribution Adaptive Self-Supervised Image Denoising using Tweedie Distribution and Score-Matching
December 05, 2021
Kwanyoung Kim, Taesung Kwon, Jong Chul Ye
eess.IV, cs.CV, cs.LG, stat.ML
Tweedie distributions are a special case of exponential dispersion models,
which are often used in classical statistics as distributions for generalized
linear models. Here, we reveal that Tweedie distributions also play key roles
in modern deep learning era, leading to a distribution independent
self-supervised image denoising formula without clean reference images.
Specifically, by combining with the recent Noise2Score self-supervised image
denoising approach and the saddle point approximation of Tweedie distribution,
we can provide a general closed-form denoising formula that can be used for
large classes of noise distributions without ever knowing the underlying noise
distribution. Similar to the original Noise2Score, the new approach is composed
of two successive steps: score matching using perturbed noisy images, followed
by a closed form image denoising formula via distribution-independent Tweedie’s
formula. This also suggests a systematic algorithm to estimate the noise model
and noise parameters for a given noisy image data set. Through extensive
experiments, we demonstrate that the proposed method can accurately estimate
noise models and parameters, and provide the state-of-the-art self-supervised
image denoising performance in the benchmark dataset and real-world dataset.
SegDiff: Image Segmentation with Diffusion Probabilistic Models
December 01, 2021
Tomer Amit, Tal Shaharbany, Eliya Nachmani, Lior Wolf
Diffusion Probabilistic Methods are employed for state-of-the-art image
generation. In this work, we present a method for extending such models for
performing image segmentation. The method learns end-to-end, without relying on
a pre-trained backbone. The information in the input image and in the current
estimation of the segmentation map is merged by summing the output of two
encoders. Additional encoding layers and a decoder are then used to iteratively
refine the segmentation map, using a diffusion model. Since the diffusion model
is probabilistic, it is applied multiple times, and the results are merged into
a final segmentation map. The new method produces state-of-the-art results on
the Cityscapes validation set, the Vaihingen building segmentation benchmark,
and the MoNuSeg dataset.
Diffusion Autoencoders: Toward a Meaningful and Decodable Representation
November 30, 2021
Konpat Preechakul, Nattanat Chatthee, Suttisak Wizadwongsa, Supasorn Suwajanakorn
Diffusion probabilistic models (DPMs) have achieved remarkable quality in
image generation that rivals GANs’. But unlike GANs, DPMs use a set of latent
variables that lack semantic meaning and cannot serve as a useful
representation for other tasks. This paper explores the possibility of using
DPMs for representation learning and seeks to extract a meaningful and
decodable representation of an input image via autoencoding. Our key idea is to
use a learnable encoder for discovering the high-level semantics, and a DPM as
the decoder for modeling the remaining stochastic variations. Our method can
encode any image into a two-part latent code, where the first part is
semantically meaningful and linear, and the second part captures stochastic
details, allowing near-exact reconstruction. This capability enables
challenging applications that currently foil GAN-based methods, such as
attribute manipulation on real images. We also show that this two-level
encoding improves denoising efficiency and naturally facilitates various
downstream tasks including few-shot conditional sampling. Please visit our
project page: https://Diff-AE.github.io/
Vector Quantized Diffusion Model for Text-to-Image Synthesis
November 29, 2021
Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, Baining Guo
We present the vector quantized diffusion (VQ-Diffusion) model for
text-to-image generation. This method is based on a vector quantized
variational autoencoder (VQ-VAE) whose latent space is modeled by a conditional
variant of the recently developed Denoising Diffusion Probabilistic Model
(DDPM). We find that this latent-space method is well-suited for text-to-image
generation tasks because it not only eliminates the unidirectional bias with
existing methods but also allows us to incorporate a mask-and-replace diffusion
strategy to avoid the accumulation of errors, which is a serious problem with
existing methods. Our experiments show that the VQ-Diffusion produces
significantly better text-to-image generation results when compared with
conventional autoregressive (AR) models with similar numbers of parameters.
Compared with previous GAN-based text-to-image methods, our VQ-Diffusion can
handle more complex scenes and improve the synthesized image quality by a large
margin. Finally, we show that the image generation computation in our method
can be made highly efficient by reparameterization. With traditional AR
methods, the text-to-image generation time increases linearly with the output
image resolution and hence is quite time consuming even for normal size images.
The VQ-Diffusion allows us to achieve a better trade-off between quality and
speed. Our experiments indicate that the VQ-Diffusion model with the
reparameterization is fifteen times faster than traditional AR methods while
achieving a better image quality.
Blended Diffusion for Text-driven Editing of Natural Images
November 29, 2021
Omri Avrahami, Dani Lischinski, Ohad Fried
Natural language offers a highly intuitive interface for image editing. In
this paper, we introduce the first solution for performing local (region-based)
edits in generic natural images, based on a natural language description along
with an ROI mask. We achieve our goal by leveraging and combining a pretrained
language-image model (CLIP), to steer the edit towards a user-provided text
prompt, with a denoising diffusion probabilistic model (DDPM) to generate
natural-looking results. To seamlessly fuse the edited region with the
unchanged parts of the image, we spatially blend noised versions of the input
image with the local text-guided diffusion latent at a progression of noise
levels. In addition, we show that adding augmentations to the diffusion process
mitigates adversarial results. We compare against several baselines and related
methods, both qualitatively and quantitatively, and show that our method
outperforms these solutions in terms of overall realism, ability to preserve
the background and matching the text. Finally, we show several text-driven
editing applications, including adding a new object to an image,
removing/replacing/altering existing objects, background replacement, and image
extrapolation. Code is available at:
https://omriavrahami.com/blended-diffusion-page/
Conditional Image Generation with Score-Based Diffusion Models
November 26, 2021
Georgios Batzolis, Jan Stanczuk, Carola-Bibiane Schönlieb, Christian Etmann
Score-based diffusion models have emerged as one of the most promising
frameworks for deep generative modelling. In this work we conduct a systematic
comparison and theoretical analysis of different approaches to learning
conditional probability distributions with score-based diffusion models. In
particular, we prove results which provide a theoretical justification for one
of the most successful estimators of the conditional score. Moreover, we
introduce a multi-speed diffusion framework, which leads to a new estimator for
the conditional score, performing on par with previous state-of-the-art
approaches. Our theoretical and experimental findings are accompanied by an
open source library MSDiff which allows for application and further research of
multi-speed diffusion models.
Solving Inverse Problems in Medical Imaging with Score-Based Generative Models
November 15, 2021
Yang Song, Liyue Shen, Lei Xing, Stefano Ermon
eess.IV, cs.CV, cs.LG, stat.ML
Reconstructing medical images from partial measurements is an important
inverse problem in Computed Tomography (CT) and Magnetic Resonance Imaging
(MRI). Existing solutions based on machine learning typically train a model to
directly map measurements to medical images, leveraging a training dataset of
paired images and measurements. These measurements are typically synthesized
from images using a fixed physical model of the measurement process, which
hinders the generalization capability of models to unknown measurement
processes. To address this issue, we propose a fully unsupervised technique for
inverse problem solving, leveraging the recently introduced score-based
generative models. Specifically, we first train a score-based generative model
on medical images to capture their prior distribution. Given measurements and a
physical model of the measurement process at test time, we introduce a sampling
method to reconstruct an image consistent with both the prior and the observed
measurements. Our method does not assume a fixed measurement process during
training, and can thus be flexibly adapted to different measurement processes
at test time. Empirically, we observe comparable or better performance to
supervised learning techniques in several medical imaging tasks in CT and MRI,
while demonstrating significantly better generalization to unknown measurement
processes.
Simulating Diffusion Bridges with Score Matching
November 14, 2021
Jeremy Heng, Valentin De Bortoli, Arnaud Doucet, James Thornton
We consider the problem of simulating diffusion bridges, which are diffusion
processes that are conditioned to initialize and terminate at two given states.
The simulation of diffusion bridges has applications in diverse scientific
fields and plays a crucial role in the statistical inference of
discretely-observed diffusions. This is known to be a challenging problem that
has received much attention in the last two decades. This article contributes
to this rich body of literature by presenting a new avenue to obtain diffusion
bridge approximations. Our approach is based on a backward time representation
of a diffusion bridge, which may be simulated if one can time-reverse the
unconditioned diffusion. We introduce a variational formulation to learn this
time-reversal with function approximation and rely on a score matching method
to circumvent intractability. Another iteration of our proposed methodology
approximates the Doob’s $h$-transform defining the forward time representation
of a diffusion bridge. We discuss algorithmic considerations and extensions,
and present numerical results on an Ornstein–Uhlenbeck process, a model from
financial econometrics for interest rates, and a model from genetics for cell
differentiation and development to illustrate the effectiveness of our
approach.
Palette: Image-to-Image Diffusion Models
November 10, 2021
Chitwan Saharia, William Chan, Huiwen Chang, Chris A. Lee, Jonathan Ho, Tim Salimans, David J. Fleet, Mohammad Norouzi
This paper develops a unified framework for image-to-image translation based
on conditional diffusion models and evaluates this framework on four
challenging image-to-image translation tasks, namely colorization, inpainting,
uncropping, and JPEG restoration. Our simple implementation of image-to-image
diffusion models outperforms strong GAN and regression baselines on all tasks,
without task-specific hyper-parameter tuning, architecture customization, or
any auxiliary loss or sophisticated new techniques needed. We uncover the
impact of an L2 vs. L1 loss in the denoising diffusion objective on sample
diversity, and demonstrate the importance of self-attention in the neural
architecture through empirical studies. Importantly, we advocate a unified
evaluation protocol based on ImageNet, with human evaluation and sample quality
scores (FID, Inception Score, Classification Accuracy of a pre-trained
ResNet-50, and Perceptual Distance against original images). We expect this
standardized evaluation protocol to play a role in advancing image-to-image
translation research. Finally, we show that a generalist, multi-task diffusion
model performs as well or better than task-specific specialist counterparts.
Check out https://diffusion-palette.github.io for an overview of the results.
Estimating High Order Gradients of the Data Distribution by Denoising
November 08, 2021
Chenlin Meng, Yang Song, Wenzhe Li, Stefano Ermon
The first order derivative of a data density can be estimated efficiently by
denoising score matching, and has become an important component in many
applications, such as image generation and audio synthesis. Higher order
derivatives provide additional local information about the data distribution
and enable new applications. Although they can be estimated via automatic
differentiation of a learned density model, this can amplify estimation errors
and is expensive in high dimensional settings. To overcome these limitations,
we propose a method to directly estimate high order derivatives (scores) of a
data density from samples. We first show that denoising score matching can be
interpreted as a particular case of Tweedie’s formula. By leveraging Tweedie’s
formula on higher order moments, we generalize denoising score matching to
estimate higher order derivatives. We demonstrate empirically that models
trained with the proposed method can approximate second order derivatives more
efficiently and accurately than via automatic differentiation. We show that our
models can be used to quantify uncertainty in denoising and to improve the
mixing speed of Langevin dynamics via Ozaki discretization for sampling
synthetic data and natural images.
Realistic galaxy image simulation via score-based generative models
November 02, 2021
Michael J. Smith, James E. Geach, Ryan A. Jackson, Nikhil Arora, Connor Stone, Stéphane Courteau
astro-ph.IM, astro-ph.GA, cs.LG
We show that a Denoising Diffusion Probabalistic Model (DDPM), a class of
score-based generative model, can be used to produce realistic mock images that
mimic observations of galaxies. Our method is tested with Dark Energy
Spectroscopic Instrument (DESI) grz imaging of galaxies from the Photometry and
Rotation curve OBservations from Extragalactic Surveys (PROBES) sample and
galaxies selected from the Sloan Digital Sky Survey. Subjectively, the
generated galaxies are highly realistic when compared with samples from the
real dataset. We quantify the similarity by borrowing from the deep generative
learning literature, using the Fr\'echet Inception Distance' to test for
subjective and morphological similarity. We also introduce the
Synthetic
Galaxy Distance’ metric to compare the emergent physical properties (such as
total magnitude, colour and half light radius) of a ground truth parent and
synthesised child dataset. We argue that the DDPM approach produces sharper and
more realistic images than other generative methods such as Adversarial
Networks (with the downside of more costly inference), and could be used to
produce large samples of synthetic observations tailored to a specific imaging
survey. We demonstrate two potential uses of the DDPM: (1) accurate in-painting
of occluded data, such as satellite trails, and (2) domain transfer, where new
input images can be processed to mimic the properties of the DDPM training set.
Here we `DESI-fy’ cartoon images as a proof of concept for domain transfer.
Finally, we suggest potential applications for score-based approaches that
could motivate further research on this topic within the astronomical
community.
Zero-Shot Translation using Diffusion Models
November 02, 2021
Eliya Nachmani, Shaked Dovrat
In this work, we show a novel method for neural machine translation (NMT),
using a denoising diffusion probabilistic model (DDPM), adjusted for textual
data, following recent advances in the field. We show that it’s possible to
translate sentences non-autoregressively using a diffusion model conditioned on
the source sentence. We also show that our model is able to translate between
pairs of languages unseen during training (zero-shot learning).
Likelihood Training of Schrodinger Bridge using Forward-Backward SDEs Theory
October 21, 2021
Tianrong Chen, Guan-Horng Liu, Evangelos A. Theodorou
stat.ML, cs.LG, math.AP, math.OC
Schr"odinger Bridge (SB) is an entropy-regularized optimal transport problem
that has received increasing attention in deep generative modeling for its
mathematical flexibility compared to the Scored-based Generative Model (SGM).
However, it remains unclear whether the optimization principle of SB relates to
the modern training of deep generative models, which often rely on constructing
log-likelihood objectives.This raises questions on the suitability of SB models
as a principled alternative for generative applications. In this work, we
present a novel computational framework for likelihood training of SB models
grounded on Forward-Backward Stochastic Differential Equations Theory - a
mathematical methodology appeared in stochastic optimal control that transforms
the optimality condition of SB into a set of SDEs. Crucially, these SDEs can be
used to construct the likelihood objectives for SB that, surprisingly,
generalizes the ones for SGM as special cases. This leads to a new optimization
principle that inherits the same SB optimality yet without losing applications
of modern generative training techniques, and we show that the resulting
training algorithm achieves comparable results on generating realistic images
on MNIST, CelebA, and CIFAR10. Our code is available at
https://github.com/ghliu/SB-FBSDE.
Controllable and Compositional Generation with Latent-Space Energy-Based Models
October 21, 2021
Weili Nie, Arash Vahdat, Anima Anandkumar
Controllable generation is one of the key requirements for successful
adoption of deep generative models in real-world applications, but it still
remains as a great challenge. In particular, the compositional ability to
generate novel concept combinations is out of reach for most current models. In
this work, we use energy-based models (EBMs) to handle compositional generation
over a set of attributes. To make them scalable to high-resolution image
generation, we introduce an EBM in the latent space of a pre-trained generative
model such as StyleGAN. We propose a novel EBM formulation representing the
joint distribution of data and attributes together, and we show how sampling
from it is formulated as solving an ordinary differential equation (ODE). Given
a pre-trained generator, all we need for controllable generation is to train an
attribute classifier. Sampling with ODEs is done efficiently in the latent
space and is robust to hyperparameters. Thus, our method is simple, fast to
train, and efficient to sample. Experimental results show that our method
outperforms the state-of-the-art in both conditional sampling and sequential
editing. In compositional generation, our method excels at zero-shot generation
of unseen attribute combinations. Also, by composing energy functions with
logical operators, this work is the first to achieve such compositionality in
generating photo-realistic images of resolution 1024x1024. Code is available at
https://github.com/NVlabs/LACE.
Diffusion Normalizing Flow
October 14, 2021
Qinsheng Zhang, Yongxin Chen
We present a novel generative modeling method called diffusion normalizing
flow based on stochastic differential equations (SDEs). The algorithm consists
of two neural SDEs: a forward SDE that gradually adds noise to the data to
transform the data into Gaussian random noise, and a backward SDE that
gradually removes the noise to sample from the data distribution. By jointly
training the two neural SDEs to minimize a common cost function that quantifies
the difference between the two, the backward SDE converges to a diffusion
process the starts with a Gaussian distribution and ends with the desired data
distribution. Our method is closely related to normalizing flow and diffusion
probabilistic models and can be viewed as a combination of the two. Compared
with normalizing flow, diffusion normalizing flow is able to learn
distributions with sharp boundaries. Compared with diffusion probabilistic
models, diffusion normalizing flow requires fewer discretization steps and thus
has better sampling efficiency. Our algorithm demonstrates competitive
performance in both high-dimension data density estimation and image generation
tasks.
Crystal Diffusion Variational Autoencoder for Periodic Material Generation
October 12, 2021
Tian Xie, Xiang Fu, Octavian-Eugen Ganea, Regina Barzilay, Tommi Jaakkola
cs.LG, cond-mat.mtrl-sci, physics.comp-ph
Generating the periodic structure of stable materials is a long-standing
challenge for the material design community. This task is difficult because
stable materials only exist in a low-dimensional subspace of all possible
periodic arrangements of atoms: 1) the coordinates must lie in the local energy
minimum defined by quantum mechanics, and 2) global stability also requires the
structure to follow the complex, yet specific bonding preferences between
different atom types. Existing methods fail to incorporate these factors and
often lack proper invariances. We propose a Crystal Diffusion Variational
Autoencoder (CDVAE) that captures the physical inductive bias of material
stability. By learning from the data distribution of stable materials, the
decoder generates materials in a diffusion process that moves atomic
coordinates towards a lower energy state and updates atom types to satisfy
bonding preferences between neighbors. Our model also explicitly encodes
interactions across periodic boundaries and respects permutation, translation,
rotation, and periodic invariances. We significantly outperform past methods in
three tasks: 1) reconstructing the input structure, 2) generating valid,
diverse, and realistic materials, and 3) generating materials that optimize a
specific property. We also provide several standard datasets and evaluation
metrics for the broader machine learning community.
Score-based diffusion models for accelerated MRI
October 08, 2021
Hyungjin Chung, Jong Chul Ye
eess.IV, cs.AI, cs.CV, cs.LG
Score-based diffusion models provide a powerful way to model images using the
gradient of the data distribution. Leveraging the learned score function as a
prior, here we introduce a way to sample data from a conditional distribution
given the measurements, such that the model can be readily used for solving
inverse problems in imaging, especially for accelerated MRI. In short, we train
a continuous time-dependent score function with denoising score matching. Then,
at the inference stage, we iterate between numerical SDE solver and data
consistency projection step to achieve reconstruction. Our model requires
magnitude images only for training, and yet is able to reconstruct
complex-valued data, and even extends to parallel imaging. The proposed method
is agnostic to sub-sampling patterns, and can be used with any sampling
schemes. Also, due to its generative nature, our approach can quantify
uncertainty, which is not possible with standard regression settings. On top of
all the advantages, our method also has very strong performance, even beating
the models trained with full supervision. With extensive experiments, we verify
the superiority of our method in terms of quality and practicality.
Score-based Generative Neural Networks for Large-Scale Optimal Transport
October 07, 2021
Max Daniels, Tyler Maunu, Paul Hand
We consider the fundamental problem of sampling the optimal transport
coupling between given source and target distributions. In certain cases, the
optimal transport plan takes the form of a one-to-one mapping from the source
support to the target support, but learning or even approximating such a map is
computationally challenging for large and high-dimensional datasets due to the
high cost of linear programming routines and an intrinsic curse of
dimensionality. We study instead the Sinkhorn problem, a regularized form of
optimal transport whose solutions are couplings between the source and the
target distribution. We introduce a novel framework for learning the Sinkhorn
coupling between two distributions in the form of a score-based generative
model. Conditioned on source data, our procedure iterates Langevin Dynamics to
sample target data according to the regularized optimal coupling. Key to this
approach is a neural network parametrization of the Sinkhorn problem, and we
prove convergence of gradient descent with respect to network parameters in
this formulation. We demonstrate its empirical success on a variety of large
scale optimal transport tasks.
DiffusionCLIP: Text-Guided Diffusion Models for Robust Image Manipulation
October 06, 2021
Gwanghyun Kim, Taesung Kwon, Jong Chul Ye
Recently, GAN inversion methods combined with Contrastive Language-Image
Pretraining (CLIP) enables zero-shot image manipulation guided by text prompts.
However, their applications to diverse real images are still difficult due to
the limited GAN inversion capability. Specifically, these approaches often have
difficulties in reconstructing images with novel poses, views, and highly
variable contents compared to the training data, altering object identity, or
producing unwanted image artifacts. To mitigate these problems and enable
faithful manipulation of real images, we propose a novel method, dubbed
DiffusionCLIP, that performs text-driven image manipulation using diffusion
models. Based on full inversion capability and high-quality image generation
power of recent diffusion models, our method performs zero-shot image
manipulation successfully even between unseen domains and takes another step
towards general application by manipulating images from a widely varying
ImageNet dataset. Furthermore, we propose a novel noise combination method that
allows straightforward multi-attribute manipulation. Extensive experiments and
human evaluation confirmed robust and superior manipulation performance of our
methods compared to the existing baselines. Code is available at
https://github.com/gwang-kim/DiffusionCLIP.git.
Autoregressive Diffusion Models
October 05, 2021
Emiel Hoogeboom, Alexey A. Gritsenko, Jasmijn Bastings, Ben Poole, Rianne van den Berg, Tim Salimans
We introduce Autoregressive Diffusion Models (ARDMs), a model class
encompassing and generalizing order-agnostic autoregressive models (Uria et
al., 2014) and absorbing discrete diffusion (Austin et al., 2021), which we
show are special cases of ARDMs under mild assumptions. ARDMs are simple to
implement and easy to train. Unlike standard ARMs, they do not require causal
masking of model representations, and can be trained using an efficient
objective similar to modern probabilistic diffusion models that scales
favourably to highly-dimensional data. At test time, ARDMs support parallel
generation which can be adapted to fit any given generation budget. We find
that ARDMs require significantly fewer steps than discrete diffusion models to
attain the same performance. Finally, we apply ARDMs to lossless compression,
and show that they are uniquely suited to this task. Contrary to existing
approaches based on bits-back coding, ARDMs obtain compelling results not only
on complete datasets, but also on compressing single data points. Moreover,
this can be done using a modest number of network calls for (de)compression due
to the model’s adaptable parallel generation.
Score-Based Generative Classifiers
October 01, 2021
Roland S. Zimmermann, Lukas Schott, Yang Song, Benjamin A. Dunn, David A. Klindt
The tremendous success of generative models in recent years raises the
question whether they can also be used to perform classification. Generative
models have been used as adversarially robust classifiers on simple datasets
such as MNIST, but this robustness has not been observed on more complex
datasets like CIFAR-10. Additionally, on natural image datasets, previous
results have suggested a trade-off between the likelihood of the data and
classification accuracy. In this work, we investigate score-based generative
models as classifiers for natural images. We show that these models not only
obtain competitive likelihood values but simultaneously achieve
state-of-the-art classification accuracy for generative classifiers on
CIFAR-10. Nevertheless, we find that these models are only slightly, if at all,
more robust than discriminative baseline models on out-of-distribution tasks
based on common image corruptions. Similarly and contrary to prior results, we
find that score-based are prone to worst-case distribution shifts in the form
of adversarial perturbations. Our work highlights that score-based generative
models are closing the gap in classification accuracy compared to standard
discriminative models. While they do not yet deliver on the promise of
adversarial and out-of-domain robustness, they provide a different approach to
classification that warrants further research.
Diffusion-Based Voice Conversion with Fast Maximum Likelihood Sampling Scheme
September 28, 2021
Vadim Popov, Ivan Vovk, Vladimir Gogoryan, Tasnima Sadekova, Mikhail Kudinov, Jiansheng Wei
Voice conversion is a common speech synthesis task which can be solved in
different ways depending on a particular real-world scenario. The most
challenging one often referred to as one-shot many-to-many voice conversion
consists in copying the target voice from only one reference utterance in the
most general case when both source and target speakers do not belong to the
training dataset. We present a scalable high-quality solution based on
diffusion probabilistic modeling and demonstrate its superior quality compared
to state-of-the-art one-shot voice conversion approaches. Moreover, focusing on
real-time applications, we investigate general principles which can make
diffusion models faster while keeping synthesis quality at a high level. As a
result, we develop a novel Stochastic Differential Equations solver suitable
for various diffusion model types and generative tasks as shown through
empirical studies and justify it by theoretical analysis.
Bilateral Denoising Diffusion Models
August 26, 2021
Max W. Y. Lam, Jun Wang, Rongjie Huang, Dan Su, Dong Yu
cs.LG, cs.AI, cs.SD, eess.AS, eess.SP
Denoising diffusion probabilistic models (DDPMs) have emerged as competitive
generative models yet brought challenges to efficient sampling. In this paper,
we propose novel bilateral denoising diffusion models (BDDMs), which take
significantly fewer steps to generate high-quality samples. From a bilateral
modeling objective, BDDMs parameterize the forward and reverse processes with a
score network and a scheduling network, respectively. We show that a new lower
bound tighter than the standard evidence lower bound can be derived as a
surrogate objective for training the two networks. In particular, BDDMs are
efficient, simple-to-train, and capable of further improving any pre-trained
DDPM by optimizing the inference noise schedules. Our experiments demonstrated
that BDDMs can generate high-fidelity samples with as few as 3 sampling steps
and produce comparable or even higher quality samples than DDPMs using 1000
steps with only 16 sampling steps (a 62x speedup).
ImageBART: Bidirectional Context with Multinomial Diffusion for Autoregressive Image Synthesis
August 19, 2021
Patrick Esser, Robin Rombach, Andreas Blattmann, Björn Ommer
Autoregressive models and their sequential factorization of the data
likelihood have recently demonstrated great potential for image representation
and synthesis. Nevertheless, they incorporate image context in a linear 1D
order by attending only to previously synthesized image patches above or to the
left. Not only is this unidirectional, sequential bias of attention unnatural
for images as it disregards large parts of a scene until synthesis is almost
complete. It also processes the entire image on a single scale, thus ignoring
more global contextual information up to the gist of the entire scene. As a
remedy we incorporate a coarse-to-fine hierarchy of context by combining the
autoregressive formulation with a multinomial diffusion process: Whereas a
multistage diffusion process successively removes information to coarsen an
image, we train a (short) Markov chain to invert this process. In each stage,
the resulting autoregressive ImageBART model progressively incorporates context
from previous stages in a coarse-to-fine manner. Experiments show greatly
improved image modification capabilities over autoregressive models while also
providing high-fidelity image generation, both of which are enabled through
efficient training in a compressed latent space. Specifically, our approach can
take unrestricted, user-provided masks into account to perform local image
editing. Thus, in contrast to pure autoregressive models, it can solve
free-form image inpainting and, in the case of conditional models, local,
text-guided image modification without requiring mask-specific training.
IVLR: Conditioning Method for Denoising Diffusion Probabilistic Models
August 06, 2021
Jooyoung Choi, Sungwon Kim, Yonghyun Jeong, Youngjune Gwon, Sungroh Yoon
Denoising diffusion probabilistic models (DDPM) have shown remarkable
performance in unconditional image generation. However, due to the
stochasticity of the generative process in DDPM, it is challenging to generate
images with the desired semantics. In this work, we propose Iterative Latent
Variable Refinement (ILVR), a method to guide the generative process in DDPM to
generate high-quality images based on a given reference image. Here, the
refinement of the generative process in DDPM enables a single DDPM to sample
images from various sets directed by the reference image. The proposed ILVR
method generates high-quality images while controlling the generation. The
controllability of our method allows adaptation of a single DDPM without any
additional learning in various image generation tasks, such as generation from
various downsampling factors, multi-domain image translation, paint-to-image,
and editing with scribbles.
Robust Compressed Sensing MRI with Deep Generative Priors
August 03, 2021
Ajil Jalal, Marius Arvinte, Giannis Daras, Eric Price, Alexandros G. Dimakis, Jonathan I. Tamir
cs.LG, cs.CV, cs.IT, math.IT, stat.ML
The CSGM framework (Bora-Jalal-Price-Dimakis’17) has shown that deep
generative priors can be powerful tools for solving inverse problems. However,
to date this framework has been empirically successful only on certain datasets
(for example, human faces and MNIST digits), and it is known to perform poorly
on out-of-distribution samples. In this paper, we present the first successful
application of the CSGM framework on clinical MRI data. We train a generative
prior on brain scans from the fastMRI dataset, and show that posterior sampling
via Langevin dynamics achieves high quality reconstructions. Furthermore, our
experiments and theory show that posterior sampling is robust to changes in the
ground-truth distribution and measurement process. Our code and models are
available at: \url{https://github.com/utcsilab/csgm-mri-langevin}.
SDEdit: Image Synthesis and Editing with Stochastic Differential Equations
August 02, 2021
Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, Stefano Ermon
Guided image synthesis enables everyday users to create and edit
photo-realistic images with minimum effort. The key challenge is balancing
faithfulness to the user input (e.g., hand-drawn colored strokes) and realism
of the synthesized image. Existing GAN-based methods attempt to achieve such
balance using either conditional GANs or GAN inversions, which are challenging
and often require additional training data or loss functions for individual
applications. To address these issues, we introduce a new image synthesis and
editing method, Stochastic Differential Editing (SDEdit), based on a diffusion
model generative prior, which synthesizes realistic images by iteratively
denoising through a stochastic differential equation (SDE). Given an input
image with user guide of any type, SDEdit first adds noise to the input, then
subsequently denoises the resulting image through the SDE prior to increase its
realism. SDEdit does not require task-specific training or inversions and can
naturally achieve the balance between realism and faithfulness. SDEdit
significantly outperforms state-of-the-art GAN-based methods by up to 98.09% on
realism and 91.72% on overall satisfaction scores, according to a human
perception study, on multiple tasks, including stroke-based image synthesis and
editing as well as image compositing.
Score-Based Point Cloud Denoising
July 23, 2021
Shitong Luo, Wei Hu
Point clouds acquired from scanning devices are often perturbed by noise,
which affects downstream tasks such as surface reconstruction and analysis. The
distribution of a noisy point cloud can be viewed as the distribution of a set
of noise-free samples $p(x)$ convolved with some noise model $n$, leading to
$(p * n)(x)$ whose mode is the underlying clean surface. To denoise a noisy
point cloud, we propose to increase the log-likelihood of each point from $p *
n$ via gradient ascent – iteratively updating each point’s position. Since $p
- n$ is unknown at test-time, and we only need the score (i.e., the gradient of
the log-probability function) to perform gradient ascent, we propose a neural
network architecture to estimate the score of $p * n$ given only noisy point
clouds as input. We derive objective functions for training the network and
develop a denoising algorithm leveraging on the estimated scores. Experiments
demonstrate that the proposed model outperforms state-of-the-art methods under
a variety of noise models, and shows the potential to be applied in other tasks
such as point cloud upsampling. The code is available at
\url{https://github.com/luost26/score-denoise}.
Interpreting diffusion score matching using normalizing flow
July 21, 2021
Wenbo Gong, Yingzhen Li
Scoring matching (SM), and its related counterpart, Stein discrepancy (SD)
have achieved great success in model training and evaluations. However, recent
research shows their limitations when dealing with certain types of
distributions. One possible fix is incorporating the original score matching
(or Stein discrepancy) with a diffusion matrix, which is called diffusion score
matching (DSM) (or diffusion Stein discrepancy (DSD)). However, the lack of
interpretation of the diffusion limits its usage within simple distributions
and manually chosen matrix. In this work, we plan to fill this gap by
interpreting the diffusion matrix using normalizing flows. Specifically, we
theoretically prove that DSM (or DSD) is equivalent to the original score
matching (or Stein discrepancy) evaluated in the transformed space defined by
the normalizing flow, where the diffusion matrix is the inverse of the flow’s
Jacobian matrix. In addition, we also build its connection to Riemannian
manifolds and further extend it to continuous flows, where the change of DSM is
characterized by an ODE.
Beyond In-Place Corruption: Insertion and Deletion in Denoising Probabilistic Models
July 16, 2021
Daniel D. Johnson, Jacob Austin, Rianne van den Berg, Daniel Tarlow
Denoising diffusion probabilistic models (DDPMs) have shown impressive
results on sequence generation by iteratively corrupting each example and then
learning to map corrupted versions back to the original. However, previous work
has largely focused on in-place corruption, adding noise to each pixel or token
individually while keeping their locations the same. In this work, we consider
a broader class of corruption processes and denoising models over sequence data
that can insert and delete elements, while still being efficient to train and
sample from. We demonstrate that these models outperform standard in-place
models on an arithmetic sequence task, and that when trained on the text8
dataset they can be used to fix spelling errors without any fine-tuning.
Denoising diffusion probabilistic models for replica exchange
July 15, 2021
Yihang Wang, Lukas Herron, Pratyush Tiwary
cond-mat.stat-mech, physics.bio-ph, physics.comp-ph, physics.data-an
Using simulations or experiments performed at some set of temperatures to
learn about the physics or chemistry at some other arbitrary temperature is a
problem of immense practical and theoretical relevance. Here we develop a
framework based on statistical mechanics and generative Artificial Intelligence
that allows solving this problem. Specifically, we work with denoising
diffusion probabilistic models, and show how these models in combination with
replica exchange molecular dynamics achieve superior sampling of the
biomolecular energy landscape at temperatures that were never even simulated
without assuming any particular slow degrees of freedom. The key idea is to
treat the temperature as a fluctuating random variable and not a control
parameter as is usually done. This allows us to directly sample from the joint
probability distribution in configuration and temperature space. The results
here are demonstrated for a chirally symmetric peptide and single-strand
ribonucleic acid undergoing conformational transitions in all-atom water. We
demonstrate how we can discover transition states and metastable states that
were previously unseen at the temperature of interest, and even bypass the need
to perform further simulations for wide range of temperatures. At the same
time, any unphysical states are easily identifiable through very low Boltzmann
weights. The procedure while shown here for a class of molecular simulations
should be more generally applicable to mixing information across simulations
and experiments with varying control parameters.
CSDI: Conditional Score-based Diffusion Models for Probabilistic Time Series Imputation
July 07, 2021
Yusuke Tashiro, Jiaming Song, Yang Song, Stefano Ermon
The imputation of missing values in time series has many applications in
healthcare and finance. While autoregressive models are natural candidates for
time series imputation, score-based diffusion models have recently outperformed
existing counterparts including autoregressive models in many tasks such as
image generation and audio synthesis, and would be promising for time series
imputation. In this paper, we propose Conditional Score-based Diffusion models
for Imputation (CSDI), a novel time series imputation method that utilizes
score-based diffusion models conditioned on observed data. Unlike existing
score-based approaches, the conditional diffusion model is explicitly trained
for imputation and can exploit correlations between observed values. On
healthcare and environmental data, CSDI improves by 40-65% over existing
probabilistic imputation methods on popular performance metrics. In addition,
deterministic imputation by CSDI reduces the error by 5-20% compared to the
state-of-the-art deterministic imputation methods. Furthermore, CSDI can also
be applied to time series interpolation and probabilistic forecasting, and is
competitive with existing baselines. The code is available at
https://github.com/ermongroup/CSDI.
Structured Denoising Diffusion Models in Discrete State-Spaces
July 07, 2021
Jacob Austin, Daniel D. Johnson, Jonathan Ho, Daniel Tarlow, Rianne van den Berg
cs.LG, cs.AI, cs.CL, cs.CV
Denoising diffusion probabilistic models (DDPMs) (Ho et al. 2020) have shown
impressive results on image and waveform generation in continuous state spaces.
Here, we introduce Discrete Denoising Diffusion Probabilistic Models (D3PMs),
diffusion-like generative models for discrete data that generalize the
multinomial diffusion model of Hoogeboom et al. 2021, by going beyond
corruption processes with uniform transition probabilities. This includes
corruption with transition matrices that mimic Gaussian kernels in continuous
space, matrices based on nearest neighbors in embedding space, and matrices
that introduce absorbing states. The third allows us to draw a connection
between diffusion models and autoregressive and mask-based generative models.
We show that the choice of transition matrix is an important design decision
that leads to improved results in image and text domains. We also introduce a
new loss function that combines the variational lower bound with an auxiliary
cross entropy loss. For text, this model class achieves strong results on
character-level text generation while scaling to large vocabularies on LM1B. On
the image dataset CIFAR-10, our models approach the sample quality and exceed
the log-likelihood of the continuous-space DDPM model.
Variational Diffusion Models
July 01, 2021
Diederik P. Kingma, Tim Salimans, Ben Poole, Jonathan Ho
Diffusion-based generative models have demonstrated a capacity for
perceptually impressive synthesis, but can they also be great likelihood-based
models? We answer this in the affirmative, and introduce a family of
diffusion-based generative models that obtain state-of-the-art likelihoods on
standard image density estimation benchmarks. Unlike other diffusion-based
models, our method allows for efficient optimization of the noise schedule
jointly with the rest of the model. We show that the variational lower bound
(VLB) simplifies to a remarkably short expression in terms of the
signal-to-noise ratio of the diffused data, thereby improving our theoretical
understanding of this model class. Using this insight, we prove an equivalence
between several models proposed in the literature. In addition, we show that
the continuous-time VLB is invariant to the noise schedule, except for the
signal-to-noise ratio at its endpoints. This enables us to learn a noise
schedule that minimizes the variance of the resulting VLB estimator, leading to
faster optimization. Combining these advances with architectural improvements,
we obtain state-of-the-art likelihoods on image density estimation benchmarks,
outperforming autoregressive models that have dominated these benchmarks for
many years, with often significantly faster optimization. In addition, we show
how to use the model as part of a bits-back compression scheme, and demonstrate
lossless compression rates close to the theoretical optimum. Code is available
at https://github.com/google-research/vdm .
Diffusion Priors in Variational Autoencoders
June 29, 2021
Antoine Wehenkel, Gilles Louppe
Among likelihood-based approaches for deep generative modelling, variational
autoencoders (VAEs) offer scalable amortized posterior inference and fast
sampling. However, VAEs are also more and more outperformed by competing models
such as normalizing flows (NFs), deep-energy models, or the new denoising
diffusion probabilistic models (DDPMs). In this preliminary work, we improve
VAEs by demonstrating how DDPMs can be used for modelling the prior
distribution of the latent variables. The diffusion prior model improves upon
Gaussian priors of classical VAEs and is competitive with NF-based priors.
Finally, we hypothesize that hierarchical VAEs could similarly benefit from the
enhanced capacity of diffusion priors.
Deep Generative Learning via Schrödinger Bridge
June 19, 2021
Gefei Wang, Yuling Jiao, Qian Xu, Yang Wang, Can Yang
We propose to learn a generative model via entropy interpolation with a
Schr"{o}dinger Bridge. The generative learning task can be formulated as
interpolating between a reference distribution and a target distribution based
on the Kullback-Leibler divergence. At the population level, this entropy
interpolation is characterized via an SDE on $[0,1]$ with a time-varying drift
term. At the sample level, we derive our Schr"{o}dinger Bridge algorithm by
plugging the drift term estimated by a deep score estimator and a deep density
ratio estimator into the Euler-Maruyama method. Under some mild smoothness
assumptions of the target distribution, we prove the consistency of both the
score estimator and the density ratio estimator, and then establish the
consistency of the proposed Schr"{o}dinger Bridge approach. Our theoretical
results guarantee that the distribution learned by our approach converges to
the target distribution. Experimental results on multimodal synthetic data and
benchmark data support our theoretical findings and indicate that the
generative model via Schr"{o}dinger Bridge is comparable with state-of-the-art
GANs, suggesting a new formulation of generative learning. We demonstrate its
usefulness in image interpolation and image inpainting.
ScoreGrad: Multivariate Probabilistic Time Series Forecasting with Continuous Energy-based Generative Models
June 18, 2021
Tijin Yan, Hongwei Zhang, Tong Zhou, Yufeng Zhan, Yuanqing Xia
Multivariate time series prediction has attracted a lot of attention because
of its wide applications such as intelligence transportation, AIOps. Generative
models have achieved impressive results in time series modeling because they
can model data distribution and take noise into consideration. However, many
existing works can not be widely used because of the constraints of functional
form of generative models or the sensitivity to hyperparameters. In this paper,
we propose ScoreGrad, a multivariate probabilistic time series forecasting
framework based on continuous energy-based generative models. ScoreGrad is
composed of time series feature extraction module and conditional stochastic
differential equation based score matching module. The prediction can be
achieved by iteratively solving reverse-time SDE. To the best of our knowledge,
ScoreGrad is the first continuous energy based generative model used for time
series forecasting. Furthermore, ScoreGrad achieves state-of-the-art results on
six real-world datasets. The impact of hyperparameters and sampler types on the
performance are also explored. Code is available at
https://github.com/yantijin/ScoreGradPred.
Non Gaussian Denoising Diffusion Models
June 14, 2021
Eliya Nachmani, Robin San Roman, Lior Wolf
cs.LG, cs.CV, cs.SD, eess.AS
Generative diffusion processes are an emerging and effective tool for image
and speech generation. In the existing methods, the underline noise
distribution of the diffusion process is Gaussian noise. However, fitting
distributions with more degrees of freedom, could help the performance of such
generative models. In this work, we investigate other types of noise
distribution for the diffusion process. Specifically, we show that noise from
Gamma distribution provides improved results for image and speech generation.
Moreover, we show that using a mixture of Gaussian noise variables in the
diffusion process improves the performance over a diffusion process that is
based on a single distribution. Our approach preserves the ability to
efficiently sample state in the training diffusion process while using Gamma
noise and a mixture of noise.
CRASH: Raw Audio Score-based Generative Modeling for Controllable High-resolution Drum Sound Synthesis
June 14, 2021
Simon Rouard, Gaëtan Hadjeres
In this paper, we propose a novel score-base generative model for
unconditional raw audio synthesis. Our proposal builds upon the latest
developments on diffusion process modeling with stochastic differential
equations, which already demonstrated promising results on image generation. We
motivate novel heuristics for the choice of the diffusion processes better
suited for audio generation, and consider the use of a conditional U-Net to
approximate the score function. While previous approaches on diffusion models
on audio were mainly designed as speech vocoders in medium resolution, our
method termed CRASH (Controllable Raw Audio Synthesis with High-resolution)
allows us to generate short percussive sounds in 44.1kHz in a controllable way.
Through extensive experiments, we showcase on a drum sound generation task the
numerous sampling schemes offered by our method (unconditional generation,
deterministic generation, inpainting, interpolation, variations,
class-conditional sampling) and propose the class-mixing sampling, a novel way
to generate “hybrid” sounds. Our proposed method closes the gap with GAN-based
methods on raw audio, while offering more flexible generation capabilities with
lighter and easier-to-train models.
D2C: Diffusion-Denoising Models for Few-shot Conditional Generation
June 12, 2021
Abhishek Sinha, Jiaming Song, Chenlin Meng, Stefano Ermon
Conditional generative models of high-dimensional images have many
applications, but supervision signals from conditions to images can be
expensive to acquire. This paper describes Diffusion-Decoding models with
Contrastive representations (D2C), a paradigm for training unconditional
variational autoencoders (VAEs) for few-shot conditional image generation. D2C
uses a learned diffusion-based prior over the latent representations to improve
generation and contrastive self-supervised learning to improve representation
quality. D2C can adapt to novel generation tasks conditioned on labels or
manipulation constraints, by learning from as few as 100 labeled examples. On
conditional generation from new labels, D2C achieves superior performance over
state-of-the-art VAEs and diffusion models. On conditional image manipulation,
D2C generations are two orders of magnitude faster to produce over StyleGAN2
ones and are preferred by 50% - 60% of the human evaluators in a double-blind
study.
PriorGrad: Improving Conditional Denoising Diffusion Models with Data-Driven Adaptive Prior
June 11, 2021
Sang-gil Lee, Heeseung Kim, Chaehun Shin, Xu Tan, Chang Liu, Qi Meng, Tao Qin, Wei Chen, Sungroh Yoon, Tie-Yan Liu
stat.ML, cs.LG, cs.SD, eess.AS
Denoising diffusion probabilistic models have been recently proposed to
generate high-quality samples by estimating the gradient of the data density.
The framework defines the prior noise as a standard Gaussian distribution,
whereas the corresponding data distribution may be more complicated than the
standard Gaussian distribution, which potentially introduces inefficiency in
denoising the prior noise into the data sample because of the discrepancy
between the data and the prior. In this paper, we propose PriorGrad to improve
the efficiency of the conditional diffusion model for speech synthesis (for
example, a vocoder using a mel-spectrogram as the condition) by applying an
adaptive prior derived from the data statistics based on the conditional
information. We formulate the training and sampling procedures of PriorGrad and
demonstrate the advantages of an adaptive prior through a theoretical analysis.
Focusing on the speech synthesis domain, we consider the recently proposed
diffusion-based speech generative models based on both the spectral and time
domains and show that PriorGrad achieves faster convergence and inference with
superior performance, leading to an improved perceptual quality and robustness
to a smaller network capacity, and thereby demonstrating the efficiency of a
data-dependent adaptive prior.
Adversarial purification with Score-based generative models
June 11, 2021
Jongmin Yoon, Sung Ju Hwang, Juho Lee
While adversarial training is considered as a standard defense method against
adversarial attacks for image classifiers, adversarial purification, which
purifies attacked images into clean images with a standalone purification
model, has shown promises as an alternative defense method. Recently, an
Energy-Based Model (EBM) trained with Markov-Chain Monte-Carlo (MCMC) has been
highlighted as a purification model, where an attacked image is purified by
running a long Markov-chain using the gradients of the EBM. Yet, the
practicality of the adversarial purification using an EBM remains questionable
because the number of MCMC steps required for such purification is too large.
In this paper, we propose a novel adversarial purification method based on an
EBM trained with Denoising Score-Matching (DSM). We show that an EBM trained
with DSM can quickly purify attacked images within a few steps. We further
introduce a simple yet effective randomized purification scheme that injects
random noises into images before purification. This process screens the
adversarial perturbations imposed on images by the random noises and brings the
images to the regime where the EBM can denoise well. We show that our
purification method is robust against various attacks and demonstrate its
state-of-the-art performances.
Score-based Generative Modeling in Latent Space
June 10, 2021
Arash Vahdat, Karsten Kreis, Jan Kautz
Score-based generative models (SGMs) have recently demonstrated impressive
results in terms of both sample quality and distribution coverage. However,
they are usually applied directly in data space and often require thousands of
network evaluations for sampling. Here, we propose the Latent Score-based
Generative Model (LSGM), a novel approach that trains SGMs in a latent space,
relying on the variational autoencoder framework. Moving from data to latent
space allows us to train more expressive generative models, apply SGMs to
non-continuous data, and learn smoother SGMs in a smaller space, resulting in
fewer network evaluations and faster sampling. To enable training LSGMs
end-to-end in a scalable and stable manner, we (i) introduce a new
score-matching objective suitable to the LSGM setting, (ii) propose a novel
parameterization of the score function that allows SGM to focus on the mismatch
of the target distribution with respect to a simple Normal one, and (iii)
analytically derive multiple techniques for variance reduction of the training
objective. LSGM obtains a state-of-the-art FID score of 2.10 on CIFAR-10,
outperforming all existing generative results on this dataset. On
CelebA-HQ-256, LSGM is on a par with previous SGMs in sample quality while
outperforming them in sampling time by two orders of magnitude. In modeling
binary images, LSGM achieves state-of-the-art likelihood on the binarized
OMNIGLOT dataset. Our project page and code can be found at
https://nvlabs.github.io/LSGM .
Soft Truncation: A Universal Training Technique of Score-based Diffusion Model for High Precision Score Estimation
June 10, 2021
Dongjun Kim, Seungjae Shin, Kyungwoo Song, Wanmo Kang, Il-Chul Moon
Recent advances in diffusion models bring state-of-the-art performance on
image generation tasks. However, empirical results from previous research in
diffusion models imply an inverse correlation between density estimation and
sample generation performances. This paper investigates with sufficient
empirical evidence that such inverse correlation happens because density
estimation is significantly contributed by small diffusion time, whereas sample
generation mainly depends on large diffusion time. However, training a score
network well across the entire diffusion time is demanding because the loss
scale is significantly imbalanced at each diffusion time. For successful
training, therefore, we introduce Soft Truncation, a universally applicable
training technique for diffusion models, that softens the fixed and static
truncation hyperparameter into a random variable. In experiments, Soft
Truncation achieves state-of-the-art performance on CIFAR-10, CelebA, CelebA-HQ
256x256, and STL-10 datasets.
Learning to Efficiently Sample from Diffusion Probabilistic Models
June 07, 2021
Daniel Watson, Jonathan Ho, Mohammad Norouzi, William Chan
Denoising Diffusion Probabilistic Models (DDPMs) have emerged as a powerful
family of generative models that can yield high-fidelity samples and
competitive log-likelihoods across a range of domains, including image and
speech synthesis. Key advantages of DDPMs include ease of training, in contrast
to generative adversarial networks, and speed of generation, in contrast to
autoregressive models. However, DDPMs typically require hundreds-to-thousands
of steps to generate a high fidelity sample, making them prohibitively
expensive for high dimensional problems. Fortunately, DDPMs allow trading
generation speed for sample quality through adjusting the number of refinement
steps as a post process. Prior work has been successful in improving generation
speed through handcrafting the time schedule by trial and error. We instead
view the selection of the inference time schedules as an optimization problem,
and introduce an exact dynamic programming algorithm that finds the optimal
discrete time schedules for any pre-trained DDPM. Our method exploits the fact
that ELBO can be decomposed into separate KL terms, and given any computation
budget, discovers the time schedule that maximizes the training ELBO exactly.
Our method is efficient, has no hyper-parameters of its own, and can be applied
to any pre-trained DDPM with no retraining. We discover inference time
schedules requiring as few as 32 refinement steps, while sacrificing less than
0.1 bits per dimension compared to the default 4,000 steps used on ImageNet
64x64 [Ho et al., 2020; Nichol and Dhariwal, 2021].
A Variational Perspective on Diffusion-Based Generative Models and Score Matching
June 05, 2021
Chin-Wei Huang, Jae Hyun Lim, Aaron Courville
Discrete-time diffusion-based generative models and score matching methods
have shown promising results in modeling high-dimensional image data. Recently,
Song et al. (2021) show that diffusion processes that transform data into noise
can be reversed via learning the score function, i.e. the gradient of the
log-density of the perturbed data. They propose to plug the learned score
function into an inverse formula to define a generative diffusion process.
Despite the empirical success, a theoretical underpinning of this procedure is
still lacking. In this work, we approach the (continuous-time) generative
diffusion directly and derive a variational framework for likelihood
estimation, which includes continuous-time normalizing flows as a special case,
and can be seen as an infinitely deep variational autoencoder. Under this
framework, we show that minimizing the score-matching loss is equivalent to
maximizing a lower bound of the likelihood of the plug-in reverse SDE proposed
by Song et al. (2021), bridging the theoretical gap.
Diffusion Schrödinger Bridge with Applications to Score-Based Generative Modeling
June 01, 2021
Valentin De Bortoli, James Thornton, Jeremy Heng, Arnaud Doucet
Progressively applying Gaussian noise transforms complex data distributions
to approximately Gaussian. Reversing this dynamic defines a generative model.
When the forward noising process is given by a Stochastic Differential Equation
(SDE), Song et al. (2021) demonstrate how the time inhomogeneous drift of the
associated reverse-time SDE may be estimated using score-matching. A limitation
of this approach is that the forward-time SDE must be run for a sufficiently
long time for the final distribution to be approximately Gaussian. In contrast,
solving the Schr"odinger Bridge problem (SB), i.e. an entropy-regularized
optimal transport problem on path spaces, yields diffusions which generate
samples from the data distribution in finite time. We present Diffusion SB
(DSB), an original approximation of the Iterative Proportional Fitting (IPF)
procedure to solve the SB problem, and provide theoretical analysis along with
generative modeling experiments. The first DSB iteration recovers the
methodology proposed by Song et al. (2021), with the flexibility of using
shorter time intervals, as subsequent DSB iterations reduce the discrepancy
between the final-time marginal of the forward (resp. backward) SDE with
respect to the prior (resp. data) distribution. Beyond generative modeling, DSB
offers a widely applicable computational optimal transport tool as the
continuous state-space analogue of the popular Sinkhorn algorithm (Cuturi,
2013).
On Fast Sampling of Diffusion Probabilistic Models
May 31, 2021
Zhifeng Kong, Wei Ping
In this work, we propose FastDPM, a unified framework for fast sampling in
diffusion probabilistic models. FastDPM generalizes previous methods and gives
rise to new algorithms with improved sample quality. We systematically
investigate the fast sampling methods under this framework across different
domains, on different datasets, and with different amount of conditional
information provided for generation. We find the performance of a particular
method depends on data domains (e.g., image or audio), the trade-off between
sampling speed and sample quality, and the amount of conditional information.
We further provide insights and recipes on the choice of methods for
practitioners.
SNIPS: Solving Noisy Inverse Problems Stochastically
May 31, 2021
Bahjat Kawar, Gregory Vaksman, Michael Elad
In this work we introduce a novel stochastic algorithm dubbed SNIPS, which
draws samples from the posterior distribution of any linear inverse problem,
where the observation is assumed to be contaminated by additive white Gaussian
noise. Our solution incorporates ideas from Langevin dynamics and Newton’s
method, and exploits a pre-trained minimum mean squared error (MMSE) Gaussian
denoiser. The proposed approach relies on an intricate derivation of the
posterior score function that includes a singular value decomposition (SVD) of
the degradation operator, in order to obtain a tractable iterative algorithm
for the desired sampling. Due to its stochasticity, the algorithm can produce
multiple high perceptual quality samples for the same noisy observation. We
demonstrate the abilities of the proposed paradigm for image deblurring,
super-resolution, and compressive sensing. We show that the samples produced
are sharp, detailed and consistent with the given measurements, and their
diversity exposes the inherent uncertainty in the inverse problem being solved.
Cascaded Diffusion Models for High Fidelity Image Generation
May 30, 2021
Jonathan Ho, Chitwan Saharia, William Chan, David J. Fleet, Mohammad Norouzi, Tim Salimans
We show that cascaded diffusion models are capable of generating high
fidelity images on the class-conditional ImageNet generation benchmark, without
any assistance from auxiliary image classifiers to boost sample quality. A
cascaded diffusion model comprises a pipeline of multiple diffusion models that
generate images of increasing resolution, beginning with a standard diffusion
model at the lowest resolution, followed by one or more super-resolution
diffusion models that successively upsample the image and add higher resolution
details. We find that the sample quality of a cascading pipeline relies
crucially on conditioning augmentation, our proposed method of data
augmentation of the lower resolution conditioning inputs to the
super-resolution models. Our experiments show that conditioning augmentation
prevents compounding error during sampling in a cascaded model, helping us to
train cascading pipelines achieving FID scores of 1.48 at 64x64, 3.52 at
128x128 and 4.88 at 256x256 resolutions, outperforming BigGAN-deep, and
classification accuracy scores of 63.02% (top-1) and 84.06% (top-5) at 256x256,
outperforming VQ-VAE-2.
Representation Learning in Continuous-Time Score-Based Generative Models
May 29, 2021
Korbinian Abstreiter, Sarthak Mittal, Stefan Bauer, Bernhard Schölkopf, Arash Mehrjou
Diffusion-based methods represented as stochastic differential equations on a
continuous-time domain have recently proven successful as a non-adversarial
generative model. Training such models relies on denoising score matching,
which can be seen as multi-scale denoising autoencoders. Here, we augment the
denoising score matching framework to enable representation learning without
any supervised signal. GANs and VAEs learn representations by directly
transforming latent codes to data samples. In contrast, the introduced
diffusion-based representation learning relies on a new formulation of the
denoising score matching objective and thus encodes the information needed for
denoising. We illustrate how this difference allows for manual control of the
level of details encoded in the representation. Using the same approach, we
propose to learn an infinite-dimensional latent code that achieves improvements
of state-of-the-art models on semi-supervised image classification. We also
compare the quality of learned representations of diffusion score matching with
other methods like autoencoder and contrastively trained systems through their
performances on downstream tasks.
Gotta Go Fast When Generating Data with Score-Based Models
May 28, 2021
Alexia Jolicoeur-Martineau, Ke Li, Rémi Piché-Taillefer, Tal Kachman, Ioannis Mitliagkas
cs.LG, cs.CV, math.OC, stat.ML
Score-based (denoising diffusion) generative models have recently gained a
lot of success in generating realistic and diverse data. These approaches
define a forward diffusion process for transforming data to noise and generate
data by reversing it (thereby going from noise to data). Unfortunately, current
score-based models generate data very slowly due to the sheer number of score
network evaluations required by numerical SDE solvers.
In this work, we aim to accelerate this process by devising a more efficient
SDE solver. Existing approaches rely on the Euler-Maruyama (EM) solver, which
uses a fixed step size. We found that naively replacing it with other SDE
solvers fares poorly - they either result in low-quality samples or become
slower than EM. To get around this issue, we carefully devise an SDE solver
with adaptive step sizes tailored to score-based generative models piece by
piece. Our solver requires only two score function evaluations, rarely rejects
samples, and leads to high-quality samples. Our approach generates data 2 to 10
times faster than EM while achieving better or equal sample quality. For
high-resolution images, our method leads to significantly higher quality
samples than all other methods tested. Our SDE solver has the benefit of
requiring no step size tuning.
DiffSVC: A Diffusion Probabilistic Model for Singing Voice Conversion
May 28, 2021
Songxiang Liu, Yuewen Cao, Dan Su, Helen Meng
Singing voice conversion (SVC) is one promising technique which can enrich
the way of human-computer interaction by endowing a computer the ability to
produce high-fidelity and expressive singing voice. In this paper, we propose
DiffSVC, an SVC system based on denoising diffusion probabilistic model.
DiffSVC uses phonetic posteriorgrams (PPGs) as content features. A denoising
module is trained in DiffSVC, which takes destroyed mel spectrogram produced by
the diffusion/forward process and its corresponding step information as input
to predict the added Gaussian noise. We use PPGs, fundamental frequency
features and loudness features as auxiliary input to assist the denoising
process. Experiments show that DiffSVC can achieve superior conversion
performance in terms of naturalness and voice similarity to current
state-of-the-art SVC approaches.
Grad-tts: A diffusion probabilistic model for text-to-speech
May 13, 2021
Vadim Popov, Ivan Vovk, Vladimir Gogoryan, Tasnima Sadekova, Mikhail Kudinov
Recently, denoising diffusion probabilistic models and generative score
matching have shown high potential in modelling complex data distributions
while stochastic calculus has provided a unified point of view on these
techniques allowing for flexible inference schemes. In this paper we introduce
Grad-TTS, a novel text-to-speech model with score-based decoder producing
mel-spectrograms by gradually transforming noise predicted by encoder and
aligned with text input by means of Monotonic Alignment Search. The framework
of stochastic differential equations helps us to generalize conventional
diffusion probabilistic models to the case of reconstructing data from noise
with different parameters and allows to make this reconstruction flexible by
explicitly controlling trade-off between sound quality and inference speed.
Subjective human evaluation shows that Grad-TTS is competitive with
state-of-the-art text-to-speech approaches in terms of Mean Opinion Score. We
will make the code publicly available shortly.
Diffusion models beat gans on image synthesis
May 11, 2021
Prafulla Dhariwal, Alex Nichol
cs.LG, cs.AI, cs.CV, stat.ML
We show that diffusion models can achieve image sample quality superior to
the current state-of-the-art generative models. We achieve this on
unconditional image synthesis by finding a better architecture through a series
of ablations. For conditional image synthesis, we further improve sample
quality with classifier guidance: a simple, compute-efficient method for
trading off diversity for fidelity using gradients from a classifier. We
achieve an FID of 2.97 on ImageNet 128$\times$128, 4.59 on ImageNet
256$\times$256, and 7.72 on ImageNet 512$\times$512, and we match BigGAN-deep
even with as few as 25 forward passes per sample, all while maintaining better
coverage of the distribution. Finally, we find that classifier guidance
combines well with upsampling diffusion models, further improving FID to 3.94
on ImageNet 256$\times$256 and 3.85 on ImageNet 512$\times$512. We release our
code at https://github.com/openai/guided-diffusion
May 09, 2021
Chence Shi, Shitong Luo, Minkai Xu, Jian Tang
cs.LG, physics.chem-ph, q-bio.BM
We study a fundamental problem in computational chemistry known as molecular
conformation generation, trying to predict stable 3D structures from 2D
molecular graphs. Existing machine learning approaches usually first predict
distances between atoms and then generate a 3D structure satisfying the
distances, where noise in predicted distances may induce extra errors during 3D
coordinate generation. Inspired by the traditional force field methods for
molecular dynamics simulation, in this paper, we propose a novel approach
called ConfGF by directly estimating the gradient fields of the log density of
atomic coordinates. The estimated gradient fields allow directly generating
stable conformations via Langevin dynamics. However, the problem is very
challenging as the gradient fields are roto-translation equivariant. We notice
that estimating the gradient fields of atomic coordinates can be translated to
estimating the gradient fields of interatomic distances, and hence develop a
novel algorithm based on recent score-based generative models to effectively
estimate these gradients. Experimental results across multiple tasks show that
ConfGF outperforms previous state-of-the-art baselines by a significant margin.
DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism
May 06, 2021
Jinglin Liu, Chengxi Li, Yi Ren, Feiyang Chen, Zhou Zhao
Singing voice synthesis (SVS) systems are built to synthesize high-quality
and expressive singing voice, in which the acoustic model generates the
acoustic features (e.g., mel-spectrogram) given a music score. Previous singing
acoustic models adopt a simple loss (e.g., L1 and L2) or generative adversarial
network (GAN) to reconstruct the acoustic features, while they suffer from
over-smoothing and unstable training issues respectively, which hinder the
naturalness of synthesized singing. In this work, we propose DiffSinger, an
acoustic model for SVS based on the diffusion probabilistic model. DiffSinger
is a parameterized Markov chain that iteratively converts the noise into
mel-spectrogram conditioned on the music score. By implicitly optimizing
variational bound, DiffSinger can be stably trained and generate realistic
outputs. To further improve the voice quality and speed up inference, we
introduce a shallow diffusion mechanism to make better use of the prior
knowledge learned by the simple loss. Specifically, DiffSinger starts
generation at a shallow step smaller than the total number of diffusion steps,
according to the intersection of the diffusion trajectories of the ground-truth
mel-spectrogram and the one predicted by a simple mel-spectrogram decoder.
Besides, we propose boundary prediction methods to locate the intersection and
determine the shallow step adaptively. The evaluations conducted on a Chinese
singing dataset demonstrate that DiffSinger outperforms state-of-the-art SVS
work. Extensional experiments also prove the generalization of our methods on
text-to-speech task (DiffSpeech). Audio samples: https://diffsinger.github.io.
Codes: https://github.com/MoonInTheRiver/DiffSinger. The old title of this
work: “Diffsinger: Diffusion acoustic model for singing voice synthesis”.
Image super-resolution via iterative refinement
April 15, 2021
Chitwan Saharia, Jonathan Ho, William Chan, Tim Salimans, David J. Fleet, Mohammad Norouzi
We present SR3, an approach to image Super-Resolution via Repeated
Refinement. SR3 adapts denoising diffusion probabilistic models to conditional
image generation and performs super-resolution through a stochastic denoising
process. Inference starts with pure Gaussian noise and iteratively refines the
noisy output using a U-Net model trained on denoising at various noise levels.
SR3 exhibits strong performance on super-resolution tasks at different
magnification factors, on faces and natural images. We conduct human evaluation
on a standard 8X face super-resolution task on CelebA-HQ, comparing with SOTA
GAN methods. SR3 achieves a fool rate close to 50%, suggesting photo-realistic
outputs, while GANs do not exceed a fool rate of 34%. We further show the
effectiveness of SR3 in cascaded image generation, where generative models are
chained with super-resolution models, yielding a competitive FID score of 11.3
on ImageNet.
UNIT-DDPM: UNpaired Image Translation with Denoising Diffusion Probabilistic Models
April 12, 2021
Hiroshi Sasaki, Chris G. Willcocks, Toby P. Breckon
We propose a novel unpaired image-to-image translation method that uses
denoising diffusion probabilistic models without requiring adversarial
training. Our method, UNpaired Image Translation with Denoising Diffusion
Probabilistic Models (UNIT-DDPM), trains a generative model to infer the joint
distribution of images over both domains as a Markov chain by minimising a
denoising score matching objective conditioned on the other domain. In
particular, we update both domain translation models simultaneously, and we
generate target domain images by a denoising Markov Chain Monte Carlo approach
that is conditioned on the input source domain images, based on Langevin
dynamics. Our approach provides stable model training for image-to-image
translation and generates high-quality image outputs. This enables
state-of-the-art Fr'echet Inception Distance (FID) performance on several
public datasets, including both colour and multispectral imagery, significantly
outperforming the contemporary adversarial image-to-image translation methods.
On tuning consistent annealed sampling for denoising score matching
April 08, 2021
Joan Serrà, Santiago Pascual, Jordi Pons
cs.LG, cs.AI, cs.CV, cs.SD, eess.AS
Score-based generative models provide state-of-the-art quality for image and
audio synthesis. Sampling from these models is performed iteratively, typically
employing a discretized series of noise levels and a predefined scheme. In this
note, we first overview three common sampling schemes for models trained with
denoising score matching. Next, we focus on one of them, consistent annealed
sampling, and study its hyper-parameter boundaries. We then highlight a
possible formulation of such hyper-parameter that explicitly considers those
boundaries and facilitates tuning when using few or a variable number of steps.
Finally, we highlight some connections of the formulation with other sampling
schemes.
3D Shape Generation and Completion through Point-Voxel Diffusion
April 08, 2021
Linqi Zhou, Yilun Du, Jiajun Wu
We propose a novel approach for probabilistic generative modeling of 3D
shapes. Unlike most existing models that learn to deterministically translate a
latent vector to a shape, our model, Point-Voxel Diffusion (PVD), is a unified,
probabilistic formulation for unconditional shape generation and conditional,
multi-modal shape completion. PVD marries denoising diffusion models with the
hybrid, point-voxel representation of 3D shapes. It can be viewed as a series
of denoising steps, reversing the diffusion process from observed point cloud
data to Gaussian noise, and is trained by optimizing a variational lower bound
to the (conditional) likelihood function. Experiments demonstrate that PVD is
capable of synthesizing high-fidelity shapes, completing partial point clouds,
and generating multiple completion results from single-view depth scans of real
objects.
Noise Estimation for Generative Diffusion Models
April 06, 2021
Robin San-Roman, Eliya Nachmani, Lior Wolf
Generative diffusion models have emerged as leading models in speech and
image generation. However, in order to perform well with a small number of
denoising steps, a costly tuning of the set of noise parameters is needed. In
this work, we present a simple and versatile learning scheme that can
step-by-step adjust those noise parameters, for any given number of steps,
while the previous work needs to retune for each number separately.
Furthermore, without modifying the weights of the diffusion model, we are able
to significantly improve the synthesis results, for a small number of steps.
Our approach comes at a negligible computation cost.
NU-Wave: A Diffusion Probabilistic Model for Neural Audio Upsampling
April 06, 2021
Junhyeok Lee, Seungu Han
In this work, we introduce NU-Wave, the first neural audio upsampling model
to produce waveforms of sampling rate 48kHz from coarse 16kHz or 24kHz inputs,
while prior works could generate only up to 16kHz. NU-Wave is the first
diffusion probabilistic model for audio super-resolution which is engineered
based on neural vocoders. NU-Wave generates high-quality audio that achieves
high performance in terms of signal-to-noise ratio (SNR), log-spectral distance
(LSD), and accuracy of the ABX test. In all cases, NU-Wave outperforms the
baseline models despite the substantially smaller model capacity (3.0M
parameters) than baselines (5.4-21%). The audio samples of our model are
available at https://mindslab-ai.github.io/nuwave, and the code will be made
available soon.
Symbolic music generation with diffusion models
March 30, 2021
Gautam Mittal, Jesse Engel, Curtis Hawthorne, Ian Simon
cs.SD, cs.LG, eess.AS, stat.ML
Score-based generative models and diffusion probabilistic models have been
successful at generating high-quality samples in continuous domains such as
images and audio. However, due to their Langevin-inspired sampling mechanisms,
their application to discrete and sequential data has been limited. In this
work, we present a technique for training diffusion models on sequential data
by parameterizing the discrete domain in the continuous latent space of a
pre-trained variational autoencoder. Our method is non-autoregressive and
learns to generate sequences of latent embeddings through the reverse process
and offers parallel generation with a constant number of iterative refinement
steps. We apply this technique to modeling symbolic music and show strong
unconditional generation and post-hoc conditional infilling results compared to
autoregressive language models operating over the same continuous embeddings.
Diffusion Probabilistic Models for 3D Point Cloud Generation
March 02, 2021
Shitong Luo, Wei Hu
We present a probabilistic model for point cloud generation, which is
fundamental for various 3D vision tasks such as shape completion, upsampling,
synthesis and data augmentation. Inspired by the diffusion process in
non-equilibrium thermodynamics, we view points in point clouds as particles in
a thermodynamic system in contact with a heat bath, which diffuse from the
original distribution to a noise distribution. Point cloud generation thus
amounts to learning the reverse diffusion process that transforms the noise
distribution to the distribution of a desired shape. Specifically, we propose
to model the reverse diffusion process for point clouds as a Markov chain
conditioned on certain shape latent. We derive the variational bound in closed
form for training and provide implementations of the model. Experimental
results demonstrate that our model achieves competitive performance in point
cloud generation and auto-encoding. The code is available at
\url{https://github.com/luost26/diffusion-point-cloud}.
Improved denoising diffusion probabilistic models
February 18, 2021
Alex Nichol, Prafulla Dhariwal
Denoising diffusion probabilistic models (DDPM) are a class of generative
models which have recently been shown to produce excellent samples. We show
that with a few simple modifications, DDPMs can also achieve competitive
log-likelihoods while maintaining high sample quality. Additionally, we find
that learning variances of the reverse diffusion process allows sampling with
an order of magnitude fewer forward passes with a negligible difference in
sample quality, which is important for the practical deployment of these
models. We additionally use precision and recall to compare how well DDPMs and
GANs cover the target distribution. Finally, we show that the sample quality
and likelihood of these models scale smoothly with model capacity and training
compute, making them easily scalable. We release our code at
https://github.com/openai/improved-diffusion
Argmax flows and multinomial diffusion: Towards non-autoregressive language models
February 10, 2021
Emiel Hoogeboom, Didrik Nielsen, Priyank Jaini, Patrick Forré, Max Welling
Generative flows and diffusion models have been predominantly trained on
ordinal data, for example natural images. This paper introduces two extensions
of flows and diffusion for categorical data such as language or image
segmentation: Argmax Flows and Multinomial Diffusion. Argmax Flows are defined
by a composition of a continuous distribution (such as a normalizing flow), and
an argmax function. To optimize this model, we learn a probabilistic inverse
for the argmax that lifts the categorical data to a continuous space.
Multinomial Diffusion gradually adds categorical noise in a diffusion process,
for which the generative denoising process is learned. We demonstrate that our
method outperforms existing dequantization approaches on text modelling and
modelling on image segmentation maps in log-likelihood.
Autoregressive Denoising Diffusion Models for Multivariate Probabilistic Time Series Forecasting
January 28, 2021
Kashif Rasul, Calvin Seward, Ingmar Schuster, Roland Vollgraf
In this work, we propose \texttt{TimeGrad}, an autoregressive model for
multivariate probabilistic time series forecasting which samples from the data
distribution at each time step by estimating its gradient. To this end, we use
diffusion probabilistic models, a class of latent variable models closely
connected to score matching and energy-based methods. Our model learns
gradients by optimizing a variational bound on the data likelihood and at
inference time converts white noise into a sample of the distribution of
interest through a Markov chain using Langevin sampling. We demonstrate
experimentally that the proposed autoregressive denoising diffusion model is
the new state-of-the-art multivariate probabilistic forecasting method on
real-world data sets with thousands of correlated dimensions. We hope that this
method is a useful tool for practitioners and lays the foundation for future
research in this area.
Stochastic Image Denoising by Sampling from the Posterior Distribution
January 23, 2021
Bahjat Kawar, Gregory Vaksman, Michael Elad
Image denoising is a well-known and well studied problem, commonly targeting
a minimization of the mean squared error (MSE) between the outcome and the
original image. Unfortunately, especially for severe noise levels, such Minimum
MSE (MMSE) solutions may lead to blurry output images. In this work we propose
a novel stochastic denoising approach that produces viable and high perceptual
quality results, while maintaining a small MSE. Our method employs Langevin
dynamics that relies on a repeated application of any given MMSE denoiser,
obtaining the reconstructed image by effectively sampling from the posterior
distribution. Due to its stochasticity, the proposed algorithm can produce a
variety of high-quality outputs for a given noisy input, all shown to be
legitimate denoising results. In addition, we present an extension of our
algorithm for handling the inpainting problem, recovering missing pixels while
removing noise from partially given data.
Maximum likelihood training of score-based diffusion models
January 22, 2021
Yang Song, Conor Durkan, Iain Murray, Stefano Ermon
Score-based diffusion models synthesize samples by reversing a stochastic
process that diffuses data to noise, and are trained by minimizing a weighted
combination of score matching losses. The log-likelihood of score-based
diffusion models can be tractably computed through a connection to continuous
normalizing flows, but log-likelihood is not directly optimized by the weighted
combination of score matching losses. We show that for a specific weighting
scheme, the objective upper bounds the negative log-likelihood, thus enabling
approximate maximum likelihood training of score-based diffusion models. We
empirically observe that maximum likelihood training consistently improves the
likelihood of score-based diffusion models across multiple datasets, stochastic
processes, and model architectures. Our best models achieve negative
log-likelihoods of 2.83 and 3.76 bits/dim on CIFAR-10 and ImageNet 32x32
without any data augmentation, on a par with state-of-the-art autoregressive
models on these tasks.
Knowledge distillation in iterative generative models for improved sampling speed
January 07, 2021
Eric Luhman, Troy Luhman
Iterative generative models, such as noise conditional score networks and
denoising diffusion probabilistic models, produce high quality samples by
gradually denoising an initial noise vector. However, their denoising process
has many steps, making them 2-3 orders of magnitude slower than other
generative models such as GANs and VAEs. In this paper, we establish a novel
connection between knowledge distillation and image generation with a technique
that distills a multi-step denoising process into a single step, resulting in a
sampling speed similar to other single-step generative models. Our Denoising
Student generates high quality samples comparable to GANs on the CIFAR-10 and
CelebA datasets, without adversarial training. We demonstrate that our method
scales to higher resolutions through experiments on 256 x 256 LSUN. Code and
checkpoints are available at https://github.com/tcl9876/Denoising_Student
Learning energy-based models by diffusion recovery likelihood
December 15, 2020
Ruiqi Gao, Yang Song, Ben Poole, Ying Nian Wu, Diederik P. Kingma
While energy-based models (EBMs) exhibit a number of desirable properties,
training and sampling on high-dimensional datasets remains challenging.
Inspired by recent progress on diffusion probabilistic models, we present a
diffusion recovery likelihood method to tractably learn and sample from a
sequence of EBMs trained on increasingly noisy versions of a dataset. Each EBM
is trained with recovery likelihood, which maximizes the conditional
probability of the data at a certain noise level given their noisy versions at
a higher noise level. Optimizing recovery likelihood is more tractable than
marginal likelihood, as sampling from the conditional distributions is much
easier than sampling from the marginal distributions. After training,
synthesized images can be generated by the sampling process that initializes
from Gaussian white noise distribution and progressively samples the
conditional distributions at decreasingly lower noise levels. Our method
generates high fidelity samples on various image datasets. On unconditional
CIFAR-10 our method achieves FID 9.58 and inception score 8.30, superior to the
majority of GANs. Moreover, we demonstrate that unlike previous work on EBMs,
our long-run MCMC samples from the conditional distributions do not diverge and
still represent realistic images, allowing us to accurately estimate the
normalized density of data even for high-dimensional datasets. Our
implementation is available at https://github.com/ruiqigao/recovery_likelihood.
Score-Based Generative Modeling through Stochastic Differential Equations
November 26, 2020
Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, Ben Poole
Creating noise from data is easy; creating data from noise is generative
modeling. We present a stochastic differential equation (SDE) that smoothly
transforms a complex data distribution to a known prior distribution by slowly
injecting noise, and a corresponding reverse-time SDE that transforms the prior
distribution back into the data distribution by slowly removing the noise.
Crucially, the reverse-time SDE depends only on the time-dependent gradient
field (\aka, score) of the perturbed data distribution. By leveraging advances
in score-based generative modeling, we can accurately estimate these scores
with neural networks, and use numerical SDE solvers to generate samples. We
show that this framework encapsulates previous approaches in score-based
generative modeling and diffusion probabilistic modeling, allowing for new
sampling procedures and new modeling capabilities. In particular, we introduce
a predictor-corrector framework to correct errors in the evolution of the
discretized reverse-time SDE. We also derive an equivalent neural ODE that
samples from the same distribution as the SDE, but additionally enables exact
likelihood computation, and improved sampling efficiency. In addition, we
provide a new way to solve inverse problems with score-based models, as
demonstrated with experiments on class-conditional generation, image
inpainting, and colorization. Combined with multiple architectural
improvements, we achieve record-breaking performance for unconditional image
generation on CIFAR-10 with an Inception score of 9.89 and FID of 2.20, a
competitive likelihood of 2.99 bits/dim, and demonstrate high fidelity
generation of 1024 x 1024 images for the first time from a score-based
generative model.
Probabilistic Mapping of Dark Matter by Neural Score Matching
November 16, 2020
Benjamin Remy, Francois Lanusse, Zaccharie Ramzi, Jia Liu, Niall Jeffrey, Jean-Luc Starck
The Dark Matter present in the Large-Scale Structure of the Universe is
invisible, but its presence can be inferred through the small gravitational
lensing effect it has on the images of far away galaxies. By measuring this
lensing effect on a large number of galaxies it is possible to reconstruct maps
of the Dark Matter distribution on the sky. This, however, represents an
extremely challenging inverse problem due to missing data and noise dominated
measurements. In this work, we present a novel methodology for addressing such
inverse problems by combining elements of Bayesian statistics, analytic
physical theory, and a recent class of Deep Generative Models based on Neural
Score Matching. This approach allows to do the following: (1) make full use of
analytic cosmological theory to constrain the 2pt statistics of the solution,
(2) learn from cosmological simulations any differences between this analytic
prior and full simulations, and (3) obtain samples from the full Bayesian
posterior of the problem for robust Uncertainty Quantification. We present an
application of this methodology on the first deep-learning-assisted Dark Matter
map reconstruction of the Hubble Space Telescope COSMOS field.
Denoising Score-Matching for Uncertainty Quantification in Inverse Problems
November 16, 2020
Zaccharie Ramzi, Benjamin Remy, Francois Lanusse, Jean-Luc Starck, Philippe Ciuciu
stat.ML, cs.CV, cs.LG, eess.SP, physics.med-ph
Deep neural networks have proven extremely efficient at solving a wide
rangeof inverse problems, but most often the uncertainty on the solution they
provideis hard to quantify. In this work, we propose a generic Bayesian
framework forsolving inverse problems, in which we limit the use of deep neural
networks tolearning a prior distribution on the signals to recover. We adopt
recent denoisingscore matching techniques to learn this prior from data, and
subsequently use it aspart of an annealed Hamiltonian Monte-Carlo scheme to
sample the full posteriorof image inverse problems. We apply this framework to
Magnetic ResonanceImage (MRI) reconstruction and illustrate how this approach
not only yields highquality reconstructions but can also be used to assess the
uncertainty on particularfeatures of a reconstructed image.
VoiceGrad: Non-Parallel Any-to-Many Voice Conversion with Annealed Langevin Dynamics
October 06, 2020
Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, Nobukatsu Hojo, Shogo Seki
In this paper, we propose a non-parallel any-to-many voice conversion (VC)
method termed VoiceGrad. Inspired by WaveGrad, a recently introduced novel
waveform generation method, VoiceGrad is based upon the concepts of score
matching and Langevin dynamics. It uses weighted denoising score matching to
train a score approximator, a fully convolutional network with a U-Net
structure designed to predict the gradient of the log density of the speech
feature sequences of multiple speakers, and performs VC by using annealed
Langevin dynamics to iteratively update an input feature sequence towards the
nearest stationary point of the target distribution based on the trained score
approximator network. Thanks to the nature of this concept, VoiceGrad enables
any-to-many VC, a VC scenario in which the speaker of input speech can be
arbitrary, and allows for non-parallel training, which requires no parallel
utterances or transcriptions.
Denoising diffusion implicit models
October 06, 2020
Jiaming Song, Chenlin Meng, Stefano Ermon
Denoising diffusion probabilistic models (DDPMs) have achieved high quality
image generation without adversarial training, yet they require simulating a
Markov chain for many steps to produce a sample. To accelerate sampling, we
present denoising diffusion implicit models (DDIMs), a more efficient class of
iterative implicit probabilistic models with the same training procedure as
DDPMs. In DDPMs, the generative process is defined as the reverse of a
Markovian diffusion process. We construct a class of non-Markovian diffusion
processes that lead to the same training objective, but whose reverse process
can be much faster to sample from. We empirically demonstrate that DDIMs can
produce high quality samples $10 \times$ to $50 \times$ faster in terms of
wall-clock time compared to DDPMs, allow us to trade off computation for sample
quality, and can perform semantically meaningful image interpolation directly
in the latent space.
Diffwave: A versatile diffusion model for audio synthesis
September 21, 2020
Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, Bryan Catanzaro
eess.AS, cs.CL, cs.LG, cs.SD, stat.ML
In this work, we propose DiffWave, a versatile diffusion probabilistic model
for conditional and unconditional waveform generation. The model is
non-autoregressive, and converts the white noise signal into structured
waveform through a Markov chain with a constant number of steps at synthesis.
It is efficiently trained by optimizing a variant of variational bound on the
data likelihood. DiffWave produces high-fidelity audios in different waveform
generation tasks, including neural vocoding conditioned on mel spectrogram,
class-conditional generation, and unconditional generation. We demonstrate that
DiffWave matches a strong WaveNet vocoder in terms of speech quality (MOS: 4.44
versus 4.43), while synthesizing orders of magnitude faster. In particular, it
significantly outperforms autoregressive and GAN-based waveform models in the
challenging unconditional generation task in terms of audio quality and sample
diversity from various automatic and human evaluations.
Adversarial score matching and improved sampling for image generation
September 11, 2020
Alexia Jolicoeur-Martineau, Rémi Piché-Taillefer, Rémi Tachet des Combes, Ioannis Mitliagkas
Denoising Score Matching with Annealed Langevin Sampling (DSM-ALS) has
recently found success in generative modeling. The approach works by first
training a neural network to estimate the score of a distribution, and then
using Langevin dynamics to sample from the data distribution assumed by the
score network. Despite the convincing visual quality of samples, this method
appears to perform worse than Generative Adversarial Networks (GANs) under the
Fr'echet Inception Distance, a standard metric for generative models.
We show that this apparent gap vanishes when denoising the final Langevin
samples using the score network. In addition, we propose two improvements to
DSM-ALS: 1) Consistent Annealed Sampling as a more stable alternative to
Annealed Langevin Sampling, and 2) a hybrid training formulation, composed of
both Denoising Score Matching and adversarial objectives. By combining these
two techniques and exploring different network architectures, we elevate score
matching methods and obtain results competitive with state-of-the-art image
generation on CIFAR-10.
September 02, 2020
Nanxin Chen, Yu Zhang, Heiga Zen, Ron J. Weiss, Mohammad Norouzi, William Chan
eess.AS, cs.LG, cs.SD, stat.ML
This paper introduces WaveGrad, a conditional model for waveform generation
which estimates gradients of the data density. The model is built on prior work
on score matching and diffusion probabilistic models. It starts from a Gaussian
white noise signal and iteratively refines the signal via a gradient-based
sampler conditioned on the mel-spectrogram. WaveGrad offers a natural way to
trade inference speed for sample quality by adjusting the number of refinement
steps, and bridges the gap between non-autoregressive and autoregressive models
in terms of audio quality. We find that it can generate high fidelity audio
samples using as few as six iterations. Experiments reveal WaveGrad to generate
high fidelity audio, outperforming adversarial non-autoregressive baselines and
matching a strong likelihood-based autoregressive baseline using fewer
sequential operations. Audio samples are available at
https://wavegrad.github.io/.
Learning gradient fields for shape generation
August 14, 2020
Ruojin Cai, Guandao Yang, Hadar Averbuch-Elor, Zekun Hao, Serge Belongie, Noah Snavely, Bharath Hariharan
In this work, we propose a novel technique to generate shapes from point
cloud data. A point cloud can be viewed as samples from a distribution of 3D
points whose density is concentrated near the surface of the shape. Point cloud
generation thus amounts to moving randomly sampled points to high-density
areas. We generate point clouds by performing stochastic gradient ascent on an
unnormalized probability density, thereby moving sampled points toward the
high-likelihood regions. Our model directly predicts the gradient of the log
density field and can be trained with a simple objective adapted from
score-based generative models. We show that our method can reach
state-of-the-art performance for point cloud auto-encoding and generation,
while also allowing for extraction of a high-quality implicit surface. Code is
available at https://github.com/RuojinCai/ShapeGF.
Efficient Learning of Generative Models via Finite-Difference Score Matching
July 07, 2020
Tianyu Pang, Kun Xu, Chongxuan Li, Yang Song, Stefano Ermon, Jun Zhu
Several machine learning applications involve the optimization of
higher-order derivatives (e.g., gradients of gradients) during training, which
can be expensive in respect to memory and computation even with automatic
differentiation. As a typical example in generative modeling, score matching
(SM) involves the optimization of the trace of a Hessian. To improve computing
efficiency, we rewrite the SM objective and its variants in terms of
directional derivatives, and present a generic strategy to efficiently
approximate any-order directional derivative with finite difference (FD). Our
approximation only involves function evaluations, which can be executed in
parallel, and no gradient computations. Thus, it reduces the total
computational cost while also improving numerical stability. We provide two
instantiations by reformulating variants of SM objectives into the FD forms.
Empirically, we demonstrate that our methods produce results comparable to the
gradient-based counterparts while being much more computationally efficient.
Denoising diffusion probabilistic models
June 19, 2020
Jonathan Ho, Ajay Jain, Pieter Abbeel
We present high quality image synthesis results using diffusion probabilistic
models, a class of latent variable models inspired by considerations from
nonequilibrium thermodynamics. Our best results are obtained by training on a
weighted variational bound designed according to a novel connection between
diffusion probabilistic models and denoising score matching with Langevin
dynamics, and our models naturally admit a progressive lossy decompression
scheme that can be interpreted as a generalization of autoregressive decoding.
On the unconditional CIFAR10 dataset, we obtain an Inception score of 9.46 and
a state-of-the-art FID score of 3.17. On 256x256 LSUN, we obtain sample quality
similar to ProgressiveGAN. Our implementation is available at
https://github.com/hojonathanho/diffusion
Rethinking the Role of Gradient-Based Attribution Methods for Model Interpretability
June 16, 2020
Suraj Srinivas, Francois Fleuret
Current methods for the interpretability of discriminative deep neural
networks commonly rely on the model’s input-gradients, i.e., the gradients of
the output logits w.r.t. the inputs. The common assumption is that these
input-gradients contain information regarding $p_{\theta} ( y \mid x)$, the
model’s discriminative capabilities, thus justifying their use for
interpretability. However, in this work we show that these input-gradients can
be arbitrarily manipulated as a consequence of the shift-invariance of softmax
without changing the discriminative function. This leaves an open question: if
input-gradients can be arbitrary, why are they highly structured and
explanatory in standard models?
We investigate this by re-interpreting the logits of standard softmax-based
classifiers as unnormalized log-densities of the data distribution and show
that input-gradients can be viewed as gradients of a class-conditional density
model $p_{\theta}(x \mid y)$ implicit within the discriminative model. This
leads us to hypothesize that the highly structured and explanatory nature of
input-gradients may be due to the alignment of this class-conditional model
$p_{\theta}(x \mid y)$ with that of the ground truth data distribution
$p_{\text{data}} (x \mid y)$. We test this hypothesis by studying the effect of
density alignment on gradient explanations. To achieve this alignment we use
score-matching, and propose novel approximations to this algorithm to enable
training large-scale models.
Our experiments show that improving the alignment of the implicit density
model with the data distribution enhances gradient structure and explanatory
power while reducing this alignment has the opposite effect. Overall, our
finding that input-gradients capture information regarding an implicit
generative model implies that we need to re-think their use for interpreting
discriminative models.
Improved techniques for training score-based generative models
June 16, 2020
Yang Song, Stefano Ermon
Score-based generative models can produce high quality image samples
comparable to GANs, without requiring adversarial optimization. However,
existing training procedures are limited to images of low resolution (typically
below 32x32), and can be unstable under some settings. We provide a new
theoretical analysis of learning and sampling from score models in high
dimensional spaces, explaining existing failure modes and motivating new
solutions that generalize across datasets. To enhance stability, we also
propose to maintain an exponential moving average of model weights. With these
improvements, we can effortlessly scale score-based generative models to images
with unprecedented resolutions ranging from 64x64 to 256x256. Our score-based
models can generate high-fidelity samples that rival best-in-class GANs on
various image datasets, including CelebA, FFHQ, and multiple LSUN categories.
Permutation invariant graph generation via score-based generative modeling
March 02, 2020
Chenhao Niu, Yang Song, Jiaming Song, Shengjia Zhao, Aditya Grover, Stefano Ermon
Learning generative models for graph-structured data is challenging because
graphs are discrete, combinatorial, and the underlying data distribution is
invariant to the ordering of nodes. However, most of the existing generative
models for graphs are not invariant to the chosen ordering, which might lead to
an undesirable bias in the learned distribution. To address this difficulty, we
propose a permutation invariant approach to modeling graphs, using the recent
framework of score-based generative modeling. In particular, we design a
permutation equivariant, multi-channel graph neural network to model the
gradient of the data distribution at the input graph (a.k.a., the score
function). This permutation equivariant model of gradients implicitly defines a
permutation invariant distribution for graphs. We train this graph neural
network with score matching and sample from it with annealed Langevin dynamics.
In our experiments, we first demonstrate the capacity of this new architecture
in learning discrete graph algorithms. For graph generation, we find that our
learning approach achieves better or comparable results to existing models on
benchmark datasets.
Source Separation with Deep Generative Prior
February 19, 2020
Vivek Jayaram, John Thickstun
Despite substantial progress in signal source separation, results for richly
structured data continue to contain perceptible artifacts. In contrast, recent
deep generative models can produce authentic samples in a variety of domains
that are indistinguishable from samples of the data distribution. This paper
introduces a Bayesian approach to source separation that uses generative models
as priors over the components of a mixture of sources, and noise-annealed
Langevin dynamics to sample from the posterior distribution of sources given a
mixture. This decouples the source separation problem from generative modeling,
enabling us to directly use cutting-edge generative models as priors. The
method achieves state-of-the-art performance for MNIST digit separation. We
introduce new methodology for evaluating separation quality on richer datasets,
providing quantitative evaluation of separation results on CIFAR-10. We also
provide qualitative results on LSUN.
Generative Modeling by Estimating Gradients of the Data Distribution
July 12, 2019
Yang Song, Stefano Ermon
We introduce a new generative model where samples are produced via Langevin
dynamics using gradients of the data distribution estimated with score
matching. Because gradients can be ill-defined and hard to estimate when the
data resides on low-dimensional manifolds, we perturb the data with different
levels of Gaussian noise, and jointly estimate the corresponding scores, i.e.,
the vector fields of gradients of the perturbed data distribution for all noise
levels. For sampling, we propose an annealed Langevin dynamics where we use
gradients corresponding to gradually decreasing noise levels as the sampling
process gets closer to the data manifold. Our framework allows flexible model
architectures, requires no sampling during training or the use of adversarial
methods, and provides a learning objective that can be used for principled
model comparisons. Our models produce samples comparable to GANs on MNIST,
CelebA and CIFAR-10 datasets, achieving a new state-of-the-art inception score
of 8.87 on CIFAR-10. Additionally, we demonstrate that our models learn
effective representations via image inpainting experiments.
Deep Unsupervised Learning using Nonequilibrium Thermodynamics
March 12, 2015
Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, Surya Ganguli
cs.LG, cond-mat.dis-nn, q-bio.NC, stat.ML
A central problem in machine learning involves modeling complex data-sets
using highly flexible families of probability distributions in which learning,
sampling, inference, and evaluation are still analytically or computationally
tractable. Here, we develop an approach that simultaneously achieves both
flexibility and tractability. The essential idea, inspired by non-equilibrium
statistical physics, is to systematically and slowly destroy structure in a
data distribution through an iterative forward diffusion process. We then learn
a reverse diffusion process that restores structure in data, yielding a highly
flexible and tractable generative model of the data. This approach allows us to
rapidly learn, sample from, and evaluate probabilities in deep generative
models with thousands of layers or time steps, as well as to compute
conditional and posterior probabilities under the learned model. We
additionally release an open source reference implementation of the algorithm.