Adversarial Score Distillation: When score distillation meets GAN
December 01, 2023
Min Wei, Jingkai Zhou, Junyao Sun, Xuesong Zhang
Existing score distillation methods are sensitive to classifier-free guidance
(CFG) scale: manifested as over-smoothness or instability at small CFG scales,
while over-saturation at large ones. To explain and analyze these issues, we
revisit the derivation of Score Distillation Sampling (SDS) and decipher
existing score distillation with the Wasserstein Generative Adversarial Network
(WGAN) paradigm. With the WGAN paradigm, we find that existing score
distillation either employs a fixed sub-optimal discriminator or conducts
incomplete discriminator optimization, resulting in the scale-sensitive issue.
We propose the Adversarial Score Distillation (ASD), which maintains an
optimizable discriminator and updates it using the complete optimization
objective. Experiments show that the proposed ASD performs favorably in 2D
distillation and text-to-3D tasks against existing methods. Furthermore, to
explore the generalization ability of our WGAN paradigm, we extend ASD to the
image editing task, which achieves competitive results. The project page and
code are at https://github.com/2y7c3/ASD.
TrackDiffusion: Multi-object Tracking Data Generation via Diffusion Models
December 01, 2023
Pengxiang Li, Zhili Liu, Kai Chen, Lanqing Hong, Yunzhi Zhuge, Dit-Yan Yeung, Huchuan Lu, Xu Jia
Diffusion models have gained prominence in generating data for perception
tasks such as image classification and object detection. However, the potential
in generating high-quality tracking sequences, a crucial aspect in the field of
video perception, has not been fully investigated. To address this gap, we
propose TrackDiffusion, a novel architecture designed to generate continuous
video sequences from the tracklets. TrackDiffusion represents a significant
departure from the traditional layout-to-image (L2I) generation and copy-paste
synthesis focusing on static image elements like bounding boxes by empowering
image diffusion models to encompass dynamic and continuous tracking
trajectories, thereby capturing complex motion nuances and ensuring instance
consistency among video frames. For the first time, we demonstrate that the
generated video sequences can be utilized for training multi-object tracking
(MOT) systems, leading to significant improvement in tracker performance.
Experimental results show that our model significantly enhances instance
consistency in generated video sequences, leading to improved perceptual
metrics. Our approach achieves an improvement of 8.7 in TrackAP and 11.8 in
TrackAP$_{50}$ on the YTVIS dataset, underscoring its potential to redefine the
standards of video data generation for MOT tasks and beyond.
DREAM: Diffusion Rectification and Estimation-Adaptive Models
November 30, 2023
Jinxin Zhou, Tianyu Ding, Tianyi Chen, Jiachen Jiang, Ilya Zharkov, Zhihui Zhu, Luming Liang
We present DREAM, a novel training framework representing Diffusion
Rectification and Estimation-Adaptive Models, requiring minimal code changes
(just three lines) yet significantly enhancing the alignment of training with
sampling in diffusion models. DREAM features two components: diffusion
rectification, which adjusts training to reflect the sampling process, and
estimation adaptation, which balances perception against distortion. When
applied to image super-resolution (SR), DREAM adeptly navigates the tradeoff
between minimizing distortion and preserving high image quality. Experiments
demonstrate DREAM’s superiority over standard diffusion-based SR methods,
showing a $2$ to $3\times $ faster training convergence and a $10$ to
$20\times$ reduction in necessary sampling steps to achieve comparable or
superior results. We hope DREAM will inspire a rethinking of diffusion model
training paradigms.
S2ST: Image-to-Image Translation in the Seed Space of Latent Diffusion
November 30, 2023
Or Greenberg, Eran Kishon, Dani Lischinski
Image-to-image translation (I2IT) refers to the process of transforming
images from a source domain to a target domain while maintaining a fundamental
connection in terms of image content. In the past few years, remarkable
advancements in I2IT were achieved by Generative Adversarial Networks (GANs),
which nevertheless struggle with translations requiring high precision.
Recently, Diffusion Models have established themselves as the engine of choice
for image generation. In this paper we introduce S2ST, a novel framework
designed to accomplish global I2IT in complex photorealistic images, such as
day-to-night or clear-to-rain translations of automotive scenes. S2ST operates
within the seed space of a Latent Diffusion Model, thereby leveraging the
powerful image priors learned by the latter. We show that S2ST surpasses
state-of-the-art GAN-based I2IT methods, as well as diffusion-based approaches,
for complex automotive scenes, improving fidelity while respecting the target
domain’s appearance across a variety of domains. Notably, S2ST obviates the
necessity for training domain-specific translation networks.
Exploiting Diffusion Prior for Generalizable Pixel-Level Semantic Prediction
November 30, 2023
Hsin-Ying Lee, Hung-Yu Tseng, Hsin-Ying Lee, Ming-Hsuan Yang
Contents generated by recent advanced Text-to-Image (T2I) diffusion models
are sometimes too imaginative for existing off-the-shelf property semantic
predictors to estimate due to the immitigable domain gap. We introduce DMP, a
pipeline utilizing pre-trained T2I models as a prior for pixel-level semantic
prediction tasks. To address the misalignment between deterministic prediction
tasks and stochastic T2I models, we reformulate the diffusion process through a
sequence of interpolations, establishing a deterministic mapping between input
RGB images and output prediction distributions. To preserve generalizability,
we use low-rank adaptation to fine-tune pre-trained models. Extensive
experiments across five tasks, including 3D property estimation, semantic
segmentation, and intrinsic image decomposition, showcase the efficacy of the
proposed method. Despite limited-domain training data, the approach yields
faithful estimations for arbitrary images, surpassing existing state-of-the-art
algorithms.
One-step Diffusion with Distribution Matching Distillation
November 30, 2023
Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Frédo Durand, William T. Freeman, Taesung Park
Diffusion models generate high-quality images but require dozens of forward
passes. We introduce Distribution Matching Distillation (DMD), a procedure to
transform a diffusion model into a one-step image generator with minimal impact
on image quality. We enforce the one-step image generator match the diffusion
model at distribution level, by minimizing an approximate KL divergence whose
gradient can be expressed as the difference between 2 score functions, one of
the target distribution and the other of the synthetic distribution being
produced by our one-step generator. The score functions are parameterized as
two diffusion models trained separately on each distribution. Combined with a
simple regression loss matching the large-scale structure of the multi-step
diffusion outputs, our method outperforms all published few-step diffusion
approaches, reaching 2.62 FID on ImageNet 64x64 and 11.49 FID on zero-shot
COCO-30k, comparable to Stable Diffusion but orders of magnitude faster.
Utilizing FP16 inference, our model can generate images at 20 FPS on modern
hardware.
Continual Diffusion with STAMINA: STack-And-Mask INcremental Adapters
November 30, 2023
James Seale Smith, Yen-Chang Hsu, Zsolt Kira, Yilin Shen, Hongxia Jin
Recent work has demonstrated a remarkable ability to customize text-to-image
diffusion models to multiple, fine-grained concepts in a sequential (i.e.,
continual) manner while only providing a few example images for each concept.
This setting is known as continual diffusion. Here, we ask the question: Can we
scale these methods to longer concept sequences without forgetting? Although
prior work mitigates the forgetting of previously learned concepts, we show
that its capacity to learn new tasks reaches saturation over longer sequences.
We address this challenge by introducing a novel method, STack-And-Mask
INcremental Adapters (STAMINA), which is composed of low-ranked
attention-masked adapters and customized MLP tokens. STAMINA is designed to
enhance the robust fine-tuning properties of LoRA for sequential concept
learning via learnable hard-attention masks parameterized with low rank MLPs,
enabling precise, scalable learning via sparse adaptation. Notably, all
introduced trainable parameters can be folded back into the model after
training, inducing no additional inference parameter costs. We show that
STAMINA outperforms the prior SOTA for the setting of text-to-image continual
customization on a 50-concept benchmark composed of landmarks and human faces,
with no stored replay data. Additionally, we extended our method to the setting
of continual learning for image classification, demonstrating that our gains
also translate to state-of-the-art performance in this standard benchmark.
DiffusionAvatars: Deferred Diffusion for High-fidelity 3D Head Avatars
November 30, 2023
Tobias Kirschstein, Simon Giebenhain, Matthias Nießner
DiffusionAvatars synthesizes a high-fidelity 3D head avatar of a person,
offering intuitive control over both pose and expression. We propose a
diffusion-based neural renderer that leverages generic 2D priors to produce
compelling images of faces. For coarse guidance of the expression and head
pose, we render a neural parametric head model (NPHM) from the target
viewpoint, which acts as a proxy geometry of the person. Additionally, to
enhance the modeling of intricate facial expressions, we condition
DiffusionAvatars directly on the expression codes obtained from NPHM via
cross-attention. Finally, to synthesize consistent surface details across
different viewpoints and expressions, we rig learnable spatial features to the
head’s surface via TriPlane lookup in NPHM’s canonical space. We train
DiffusionAvatars on RGB videos and corresponding tracked NPHM meshes of a
person and test the obtained avatars in both self-reenactment and animation
scenarios. Our experiments demonstrate that DiffusionAvatars generates
temporally consistent and visually appealing videos for novel poses and
expressions of a person, outperforming existing approaches.
Fast ODE-based Sampling for Diffusion Models in Around 5 Steps
November 30, 2023
Zhenyu Zhou, Defang Chen, Can Wang, Chun Chen
Sampling from diffusion models can be treated as solving the corresponding
ordinary differential equations (ODEs), with the aim of obtaining an accurate
solution with as few number of function evaluations (NFE) as possible.
Recently, various fast samplers utilizing higher-order ODE solvers have emerged
and achieved better performance than the initial first-order one. However,
these numerical methods inherently result in certain approximation errors,
which significantly degrades sample quality with extremely small NFE (e.g.,
around 5). In contrast, based on the geometric observation that each sampling
trajectory almost lies in a two-dimensional subspace embedded in the ambient
space, we propose Approximate MEan-Direction Solver (AMED-Solver) that
eliminates truncation errors by directly learning the mean direction for fast
diffusion sampling. Besides, our method can be easily used as a plugin to
further improve existing ODE-based samplers. Extensive experiments on image
synthesis with the resolution ranging from 32 to 256 demonstrate the
effectiveness of our method. With only 5 NFE, we achieve 7.14 FID on CIFAR-10,
13.75 FID on ImageNet 64$\times$64, and 12.79 FID on LSUN Bedroom. Our code is
available at https://github.com/zhyzhouu/amed-solver.
On Exact Inversion of DPM-Solvers
November 30, 2023
Seongmin Hong, Kyeonghyun Lee, Suh Yoon Jeon, Hyewon Bae, Se Young Chun
Diffusion probabilistic models (DPMs) are a key component in modern
generative models. DPM-solvers have achieved reduced latency and enhanced
quality significantly, but have posed challenges to find the exact inverse
(i.e., finding the initial noise from the given image). Here we investigate the
exact inversions for DPM-solvers and propose algorithms to perform them when
samples are generated by the first-order as well as higher-order DPM-solvers.
For each explicit denoising step in DPM-solvers, we formulated the inversions
using implicit methods such as gradient descent or forward step method to
ensure the robustness to large classifier-free guidance unlike the prior
approach using fixed-point iteration. Experimental results demonstrated that
our proposed exact inversion methods significantly reduced the error of both
image and noise reconstructions, greatly enhanced the ability to distinguish
invisible watermarks and well prevented unintended background changes
consistently during image editing. Project page:
\url{https://smhongok.github.io/inv-dpm.html}.
Diffusion Models Without Attention
November 30, 2023
Jing Nathan Yan, Jiatao Gu, Alexander M. Rush
In recent advancements in high-fidelity image generation, Denoising Diffusion
Probabilistic Models (DDPMs) have emerged as a key player. However, their
application at high resolutions presents significant computational challenges.
Current methods, such as patchifying, expedite processes in UNet and
Transformer architectures but at the expense of representational capacity.
Addressing this, we introduce the Diffusion State Space Model (DiffuSSM), an
architecture that supplants attention mechanisms with a more scalable state
space model backbone. This approach effectively handles higher resolutions
without resorting to global compression, thus preserving detailed image
representation throughout the diffusion process. Our focus on FLOP-efficient
architectures in diffusion training marks a significant step forward.
Comprehensive evaluations on both ImageNet and LSUN datasets at two resolutions
demonstrate that DiffuSSMs are on par or even outperform existing diffusion
models with attention modules in FID and Inception Score metrics while
significantly reducing total FLOP usage.
SMaRt: Improving GANs with Score Matching Regularity
November 30, 2023
Mengfei Xia, Yujun Shen, Ceyuan Yang, Ran Yi, Wenping Wang, Yong-jin Liu
Generative adversarial networks (GANs) usually struggle in learning from
highly diverse data, whose underlying manifold is complex. In this work, we
revisit the mathematical foundations of GANs, and theoretically reveal that the
native adversarial loss for GAN training is insufficient to fix the problem of
subsets with positive Lebesgue measure of the generated data manifold lying out
of the real data manifold. Instead, we find that score matching serves as a
valid solution to this issue thanks to its capability of persistently pushing
the generated data points towards the real data manifold. We thereby propose to
improve the optimization of GANs with score matching regularity (SMaRt).
Regarding the empirical evidences, we first design a toy example to show that
training GANs by the aid of a ground-truth score function can help reproduce
the real data distribution more accurately, and then confirm that our approach
can consistently boost the synthesis performance of various state-of-the-art
GANs on real-world datasets with pre-trained diffusion models acting as the
approximate score function. For instance, when training Aurora on the ImageNet
64x64 dataset, we manage to improve FID from 8.87 to 7.11, on par with the
performance of one-step consistency model. The source code will be made public.
HiFi Tuner: High-Fidelity Subject-Driven Fine-Tuning for Diffusion Models
November 30, 2023
Zhonghao Wang, Wei Wei, Yang Zhao, Zhisheng Xiao, Mark Hasegawa-Johnson, Humphrey Shi, Tingbo Hou
cs.CV, cs.AI, cs.CL, cs.LG
This paper explores advancements in high-fidelity personalized image
generation through the utilization of pre-trained text-to-image diffusion
models. While previous approaches have made significant strides in generating
versatile scenes based on text descriptions and a few input images, challenges
persist in maintaining the subject fidelity within the generated images. In
this work, we introduce an innovative algorithm named HiFi Tuner to enhance the
appearance preservation of objects during personalized image generation. Our
proposed method employs a parameter-efficient fine-tuning framework, comprising
a denoising process and a pivotal inversion process. Key enhancements include
the utilization of mask guidance, a novel parameter regularization technique,
and the incorporation of step-wise subject representations to elevate the
sample fidelity. Additionally, we propose a reference-guided generation
approach that leverages the pivotal inversion of a reference image to mitigate
unwanted subject variations and artifacts. We further extend our method to a
novel image editing task: substituting the subject in an image through textual
manipulations. Experimental evaluations conducted on the DreamBooth dataset
using the Stable Diffusion model showcase promising results. Fine-tuning solely
on textual embeddings improves CLIP-T score by 3.6 points and improves DINO
score by 9.6 points over Textual Inversion. When fine-tuning all parameters,
HiFi Tuner improves CLIP-T score by 1.2 points and improves DINO score by 1.2
points over DreamBooth, establishing a new state of the art.
DiffGEPCI: 3D MRI Synthesis from mGRE Signals using 2.5D Diffusion Model
November 29, 2023
Yuyang Hu, Satya V. V. N. Kothapalli, Weijie Gan, Alexander L. Sukstanskii, Gregory F. Wu, Manu Goyal, Dmitriy A. Yablonskiy, Ulugbek S. Kamilov
We introduce a new framework called DiffGEPCI for cross-modality generation
in magnetic resonance imaging (MRI) using a 2.5D conditional diffusion model.
DiffGEPCI can synthesize high-quality Fluid Attenuated Inversion Recovery
(FLAIR) and Magnetization Prepared-Rapid Gradient Echo (MPRAGE) images, without
acquiring corresponding measurements, by leveraging multi-Gradient-Recalled
Echo (mGRE) MRI signals as conditional inputs. DiffGEPCI operates in a two-step
fashion: it initially estimates a 3D volume slice-by-slice using the axial
plane and subsequently applies a refinement algorithm (referred to as 2.5D) to
enhance the quality of the coronal and sagittal planes. Experimental validation
on real mGRE data shows that DiffGEPCI achieves excellent performance,
surpassing generative adversarial networks (GANs) and traditional diffusion
models.
Do text-free diffusion models learn discriminative visual representations?
November 29, 2023
Soumik Mukhopadhyay, Matthew Gwilliam, Yosuke Yamaguchi, Vatsal Agarwal, Namitha Padmanabhan, Archana Swaminathan, Tianyi Zhou, Abhinav Shrivastava
While many unsupervised learning models focus on one family of tasks, either
generative or discriminative, we explore the possibility of a unified
representation learner: a model which addresses both families of tasks
simultaneously. We identify diffusion models, a state-of-the-art method for
generative tasks, as a prime candidate. Such models involve training a U-Net to
iteratively predict and remove noise, and the resulting model can synthesize
high-fidelity, diverse, novel images. We find that the intermediate feature
maps of the U-Net are diverse, discriminative feature representations. We
propose a novel attention mechanism for pooling feature maps and further
leverage this mechanism as DifFormer, a transformer feature fusion of features
from different diffusion U-Net blocks and noise steps. We also develop DifFeed,
a novel feedback mechanism tailored to diffusion. We find that diffusion models
are better than GANs, and, with our fusion and feedback mechanisms, can compete
with state-of-the-art unsupervised image representation learning methods for
discriminative tasks - image classification with full and semi-supervision,
transfer for fine-grained classification, object detection and segmentation,
and semantic segmentation. Our project website
(https://mgwillia.github.io/diffssl/) and code
(https://github.com/soumik-kanad/diffssl) are available publicly.
SODA: Bottleneck Diffusion Models for Representation Learning
November 29, 2023
Drew A. Hudson, Daniel Zoran, Mateusz Malinowski, Andrew K. Lampinen, Andrew Jaegle, James L. McClelland, Loic Matthey, Felix Hill, Alexander Lerchner
We introduce SODA, a self-supervised diffusion model, designed for
representation learning. The model incorporates an image encoder, which
distills a source view into a compact representation, that, in turn, guides the
generation of related novel views. We show that by imposing a tight bottleneck
between the encoder and a denoising decoder, and leveraging novel view
synthesis as a self-supervised objective, we can turn diffusion models into
strong representation learners, capable of capturing visual semantics in an
unsupervised manner. To the best of our knowledge, SODA is the first diffusion
model to succeed at ImageNet linear-probe classification, and, at the same
time, it accomplishes reconstruction, editing and synthesis tasks across a wide
range of datasets. Further investigation reveals the disentangled nature of its
emergent latent space, that serves as an effective interface to control and
manipulate the model’s produced images. All in all, we aim to shed light on the
exciting and promising potential of diffusion models, not only for image
generation, but also for learning rich and robust representations.
Leveraging Graph Diffusion Models for Network Refinement Tasks
November 29, 2023
Puja Trivedi, Ryan Rossi, David Arbour, Tong Yu, Franck Dernoncourt, Sungchul Kim, Nedim Lipka, Namyong Park, Nesreen K. Ahmed, Danai Koutra
Most real-world networks are noisy and incomplete samples from an unknown
target distribution. Refining them by correcting corruptions or inferring
unobserved regions typically improves downstream performance. Inspired by the
impressive generative capabilities that have been used to correct corruptions
in images, and the similarities between “in-painting” and filling in missing
nodes and edges conditioned on the observed graph, we propose a novel graph
generative framework, SGDM, which is based on subgraph diffusion. Our framework
not only improves the scalability and fidelity of graph diffusion models, but
also leverages the reverse process to perform novel, conditional generation
tasks. In particular, through extensive empirical analysis and a set of novel
metrics, we demonstrate that our proposed model effectively supports the
following refinement tasks for partially observable networks: T1: denoising
extraneous subgraphs, T2: expanding existing subgraphs and T3: performing
“style” transfer by regenerating a particular subgraph to match the
characteristics of a different node or subgraph.
Fair Text-to-Image Diffusion via Fair Mapping
November 29, 2023
Jia Li, Lijie Hu, Jingfeng Zhang, Tianhang Zheng, Hua Zhang, Di Wang
cs.CV, cs.AI, cs.CY, cs.LG
In this paper, we address the limitations of existing text-to-image diffusion
models in generating demographically fair results when given human-related
descriptions. These models often struggle to disentangle the target language
context from sociocultural biases, resulting in biased image generation. To
overcome this challenge, we propose Fair Mapping, a general, model-agnostic,
and lightweight approach that modifies a pre-trained text-to-image model by
controlling the prompt to achieve fair image generation. One key advantage of
our approach is its high efficiency. The training process only requires
updating a small number of parameters in an additional linear mapping network.
This not only reduces the computational cost but also accelerates the
optimization process. We first demonstrate the issue of bias in generated
results caused by language biases in text-guided diffusion models. By
developing a mapping network that projects language embeddings into an unbiased
space, we enable the generation of relatively balanced demographic results
based on a keyword specified in the prompt. With comprehensive experiments on
face image generation, we show that our method significantly improves image
generation performance when prompted with descriptions related to human faces.
By effectively addressing the issue of bias, we produce more fair and diverse
image outputs. This work contributes to the field of text-to-image generation
by enhancing the ability to generate images that accurately reflect the
intended demographic characteristics specified in the text.
Using Ornstein-Uhlenbeck Process to understand Denoising Diffusion Probabilistic Model and its Noise Schedules
November 29, 2023
Javier E. Santos, Yen Ting Lin
stat.ML, cond-mat.stat-mech, cs.AI, cs.LG, math-ph, math.MP
The aim of this short note is to show that Denoising Diffusion Probabilistic
Model DDPM, a non-homogeneous discrete-time Markov process, can be represented
by a time-homogeneous continuous-time Markov process observed at non-uniformly
sampled discrete times. Surprisingly, this continuous-time Markov process is
the well-known and well-studied Ornstein-Ohlenbeck (OU) process, which was
developed in 1930’s for studying Brownian particles in Harmonic potentials. We
establish the formal equivalence between DDPM and the OU process using its
analytical solution. We further demonstrate that the design problem of the
noise scheduler for non-homogeneous DDPM is equivalent to designing observation
times for the OU process. We present several heuristic designs for observation
times based on principled quantities such as auto-variance and Fisher
Information and connect them to ad hoc noise schedules for DDPM. Interestingly,
we show that the Fisher-Information-motivated schedule corresponds exactly the
cosine schedule, which was developed without any theoretical foundation but is
the current state-of-the-art noise schedule.
Probabilistic Copyright Protection Can Fail for Text-to-Image Generative Models
November 29, 2023
Xiang Li, Qianli Shen, Kenji Kawaguchi
cs.CR, cs.AI, cs.CV, cs.MM
The booming use of text-to-image generative models has raised concerns about
their high risk of producing copyright-infringing content. While probabilistic
copyright protection methods provide a probabilistic guarantee against such
infringement, in this paper, we introduce Virtually Assured Amplification
Attack (VA3), a novel online attack framework that exposes the vulnerabilities
of these protection mechanisms. The proposed framework significantly amplifies
the probability of generating infringing content on the sustained interactions
with generative models and a lower-bounded success probability of each
engagement. Our theoretical and experimental results demonstrate the
effectiveness of our approach and highlight the potential risk of implementing
probabilistic copyright protection in practical applications of text-to-image
generative models. Code is available at https://github.com/South7X/VA3.
HiDiffusion: Unlocking High-Resolution Creativity and Efficiency in Low-Resolution Trained Diffusion Models
November 29, 2023
Shen Zhang, Zhaowei Chen, Zhenyu Zhao, Zhenyuan Chen, Yao Tang, Yuhao Chen, Wengang Cao, Jiajun Liang
We introduce HiDiffusion, a tuning-free framework comprised of
Resolution-Aware U-Net (RAU-Net) and Modified Shifted Window Multi-head
Self-Attention (MSW-MSA) to enable pretrained large text-to-image diffusion
models to efficiently generate high-resolution images (e.g. 1024$\times$1024)
that surpass the training image resolution. Pretrained diffusion models
encounter unreasonable object duplication in generating images beyond the
training image resolution. We attribute it to the mismatch between the feature
map size of high-resolution images and the receptive field of U-Net’s
convolution. To address this issue, we propose a simple yet scalable method
named RAU-Net. RAU-Net dynamically adjusts the feature map size to match the
convolution’s receptive field in the deep block of U-Net. Another obstacle in
high-resolution synthesis is the slow inference speed of U-Net. Our
observations reveal that the global self-attention in the top block, which
exhibits locality, however, consumes the majority of computational resources.
To tackle this issue, we propose MSW-MSA. Unlike previous window attention
mechanisms, our method uses a much larger window size and dynamically shifts
windows to better accommodate diffusion models. Extensive experiments
demonstrate that our HiDiffusion can scale diffusion models to generate
1024$\times$1024, 2048$\times$2048, or even 4096$\times$4096 resolution images,
while simultaneously reducing inference time by 40\%-60\%, achieving
state-of-the-art performance on high-resolution image synthesis. The most
significant revelation of our work is that a pretrained diffusion model on
low-resolution images is scalable for high-resolution generation without
further tuning. We hope this revelation can provide insights for future
research on the scalability of diffusion models.
MMA-Diffusion: MultiModal Attack on Diffusion Models
November 29, 2023
Yijun Yang, Ruiyuan Gao, Xiaosen Wang, Nan Xu, Qiang Xu
In recent years, Text-to-Image (T2I) models have seen remarkable
advancements, gaining widespread adoption. However, this progress has
inadvertently opened avenues for potential misuse, particularly in generating
inappropriate or Not-Safe-For-Work (NSFW) content. Our work introduces
MMA-Diffusion, a framework that presents a significant and realistic threat to
the security of T2I models by effectively circumventing current defensive
measures in both open-source models and commercial online services. Unlike
previous approaches, MMA-Diffusion leverages both textual and visual modalities
to bypass safeguards like prompt filters and post-hoc safety checkers, thus
exposing and highlighting the vulnerabilities in existing defense mechanisms.
When StyleGAN Meets Stable Diffusion: a W+ Adapter for Personalized Image Generation
November 29, 2023
Xiaoming Li, Xinyu Hou, Chen Change Loy
Text-to-image diffusion models have remarkably excelled in producing diverse,
high-quality, and photo-realistic images. This advancement has spurred a
growing interest in incorporating specific identities into generated content.
Most current methods employ an inversion approach to embed a target visual
concept into the text embedding space using a single reference image. However,
the newly synthesized faces either closely resemble the reference image in
terms of facial attributes, such as expression, or exhibit a reduced capacity
for identity preservation. Text descriptions intended to guide the facial
attributes of the synthesized face may fall short, owing to the intricate
entanglement of identity information with identity-irrelevant facial attributes
derived from the reference image. To address these issues, we present the novel
use of the extended StyleGAN embedding space $\mathcal{W}_+$, to achieve
enhanced identity preservation and disentanglement for diffusion models. By
aligning this semantically meaningful human face latent space with
text-to-image diffusion models, we succeed in maintaining high fidelity in
identity preservation, coupled with the capacity for semantic editing.
Additionally, we propose new training objectives to balance the influences of
both prompt and identity conditions, ensuring that the identity-irrelevant
background remains unaffected during facial attribute modifications. Extensive
experiments reveal that our method adeptly generates personalized text-to-image
outputs that are not only compatible with prompt descriptions but also amenable
to common StyleGAN editing directions in diverse settings. Our source code will
be available at \url{https://github.com/csxmli2016/w-plus-adapter}.
DifFlow3D: Toward Robust Uncertainty-Aware Scene Flow Estimation with Diffusion Model
November 29, 2023
Jiuming Liu, Guangming Wang, Weicai Ye, Chaokang Jiang, Jinru Han, Zhe Liu, Guofeng Zhang, Dalong Du, Hesheng Wang
Scene flow estimation, which aims to predict per-point 3D displacements of
dynamic scenes, is a fundamental task in the computer vision field. However,
previous works commonly suffer from unreliable correlation caused by locally
constrained searching ranges, and struggle with accumulated inaccuracy arising
from the coarse-to-fine structure. To alleviate these problems, we propose a
novel uncertainty-aware scene flow estimation network (DifFlow3D) with the
diffusion probabilistic model. Iterative diffusion-based refinement is designed
to enhance the correlation robustness and resilience to challenging cases,
e.g., dynamics, noisy inputs, repetitive patterns, etc. To restrain the
generation diversity, three key flow-related features are leveraged as
conditions in our diffusion model. Furthermore, we also develop an uncertainty
estimation module within diffusion to evaluate the reliability of estimated
scene flow. Our DifFlow3D achieves state-of-the-art performance, with 6.7\% and
19.1\% EPE3D reduction respectively on FlyingThings3D and KITTI 2015 datasets.
Notably, our method achieves an unprecedented millimeter-level accuracy
(0.0089m in EPE3D) on the KITTI dataset. Additionally, our diffusion-based
refinement paradigm can be readily integrated as a plug-and-play module into
existing scene flow networks, significantly increasing their estimation
accuracy. Codes will be released later.
Unlocking Spatial Comprehension in Text-to-Image Diffusion Models
November 28, 2023
Mohammad Mahdi Derakhshani, Menglin Xia, Harkirat Behl, Cees G. M. Snoek, Victor Rühle
We propose CompFuser, an image generation pipeline that enhances spatial
comprehension and attribute assignment in text-to-image generative models. Our
pipeline enables the interpretation of instructions defining spatial
relationships between objects in a scene, such as `An image of a gray cat on
the left of an orange dog’, and generate corresponding images. This is
especially important in order to provide more control to the user. CompFuser
overcomes the limitation of existing text-to-image diffusion models by decoding
the generation of multiple objects into iterative steps: first generating a
single object and then editing the image by placing additional objects in their
designated positions. To create training data for spatial comprehension and
attribute assignment we introduce a synthetic data generation process, that
leverages a frozen large language model and a frozen layout-based diffusion
model for object placement. We compare our approach to strong baselines and
show that our model outperforms state-of-the-art image generation models in
spatial comprehension and attribute assignment, despite being 3x to 5x smaller
in parameters.
Shadows Don’t Lie and Lines Can’t Bend! Generative Models don’t know Projective Geometry…for now
November 28, 2023
Ayush Sarkar, Hanlin Mai, Amitabh Mahapatra, Svetlana Lazebnik, D. A. Forsyth, Anand Bhattad
cs.CV, cs.AI, cs.GR, cs.LG
Generative models can produce impressively realistic images. This paper
demonstrates that generated images have geometric features different from those
of real images. We build a set of collections of generated images, prequalified
to fool simple, signal-based classifiers into believing they are real. We then
show that prequalified generated images can be identified reliably by
classifiers that only look at geometric properties. We use three such
classifiers. All three classifiers are denied access to image pixels, and look
only at derived geometric features. The first classifier looks at the
perspective field of the image, the second looks at lines detected in the
image, and the third looks at relations between detected objects and shadows.
Our procedure detects generated images more reliably than SOTA local signal
based detectors, for images from a number of distinct generators. Saliency maps
suggest that the classifiers can identify geometric problems reliably. We
conclude that current generators cannot reliably reproduce geometric properties
of real images.
DiffuseBot: Breeding Soft Robots With Physics-Augmented Generative Diffusion Models
November 28, 2023
Tsun-Hsuan Wang, Juntian Zheng, Pingchuan Ma, Yilun Du, Byungchul Kim, Andrew Spielberg, Joshua Tenenbaum, Chuang Gan, Daniela Rus
cs.RO, cs.AI, cs.CV, cs.LG
Nature evolves creatures with a high complexity of morphological and
behavioral intelligence, meanwhile computational methods lag in approaching
that diversity and efficacy. Co-optimization of artificial creatures’
morphology and control in silico shows promise for applications in physical
soft robotics and virtual character creation; such approaches, however, require
developing new learning algorithms that can reason about function atop pure
structure. In this paper, we present DiffuseBot, a physics-augmented diffusion
model that generates soft robot morphologies capable of excelling in a wide
spectrum of tasks. DiffuseBot bridges the gap between virtually generated
content and physical utility by (i) augmenting the diffusion process with a
physical dynamical simulation which provides a certificate of performance, and
(ii) introducing a co-design procedure that jointly optimizes physical design
and control by leveraging information about physical sensitivities from
differentiable simulation. We showcase a range of simulated and fabricated
robots along with their capabilities. Check our website at
https://diffusebot.github.io/
Adversarial Diffusion Distillation
November 28, 2023
Axel Sauer, Dominik Lorenz, Andreas Blattmann, Robin Rombach
We introduce Adversarial Diffusion Distillation (ADD), a novel training
approach that efficiently samples large-scale foundational image diffusion
models in just 1-4 steps while maintaining high image quality. We use score
distillation to leverage large-scale off-the-shelf image diffusion models as a
teacher signal in combination with an adversarial loss to ensure high image
fidelity even in the low-step regime of one or two sampling steps. Our analyses
show that our model clearly outperforms existing few-step methods (GANs, Latent
Consistency Models) in a single step and reaches the performance of
state-of-the-art diffusion models (SDXL) in only four steps. ADD is the first
method to unlock single-step, real-time image synthesis with foundation models.
Code and weights available under
https://github.com/Stability-AI/generative-models and
https://huggingface.co/stabilityai/ .
Space-Time Diffusion Features for Zero-Shot Text-Driven Motion Transfer
November 28, 2023
Danah Yatim, Rafail Fridman, Omer Bar Tal, Yoni Kasten, Tali Dekel
We present a new method for text-driven motion transfer - synthesizing a
video that complies with an input text prompt describing the target objects and
scene while maintaining an input video’s motion and scene layout. Prior methods
are confined to transferring motion across two subjects within the same or
closely related object categories and are applicable for limited domains (e.g.,
humans). In this work, we consider a significantly more challenging setting in
which the target and source objects differ drastically in shape and
fine-grained motion characteristics (e.g., translating a jumping dog into a
dolphin). To this end, we leverage a pre-trained and fixed text-to-video
diffusion model, which provides us with generative and motion priors. The
pillar of our method is a new space-time feature loss derived directly from the
model. This loss guides the generation process to preserve the overall motion
of the input video while complying with the target object in terms of shape and
fine-grained motion traits.
Ranni: Taming Text-to-Image Diffusion for Accurate Instruction Following
November 28, 2023
Yutong Feng, Biao Gong, Di Chen, Yujun Shen, Yu Liu, Jingren Zhou
Existing text-to-image (T2I) diffusion models usually struggle in
interpreting complex prompts, especially those with quantity, object-attribute
binding, and multi-subject descriptions. In this work, we introduce a semantic
panel as the middleware in decoding texts to images, supporting the generator
to better follow instructions. The panel is obtained through arranging the
visual concepts parsed from the input text by the aid of large language models,
and then injected into the denoising network as a detailed control signal to
complement the text condition. To facilitate text-to-panel learning, we come up
with a carefully designed semantic formatting protocol, accompanied by a
fully-automatic data preparation pipeline. Thanks to such a design, our
approach, which we call Ranni, manages to enhance a pre-trained T2I generator
regarding its textual controllability. More importantly, the introduction of
the generative middleware brings a more convenient form of interaction (i.e.,
directly adjusting the elements in the panel or using language instructions)
and further allows users to finely customize their generation, based on which
we develop a practical system and showcase its potential in continuous
generation and chatting-based editing. Our project page is at
https://ranni-t2i.github.io/Ranni.
November 28, 2023
Chen Zhao, Weiling Cai, Chenyu Dong, Chengwei Hu
Underwater images are subject to intricate and diverse degradation,
inevitably affecting the effectiveness of underwater visual tasks. However,
most approaches primarily operate in the raw pixel space of images, which
limits the exploration of the frequency characteristics of underwater images,
leading to an inadequate utilization of deep models’ representational
capabilities in producing high-quality images. In this paper, we introduce a
novel Underwater Image Enhancement (UIE) framework, named WF-Diff, designed to
fully leverage the characteristics of frequency domain information and
diffusion models. WF-Diff consists of two detachable networks: Wavelet-based
Fourier information interaction network (WFI2-net) and Frequency Residual
Diffusion Adjustment Module (FRDAM). With our full exploration of the frequency
domain information, WFI2-net aims to achieve preliminary enhancement of
frequency information in the wavelet space. Our proposed FRDAM can further
refine the high- and low-frequency information of the initial enhanced images,
which can be viewed as a plug-and-play universal module to adjust the detail of
the underwater images. With the above techniques, our algorithm can show SOTA
performance on real-world underwater image datasets, and achieves competitive
performance in visual quality.
Denoising Diffusion Probabilistic Models for Image Inpainting of Cell Distributions in the Human Brain
November 28, 2023
Jan-Oliver Kropp, Christian Schiffer, Katrin Amunts, Timo Dickscheid
Recent advances in imaging and high-performance computing have made it
possible to image the entire human brain at the cellular level. This is the
basis to study the multi-scale architecture of the brain regarding its
subdivision into brain areas and nuclei, cortical layers, columns, and cell
clusters down to single cell morphology Methods for brain mapping and cell
segmentation exploit such images to enable rapid and automated analysis of
cytoarchitecture and cell distribution in complete series of histological
sections. However, the presence of inevitable processing artifacts in the image
data caused by missing sections, tears in the tissue, or staining variations
remains the primary reason for gaps in the resulting image data. To this end we
aim to provide a model that can fill in missing information in a reliable way,
following the true cell distribution at different scales. Inspired by the
recent success in image generation, we propose a denoising diffusion
probabilistic model (DDPM), trained on light-microscopic scans of cell-body
stained sections. We extend this model with the RePaint method to impute
missing or replace corrupted image data. We show that our trained DDPM is able
to generate highly realistic image information for this purpose, generating
plausible cell statistics and cytoarchitectonic patterns. We validate its
outputs using two established downstream task models trained on the same data.
Robust Diffusion GAN using Semi-Unbalanced Optimal Transport
November 28, 2023
Quan Dao, Binh Ta, Tung Pham, Anh Tran
Diffusion models, a type of generative model, have demonstrated great
potential for synthesizing highly detailed images. By integrating with GAN,
advanced diffusion models like DDGAN \citep{xiao2022DDGAN} could approach
real-time performance for expansive practical applications. While DDGAN has
effectively addressed the challenges of generative modeling, namely producing
high-quality samples, covering different data modes, and achieving faster
sampling, it remains susceptible to performance drops caused by datasets that
are corrupted with outlier samples. This work introduces a robust training
technique based on semi-unbalanced optimal transport to mitigate the impact of
outliers effectively. Through comprehensive evaluations, we demonstrate that
our robust diffusion GAN (RDGAN) outperforms vanilla DDGAN in terms of the
aforementioned generative modeling criteria, i.e., image quality, mode coverage
of distribution, and inference speed, and exhibits improved robustness when
dealing with both clean and corrupted datasets.
MobileDiffusion: Subsecond Text-to-Image Generation on Mobile Devices
November 28, 2023
Yang Zhao, Yanwu Xu, Zhisheng Xiao, Tingbo Hou
The deployment of large-scale text-to-image diffusion models on mobile
devices is impeded by their substantial model size and slow inference speed. In
this paper, we propose \textbf{MobileDiffusion}, a highly efficient
text-to-image diffusion model obtained through extensive optimizations in both
architecture and sampling techniques. We conduct a comprehensive examination of
model architecture design to reduce redundancy, enhance computational
efficiency, and minimize model’s parameter count, while preserving image
generation quality. Additionally, we employ distillation and diffusion-GAN
finetuning techniques on MobileDiffusion to achieve 8-step and 1-step inference
respectively. Empirical studies, conducted both quantitatively and
qualitatively, demonstrate the effectiveness of our proposed techniques.
MobileDiffusion achieves a remarkable \textbf{sub-second} inference speed for
generating a $512\times512$ image on mobile devices, establishing a new state
of the art.
DiffusionTalker: Personalization and Acceleration for Speech-Driven 3D Face Diffuser
November 28, 2023
Peng Chen, Xiaobao Wei, Ming Lu, Yitong Zhu, Naiming Yao, Xingyu Xiao, Hui Chen
Speech-driven 3D facial animation has been an attractive task in both
academia and industry. Traditional methods mostly focus on learning a
deterministic mapping from speech to animation. Recent approaches start to
consider the non-deterministic fact of speech-driven 3D face animation and
employ the diffusion model for the task. However, personalizing facial
animation and accelerating animation generation are still two major limitations
of existing diffusion-based methods. To address the above limitations, we
propose DiffusionTalker, a diffusion-based method that utilizes contrastive
learning to personalize 3D facial animation and knowledge distillation to
accelerate 3D animation generation. Specifically, to enable personalization, we
introduce a learnable talking identity to aggregate knowledge in audio
sequences. The proposed identity embeddings extract customized facial cues
across different people in a contrastive learning manner. During inference,
users can obtain personalized facial animation based on input audio, reflecting
a specific talking style. With a trained diffusion model with hundreds of
steps, we distill it into a lightweight model with 8 steps for acceleration.
Extensive experiments are conducted to demonstrate that our method outperforms
state-of-the-art methods. The code will be released.
Exploring Straighter Trajectories of Flow Matching with Diffusion Guidance
November 28, 2023
Siyu Xing, Jie Cao, Huaibo Huang, Xiao-Yu Zhang, Ran He
Flow matching as a paradigm of generative model achieves notable success
across various domains. However, existing methods use either multi-round
training or knowledge within minibatches, posing challenges in finding a
favorable coupling strategy for straight trajectories. To address this issue,
we propose a novel approach, Straighter trajectories of Flow Matching
(StraightFM). It straightens trajectories with the coupling strategy guided by
diffusion model from entire distribution level. First, we propose a coupling
strategy to straighten trajectories, creating couplings between image and noise
samples under diffusion model guidance. Second, StraightFM also integrates real
data to enhance training, employing a neural network to parameterize another
coupling process from images to noise samples. StraightFM is jointly optimized
with couplings from above two mutually complementary directions, resulting in
straighter trajectories and enabling both one-step and few-step generation.
Extensive experiments demonstrate that StraightFM yields high quality samples
with fewer step. StraightFM generates visually appealing images with a lower
FID among diffusion and traditional flow matching methods within 5 sampling
steps when trained on pixel space. In the latent space (i.e., Latent
Diffusion), StraightFM achieves a lower KID value compared to existing methods
on the CelebA-HQ 256 dataset in fewer than 10 sampling steps.
Federated Learning with Diffusion Models for Privacy-Sensitive Vision Tasks
November 28, 2023
Ye Lin Tun, Chu Myaet Thwal, Ji Su Yoon, Sun Moo Kang, Chaoning Zhang, Choong Seon Hong
Diffusion models have shown great potential for vision-related tasks,
particularly for image generation. However, their training is typically
conducted in a centralized manner, relying on data collected from publicly
available sources. This approach may not be feasible or practical in many
domains, such as the medical field, which involves privacy concerns over data
collection. Despite the challenges associated with privacy-sensitive data, such
domains could still benefit from valuable vision services provided by diffusion
models. Federated learning (FL) plays a crucial role in enabling decentralized
model training without compromising data privacy. Instead of collecting data,
an FL system gathers model parameters, effectively safeguarding the private
data of different parties involved. This makes FL systems vital for managing
decentralized learning tasks, especially in scenarios where privacy-sensitive
data is distributed across a network of clients. Nonetheless, FL presents its
own set of challenges due to its distributed nature and privacy-preserving
properties. Therefore, in this study, we explore the FL strategy to train
diffusion models, paving the way for the development of federated diffusion
models. We conduct experiments on various FL scenarios, and our findings
demonstrate that federated diffusion models have great potential to deliver
vision services to privacy-sensitive domains.
TextDiffuser-2: Unleashing the Power of Language Models for Text Rendering
November 28, 2023
Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, Furu Wei
The diffusion model has been proven a powerful generative model in recent
years, yet remains a challenge in generating visual text. Several methods
alleviated this issue by incorporating explicit text position and content as
guidance on where and what text to render. However, these methods still suffer
from several drawbacks, such as limited flexibility and automation, constrained
capability of layout prediction, and restricted style diversity. In this paper,
we present TextDiffuser-2, aiming to unleash the power of language models for
text rendering. Firstly, we fine-tune a large language model for layout
planning. The large language model is capable of automatically generating
keywords for text rendering and also supports layout modification through
chatting. Secondly, we utilize the language model within the diffusion model to
encode the position and texts at the line level. Unlike previous methods that
employed tight character-level guidance, this approach generates more diverse
text images. We conduct extensive experiments and incorporate user studies
involving human participants as well as GPT-4V, validating TextDiffuser-2’s
capacity to achieve a more rational text layout and generation with enhanced
diversity. The code and model will be available at
\url{https://aka.ms/textdiffuser-2}.
PEA-Diffusion: Parameter-Efficient Adapter with Knowledge Distillation in non-English Text-to-Image Generation
November 28, 2023
Jian Ma, Chen Chen, Qingsong Xie, Haonan Lu
Text-to-image diffusion models are well-known for their ability to generate
realistic images based on textual prompts. However, the existing works have
predominantly focused on English, lacking support for non-English text-to-image
models. The most commonly used translation methods cannot solve the generation
problem related to language culture, while training from scratch on a specific
language dataset is prohibitively expensive. In this paper, we are inspired to
propose a simple plug-and-play language transfer method based on knowledge
distillation. All we need to do is train a lightweight MLP-like
parameter-efficient adapter (PEA) with only 6M parameters under teacher
knowledge distillation along with a small parallel data corpus. We are
surprised to find that freezing the parameters of UNet can still achieve
remarkable performance on the language-specific prompt evaluation set,
demonstrating that PEA can stimulate the potential generation ability of the
original UNet. Additionally, it closely approaches the performance of the
English text-to-image model on a general prompt evaluation set. Furthermore,
our adapter can be used as a plugin to achieve significant results in
downstream tasks in cross-lingual text-to-image generation. Code will be
available at: https://github.com/OPPO-Mente-Lab/PEA-Diffusion
Manifold Preserving Guided Diffusion
November 28, 2023
Yutong He, Naoki Murata, Chieh-Hsin Lai, Yuhta Takida, Toshimitsu Uesaka, Dongjun Kim, Wei-Hsiang Liao, Yuki Mitsufuji, J. Zico Kolter, Ruslan Salakhutdinov, Stefano Ermon
Despite the recent advancements, conditional image generation still faces
challenges of cost, generalizability, and the need for task-specific training.
In this paper, we propose Manifold Preserving Guided Diffusion (MPGD), a
training-free conditional generation framework that leverages pretrained
diffusion models and off-the-shelf neural networks with minimal additional
inference cost for a broad range of tasks. Specifically, we leverage the
manifold hypothesis to refine the guided diffusion steps and introduce a
shortcut algorithm in the process. We then propose two methods for on-manifold
training-free guidance using pre-trained autoencoders and demonstrate that our
shortcut inherently preserves the manifolds when applied to latent diffusion
models. Our experiments show that MPGD is efficient and effective for solving a
variety of conditional generation applications in low-compute settings, and can
consistently offer up to 3.8x speed-ups with the same number of diffusion steps
while maintaining high sample quality compared to the baselines.
Test-time Adaptation of Discriminative Models via Diffusion Generative Feedback
November 27, 2023
Mihir Prabhudesai, Tsung-Wei Ke, Alexander C. Li, Deepak Pathak, Katerina Fragkiadaki
cs.CV, cs.AI, cs.LG, cs.RO
The advancements in generative modeling, particularly the advent of diffusion
models, have sparked a fundamental question: how can these models be
effectively used for discriminative tasks? In this work, we find that
generative models can be great test-time adapters for discriminative models.
Our method, Diffusion-TTA, adapts pre-trained discriminative models such as
image classifiers, segmenters and depth predictors, to each unlabelled example
in the test set using generative feedback from a diffusion model. We achieve
this by modulating the conditioning of the diffusion model using the output of
the discriminative model. We then maximize the image likelihood objective by
backpropagating the gradients to discriminative model’s parameters. We show
Diffusion-TTA significantly enhances the accuracy of various large-scale
pre-trained discriminative models, such as, ImageNet classifiers, CLIP models,
image pixel labellers and image depth predictors. Diffusion-TTA outperforms
existing test-time adaptation methods, including TTT-MAE and TENT, and
particularly shines in online adaptation setups, where the discriminative model
is continually adapted to each example in the test set. We provide access to
code, results, and visualizations on our website:
https://diffusion-tta.github.io/.
Self-correcting LLM-controlled Diffusion Models
November 27, 2023
Tsung-Han Wu, Long Lian, Joseph E. Gonzalez, Boyi Li, Trevor Darrell
Text-to-image generation has witnessed significant progress with the advent
of diffusion models. Despite the ability to generate photorealistic images,
current text-to-image diffusion models still often struggle to accurately
interpret and follow complex input text prompts. In contrast to existing models
that aim to generate images only with their best effort, we introduce
Self-correcting LLM-controlled Diffusion (SLD). SLD is a framework that
generates an image from the input prompt, assesses its alignment with the
prompt, and performs self-corrections on the inaccuracies in the generated
image. Steered by an LLM controller, SLD turns text-to-image generation into an
iterative closed-loop process, ensuring correctness in the resulting image. SLD
is not only training-free but can also be seamlessly integrated with diffusion
models behind API access, such as DALL-E 3, to further boost the performance of
state-of-the-art diffusion models. Experimental results show that our approach
can rectify a majority of incorrect generations, particularly in generative
numeracy, attribute binding, and spatial relationships. Furthermore, by simply
adjusting the instructions to the LLM, SLD can perform image editing tasks,
bridging the gap between text-to-image generation and image editing pipelines.
We will make our code available for future research and applications.
DiffSLVA: Harnessing Diffusion Models for Sign Language Video Anonymization
November 27, 2023
Zhaoyang Xia, Carol Neidle, Dimitris N. Metaxas
Since American Sign Language (ASL) has no standard written form, Deaf signers
frequently share videos in order to communicate in their native language.
However, since both hands and face convey critical linguistic information in
signed languages, sign language videos cannot preserve signer privacy. While
signers have expressed interest, for a variety of applications, in sign
language video anonymization that would effectively preserve linguistic
content, attempts to develop such technology have had limited success, given
the complexity of hand movements and facial expressions. Existing approaches
rely predominantly on precise pose estimations of the signer in video footage
and often require sign language video datasets for training. These requirements
prevent them from processing videos ‘in the wild,’ in part because of the
limited diversity present in current sign language video datasets. To address
these limitations, our research introduces DiffSLVA, a novel methodology that
utilizes pre-trained large-scale diffusion models for zero-shot text-guided
sign language video anonymization. We incorporate ControlNet, which leverages
low-level image features such as HED (Holistically-Nested Edge Detection)
edges, to circumvent the need for pose estimation. Additionally, we develop a
specialized module dedicated to capturing facial expressions, which are
critical for conveying essential linguistic information in signed languages. We
then combine the above methods to achieve anonymization that better preserves
the essential linguistic content of the original signer. This innovative
methodology makes possible, for the first time, sign language video
anonymization that could be used for real-world applications, which would offer
significant benefits to the Deaf and Hard-of-Hearing communities. We
demonstrate the effectiveness of our approach with a series of signer
anonymization experiments.
Closing the ODE-SDE gap in score-based diffusion models through the Fokker-Planck equation
November 27, 2023
Teo Deveney, Jan Stanczuk, Lisa Maria Kreusser, Chris Budd, Carola-Bibiane Schönlieb
cs.LG, cs.NA, math.NA, stat.ML
Score-based diffusion models have emerged as one of the most promising
frameworks for deep generative modelling, due to their state-of-the art
performance in many generation tasks while relying on mathematical foundations
such as stochastic differential equations (SDEs) and ordinary differential
equations (ODEs). Empirically, it has been reported that ODE based samples are
inferior to SDE based samples. In this paper we rigorously describe the range
of dynamics and approximations that arise when training score-based diffusion
models, including the true SDE dynamics, the neural approximations, the various
approximate particle dynamics that result, as well as their associated
Fokker–Planck equations and the neural network approximations of these
Fokker–Planck equations. We systematically analyse the difference between the
ODE and SDE dynamics of score-based diffusion models, and link it to an
associated Fokker–Planck equation. We derive a theoretical upper bound on the
Wasserstein 2-distance between the ODE- and SDE-induced distributions in terms
of a Fokker–Planck residual. We also show numerically that conventional
score-based diffusion models can exhibit significant differences between ODE-
and SDE-induced distributions which we demonstrate using explicit comparisons.
Moreover, we show numerically that reducing the Fokker–Planck residual by
adding it as an additional regularisation term leads to closing the gap between
ODE- and SDE-induced distributions. Our experiments suggest that this
regularisation can improve the distribution generated by the ODE, however that
this can come at the cost of degraded SDE sample quality.
DiffAnt: Diffusion Models for Action Anticipation
November 27, 2023
Zeyun Zhong, Chengzhi Wu, Manuel Martin, Michael Voit, Juergen Gall, Jürgen Beyerer
Anticipating future actions is inherently uncertain. Given an observed video
segment containing ongoing actions, multiple subsequent actions can plausibly
follow. This uncertainty becomes even larger when predicting far into the
future. However, the majority of existing action anticipation models adhere to
a deterministic approach, neglecting to account for future uncertainties. In
this work, we rethink action anticipation from a generative view, employing
diffusion models to capture different possible future actions. In this
framework, future actions are iteratively generated from standard Gaussian
noise in the latent space, conditioned on the observed video, and subsequently
transitioned into the action space. Extensive experiments on four benchmark
datasets, i.e., Breakfast, 50Salads, EpicKitchens, and EGTEA Gaze+, are
performed and the proposed method achieves superior or comparable results to
state-of-the-art methods, showing the effectiveness of a generative approach
for action anticipation. Our code and trained models will be published on
GitHub.
TFMQ-DM: Temporal Feature Maintenance Quantization for Diffusion Models
November 27, 2023
Yushi Huang, Ruihao Gong, Jing Liu, Tianlong Chen, Xianglong Liu
The Diffusion model, a prevalent framework for image generation, encounters
significant challenges in terms of broad applicability due to its extended
inference times and substantial memory requirements. Efficient Post-training
Quantization (PTQ) is pivotal for addressing these issues in traditional
models. Different from traditional models, diffusion models heavily depend on
the time-step $t$ to achieve satisfactory multi-round denoising. Usually, $t$
from the finite set ${1, \ldots, T}$ is encoded to a temporal feature by a
few modules totally irrespective of the sampling data. However, existing PTQ
methods do not optimize these modules separately. They adopt inappropriate
reconstruction targets and complex calibration methods, resulting in a severe
disturbance of the temporal feature and denoising trajectory, as well as a low
compression efficiency. To solve these, we propose a Temporal Feature
Maintenance Quantization (TFMQ) framework building upon a Temporal Information
Block which is just related to the time-step $t$ and unrelated to the sampling
data. Powered by the pioneering block design, we devise temporal information
aware reconstruction (TIAR) and finite set calibration (FSC) to align the
full-precision temporal features in a limited time. Equipped with the
framework, we can maintain the most temporal information and ensure the
end-to-end generation quality. Extensive experiments on various datasets and
diffusion models prove our state-of-the-art results. Remarkably, our
quantization approach, for the first time, achieves model performance nearly on
par with the full-precision model under 4-bit weight quantization.
Additionally, our method incurs almost no extra computational cost and
accelerates quantization time by $2.0 \times$ on LSUN-Bedrooms $256 \times 256$
compared to previous works.
Regularization by Texts for Latent Diffusion Inverse Solvers
November 27, 2023
Jeongsol Kim, Geon Yeong Park, Hyungjin Chung, Jong Chul Ye
The recent advent of diffusion models has led to significant progress in
solving inverse problems, leveraging these models as effective generative
priors. Nonetheless, challenges related to the ill-posed nature of such
problems remain, often due to inherent ambiguities in measurements. Drawing
inspiration from the human ability to resolve visual ambiguities through
perceptual biases, here we introduce a novel latent diffusion inverse solver by
incorporating regularization by texts (TReg). Specifically, TReg applies the
textual description of the preconception of the solution during the reverse
sampling phase, of which description isndynamically reinforced through
null-text optimization for adaptive negation. Our comprehensive experimental
results demonstrate that TReg successfully mitigates ambiguity in latent
diffusion inverse solvers, enhancing their effectiveness and accuracy.
LFSRDiff: Light Field Image Super-Resolution via Diffusion Models
November 27, 2023
Wentao Chao, Fuqing Duan, Xuechun Wang, Yingqian Wang, Guanghui Wang
Light field (LF) image super-resolution (SR) is a challenging problem due to
its inherent ill-posed nature, where a single low-resolution (LR) input LF
image can correspond to multiple potential super-resolved outcomes. Despite
this complexity, mainstream LF image SR methods typically adopt a deterministic
approach, generating only a single output supervised by pixel-wise loss
functions. This tendency often results in blurry and unrealistic results.
Although diffusion models can capture the distribution of potential SR results
by iteratively predicting Gaussian noise during the denoising process, they are
primarily designed for general images and struggle to effectively handle the
unique characteristics and information present in LF images. To address these
limitations, we introduce LFSRDiff, the first diffusion-based LF image SR
model, by incorporating the LF disentanglement mechanism. Our novel
contribution includes the introduction of a disentangled U-Net for diffusion
models, enabling more effective extraction and fusion of both spatial and
angular information within LF images. Through comprehensive experimental
evaluations and comparisons with the state-of-the-art LF image SR methods, the
proposed approach consistently produces diverse and realistic SR results. It
achieves the highest perceptual metric in terms of LPIPS. It also demonstrates
the ability to effectively control the trade-off between perception and
distortion. The code is available at
\url{https://github.com/chaowentao/LFSRDiff}.
Functional Diffusion
November 26, 2023
Biao Zhang, Peter Wonka
We propose a new class of generative diffusion models, called functional
diffusion. In contrast to previous work, functional diffusion works on samples
that are represented by functions with a continuous domain. Functional
diffusion can be seen as an extension of classical diffusion models to an
infinite-dimensional domain. Functional diffusion is very versatile as images,
videos, audio, 3D shapes, deformations, \etc, can be handled by the same
framework with minimal changes. In addition, functional diffusion is especially
suited for irregular data or data defined in non-standard domains. In our work,
we derive the necessary foundations for functional diffusion and propose a
first implementation based on the transformer architecture. We show generative
results on complicated signed distance functions and deformation functions
defined on 3D surfaces.
ToddlerDiffusion: Flash Interpretable Controllable Diffusion Model
November 24, 2023
Eslam Mohamed Bakr, Liangbing Zhao, Vincent Tao Hu, Matthieu Cord, Patrick Perez, Mohamed Elhoseiny
Diffusion-based generative models excel in perceptually impressive synthesis
but face challenges in interpretability. This paper introduces
ToddlerDiffusion, an interpretable 2D diffusion image-synthesis framework
inspired by the human generation system. Unlike traditional diffusion models
with opaque denoising steps, our approach decomposes the generation process
into simpler, interpretable stages; generating contours, a palette, and a
detailed colored image. This not only enhances overall performance but also
enables robust editing and interaction capabilities. Each stage is meticulously
formulated for efficiency and accuracy, surpassing Stable-Diffusion (LDM).
Extensive experiments on datasets like LSUN-Churches and COCO validate our
approach, consistently outperforming existing methods. ToddlerDiffusion
achieves notable efficiency, matching LDM performance on LSUN-Churches while
operating three times faster with a 3.76 times smaller architecture. Our source
code is provided in the supplementary material and will be publicly accessible.
DemoFusion: Democratising High-Resolution Image Generation With No $$$
November 24, 2023
Ruoyi Du, Dongliang Chang, Timothy Hospedales, Yi-Zhe Song, Zhanyu Ma
High-resolution image generation with Generative Artificial Intelligence
(GenAI) has immense potential but, due to the enormous capital investment
required for training, it is increasingly centralised to a few large
corporations, and hidden behind paywalls. This paper aims to democratise
high-resolution GenAI by advancing the frontier of high-resolution generation
while remaining accessible to a broad audience. We demonstrate that existing
Latent Diffusion Models (LDMs) possess untapped potential for higher-resolution
image generation. Our novel DemoFusion framework seamlessly extends open-source
GenAI models, employing Progressive Upscaling, Skip Residual, and Dilated
Sampling mechanisms to achieve higher-resolution image generation. The
progressive nature of DemoFusion requires more passes, but the intermediate
results can serve as “previews”, facilitating rapid prompt iteration.
On diffusion-based generative models and their error bounds: The log-concave case with full convergence estimates
November 22, 2023
Stefano Bruno, Ying Zhang, Dong-Young Lim, Ömer Deniz Akyildiz, Sotirios Sabanis
cs.LG, math.OC, math.PR, stat.ML
We provide full theoretical guarantees for the convergence behaviour of
diffusion-based generative models under the assumption of strongly logconcave
data distributions while our approximating class of functions used for score
estimation is made of Lipschitz continuous functions. We demonstrate via a
motivating example, sampling from a Gaussian distribution with unknown mean,
the powerfulness of our approach. In this case, explicit estimates are provided
for the associated optimization problem, i.e. score approximation, while these
are combined with the corresponding sampling estimates. As a result, we obtain
the best known upper bound estimates in terms of key quantities of interest,
such as the dimension and rates of convergence, for the Wasserstein-2 distance
between the data distribution (Gaussian with unknown mean) and our sampling
algorithm.
Beyond the motivating example and in order to allow for the use of a diverse
range of stochastic optimizers, we present our results using an $L^2$-accurate
score estimation assumption, which crucially is formed under an expectation
with respect to the stochastic optimizer and our novel auxiliary process that
uses only known information. This approach yields the best known convergence
rate for our sampling algorithm.
DiffusionMat: Alpha Matting as Sequential Refinement Learning
November 22, 2023
Yangyang Xu, Shengfeng He, Wenqi Shao, Kwan-Yee K. Wong, Yu Qiao, Ping Luo
In this paper, we introduce DiffusionMat, a novel image matting framework
that employs a diffusion model for the transition from coarse to refined alpha
mattes. Diverging from conventional methods that utilize trimaps merely as
loose guidance for alpha matte prediction, our approach treats image matting as
a sequential refinement learning process. This process begins with the addition
of noise to trimaps and iteratively denoises them using a pre-trained diffusion
model, which incrementally guides the prediction towards a clean alpha matte.
The key innovation of our framework is a correction module that adjusts the
output at each denoising step, ensuring that the final result is consistent
with the input image’s structures. We also introduce the Alpha Reliability
Propagation, a novel technique designed to maximize the utility of available
guidance by selectively enhancing the trimap regions with confident alpha
information, thus simplifying the correction task. To train the correction
module, we devise specialized loss functions that target the accuracy of the
alpha matte’s edges and the consistency of its opaque and transparent regions.
We evaluate our model across several image matting benchmarks, and the results
indicate that DiffusionMat consistently outperforms existing methods. Project
page at~\url{https://cnnlstm.github.io/DiffusionMat
Accelerating Inference in Molecular Diffusion Models with Latent Representations of Protein Structure
November 22, 2023
Ian Dunn, David Ryan Koes
Diffusion generative models have emerged as a powerful framework for
addressing problems in structural biology and structure-based drug design.
These models operate directly on 3D molecular structures. Due to the
unfavorable scaling of graph neural networks (GNNs) with graph size as well as
the relatively slow inference speeds inherent to diffusion models, many
existing molecular diffusion models rely on coarse-grained representations of
protein structure to make training and inference feasible. However, such
coarse-grained representations discard essential information for modeling
molecular interactions and impair the quality of generated structures. In this
work, we present a novel GNN-based architecture for learning latent
representations of molecular structure. When trained end-to-end with a
diffusion model for de novo ligand design, our model achieves comparable
performance to one with an all-atom protein representation while exhibiting a
3-fold reduction in inference time.
Guided Flows for Generative Modeling and Decision Making
November 22, 2023
Qinqing Zheng, Matt Le, Neta Shaul, Yaron Lipman, Aditya Grover, Ricky T. Q. Chen
cs.LG, cs.AI, cs.CV, cs.RO, stat.ML
Classifier-free guidance is a key component for improving the performance of
conditional generative models for many downstream tasks. It drastically
improves the quality of samples produced, but has so far only been used for
diffusion models. Flow Matching (FM), an alternative simulation-free approach,
trains Continuous Normalizing Flows (CNFs) based on regressing vector fields.
It remains an open question whether classifier-free guidance can be performed
for Flow Matching models, and to what extent does it improve performance. In
this paper, we explore the usage of Guided Flows for a variety of downstream
applications involving conditional image generation, speech synthesis, and
reinforcement learning. In particular, we are the first to apply flow models to
the offline reinforcement learning setting. We also show that Guided Flows
significantly improves the sample quality in image generation and zero-shot
text-to-speech synthesis, and can make use of drastically low amounts of
computation without affecting the agent’s overall performance.
Recognition-Guided Diffusion Model for Scene Text Image Super-Resolution
November 22, 2023
Yuxuan Zhou, Liangcai Gao, Zhi Tang, Baole Wei
Scene Text Image Super-Resolution (STISR) aims to enhance the resolution and
legibility of text within low-resolution (LR) images, consequently elevating
recognition accuracy in Scene Text Recognition (STR). Previous methods
predominantly employ discriminative Convolutional Neural Networks (CNNs)
augmented with diverse forms of text guidance to address this issue.
Nevertheless, they remain deficient when confronted with severely blurred
images, due to their insufficient generation capability when little structural
or semantic information can be extracted from original images. Therefore, we
introduce RGDiffSR, a Recognition-Guided Diffusion model for scene text image
Super-Resolution, which exhibits great generative diversity and fidelity even
in challenging scenarios. Moreover, we propose a Recognition-Guided Denoising
Network, to guide the diffusion model generating LR-consistent results through
succinct semantic guidance. Experiments on the TextZoom dataset demonstrate the
superiority of RGDiffSR over prior state-of-the-art methods in both text
recognition accuracy and image fidelity.
Diffusion360: Seamless 360 Degree Panoramic Image Generation based on Diffusion Models
November 22, 2023
Mengyang Feng, Jinlin Liu, Miaomiao Cui, Xuansong Xie
This is a technical report on the 360-degree panoramic image generation task
based on diffusion models. Unlike ordinary 2D images, 360-degree panoramic
images capture the entire $360^\circ\times 180^\circ$ field of view. So the
rightmost and the leftmost sides of the 360 panoramic image should be
continued, which is the main challenge in this field. However, the current
diffusion pipeline is not appropriate for generating such a seamless 360-degree
panoramic image. To this end, we propose a circular blending strategy on both
the denoising and VAE decoding stages to maintain the geometry continuity.
Based on this, we present two models for \textbf{Text-to-360-panoramas} and
\textbf{Single-Image-to-360-panoramas} tasks. The code has been released as an
open-source project at
\href{https://github.com/ArcherFMY/SD-T2I-360PanoImage}{https://github.com/ArcherFMY/SD-T2I-360PanoImage}
and
\href{https://www.modelscope.cn/models/damo/cv_diffusion_text-to-360panorama-image_generation/summary}{ModelScope}
On the Limitation of Diffusion Models for Synthesizing Training Datasets
November 22, 2023
Shin'ya Yamaguchi, Takuma Fukuda
Synthetic samples from diffusion models are promising for leveraging in
training discriminative models as replications of real training datasets.
However, we found that the synthetic datasets degrade classification
performance over real datasets even when using state-of-the-art diffusion
models. This means that modern diffusion models do not perfectly represent the
data distribution for the purpose of replicating datasets for training
discriminative tasks. This paper investigates the gap between synthetic and
real samples by analyzing the synthetic samples reconstructed from real samples
through the diffusion and reverse process. By varying the time steps starting
the reverse process in the reconstruction, we can control the trade-off between
the information in the original real data and the information added by
diffusion models. Through assessing the reconstructed samples and trained
models, we found that the synthetic data are concentrated in modes of the
training data distribution as the reverse step increases, and thus, they are
difficult to cover the outer edges of the distribution. Our findings imply that
modern diffusion models are insufficient to replicate training data
distribution perfectly, and there is room for the improvement of generative
modeling in the replication of training datasets.
Diffusion Model Alignment Using Direct Preference Optimization
November 21, 2023
Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, Nikhil Naik
cs.CV, cs.AI, cs.GR, cs.LG
Large language models (LLMs) are fine-tuned using human comparison data with
Reinforcement Learning from Human Feedback (RLHF) methods to make them better
aligned with users’ preferences. In contrast to LLMs, human preference learning
has not been widely explored in text-to-image diffusion models; the best
existing approach is to fine-tune a pretrained model using carefully curated
high quality images and captions to improve visual appeal and text alignment.
We propose Diffusion-DPO, a method to align diffusion models to human
preferences by directly optimizing on human comparison data. Diffusion-DPO is
adapted from the recently developed Direct Preference Optimization (DPO), a
simpler alternative to RLHF which directly optimizes a policy that best
satisfies human preferences under a classification objective. We re-formulate
DPO to account for a diffusion model notion of likelihood, utilizing the
evidence lower bound to derive a differentiable objective. Using the Pick-a-Pic
dataset of 851K crowdsourced pairwise preferences, we fine-tune the base model
of the state-of-the-art Stable Diffusion XL (SDXL)-1.0 model with
Diffusion-DPO. Our fine-tuned base model significantly outperforms both base
SDXL-1.0 and the larger SDXL-1.0 model consisting of an additional refinement
model in human evaluation, improving visual appeal and prompt alignment. We
also develop a variant that uses AI feedback and has comparable performance to
training on human preferences, opening the door for scaling of diffusion model
alignment methods.
Stable Diffusion For Aerial Object Detection
November 21, 2023
Yanan Jian, Fuxun Yu, Simranjit Singh, Dimitrios Stamoulis
Aerial object detection is a challenging task, in which one major obstacle
lies in the limitations of large-scale data collection and the long-tail
distribution of certain classes. Synthetic data offers a promising solution,
especially with recent advances in diffusion-based methods like stable
diffusion (SD). However, the direct application of diffusion methods to aerial
domains poses unique challenges: stable diffusion’s optimization for rich
ground-level semantics doesn’t align with the sparse nature of aerial objects,
and the extraction of post-synthesis object coordinates remains problematic. To
address these challenges, we introduce a synthetic data augmentation framework
tailored for aerial images. It encompasses sparse-to-dense region of interest
(ROI) extraction to bridge the semantic gap, fine-tuning the diffusion model
with low-rank adaptation (LORA) to circumvent exhaustive retraining, and
finally, a Copy-Paste method to compose synthesized objects with backgrounds,
providing a nuanced approach to aerial object detection through synthetic data.
Concept Sliders: LoRA Adaptors for Precise Control in Diffusion Models
November 20, 2023
Rohit Gandikota, Joanna Materzynska, Tingrui Zhou, Antonio Torralba, David Bau
We present a method to create interpretable concept sliders that enable
precise control over attributes in image generations from diffusion models. Our
approach identifies a low-rank parameter direction corresponding to one concept
while minimizing interference with other attributes. A slider is created using
a small set of prompts or sample images; thus slider directions can be created
for either textual or visual concepts. Concept Sliders are plug-and-play: they
can be composed efficiently and continuously modulated, enabling precise
control over image generation. In quantitative experiments comparing to
previous editing techniques, our sliders exhibit stronger targeted edits with
lower interference. We showcase sliders for weather, age, styles, and
expressions, as well as slider compositions. We show how sliders can transfer
latents from StyleGAN for intuitive editing of visual concepts for which
textual description is difficult. We also find that our method can help address
persistent quality issues in Stable Diffusion XL including repair of object
deformations and fixing distorted hands. Our code, data, and trained sliders
are available at https://sliders.baulab.info/
FrePolad: Frequency-Rectified Point Latent Diffusion for Point Cloud Generation
November 20, 2023
Chenliang Zhou, Fangcheng Zhong, Param Hanji, Zhilin Guo, Kyle Fogarty, Alejandro Sztrajman, Hongyun Gao, Cengiz Oztireli
We propose FrePolad: frequency-rectified point latent diffusion, a point
cloud generation pipeline integrating a variational autoencoder (VAE) with a
denoising diffusion probabilistic model (DDPM) for the latent distribution.
FrePolad simultaneously achieves high quality, diversity, and flexibility in
point cloud cardinality for generation tasks while maintaining high
computational efficiency. The improvement in generation quality and diversity
is achieved through (1) a novel frequency rectification module via spherical
harmonics designed to retain high-frequency content while learning the point
cloud distribution; and (2) a latent DDPM to learn the regularized yet complex
latent distribution. In addition, FrePolad supports variable point cloud
cardinality by formulating the sampling of points as conditional distributions
over a latent shape distribution. Finally, the low-dimensional latent space
encoded by the VAE contributes to FrePolad’s fast and scalable sampling. Our
quantitative and qualitative results demonstrate the state-of-the-art
performance of FrePolad in terms of quality, diversity, and computational
efficiency.
Pyramid Diffusion for Fine 3D Large Scene Generation
November 20, 2023
Yuheng Liu, Xinke Li, Xueting Li, Lu Qi, Chongshou Li, Ming-Hsuan Yang
Directly transferring the 2D techniques to 3D scene generation is challenging
due to significant resolution reduction and the scarcity of comprehensive
real-world 3D scene datasets. To address these issues, our work introduces the
Pyramid Discrete Diffusion model (PDD) for 3D scene generation. This novel
approach employs a multi-scale model capable of progressively generating
high-quality 3D scenes from coarse to fine. In this way, the PDD can generate
high-quality scenes within limited resource constraints and does not require
additional data sources. To the best of our knowledge, we are the first to
adopt the simple but effective coarse-to-fine strategy for 3D large scene
generation. Our experiments, covering both unconditional and conditional
generation, have yielded impressive results, showcasing the model’s
effectiveness and robustness in generating realistic and detailed 3D scenes.
Our code will be available to the public.
Reti-Diff: Illumination Degradation Image Restoration with Retinex-based Latent Diffusion Model
November 20, 2023
Chunming He, Chengyu Fang, Yulun Zhang, Kai Li, Longxiang Tang, Chenyu You, Fengyang Xiao, Zhenhua Guo, Xiu Li
Illumination degradation image restoration (IDIR) techniques aim to improve
the visibility of degraded images and mitigate the adverse effects of
deteriorated illumination. Among these algorithms, diffusion model (DM)-based
methods have shown promising performance but are often burdened by heavy
computational demands and pixel misalignment issues when predicting the
image-level distribution. To tackle these problems, we propose to leverage DM
within a compact latent space to generate concise guidance priors and introduce
a novel solution called Reti-Diff for the IDIR task. Reti-Diff comprises two
key components: the Retinex-based latent DM (RLDM) and the Retinex-guided
transformer (RGformer). To ensure detailed reconstruction and illumination
correction, RLDM is empowered to acquire Retinex knowledge and extract
reflectance and illumination priors. These priors are subsequently utilized by
RGformer to guide the decomposition of image features into their respective
reflectance and illumination components. Following this, RGformer further
enhances and consolidates the decomposed features, resulting in the production
of refined images with consistent content and robustness to handle complex
degradation scenarios. Extensive experiments show that Reti-Diff outperforms
existing methods on three IDIR tasks, as well as downstream applications. Code
will be available at \url{https://github.com/ChunmingHe/Reti-Diff}.
Deep Equilibrium Diffusion Restoration with Parallel Sampling
November 20, 2023
Jiezhang Cao, Yue Shi, Kai Zhang, Yulun Zhang, Radu Timofte, Luc Van Gool
Diffusion-based image restoration (IR) methods aim to use diffusion models to
recover high-quality (HQ) images from degraded images and achieve promising
performance. Due to the inherent property of diffusion models, most of these
methods need long serial sampling chains to restore HQ images step-by-step. As
a result, it leads to expensive sampling time and high computation costs.
Moreover, such long sampling chains hinder understanding the relationship
between the restoration results and the inputs since it is hard to compute the
gradients in the whole chains. In this work, we aim to rethink the
diffusion-based IR models through a different perspective, i.e., a deep
equilibrium (DEQ) fixed point system. Specifically, we derive an analytical
solution by modeling the entire sampling chain in diffusion-based IR models as
a joint multivariate fixed point system. With the help of the analytical
solution, we are able to conduct single-image sampling in a parallel way and
restore HQ images without training. Furthermore, we compute fast gradients in
DEQ and found that initialization optimization can boost performance and
control the generation direction. Extensive experiments on benchmarks
demonstrate the effectiveness of our proposed method on typical IR tasks and
real-world settings. The code and models will be made publicly available.
Advancing Urban Renewal: An Automated Approach to Generating Historical Arcade Facades with Stable Diffusion Models
November 20, 2023
Zheyuan Kuang, Jiaxin Zhang, Yiying Huang, Yunqin Li
Urban renewal and transformation processes necessitate the preservation of
the historical urban fabric, particularly in districts known for their
architectural and historical significance. These regions, with their diverse
architectural styles, have traditionally required extensive preliminary
research, often leading to subjective results. However, the advent of machine
learning models has opened up new avenues for generating building facade
images. Despite this, creating high-quality images for historical district
renovations remains challenging, due to the complexity and diversity inherent
in such districts. In response to these challenges, our study introduces a new
methodology for automatically generating images of historical arcade facades,
utilizing Stable Diffusion models conditioned on textual descriptions. By
classifying and tagging a variety of arcade styles, we have constructed several
realistic arcade facade image datasets. We trained multiple low-rank adaptation
(LoRA) models to control the stylistic aspects of the generated images,
supplemented by ControlNet models for improved precision and authenticity. Our
approach has demonstrated high levels of precision, authenticity, and diversity
in the generated images, showing promising potential for real-world urban
renewal projects. This new methodology offers a more efficient and accurate
alternative to conventional design processes in urban renewal, bypassing issues
of unconvincing image details, lack of precision, and limited stylistic
variety. Future research could focus on integrating this two-dimensional image
generation with three-dimensional modeling techniques, providing a more
comprehensive solution for renovating architectural facades in historical
districts.
Fast Controllable Diffusion Models for Undersampled MRI Reconstruction
November 20, 2023
Wei Jiang, Zhuang Xiong, Feng Liu, Nan Ye, Hongfu Sun
Supervised deep learning methods have shown promise in undersampled Magnetic
Resonance Imaging (MRI) reconstruction, but their requirement for paired data
limits their generalizability to the diverse MRI acquisition parameters.
Recently, unsupervised controllable generative diffusion models have been
applied to undersampled MRI reconstruction, without paired data or model
retraining for different MRI acquisitions. However, diffusion models are
generally slow in sampling and state-of-the-art acceleration techniques can
lead to sub-optimal results when directly applied to the controllable
generation process. This study introduces a new algorithm called
Predictor-Projector-Noisor (PPN), which enhances and accelerates controllable
generation of diffusion models for undersampled MRI reconstruction. Our results
demonstrate that PPN produces high-fidelity MR images that conform to
undersampled k-space measurements with significantly shorter reconstruction
time than other controllable sampling methods. In addition, the unsupervised
PPN accelerated diffusion models are adaptable to different MRI acquisition
parameters, making them more practical for clinical use than supervised
learning techniques.
Gaussian Interpolation Flows
November 20, 2023
Yuan Gao, Jian Huang, Yuling Jiao
Gaussian denoising has emerged as a powerful principle for constructing
simulation-free continuous normalizing flows for generative modeling. Despite
their empirical successes, theoretical properties of these flows and the
regularizing effect of Gaussian denoising have remained largely unexplored. In
this work, we aim to address this gap by investigating the well-posedness of
simulation-free continuous normalizing flows built on Gaussian denoising.
Through a unified framework termed Gaussian interpolation flow, we establish
the Lipschitz regularity of the flow velocity field, the existence and
uniqueness of the flow, and the Lipschitz continuity of the flow map and the
time-reversed flow map for several rich classes of target distributions. This
analysis also sheds light on the auto-encoding and cycle-consistency properties
of Gaussian interpolation flows. Additionally, we delve into the stability of
these flows in source distributions and perturbations of the velocity field,
using the quadratic Wasserstein distance as a metric. Our findings offer
valuable insights into the learning techniques employed in Gaussian
interpolation flows for generative modeling, providing a solid theoretical
foundation for end-to-end error analyses of learning GIFs with empirical
observations.
FDDM: Unsupervised Medical Image Translation with a Frequency-Decoupled Diffusion Model
November 19, 2023
Yunxiang Li, Hua-Chieh Shao, Xiaoxue Qian, You Zhang
Diffusion models have demonstrated significant potential in producing
high-quality images for medical image translation to aid disease diagnosis,
localization, and treatment. Nevertheless, current diffusion models have
limited success in achieving faithful image translations that can accurately
preserve the anatomical structures of medical images, especially for unpaired
datasets. The preservation of structural and anatomical details is essential to
reliable medical diagnosis and treatment planning, as structural mismatches can
lead to disease misidentification and treatment errors. In this study, we
introduced a frequency-decoupled diffusion model (FDDM), a novel framework that
decouples the frequency components of medical images in the Fourier domain
during the translation process, to allow structure-preserved high-quality image
conversion. FDDM applies an unsupervised frequency conversion module to
translate the source medical images into frequency-specific outputs and then
uses the frequency-specific information to guide a following diffusion model
for final source-to-target image translation. We conducted extensive
evaluations of FDDM using a public brain MR-to-CT translation dataset, showing
its superior performance against other GAN-, VAE-, and diffusion-based models.
Metrics including the Frechet inception distance (FID), the peak
signal-to-noise ratio (PSNR), and the structural similarity index measure
(SSIM) were assessed. FDDM achieves an FID of 29.88, less than half of the
second best. These results demonstrated FDDM’s prowess in generating
highly-realistic target-domain images while maintaining the faithfulness of
translated anatomical structures.
A Survey of Emerging Applications of Diffusion Probabilistic Models in MRI
November 19, 2023
Yuheng Fan, Hanxi Liao, Shiqi Huang, Yimin Luo, Huazhu Fu, Haikun Qi
Diffusion probabilistic models (DPMs) which employ explicit likelihood
characterization and a gradual sampling process to synthesize data, have gained
increasing research interest. Despite their huge computational burdens due to
the large number of steps involved during sampling, DPMs are widely appreciated
in various medical imaging tasks for their high-quality and diversity of
generation. Magnetic resonance imaging (MRI) is an important medical imaging
modality with excellent soft tissue contrast and superb spatial resolution,
which possesses unique opportunities for diffusion models. Although there is a
recent surge of studies exploring DPMs in MRI, a survey paper of DPMs
specifically designed for MRI applications is still lacking. This review
article aims to help researchers in the MRI community to grasp the advances of
DPMs in different applications. We first introduce the theory of two dominant
kinds of DPMs, categorized according to whether the diffusion time step is
discrete or continuous, and then provide a comprehensive review of emerging
DPMs in MRI, including reconstruction, image generation, image translation,
segmentation, anomaly detection, and further research topics. Finally, we
discuss the general limitations as well as limitations specific to the MRI
tasks of DPMs and point out potential areas that are worth further exploration.
GaussianDiffusion: 3D Gaussian Splatting for Denoising Diffusion Probabilistic Models with Structured Noise
November 19, 2023
Xinhai Li, Huaibin Wang, Kuo-Kun Tseng
Text-to-3D, known for its efficient generation methods and expansive creative
potential, has garnered significant attention in the AIGC domain. However, the
amalgamation of Nerf and 2D diffusion models frequently yields oversaturated
images, posing severe limitations on downstream industrial applications due to
the constraints of pixelwise rendering method. Gaussian splatting has recently
superseded the traditional pointwise sampling technique prevalent in NeRF-based
methodologies, revolutionizing various aspects of 3D reconstruction. This paper
introduces a novel text to 3D content generation framework based on Gaussian
splatting, enabling fine control over image saturation through individual
Gaussian sphere transparencies, thereby producing more realistic images. The
challenge of achieving multi-view consistency in 3D generation significantly
impedes modeling complexity and accuracy. Taking inspiration from SJC, we
explore employing multi-view noise distributions to perturb images generated by
3D Gaussian splatting, aiming to rectify inconsistencies in multi-view
geometry. We ingeniously devise an efficient method to generate noise that
produces Gaussian noise from diverse viewpoints, all originating from a shared
noise source. Furthermore, vanilla 3D Gaussian-based generation tends to trap
models in local minima, causing artifacts like floaters, burrs, or
proliferative elements. To mitigate these issues, we propose the variational
Gaussian splatting technique to enhance the quality and stability of 3D
appearance. To our knowledge, our approach represents the first comprehensive
utilization of Gaussian splatting across the entire spectrum of 3D content
generation processes.
Wasserstein Convergence Guarantees for a General Class of Score-Based Generative Models
November 18, 2023
Xuefeng Gao, Hoang M. Nguyen, Lingjiong Zhu
Score-based generative models (SGMs) is a recent class of deep generative
models with state-of-the-art performance in many applications. In this paper,
we establish convergence guarantees for a general class of SGMs in
2-Wasserstein distance, assuming accurate score estimates and smooth
log-concave data distribution. We specialize our result to several concrete
SGMs with specific choices of forward processes modelled by stochastic
differential equations, and obtain an upper bound on the iteration complexity
for each model, which demonstrates the impacts of different choices of the
forward processes. We also provide a lower bound when the data distribution is
Gaussian. Numerically, we experiment SGMs with different forward processes,
some of which are newly proposed in this paper, for unconditional image
generation on CIFAR-10. We find that the experimental results are in good
agreement with our theoretical predictions on the iteration complexity, and the
models with our newly proposed forward processes can outperform existing
models.
SDDPM: Speckle Denoising Diffusion Probabilistic Models
November 17, 2023
Soumee Guha, Scott T. Acton
Coherent imaging systems, such as medical ultrasound and synthetic aperture
radar (SAR), are subject to corruption from speckle due to sub-resolution
scatterers. Since speckle is multiplicative in nature, the constituent image
regions become corrupted to different extents. The task of denoising such
images requires algorithms specifically designed for removing signal-dependent
noise. This paper proposes a novel image denoising algorithm for removing
signal-dependent multiplicative noise with diffusion models, called Speckle
Denoising Diffusion Probabilistic Models (SDDPM). We derive the mathematical
formulations for the forward process, the reverse process, and the training
objective. In the forward process, we apply multiplicative noise to a given
image and prove that the forward process is Gaussian. We show that the reverse
process is also Gaussian and the final training objective can be expressed as
the Kullback Leibler (KL) divergence between the forward and reverse processes.
As derived in the paper, the final denoising task is a single step process,
thereby reducing the denoising time significantly. We have trained our model
with natural land-use images and ultrasound images for different noise levels.
Extensive experiments centered around two different applications show that
SDDPM is robust and performs significantly better than the comparative models
even when the images are severely corrupted.
K-space Cold Diffusion: Learning to Reconstruct Accelerated MRI without Noise
November 16, 2023
Guoyao Shen, Mengyu Li, Chad W. Farris, Stephan Anderson, Xin Zhang
eess.IV, cs.CV, cs.LG, physics.med-ph
Deep learning-based MRI reconstruction models have achieved superior
performance these days. Most recently, diffusion models have shown remarkable
performance in image generation, in-painting, super-resolution, image editing
and more. As a generalized diffusion model, cold diffusion further broadens the
scope and considers models built around arbitrary image transformations such as
blurring, down-sampling, etc. In this paper, we propose a k-space cold
diffusion model that performs image degradation and restoration in k-space
without the need for Gaussian noise. We provide comparisons with multiple deep
learning-based MRI reconstruction models and perform tests on a well-known
large open-source MRI dataset. Our results show that this novel way of
performing degradation can generate high-quality reconstruction images for
accelerated MRI.
The Chosen One: Consistent Characters in Text-to-Image Diffusion Models
November 16, 2023
Omri Avrahami, Amir Hertz, Yael Vinker, Moab Arar, Shlomi Fruchter, Ohad Fried, Daniel Cohen-Or, Dani Lischinski
Recent advances in text-to-image generation models have unlocked vast
potential for visual creativity. However, these models struggle with generation
of consistent characters, a crucial aspect for numerous real-world applications
such as story visualization, game development asset design, advertising, and
more. Current methods typically rely on multiple pre-existing images of the
target character or involve labor-intensive manual processes. In this work, we
propose a fully automated solution for consistent character generation, with
the sole input being a text prompt. We introduce an iterative procedure that,
at each stage, identifies a coherent set of images sharing a similar identity
and extracts a more consistent identity from this set. Our quantitative
analysis demonstrates that our method strikes a better balance between prompt
alignment and identity consistency compared to the baseline methods, and these
findings are reinforced by a user study. To conclude, we showcase several
practical applications of our approach. Project page is available at
https://omriavrahami.com/the-chosen-one
Score-based generative models learn manifold-like structures with constrained mixing
November 16, 2023
Li Kevin Wenliang, Ben Moran
How do score-based generative models (SBMs) learn the data distribution
supported on a low-dimensional manifold? We investigate the score model of a
trained SBM through its linear approximations and subspaces spanned by local
feature vectors. During diffusion as the noise decreases, the local
dimensionality increases and becomes more varied between different sample
sequences. Importantly, we find that the learned vector field mixes samples by
a non-conservative field within the manifold, although it denoises with normal
projections as if there is an energy function in off-manifold directions. At
each noise level, the subspace spanned by the local features overlap with an
effective density function. These observations suggest that SBMs can flexibly
mix samples with the learned score field while carefully maintaining a
manifold-like structure of the data distribution.
DSR-Diff: Depth Map Super-Resolution with Diffusion Model
November 16, 2023
Yuan Shi, Bin Xia, Rui Zhu, Qingmin Liao, Wenming Yang
Color-guided depth map super-resolution (CDSR) improve the spatial resolution
of a low-quality depth map with the corresponding high-quality color map,
benefiting various applications such as 3D reconstruction, virtual reality, and
augmented reality. While conventional CDSR methods typically rely on
convolutional neural networks or transformers, diffusion models (DMs) have
demonstrated notable effectiveness in high-level vision tasks. In this work, we
present a novel CDSR paradigm that utilizes a diffusion model within the latent
space to generate guidance for depth map super-resolution. The proposed method
comprises a guidance generation network (GGN), a depth map super-resolution
network (DSRN), and a guidance recovery network (GRN). The GGN is specifically
designed to generate the guidance while managing its compactness. Additionally,
we integrate a simple but effective feature fusion module and a
transformer-style feature extraction module into the DSRN, enabling it to
leverage guided priors in the extraction, fusion, and reconstruction of
multi-model images. Taking into account both accuracy and efficiency, our
proposed method has shown superior performance in extensive experiments when
compared to state-of-the-art methods. Our codes will be made available at
https://github.com/shiyuan7/DSR-Diff.
Diffusion-Augmented Neural Processes
November 16, 2023
Lorenzo Bonito, James Requeima, Aliaksandra Shysheya, Richard E. Turner
Over the last few years, Neural Processes have become a useful modelling tool
in many application areas, such as healthcare and climate sciences, in which
data are scarce and prediction uncertainty estimates are indispensable.
However, the current state of the art in the field (AR CNPs; Bruinsma et al.,
2023) presents a few issues that prevent its widespread deployment. This work
proposes an alternative, diffusion-based approach to NPs which, through
conditioning on noised datasets, addresses many of these limitations, whilst
also exceeding SOTA performance.
DIFFNAT: Improving Diffusion Image Quality Using Natural Image Statistics
November 16, 2023
Aniket Roy, Maiterya Suin, Anshul Shah, Ketul Shah, Jiang Liu, Rama Chellappa
Diffusion models have advanced generative AI significantly in terms of
editing and creating naturalistic images. However, efficiently improving
generated image quality is still of paramount interest. In this context, we
propose a generic “naturalness” preserving loss function, viz., kurtosis
concentration (KC) loss, which can be readily applied to any standard diffusion
model pipeline to elevate the image quality. Our motivation stems from the
projected kurtosis concentration property of natural images, which states that
natural images have nearly constant kurtosis values across different band-pass
versions of the image. To retain the “naturalness” of the generated images, we
enforce reducing the gap between the highest and lowest kurtosis values across
the band-pass versions (e.g., Discrete Wavelet Transform (DWT)) of images. Note
that our approach does not require any additional guidance like classifier or
classifier-free guidance to improve the image quality. We validate the proposed
approach for three diverse tasks, viz., (1) personalized few-shot finetuning
using text guidance, (2) unconditional image generation, and (3) image
super-resolution. Integrating the proposed KC loss has improved the perceptual
quality across all these tasks in terms of both FID, MUSIQ score, and user
evaluation.
Privacy Threats in Stable Diffusion Models
November 15, 2023
Thomas Cilloni, Charles Fleming, Charles Walter
This paper introduces a novel approach to membership inference attacks (MIA)
targeting stable diffusion computer vision models, specifically focusing on the
highly sophisticated Stable Diffusion V2 by StabilityAI. MIAs aim to extract
sensitive information about a model’s training data, posing significant privacy
concerns. Despite its advancements in image synthesis, our research reveals
privacy vulnerabilities in the stable diffusion models’ outputs. Exploiting
this information, we devise a black-box MIA that only needs to query the victim
model repeatedly. Our methodology involves observing the output of a stable
diffusion model at different generative epochs and training a classification
model to distinguish when a series of intermediates originated from a training
sample or not. We propose numerous ways to measure the membership features and
discuss what works best. The attack’s efficacy is assessed using the ROC AUC
method, demonstrating a 60\% success rate in inferring membership information.
This paper contributes to the growing body of research on privacy and security
in machine learning, highlighting the need for robust defenses against MIAs.
Our findings prompt a reevaluation of the privacy implications of stable
diffusion models, urging practitioners and developers to implement enhanced
security measures to safeguard against such attacks.
DMV3D: Denoising Multi-View Diffusion using 3D Large Reconstruction Model
November 15, 2023
Yinghao Xu, Hao Tan, Fujun Luan, Sai Bi, Peng Wang, Jiahao Li, Zifan Shi, Kalyan Sunkavalli, Gordon Wetzstein, Zexiang Xu, Kai Zhang
We propose \textbf{DMV3D}, a novel 3D generation approach that uses a
transformer-based 3D large reconstruction model to denoise multi-view
diffusion. Our reconstruction model incorporates a triplane NeRF representation
and can denoise noisy multi-view images via NeRF reconstruction and rendering,
achieving single-stage 3D generation in $\sim$30s on single A100 GPU. We train
\textbf{DMV3D} on large-scale multi-view image datasets of highly diverse
objects using only image reconstruction losses, without accessing 3D assets. We
demonstrate state-of-the-art results for the single-image reconstruction
problem where probabilistic modeling of unseen object parts is required for
generating diverse reconstructions with sharp textures. We also show
high-quality text-to-3D generation results outperforming previous 3D diffusion
models. Our project website is at: https://justimyhxu.github.io/projects/dmv3d/ .
A Spectral Diffusion Prior for Hyperspectral Image Super-Resolution
November 15, 2023
Jianjun Liu, Zebin Wu, Liang Xiao
Fusion-based hyperspectral image (HSI) super-resolution aims to produce a
high-spatial-resolution HSI by fusing a low-spatial-resolution HSI and a
high-spatial-resolution multispectral image. Such a HSI super-resolution
process can be modeled as an inverse problem, where the prior knowledge is
essential for obtaining the desired solution. Motivated by the success of
diffusion models, we propose a novel spectral diffusion prior for fusion-based
HSI super-resolution. Specifically, we first investigate the spectrum
generation problem and design a spectral diffusion model to model the spectral
data distribution. Then, in the framework of maximum a posteriori, we keep the
transition information between every two neighboring states during the reverse
generative process, and thereby embed the knowledge of trained spectral
diffusion model into the fusion problem in the form of a regularization term.
At last, we treat each generation step of the final optimization problem as its
subproblem, and employ the Adam to solve these subproblems in a reverse
sequence. Experimental results conducted on both synthetic and real datasets
demonstrate the effectiveness of the proposed approach. The code of the
proposed approach will be available on https://github.com/liuofficial/SDP.
A Diffusion Model Based Quality Enhancement Method for HEVC Compressed Video
November 15, 2023
Zheng Liu, Honggang Qi
Video post-processing methods can improve the quality of compressed videos at
the decoder side. Most of the existing methods need to train corresponding
models for compressed videos with different quantization parameters to improve
the quality of compressed videos. However, in most cases, the quantization
parameters of the decoded video are unknown. This makes existing methods have
their limitations in improving video quality. To tackle this problem, this work
proposes a diffusion model based post-processing method for compressed videos.
The proposed method first estimates the feature vectors of the compressed video
and then uses the estimated feature vectors as the prior information for the
quality enhancement model to adaptively enhance the quality of compressed video
with different quantization parameters. Experimental results show that the
quality enhancement results of our proposed method on mixed datasets are
superior to existing methods.
Towards Graph-Aware Diffusion Modeling for Collaborative Filtering
November 15, 2023
Yunqin Zhu, Chao Wang, Hui Xiong
Recovering masked feedback with neural models is a popular paradigm in
recommender systems. Seeing the success of diffusion models in solving
ill-posed inverse problems, we introduce a conditional diffusion framework for
collaborative filtering that iteratively reconstructs a user’s hidden
preferences guided by its historical interactions. To better align with the
intrinsic characteristics of implicit feedback data, we implement forward
diffusion by applying synthetic smoothing filters to interaction signals on an
item-item graph. The resulting reverse diffusion can be interpreted as a
personalized process that gradually refines preference scores. Through graph
Fourier transform, we equivalently characterize this model as an anisotropic
Gaussian diffusion in the graph spectral domain, establishing both forward and
reverse formulations. Our model outperforms state-of-the-art methods by a large
margin on one dataset and yields competitive results on the others.
EDMSound: Spectrogram Based Diffusion Models for Efficient and High-Quality Audio Synthesis
November 15, 2023
Ge Zhu, Yutong Wen, Marc-André Carbonneau, Zhiyao Duan
Audio diffusion models can synthesize a wide variety of sounds. Existing
models often operate on the latent domain with cascaded phase recovery modules
to reconstruct waveform. This poses challenges when generating high-fidelity
audio. In this paper, we propose EDMSound, a diffusion-based generative model
in spectrogram domain under the framework of elucidated diffusion models (EDM).
Combining with efficient deterministic sampler, we achieved similar Fr'echet
audio distance (FAD) score as top-ranked baseline with only 10 steps and
reached state-of-the-art performance with 50 steps on the DCASE2023 foley sound
generation benchmark. We also revealed a potential concern regarding diffusion
based audio generation models that they tend to generate samples with high
perceptual similarity to the data from training data. Project page:
https://agentcooper2002.github.io/EDMSound/
I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models
November 07, 2023
Shiwei Zhang, Jiayu Wang, Yingya Zhang, Kang Zhao, Hangjie Yuan, Zhiwu Qin, Xiang Wang, Deli Zhao, Jingren Zhou
Video synthesis has recently made remarkable strides benefiting from the
rapid development of diffusion models. However, it still encounters challenges
in terms of semantic accuracy, clarity and spatio-temporal continuity. They
primarily arise from the scarcity of well-aligned text-video data and the
complex inherent structure of videos, making it difficult for the model to
simultaneously ensure semantic and qualitative excellence. In this report, we
propose a cascaded I2VGen-XL approach that enhances model performance by
decoupling these two factors and ensures the alignment of the input data by
utilizing static images as a form of crucial guidance. I2VGen-XL consists of
two stages: i) the base stage guarantees coherent semantics and preserves
content from input images by using two hierarchical encoders, and ii) the
refinement stage enhances the video’s details by incorporating an additional
brief text and improves the resolution to 1280$\times$720. To improve the
diversity, we collect around 35 million single-shot text-video pairs and 6
billion text-image pairs to optimize the model. By this means, I2VGen-XL can
simultaneously enhance the semantic accuracy, continuity of details and clarity
of generated videos. Through extensive experiments, we have investigated the
underlying principles of I2VGen-XL and compared it with current top methods,
which can demonstrate its effectiveness on diverse data. The source code and
models will be publicly available at \url{https://i2vgen-xl.github.io}.
Generative learning for nonlinear dynamics
November 07, 2023
William Gilpin
cs.LG, nlin.CD, physics.comp-ph
Modern generative machine learning models demonstrate surprising ability to
create realistic outputs far beyond their training data, such as photorealistic
artwork, accurate protein structures, or conversational text. These successes
suggest that generative models learn to effectively parametrize and sample
arbitrarily complex distributions. Beginning half a century ago, foundational
works in nonlinear dynamics used tools from information theory to infer
properties of chaotic attractors from time series, motivating the development
of algorithms for parametrizing chaos in real datasets. In this perspective, we
aim to connect these classical works to emerging themes in large-scale
generative statistical learning. We first consider classical attractor
reconstruction, which mirrors constraints on latent representations learned by
state space models of time series. We next revisit early efforts to use
symbolic approximations to compare minimal discrete generators underlying
complex processes, a problem relevant to modern efforts to distill and
interpret black-box statistical models. Emerging interdisciplinary works bridge
nonlinear dynamics and learning theory, such as operator-theoretic methods for
complex fluid flows, or detection of broken detailed balance in biological
datasets. We anticipate that future machine learning techniques may revisit
other classical concepts from nonlinear dynamics, such as transinformation
decay and complexity-entropy tradeoffs.
Generative Structural Design Integrating BIM and Diffusion Model
November 07, 2023
Zhili He, Yu-Hsing Wang, Jian Zhang
Intelligent structural design using AI can effectively reduce time overhead
and increase efficiency. It has potential to become the new design paradigm in
the future to assist and even replace engineers, and so it has become a
research hotspot in the academic community. However, current methods have some
limitations to be addressed, whether in terms of application scope, visual
quality of generated results, or evaluation metrics of results. This study
proposes a comprehensive solution. Firstly, we introduce building information
modeling (BIM) into intelligent structural design and establishes a structural
design pipeline integrating BIM and generative AI, which is a powerful
supplement to the previous frameworks that only considered CAD drawings. In
order to improve the perceptual quality and details of generations, this study
makes 3 contributions. Firstly, in terms of generation framework, inspired by
the process of human drawing, a novel 2-stage generation framework is proposed
to replace the traditional end-to-end framework to reduce the generation
difficulty for AI models. Secondly, in terms of generative AI tools adopted,
diffusion models (DMs) are introduced to replace widely used generative
adversarial network (GAN)-based models, and a novel physics-based conditional
diffusion model (PCDM) is proposed to consider different design prerequisites.
Thirdly, in terms of neural networks, an attention block (AB) consisting of a
self-attention block (SAB) and a parallel cross-attention block (PCAB) is
designed to facilitate cross-domain data fusion. The quantitative and
qualitative results demonstrate the powerful generation and representation
capabilities of PCDM. Necessary ablation studies are conducted to examine the
validity of the methods. This study also shows that DMs have the potential to
replace GANs and become the new benchmark for generative problems in civil
engineering.
November 07, 2023
Pengze Zhang, Hubery Yin, Chen Li, Xiaohua Xie
Continuous diffusion models are commonly acknowledged to display a
deterministic probability flow, whereas discrete diffusion models do not. In
this paper, we aim to establish the fundamental theory for the probability flow
of discrete diffusion models. Specifically, we first prove that the continuous
probability flow is the Monge optimal transport map under certain conditions,
and also present an equivalent evidence for discrete cases. In view of these
findings, we are then able to define the discrete probability flow in line with
the principles of optimal transport. Finally, drawing upon our newly
established definitions, we propose a novel sampling method that surpasses
previous discrete diffusion models in its ability to generate more certain
outcomes. Extensive experiments on the synthetic toy dataset and the CIFAR-10
dataset have validated the effectiveness of our proposed discrete probability
flow. Code is released at:
https://github.com/PangzeCheung/Discrete-Probability-Flow.
Reducing Spatial Fitting Error in Distillation of Denoising Diffusion Models
November 07, 2023
Shengzhe Zhou, Zejian Lee, Shengyuan Zhang, Lefan Hou, Changyuan Yang, Guang Yang, Lingyun Sun
Denoising Diffusion models have exhibited remarkable capabilities in image
generation. However, generating high-quality samples requires a large number of
iterations. Knowledge distillation for diffusion models is an effective method
to address this limitation with a shortened sampling process but causes
degraded generative quality. Based on our analysis with bias-variance
decomposition and experimental observations, we attribute the degradation to
the spatial fitting error occurring in the training of both the teacher and
student model. Accordingly, we propose $\textbf{S}$patial
$\textbf{F}$itting-$\textbf{E}$rror $\textbf{R}$eduction
$\textbf{D}$istillation model ($\textbf{SFERD}$). SFERD utilizes attention
guidance from the teacher model and a designed semantic gradient predictor to
reduce the student’s fitting error. Empirically, our proposed model facilitates
high-quality sample generation in a few function evaluations. We achieve an FID
of 5.31 on CIFAR-10 and 9.39 on ImageNet 64$\times$64 with only one step,
outperforming existing diffusion methods. Our study provides a new perspective
on diffusion distillation by highlighting the intrinsic denoising ability of
models.
Multi-Resolution Diffusion for Privacy-Sensitive Recommender Systems
November 06, 2023
Derek Lilienthal, Paul Mello, Magdalini Eirinaki, Stas Tiomkin
cs.IR, cs.AI, cs.CR, cs.LG
While recommender systems have become an integral component of the Web
experience, their heavy reliance on user data raises privacy and security
concerns. Substituting user data with synthetic data can address these
concerns, but accurately replicating these real-world datasets has been a
notoriously challenging problem. Recent advancements in generative AI have
demonstrated the impressive capabilities of diffusion models in generating
realistic data across various domains. In this work we introduce a Score-based
Diffusion Recommendation Module (SDRM), which captures the intricate patterns
of real-world datasets required for training highly accurate recommender
systems. SDRM allows for the generation of synthetic data that can replace
existing datasets to preserve user privacy, or augment existing datasets to
address excessive data sparsity. Our method outperforms competing baselines
such as generative adversarial networks, variational autoencoders, and recently
proposed diffusion models in synthesizing various datasets to replace or
augment the original data by an average improvement of 4.30% in Recall@$k$ and
4.65% in NDCG@$k$.
TS-Diffusion: Generating Highly Complex Time Series with Diffusion Models
November 06, 2023
Yangming Li
While current generative models have achieved promising performances in
time-series synthesis, they either make strong assumptions on the data format
(e.g., regularities) or rely on pre-processing approaches (e.g.,
interpolations) to simplify the raw data. In this work, we consider a class of
time series with three common bad properties, including sampling
irregularities, missingness, and large feature-temporal dimensions, and
introduce a general model, TS-Diffusion, to process such complex time series.
Our model consists of three parts under the framework of point process. The
first part is an encoder of the neural ordinary differential equation (ODE)
that converts time series into dense representations, with the jump technique
to capture sampling irregularities and self-attention mechanism to handle
missing values; The second component of TS-Diffusion is a diffusion model that
learns from the representation of time series. These time-series
representations can have a complex distribution because of their high
dimensions; The third part is a decoder of another ODE that generates time
series with irregularities and missing values given their representations. We
have conducted extensive experiments on multiple time-series datasets,
demonstrating that TS-Diffusion achieves excellent results on both conventional
and complex time series and significantly outperforms previous baselines.
LDM3D-VR: Latent Diffusion Model for 3D VR
November 06, 2023
Gabriela Ben Melech Stan, Diana Wofk, Estelle Aflalo, Shao-Yen Tseng, Zhipeng Cai, Michael Paulitsch, Vasudev Lal
Latent diffusion models have proven to be state-of-the-art in the creation
and manipulation of visual outputs. However, as far as we know, the generation
of depth maps jointly with RGB is still limited. We introduce LDM3D-VR, a suite
of diffusion models targeting virtual reality development that includes
LDM3D-pano and LDM3D-SR. These models enable the generation of panoramic RGBD
based on textual prompts and the upscaling of low-resolution inputs to
high-resolution RGBD, respectively. Our models are fine-tuned from existing
pretrained models on datasets containing panoramic/high-resolution RGB images,
depth maps and captions. Both models are evaluated in comparison to existing
related methods.
A Two-Stage Generative Model with CycleGAN and Joint Diffusion for MRI-based Brain Tumor Detection
November 06, 2023
Wenxin Wang, Zhuo-Xu Cui, Guanxun Cheng, Chentao Cao, Xi Xu, Ziwei Liu, Haifeng Wang, Yulong Qi, Dong Liang, Yanjie Zhu
Accurate detection and segmentation of brain tumors is critical for medical
diagnosis. However, current supervised learning methods require extensively
annotated images and the state-of-the-art generative models used in
unsupervised methods often have limitations in covering the whole data
distribution. In this paper, we propose a novel framework Two-Stage Generative
Model (TSGM) that combines Cycle Generative Adversarial Network (CycleGAN) and
Variance Exploding stochastic differential equation using joint probability
(VE-JP) to improve brain tumor detection and segmentation. The CycleGAN is
trained on unpaired data to generate abnormal images from healthy images as
data prior. Then VE-JP is implemented to reconstruct healthy images using
synthetic paired abnormal images as a guide, which alters only pathological
regions but not regions of healthy. Notably, our method directly learned the
joint probability distribution for conditional generation. The residual between
input and reconstructed images suggests the abnormalities and a thresholding
method is subsequently applied to obtain segmentation results. Furthermore, the
multimodal results are weighted with different weights to improve the
segmentation accuracy further. We validated our method on three datasets, and
compared with other unsupervised methods for anomaly detection and
segmentation. The DSC score of 0.8590 in BraTs2020 dataset, 0.6226 in ITCS
dataset and 0.7403 in In-house dataset show that our method achieves better
segmentation performance and has better generalization.
Diffusion-based Radiotherapy Dose Prediction Guided by Inter-slice Aware Structure Encoding
November 06, 2023
Zhenghao Feng, Lu Wen, Jianghong Xiao, Yuanyuan Xu, Xi Wu, Jiliu Zhou, Xingchen Peng, Yan Wang
Deep learning (DL) has successfully automated dose distribution prediction in
radiotherapy planning, enhancing both efficiency and quality. However, existing
methods suffer from the over-smoothing problem for their commonly used L1 or L2
loss with posterior average calculations. To alleviate this limitation, we
propose a diffusion model-based method (DiffDose) for predicting the
radiotherapy dose distribution of cancer patients. Specifically, the DiffDose
model contains a forward process and a reverse process. In the forward process,
DiffDose transforms dose distribution maps into pure Gaussian noise by
gradually adding small noise and a noise predictor is simultaneously trained to
estimate the noise added at each timestep. In the reverse process, it removes
the noise from the pure Gaussian noise in multiple steps with the well-trained
noise predictor and finally outputs the predicted dose distribution maps…
Scenario Diffusion: Controllable Driving Scenario Generation With Diffusion
November 05, 2023
Ethan Pronovost, Meghana Reddy Ganesina, Noureldin Hendy, Zeyu Wang, Andres Morales, Kai Wang, Nicholas Roy
Automated creation of synthetic traffic scenarios is a key part of validating
the safety of autonomous vehicles (AVs). In this paper, we propose Scenario
Diffusion, a novel diffusion-based architecture for generating traffic
scenarios that enables controllable scenario generation. We combine latent
diffusion, object detection and trajectory regression to generate distributions
of synthetic agent poses, orientations and trajectories simultaneously. To
provide additional control over the generated scenario, this distribution is
conditioned on a map and sets of tokens describing the desired scenario. We
show that our approach has sufficient expressive capacity to model diverse
traffic patterns and generalizes to different geographical regions.
Domain Transfer in Latent Space (DTLS) Wins on Image Super-Resolution – a Non-Denoising Model
November 04, 2023
Chun-Chuen Hui, Wan-Chi Siu, Ngai-Fong Law
Large scale image super-resolution is a challenging computer vision task,
since vast information is missing in a highly degraded image, say for example
forscale x16 super-resolution. Diffusion models are used successfully in recent
years in extreme super-resolution applications, in which Gaussian noise is used
as a means to form a latent photo-realistic space, and acts as a link between
the space of latent vectors and the latent photo-realistic space. There are
quite a few sophisticated mathematical derivations on mapping the statistics of
Gaussian noises making Diffusion Models successful. In this paper we propose a
simple approach which gets away from using Gaussian noise but adopts some basic
structures of diffusion models for efficient image super-resolution.
Essentially, we propose a DNN to perform domain transfer between neighbor
domains, which can learn the differences in statistical properties to
facilitate gradual interpolation with results of reasonable quality. Further
quality improvement is achieved by conditioning the domain transfer with
reference to the input LR image. Experimental results show that our method
outperforms not only state-of-the-art large scale super resolution models, but
also the current diffusion models for image super-resolution. The approach can
readily be extended to other image-to-image tasks, such as image enlightening,
inpainting, denoising, etc.
Stable Diffusion Reference Only: Image Prompt and Blueprint Jointly Guided Multi-Condition Diffusion Model for Secondary Painting
November 04, 2023
Hao Ai, Lu Sheng
Stable Diffusion and ControlNet have achieved excellent results in the field
of image generation and synthesis. However, due to the granularity and method
of its control, the efficiency improvement is limited for professional artistic
creations such as comics and animation production whose main work is secondary
painting. In the current workflow, fixing characters and image styles often
need lengthy text prompts, and even requires further training through
TextualInversion, DreamBooth or other methods, which is very complicated and
expensive for painters. Therefore, we present a new method in this paper,
Stable Diffusion Reference Only, a images-to-image self-supervised model that
uses only two types of conditional images for precise control generation to
accelerate secondary painting. The first type of conditional image serves as an
image prompt, supplying the necessary conceptual and color information for
generation. The second type is blueprint image, which controls the visual
structure of the generated image. It is natively embedded into the original
UNet, eliminating the need for ControlNet. We released all the code for the
module and pipeline, and trained a controllable character line art coloring
model at https://github.com/aihao2000/stable-diffusion-reference-only, that
achieved state-of-the-art results in this field. This verifies the
effectiveness of the structure and greatly improves the production efficiency
of animations, comics, and fanworks.
Sparse Training of Discrete Diffusion Models for Graph Generation
November 03, 2023
Yiming Qin, Clement Vignac, Pascal Frossard
Generative models for graphs often encounter scalability challenges due to
the inherent need to predict interactions for every node pair. Despite the
sparsity often exhibited by real-world graphs, the unpredictable sparsity
patterns of their adjacency matrices, stemming from their unordered nature,
leads to quadratic computational complexity. In this work, we introduce
SparseDiff, a denoising diffusion model for graph generation that is able to
exploit sparsity during its training phase. At the core of SparseDiff is a
message-passing neural network tailored to predict only a subset of edges
during each forward pass. When combined with a sparsity-preserving noise model,
this model can efficiently work with edge lists representations of graphs,
paving the way for scalability to much larger structures. During the sampling
phase, SparseDiff iteratively populates the adjacency matrix from its prior
state, ensuring prediction of the full graph while controlling memory
utilization. Experimental results show that SparseDiff simultaneously matches
state-of-the-art in generation performance on both small and large graphs,
highlighting the versatility of our method.
Score Models for Offline Goal-Conditioned Reinforcement Learning
November 03, 2023
Harshit Sikchi, Rohan Chitnis, Ahmed Touati, Alborz Geramifard, Amy Zhang, Scott Niekum
Offline Goal-Conditioned Reinforcement Learning (GCRL) is tasked with
learning to achieve multiple goals in an environment purely from offline
datasets using sparse reward functions. Offline GCRL is pivotal for developing
generalist agents capable of leveraging pre-existing datasets to learn diverse
and reusable skills without hand-engineering reward functions. However,
contemporary approaches to GCRL based on supervised learning and contrastive
learning are often suboptimal in the offline setting. An alternative
perspective on GCRL optimizes for occupancy matching, but necessitates learning
a discriminator, which subsequently serves as a pseudo-reward for downstream
RL. Inaccuracies in the learned discriminator can cascade, negatively
influencing the resulting policy. We present a novel approach to GCRL under a
new lens of mixture-distribution matching, leading to our discriminator-free
method: SMORe. The key insight is combining the occupancy matching perspective
of GCRL with a convex dual formulation to derive a learning objective that can
better leverage suboptimal offline data. SMORe learns scores or unnormalized
densities representing the importance of taking an action at a state for
reaching a particular goal. SMORe is principled and our extensive experiments
on the fully offline GCRL benchmark composed of robot manipulation and
locomotion tasks, including high-dimensional observations, show that SMORe can
outperform state-of-the-art baselines by a significant margin.
Latent Diffusion Model for Conditional Reservoir Facies Generation
November 03, 2023
Daesoo Lee, Oscar Ovanger, Jo Eidsvik, Erlend Aune, Jacob Skauvold, Ragnar Hauge
physics.geo-ph, cs.LG, stat.ML
Creating accurate and geologically realistic reservoir facies based on
limited measurements is crucial for field development and reservoir management,
especially in the oil and gas sector. Traditional two-point geostatistics,
while foundational, often struggle to capture complex geological patterns.
Multi-point statistics offers more flexibility, but comes with its own
challenges. With the rise of Generative Adversarial Networks (GANs) and their
success in various fields, there has been a shift towards using them for facies
generation. However, recent advances in the computer vision domain have shown
the superiority of diffusion models over GANs. Motivated by this, a novel
Latent Diffusion Model is proposed, which is specifically designed for
conditional generation of reservoir facies. The proposed model produces
high-fidelity facies realizations that rigorously preserve conditioning data.
It significantly outperforms a GAN-based alternative.
On the Generalization Properties of Diffusion Models
November 03, 2023
Puheng Li, Zhong Li, Huishuai Zhang, Jiang Bian
Diffusion models are a class of generative models that serve to establish a
stochastic transport map between an empirically observed, yet unknown, target
distribution and a known prior. Despite their remarkable success in real-world
applications, a theoretical understanding of their generalization capabilities
remains underdeveloped. This work embarks on a comprehensive theoretical
exploration of the generalization attributes of diffusion models. We establish
theoretical estimates of the generalization gap that evolves in tandem with the
training dynamics of score-based diffusion models, suggesting a polynomially
small generalization error ($O(n^{-2/5}+m^{-4/5})$) on both the sample size $n$
and the model capacity $m$, evading the curse of dimensionality (i.e., not
exponentially large in the data dimension) when early-stopped. Furthermore, we
extend our quantitative analysis to a data-dependent scenario, wherein target
distributions are portrayed as a succession of densities with progressively
increasing distances between modes. This precisely elucidates the adverse
effect of “modes shift” in ground truths on the model generalization. Moreover,
these estimates are not solely theoretical constructs but have also been
confirmed through numerical simulations. Our findings contribute to the
rigorous understanding of diffusion models’ generalization properties and
provide insights that may guide practical applications.
PDF: Point Diffusion Implicit Function for Large-scale Scene Neural Representation
November 03, 2023
Yuhan Ding, Fukun Yin, Jiayuan Fan, Hui Li, Xin Chen, Wen Liu, Chongshan Lu, Gang YU, Tao Chen
Recent advances in implicit neural representations have achieved impressive
results by sampling and fusing individual points along sampling rays in the
sampling space. However, due to the explosively growing sampling space, finely
representing and synthesizing detailed textures remains a challenge for
unbounded large-scale outdoor scenes. To alleviate the dilemma of using
individual points to perceive the entire colossal space, we explore learning
the surface distribution of the scene to provide structural priors and reduce
the samplable space and propose a Point Diffusion implicit Function, PDF, for
large-scale scene neural representation. The core of our method is a
large-scale point cloud super-resolution diffusion module that enhances the
sparse point cloud reconstructed from several training images into a dense
point cloud as an explicit prior. Then in the rendering stage, only sampling
points with prior points within the sampling radius are retained. That is, the
sampling space is reduced from the unbounded space to the scene surface.
Meanwhile, to fill in the background of the scene that cannot be provided by
point clouds, the region sampling based on Mip-NeRF 360 is employed to model
the background representation. Expensive experiments have demonstrated the
effectiveness of our method for large-scale scene novel view synthesis, which
outperforms relevant state-of-the-art baselines.
Investigating the Behavior of Diffusion Models for Accelerating Electronic Structure Calculations
November 02, 2023
Daniel Rothchild, Andrew S. Rosen, Eric Taw, Connie Robinson, Joseph E. Gonzalez, Aditi S. Krishnapriyan
physics.chem-ph, cond-mat.mtrl-sci, cs.LG, physics.comp-ph
We present an investigation into diffusion models for molecular generation,
with the aim of better understanding how their predictions compare to the
results of physics-based calculations. The investigation into these models is
driven by their potential to significantly accelerate electronic structure
calculations using machine learning, without requiring expensive
first-principles datasets for training interatomic potentials. We find that the
inference process of a popular diffusion model for de novo molecular generation
is divided into an exploration phase, where the model chooses the atomic
species, and a relaxation phase, where it adjusts the atomic coordinates to
find a low-energy geometry. As training proceeds, we show that the model
initially learns about the first-order structure of the potential energy
surface, and then later learns about higher-order structure. We also find that
the relaxation phase of the diffusion model can be re-purposed to sample the
Boltzmann distribution over conformations and to carry out structure
relaxations. For structure relaxations, the model finds geometries with ~10x
lower energy than those produced by a classical force field for small organic
molecules. Initializing a density functional theory (DFT) relaxation at the
diffusion-produced structures yields a >2x speedup to the DFT relaxation when
compared to initializing at structures relaxed with a classical force field.
De-Diffusion Makes Text a Strong Cross-Modal Interface
November 01, 2023
Chen Wei, Chenxi Liu, Siyuan Qiao, Zhishuai Zhang, Alan Yuille, Jiahui Yu
We demonstrate text as a strong cross-modal interface. Rather than relying on
deep embeddings to connect image and language as the interface representation,
our approach represents an image as text, from which we enjoy the
interpretability and flexibility inherent to natural language. We employ an
autoencoder that uses a pre-trained text-to-image diffusion model for decoding.
The encoder is trained to transform an input image into text, which is then fed
into the fixed text-to-image diffusion decoder to reconstruct the original
input – a process we term De-Diffusion. Experiments validate both the
precision and comprehensiveness of De-Diffusion text representing images, such
that it can be readily ingested by off-the-shelf text-to-image tools and LLMs
for diverse multi-modal tasks. For example, a single De-Diffusion model can
generalize to provide transferable prompts for different text-to-image tools,
and also achieves a new state of the art on open-ended vision-language tasks by
simply prompting large language models with few-shot examples.
Intriguing Properties of Data Attribution on Diffusion Models
November 01, 2023
Xiaosen Zheng, Tianyu Pang, Chao Du, Jing Jiang, Min Lin
Data attribution seeks to trace model outputs back to training data. With the
recent development of diffusion models, data attribution has become a desired
module to properly assign valuations for high-quality or copyrighted training
samples, ensuring that data contributors are fairly compensated or credited.
Several theoretically motivated methods have been proposed to implement data
attribution, in an effort to improve the trade-off between computational
scalability and effectiveness. In this work, we conduct extensive experiments
and ablation studies on attributing diffusion models, specifically focusing on
DDPMs trained on CIFAR-10 and CelebA, as well as a Stable Diffusion model
LoRA-finetuned on ArtBench. Intriguingly, we report counter-intuitive
observations that theoretically unjustified design choices for attribution
empirically outperform previous baselines by a large margin, in terms of both
linear datamodeling score and counterfactual evaluation. Our work presents a
significantly more efficient approach for attributing diffusion models, while
the unexpected findings suggest that at least in non-convex settings,
constructions guided by theoretical assumptions may lead to inferior
attribution performance. The code is available at
https://github.com/sail-sg/D-TRAK.
Diffusion models for probabilistic programming
November 01, 2023
Simon Dirmeier, Fernando Perez-Cruz
We propose Diffusion Model Variational Inference (DMVI), a novel method for
automated approximate inference in probabilistic programming languages (PPLs).
DMVI utilizes diffusion models as variational approximations to the true
posterior distribution by deriving a novel bound to the marginal likelihood
objective used in Bayesian modelling. DMVI is easy to implement, allows
hassle-free inference in PPLs without the drawbacks of, e.g., variational
inference using normalizing flows, and does not make any constraints on the
underlying neural network model. We evaluate DMVI on a set of common Bayesian
models and show that its posterior inferences are in general more accurate than
those of contemporary methods used in PPLs while having a similar computational
cost and requiring less manual tuning.
Adaptive Latent Diffusion Model for 3D Medical Image to Image Translation: Multi-modal Magnetic Resonance Imaging Study
November 01, 2023
Jonghun Kim, Hyunjin Park
Multi-modal images play a crucial role in comprehensive evaluations in
medical image analysis providing complementary information for identifying
clinically important biomarkers. However, in clinical practice, acquiring
multiple modalities can be challenging due to reasons such as scan cost,
limited scan time, and safety considerations. In this paper, we propose a model
based on the latent diffusion model (LDM) that leverages switchable blocks for
image-to-image translation in 3D medical images without patch cropping. The 3D
LDM combined with conditioning using the target modality allows generating
high-quality target modality in 3D overcoming the shortcoming of the missing
out-of-slice information in 2D generation methods. The switchable block, noted
as multiple switchable spatially adaptive normalization (MS-SPADE), dynamically
transforms source latents to the desired style of the target latents to help
with the diffusion process. The MS-SPADE block allows us to have one single
model to tackle many translation tasks of one source modality to various
targets removing the need for many translation models for different scenarios.
Our model exhibited successful image synthesis across different source-target
modality scenarios and surpassed other models in quantitative evaluations
tested on multi-modal brain magnetic resonance imaging datasets of four
different modalities and an independent IXI dataset. Our model demonstrated
successful image synthesis across various modalities even allowing for
one-to-many modality translations. Furthermore, it outperformed other
one-to-one translation models in quantitative evaluations.
Score Normalization for a Faster Diffusion Exponential Integrator Sampler
October 31, 2023
Guoxuan Xia, Duolikun Danier, Ayan Das, Stathi Fotiadis, Farhang Nabiei, Ushnish Sengupta, Alberto Bernacchia
Recently, Zhang et al. have proposed the Diffusion Exponential Integrator
Sampler (DEIS) for fast generation of samples from Diffusion Models. It
leverages the semi-linear nature of the probability flow ordinary differential
equation (ODE) in order to greatly reduce integration error and improve
generation quality at low numbers of function evaluations (NFEs). Key to this
approach is the score function reparameterisation, which reduces the
integration error incurred from using a fixed score function estimate over each
integration step. The original authors use the default parameterisation used by
models trained for noise prediction – multiply the score by the standard
deviation of the conditional forward noising distribution. We find that
although the mean absolute value of this score parameterisation is close to
constant for a large portion of the reverse sampling process, it changes
rapidly at the end of sampling. As a simple fix, we propose to instead
reparameterise the score (at inference) by dividing it by the average absolute
value of previous score estimates at that time step collected from offline high
NFE generations. We find that our score normalisation (DEIS-SN) consistently
improves FID compared to vanilla DEIS, showing an improvement at 10 NFEs from
6.44 to 5.57 on CIFAR-10 and from 5.9 to 4.95 on LSUN-Church 64x64. Our code is
available at https://github.com/mtkresearch/Diffusion-DEIS-SN
Diversity and Diffusion: Observations on Synthetic Image Distributions with Stable Diffusion
October 31, 2023
David Marwood, Shumeet Baluja, Yair Alon
Recent progress in text-to-image (TTI) systems, such as StableDiffusion,
Imagen, and DALL-E 2, have made it possible to create realistic images with
simple text prompts. It is tempting to use these systems to eliminate the
manual task of obtaining natural images for training a new machine learning
classifier. However, in all of the experiments performed to date, classifiers
trained solely with synthetic images perform poorly at inference, despite the
images used for training appearing realistic. Examining this apparent
incongruity in detail gives insight into the limitations of the underlying
image generation processes. Through the lens of diversity in image creation
vs.accuracy of what is created, we dissect the differences in semantic
mismatches in what is modeled in synthetic vs. natural images. This will
elucidate the roles of the image-languag emodel, CLIP, and the image generation
model, diffusion. We find four issues that limit the usefulness of TTI systems
for this task: ambiguity, adherence to prompt, lack of diversity, and inability
to represent the underlying concept. We further present surprising insights
into the geometry of CLIP embeddings.
October 31, 2023
Yuxin Zhang, Clément Huneau, Jérôme Idier, Diana Mateus
Despite its wide use in medicine, ultrasound imaging faces several challenges
related to its poor signal-to-noise ratio and several sources of noise and
artefacts. Enhancing ultrasound image quality involves balancing concurrent
factors like contrast, resolution, and speckle preservation. In recent years,
there has been progress both in model-based and learning-based approaches to
improve ultrasound image reconstruction. Bringing the best from both worlds, we
propose a hybrid approach leveraging advances in diffusion models. To this end,
we adapt Denoising Diffusion Restoration Models (DDRM) to incorporate
ultrasound physics through a linear direct model and an unsupervised
fine-tuning of the prior diffusion model. We conduct comprehensive experiments
on simulated, in-vitro, and in-vivo data, demonstrating the efficacy of our
approach in achieving high-quality image reconstructions from a single plane
wave input and in comparison to state-of-the-art methods. Finally, given the
stochastic nature of the method, we analyse in depth the statistical properties
of single and multiple-sample reconstructions, experimentally show the
informativeness of their variance, and provide an empirical model relating this
behaviour to speckle noise. The code and data are available at: (upon
acceptance).
October 31, 2023
Reza Basiri, Karim Manji, Francois Harton, Alisha Poonja, Milos R. Popovic, Shehroz S. Khan
Diabetic Foot Ulcer (DFU) is a serious skin wound requiring specialized care.
However, real DFU datasets are limited, hindering clinical training and
research activities. In recent years, generative adversarial networks and
diffusion models have emerged as powerful tools for generating synthetic images
with remarkable realism and diversity in many applications. This paper explores
the potential of diffusion models for synthesizing DFU images and evaluates
their authenticity through expert clinician assessments. Additionally,
evaluation metrics such as Frechet Inception Distance (FID) and Kernel
Inception Distance (KID) are examined to assess the quality of the synthetic
DFU images. A dataset of 2,000 DFU images is used for training the diffusion
model, and the synthetic images are generated by applying diffusion processes.
The results indicate that the diffusion model successfully synthesizes visually
indistinguishable DFU images. 70% of the time, clinicians marked synthetic DFU
images as real DFUs. However, clinicians demonstrate higher unanimous
confidence in rating real images than synthetic ones. The study also reveals
that FID and KID metrics do not significantly align with clinicians’
assessments, suggesting alternative evaluation approaches are needed. The
findings highlight the potential of diffusion models for generating synthetic
DFU images and their impact on medical training programs and research in wound
detection and classification.
Beyond U: Making Diffusion Models Faster & Lighter
October 31, 2023
Sergio Calvo-Ordonez, Jiahao Huang, Lipei Zhang, Guang Yang, Carola-Bibiane Schonlieb, Angelica I Aviles-Rivero
Diffusion models are a family of generative models that yield record-breaking
performance in tasks such as image synthesis, video generation, and molecule
design. Despite their capabilities, their efficiency, especially in the reverse
denoising process, remains a challenge due to slow convergence rates and high
computational costs. In this work, we introduce an approach that leverages
continuous dynamical systems to design a novel denoising network for diffusion
models that is more parameter-efficient, exhibits faster convergence, and
demonstrates increased noise robustness. Experimenting with denoising
probabilistic diffusion models, our framework operates with approximately a
quarter of the parameters and 30% of the Floating Point Operations (FLOPs)
compared to standard U-Nets in Denoising Diffusion Probabilistic Models
(DDPMs). Furthermore, our model is up to 70% faster in inference than the
baseline models when measured in equal conditions while converging to better
quality solutions.
Scaling Riemannian Diffusion Models
October 30, 2023
Aaron Lou, Minkai Xu, Stefano Ermon
Riemannian diffusion models draw inspiration from standard Euclidean space
diffusion models to learn distributions on general manifolds. Unfortunately,
the additional geometric complexity renders the diffusion transition term
inexpressible in closed form, so prior methods resort to imprecise
approximations of the score matching training objective that degrade
performance and preclude applications in high dimensions. In this work, we
reexamine these approximations and propose several practical improvements. Our
key observation is that most relevant manifolds are symmetric spaces, which are
much more amenable to computation. By leveraging and combining various
ans"{a}tze, we can quickly compute relevant quantities to high precision. On
low dimensional datasets, our correction produces a noticeable improvement,
allowing diffusion to compete with other methods. Additionally, we show that
our method enables us to scale to high dimensional tasks on nontrivial
manifolds. In particular, we model QCD densities on $SU(n)$ lattices and
contrastively learned embeddings on high dimensional hyperspheres.
VideoCrafter1: Open Diffusion Models for High-Quality Video Generation
October 30, 2023
Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Jinbo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, Chao Weng, Ying Shan
Video generation has increasingly gained interest in both academia and
industry. Although commercial tools can generate plausible videos, there is a
limited number of open-source models available for researchers and engineers.
In this work, we introduce two diffusion models for high-quality video
generation, namely text-to-video (T2V) and image-to-video (I2V) models. T2V
models synthesize a video based on a given text input, while I2V models
incorporate an additional image input. Our proposed T2V model can generate
realistic and cinematic-quality videos with a resolution of $1024 \times 576$,
outperforming other open-source T2V models in terms of quality. The I2V model
is designed to produce videos that strictly adhere to the content of the
provided reference image, preserving its content, structure, and style. This
model is the first open-source I2V foundation model capable of transforming a
given image into a video clip while maintaining content preservation
constraints. We believe that these open-source video generation models will
contribute significantly to the technological advancements within the
community.
Text-to-3D with Classifier Score Distillation
October 30, 2023
Xin Yu, Yuan-Chen Guo, Yangguang Li, Ding Liang, Song-Hai Zhang, Xiaojuan Qi
Text-to-3D generation has made remarkable progress recently, particularly
with methods based on Score Distillation Sampling (SDS) that leverages
pre-trained 2D diffusion models. While the usage of classifier-free guidance is
well acknowledged to be crucial for successful optimization, it is considered
an auxiliary trick rather than the most essential component. In this paper, we
re-evaluate the role of classifier-free guidance in score distillation and
discover a surprising finding: the guidance alone is enough for effective
text-to-3D generation tasks. We name this method Classifier Score Distillation
(CSD), which can be interpreted as using an implicit classification model for
generation. This new perspective reveals new insights for understanding
existing techniques. We validate the effectiveness of CSD across a variety of
text-to-3D tasks including shape generation, texture synthesis, and shape
editing, achieving results superior to those of state-of-the-art methods. Our
project page is https://xinyu-andy.github.io/Classifier-Score-Distillation
Noise-Free Score Distillation
October 26, 2023
Oren Katzir, Or Patashnik, Daniel Cohen-Or, Dani Lischinski
Score Distillation Sampling (SDS) has emerged as the de facto approach for
text-to-content generation in non-image domains. In this paper, we reexamine
the SDS process and introduce a straightforward interpretation that demystifies
the necessity for large Classifier-Free Guidance (CFG) scales, rooted in the
distillation of an undesired noise term. Building upon our interpretation, we
propose a novel Noise-Free Score Distillation (NFSD) process, which requires
minimal modifications to the original SDS framework. Through this streamlined
design, we achieve more effective distillation of pre-trained text-to-image
diffusion models while using a nominal CFG scale. This strategic choice allows
us to prevent the over-smoothing of results, ensuring that the generated data
is both realistic and complies with the desired prompt. To demonstrate the
efficacy of NFSD, we provide qualitative examples that compare NFSD and SDS, as
well as several other methods.
SD4Match: Learning to Prompt Stable Diffusion Model for Semantic Matching
October 26, 2023
Xinghui Li, Jingyi Lu, Kai Han, Victor Prisacariu
In this paper, we address the challenge of matching semantically similar
keypoints across image pairs. Existing research indicates that the intermediate
output of the UNet within the Stable Diffusion (SD) can serve as robust image
feature maps for such a matching task. We demonstrate that by employing a basic
prompt tuning technique, the inherent potential of Stable Diffusion can be
harnessed, resulting in a significant enhancement in accuracy over previous
approaches. We further introduce a novel conditional prompting module that
conditions the prompt on the local details of the input image pairs, leading to
a further improvement in performance. We designate our approach as SD4Match,
short for Stable Diffusion for Semantic Matching. Comprehensive evaluations of
SD4Match on the PF-Pascal, PF-Willow, and SPair-71k datasets show that it sets
new benchmarks in accuracy across all these datasets. Particularly, SD4Match
outperforms the previous state-of-the-art by a margin of 12 percentage points
on the challenging SPair-71k dataset.
Discrete Diffusion Language Modeling by Estimating the Ratios of the Data Distribution
October 25, 2023
Aaron Lou, Chenlin Meng, Stefano Ermon
Despite their groundbreaking performance for many generative modeling tasks,
diffusion models have fallen short on discrete data domains such as natural
language. Crucially, standard diffusion models rely on the well-established
theory of score matching, but efforts to generalize this to discrete structures
have not yielded the same empirical gains. In this work, we bridge this gap by
proposing score entropy, a novel discrete score matching loss that is more
stable than existing methods, forms an ELBO for maximum likelihood training,
and can be efficiently optimized with a denoising variant. We scale our Score
Entropy Discrete Diffusion models (SEDD) to the experimental setting of GPT-2,
achieving highly competitive likelihoods while also introducing distinct
algorithmic advantages. In particular, when comparing similarly sized SEDD and
GPT-2 models, SEDD attains comparable perplexities (normally within $+10\%$ of
and sometimes outperforming the baseline). Furthermore, SEDD models learn a
more faithful sequence distribution (around $4\times$ better compared to GPT-2
models with ancestral sampling as measured by large models), can trade off
compute for generation quality (needing only $16\times$ fewer network
evaluations to match GPT-2), and enables arbitrary infilling beyond the
standard left to right prompting.
Using Diffusion Models to Generate Synthetic Labelled Data for Medical Image Segmentation
October 25, 2023
Daniel Saragih, Pascal Tyrrell
In this paper, we proposed and evaluated a pipeline for generating synthetic
labeled polyp images with the aim of augmenting automatic medical image
segmentation models. In doing so, we explored the use of diffusion models to
generate and style synthetic labeled data. The HyperKvasir dataset consisting
of 1000 images of polyps in the human GI tract obtained from 2008 to 2016
during clinical endoscopies was used for training and testing. Furthermore, we
did a qualitative expert review, and computed the Fr'echet Inception Distance
(FID) and Multi-Scale Structural Similarity (MS-SSIM) between the output images
and the source images to evaluate our samples. To evaluate its augmentation
potential, a segmentation model was trained with the synthetic data to compare
their performance with the real data and previous Generative Adversarial
Networks (GAN) methods. These models were evaluated using the Dice loss (DL)
and Intersection over Union (IoU) score. Our pipeline generated images that
more closely resembled real images according to the FID scores (GAN: $118.37
\pm 1.06 \text{ vs SD: } 65.99 \pm 0.37$). Improvements over GAN methods were
seen on average when the segmenter was entirely trained (DL difference:
$-0.0880 \pm 0.0170$, IoU difference: $0.0993 \pm 0.01493$) or augmented (DL
difference: GAN $-0.1140 \pm 0.0900 \text{ vs SD }-0.1053 \pm 0.0981$, IoU
difference: GAN $0.01533 \pm 0.03831 \text{ vs SD }0.0255 \pm 0.0454$) with
synthetic data. Overall, we obtained more realistic synthetic images and
improved segmentation model performance when fully or partially trained on
synthetic data.
Multi-scale Diffusion Denoised Smoothing
October 25, 2023
Jongheon Jeong, Jinwoo Shin
Along with recent diffusion models, randomized smoothing has become one of a
few tangible approaches that offers adversarial robustness to models at scale,
e.g., those of large pre-trained models. Specifically, one can perform
randomized smoothing on any classifier via a simple “denoise-and-classify”
pipeline, so-called denoised smoothing, given that an accurate denoiser is
available - such as diffusion model. In this paper, we present scalable methods
to address the current trade-off between certified robustness and accuracy in
denoised smoothing. Our key idea is to “selectively” apply smoothing among
multiple noise scales, coined multi-scale smoothing, which can be efficiently
implemented with a single diffusion model. This approach also suggests a new
objective to compare the collective robustness of multi-scale smoothed
classifiers, and questions which representation of diffusion model would
maximize the objective. To address this, we propose to further fine-tune
diffusion model (a) to perform consistent denoising whenever the original image
is recoverable, but (b) to generate rather diverse outputs otherwise. Our
experiments show that the proposed multi-scale smoothing scheme combined with
diffusion fine-tuning enables strong certified robustness available with high
noise level while maintaining its accuracy close to non-smoothed classifiers.
Fuse Your Latents: Video Editing with Multi-source Latent Diffusion Models
October 25, 2023
Tianyi Lu, Xing Zhang, Jiaxi Gu, Hang Xu, Renjing Pei, Songcen Xu, Zuxuan Wu
Latent Diffusion Models (LDMs) are renowned for their powerful capabilities
in image and video synthesis. Yet, video editing methods suffer from
insufficient pre-training data or video-by-video re-training cost. In
addressing this gap, we propose FLDM (Fused Latent Diffusion Model), a
training-free framework to achieve text-guided video editing by applying
off-the-shelf image editing methods in video LDMs. Specifically, FLDM fuses
latents from an image LDM and an video LDM during the denoising process. In
this way, temporal consistency can be kept with video LDM while high-fidelity
from the image LDM can also be exploited. Meanwhile, FLDM possesses high
flexibility since both image LDM and video LDM can be replaced so advanced
image editing methods such as InstructPix2Pix and ControlNet can be exploited.
To the best of our knowledge, FLDM is the first method to adapt off-the-shelf
image editing methods into video LDMs for video editing. Extensive quantitative
and qualitative experiments demonstrate that FLDM can improve the textual
alignment and temporal consistency of edited videos.
DiffRef3D: A Diffusion-based Proposal Refinement Framework for 3D Object Detection
October 25, 2023
Se-Ho Kim, Inyong Koo, Inyoung Lee, Byeongjun Park, Changick Kim
Denoising diffusion models show remarkable performances in generative tasks,
and their potential applications in perception tasks are gaining interest. In
this paper, we introduce a novel framework named DiffRef3D which adopts the
diffusion process on 3D object detection with point clouds for the first time.
Specifically, we formulate the proposal refinement stage of two-stage 3D object
detectors as a conditional diffusion process. During training, DiffRef3D
gradually adds noise to the residuals between proposals and target objects,
then applies the noisy residuals to proposals to generate hypotheses. The
refinement module utilizes these hypotheses to denoise the noisy residuals and
generate accurate box predictions. In the inference phase, DiffRef3D generates
initial hypotheses by sampling noise from a Gaussian distribution as residuals
and refines the hypotheses through iterative steps. DiffRef3D is a versatile
proposal refinement framework that consistently improves the performance of
existing 3D object detection models. We demonstrate the significance of
DiffRef3D through extensive experiments on the KITTI benchmark. Code will be
available.
Generative Pre-training for Speech with Flow Matching
October 25, 2023
Alexander H. Liu, Matt Le, Apoorv Vyas, Bowen Shi, Andros Tjandra, Wei-Ning Hsu
eess.AS, cs.CL, cs.LG, cs.SD
Generative models have gained more and more attention in recent years for
their remarkable success in tasks that required estimating and sampling data
distribution to generate high-fidelity synthetic data. In speech,
text-to-speech synthesis and neural vocoder are good examples where generative
models have shined. While generative models have been applied to different
applications in speech, there exists no general-purpose generative model that
models speech directly. In this work, we take a step toward this direction by
showing a single pre-trained generative model can be adapted to different
downstream tasks with strong performance. Specifically, we pre-trained a
generative model, named SpeechFlow, on 60k hours of untranscribed speech with
Flow Matching and masked conditions. Experiment results show the pre-trained
generative model can be fine-tuned with task-specific data to match or surpass
existing expert models on speech enhancement, separation, and synthesis. Our
work suggested a foundational model for generation tasks in speech can be built
with generative pre-training.
Score Matching-based Pseudolikelihood Estimation of Neural Marked Spatio-Temporal Point Process with Uncertainty Quantification
October 25, 2023
Zichong Li, Qunzhi Xu, Zhenghao Xu, Yajun Mei, Tuo Zhao, Hongyuan Zha
Spatio-temporal point processes (STPPs) are potent mathematical tools for
modeling and predicting events with both temporal and spatial features. Despite
their versatility, most existing methods for learning STPPs either assume a
restricted form of the spatio-temporal distribution, or suffer from inaccurate
approximations of the intractable integral in the likelihood training
objective. These issues typically arise from the normalization term of the
probability density function. Moreover, current techniques fail to provide
uncertainty quantification for model predictions, such as confidence intervals
for the predicted event’s arrival time and confidence regions for the event’s
location, which is crucial given the considerable randomness of the data. To
tackle these challenges, we introduce SMASH: a Score MAtching-based
pSeudolikeliHood estimator for learning marked STPPs with uncertainty
quantification. Specifically, our framework adopts a normalization-free
objective by estimating the pseudolikelihood of marked STPPs through
score-matching and offers uncertainty quantification for the predicted event
time, location and mark by computing confidence regions over the generated
samples. The superior performance of our proposed framework is demonstrated
through extensive experiments in both event prediction and uncertainty
quantification.
RAEDiff: Denoising Diffusion Probabilistic Models Based Reversible Adversarial Examples Self-Generation and Self-Recovery
October 25, 2023
Fan Xing, Xiaoyi Zhou, Xuefeng Fan, Zhuo Tian, Yan Zhao
cs.CR, cs.AI, cs.GR, cs.LG
Collected and annotated datasets, which are obtained through extensive
efforts, are effective for training Deep Neural Network (DNN) models. However,
these datasets are susceptible to be misused by unauthorized users, resulting
in infringement of Intellectual Property (IP) rights owned by the dataset
creators. Reversible Adversarial Exsamples (RAE) can help to solve the issues
of IP protection for datasets. RAEs are adversarial perturbed images that can
be restored to the original. As a cutting-edge approach, RAE scheme can serve
the purposes of preventing unauthorized users from engaging in malicious model
training, as well as ensuring the legitimate usage of authorized users.
Nevertheless, in the existing work, RAEs still rely on the embedded auxiliary
information for restoration, which may compromise their adversarial abilities.
In this paper, a novel self-generation and self-recovery method, named as
RAEDiff, is introduced for generating RAEs based on a Denoising Diffusion
Probabilistic Models (DDPM). It diffuses datasets into a Biased Gaussian
Distribution (BGD) and utilizes the prior knowledge of the DDPM for generating
and recovering RAEs. The experimental results demonstrate that RAEDiff
effectively self-generates adversarial perturbations for DNN models, including
Artificial Intelligence Generated Content (AIGC) models, while also exhibiting
significant self-recovery capabilities.
A Diffusion Weighted Graph Framework for New Intent Discovery
October 24, 2023
Wenkai Shi, Wenbin An, Feng Tian, Qinghua Zheng, QianYing Wang, Ping Chen
New Intent Discovery (NID) aims to recognize both new and known intents from
unlabeled data with the aid of limited labeled data containing only known
intents. Without considering structure relationships between samples, previous
methods generate noisy supervisory signals which cannot strike a balance
between quantity and quality, hindering the formation of new intent clusters
and effective transfer of the pre-training knowledge. To mitigate this
limitation, we propose a novel Diffusion Weighted Graph Framework (DWGF) to
capture both semantic similarities and structure relationships inherent in
data, enabling more sufficient and reliable supervisory signals. Specifically,
for each sample, we diffuse neighborhood relationships along semantic paths
guided by the nearest neighbors for multiple hops to characterize its local
structure discriminately. Then, we sample its positive keys and weigh them
based on semantic similarities and local structures for contrastive learning.
During inference, we further propose Graph Smoothing Filter (GSF) to explicitly
utilize the structure relationships to filter high-frequency noise embodied in
semantically ambiguous samples on the cluster boundary. Extensive experiments
show that our method outperforms state-of-the-art models on all evaluation
metrics across multiple benchmark datasets. Code and data are available at
https://github.com/yibai-shi/DWGF.
A Comparative Study of Variational Autoencoders, Normalizing Flows, and Score-based Diffusion Models for Electrical Impedance Tomography
October 24, 2023
Huihui Wang, Guixian Xu, Qingping Zhou
Electrical Impedance Tomography (EIT) is a widely employed imaging technique
in industrial inspection, geophysical prospecting, and medical imaging.
However, the inherent nonlinearity and ill-posedness of EIT image
reconstruction present challenges for classical regularization techniques, such
as the critical selection of regularization terms and the lack of prior
knowledge. Deep generative models (DGMs) have been shown to play a crucial role
in learning implicit regularizers and prior knowledge. This study aims to
investigate the potential of three DGMs-variational autoencoder networks,
normalizing flow, and score-based diffusion model-to learn implicit
regularizers in learning-based EIT imaging. We first introduce background
information on EIT imaging and its inverse problem formulation. Next, we
propose three algorithms for performing EIT inverse problems based on
corresponding DGMs. Finally, we present numerical and visual experiments, which
reveal that (1) no single method consistently outperforms the others across all
settings, and (2) when reconstructing an object with 2 anomalies using a
well-trained model based on a training dataset containing 4 anomalies, the
conditional normalizing flow model (CNF) exhibits the best generalization in
low-level noise, while the conditional score-based diffusion model (CSD*)
demonstrates the best generalization in high-level noise settings. We hope our
preliminary efforts will encourage other researchers to assess their DGMs in
EIT and other nonlinear inverse problems.
Discriminator Guidance for Autoregressive Diffusion Models
October 24, 2023
Filip Ekström Kelvinius, Fredrik Lindsten
We introduce discriminator guidance in the setting of Autoregressive
Diffusion Models. The use of a discriminator to guide a diffusion process has
previously been used for continuous diffusion models, and in this work we
derive ways of using a discriminator together with a pretrained generative
model in the discrete case. First, we show that using an optimal discriminator
will correct the pretrained model and enable exact sampling from the underlying
data distribution. Second, to account for the realistic scenario of using a
sub-optimal discriminator, we derive a sequential Monte Carlo algorithm which
iteratively takes the predictions from the discrimiator into account during the
generation process. We test these approaches on the task of generating
molecular graphs and show how the discriminator improves the generative
performance over using only the pretrained model.
Improving Diffusion Models for ECG Imputation with an Augmented Template Prior
October 24, 2023
Alexander Jenkins, Zehua Chen, Fu Siong Ng, Danilo Mandic
Pulsative signals such as the electrocardiogram (ECG) are extensively
collected as part of routine clinical care. However, noisy and poor-quality
recordings are a major issue for signals collected using mobile health systems,
decreasing the signal quality, leading to missing values, and affecting
automated downstream tasks. Recent studies have explored the imputation of
missing values in ECG with probabilistic time-series models. Nevertheless, in
comparison with the deterministic models, their performance is still limited,
as the variations across subjects and heart-beat relationships are not
explicitly considered in the training objective. In this work, to improve the
imputation and forecasting accuracy for ECG with probabilistic models, we
present a template-guided denoising diffusion probabilistic model (DDPM),
PulseDiff, which is conditioned on an informative prior for a range of health
conditions. Specifically, 1) we first extract a subject-level pulsative
template from the observed values to use as an informative prior of the missing
values, which personalises the prior; 2) we then add beat-level stochastic
shift terms to augment the prior, which considers variations in the position
and amplitude of the prior at each beat; 3) we finally design a confidence
score to consider the health condition of the subject, which ensures our prior
is provided safely. Experiments with the PTBXL dataset reveal that PulseDiff
improves the performance of two strong DDPM baseline models, CSDI and
SSSD$^{S4}$, verifying that our method guides the generation of DDPMs while
managing the uncertainty. When combined with SSSD$^{S4}$, PulseDiff outperforms
the leading deterministic model for short-interval missing data and is
comparable for long-interval data loss.
AutoDiff: combining Auto-encoder and Diffusion model for tabular data synthesizing
October 24, 2023
Namjoon Suh, Xiaofeng Lin, Din-Yin Hsieh, Merhdad Honarkhah, Guang Cheng
Diffusion model has become a main paradigm for synthetic data generation in
many subfields of modern machine learning, including computer vision, language
model, or speech synthesis. In this paper, we leverage the power of diffusion
model for generating synthetic tabular data. The heterogeneous features in
tabular data have been main obstacles in tabular data synthesis, and we tackle
this problem by employing the auto-encoder architecture. When compared with the
state-of-the-art tabular synthesizers, the resulting synthetic tables from our
model show nice statistical fidelities to the real data, and perform well in
downstream tasks for machine learning utilities. We conducted the experiments
over $15$ publicly available datasets. Notably, our model adeptly captures the
correlations among features, which has been a long-standing challenge in
tabular data synthesis. Our code is available at
https://github.com/UCLA-Trustworthy-AI-Lab/AutoDiffusion.
Matryoshka Diffusion Models
October 23, 2023
Jiatao Gu, Shuangfei Zhai, Yizhe Zhang, Josh Susskind, Navdeep Jaitly
Diffusion models are the de facto approach for generating high-quality images
and videos, but learning high-dimensional models remains a formidable task due
to computational and optimization challenges. Existing methods often resort to
training cascaded models in pixel space or using a downsampled latent space of
a separately trained auto-encoder. In this paper, we introduce Matryoshka
Diffusion Models(MDM), an end-to-end framework for high-resolution image and
video synthesis. We propose a diffusion process that denoises inputs at
multiple resolutions jointly and uses a NestedUNet architecture where features
and parameters for small-scale inputs are nested within those of large scales.
In addition, MDM enables a progressive training schedule from lower to higher
resolutions, which leads to significant improvements in optimization for
high-resolution generation. We demonstrate the effectiveness of our approach on
various benchmarks, including class-conditioned image generation,
high-resolution text-to-image, and text-to-video applications. Remarkably, we
can train a single pixel-space model at resolutions of up to 1024x1024 pixels,
demonstrating strong zero-shot generalization using the CC12M dataset, which
contains only 12 million images.
Wonder3D: Single Image to 3D using Cross-Domain Diffusion
October 23, 2023
Xiaoxiao Long, Yuan-Chen Guo, Cheng Lin, Yuan Liu, Zhiyang Dou, Lingjie Liu, Yuexin Ma, Song-Hai Zhang, Marc Habermann, Christian Theobalt, Wenping Wang
In this work, we introduce Wonder3D, a novel method for efficiently
generating high-fidelity textured meshes from single-view images.Recent methods
based on Score Distillation Sampling (SDS) have shown the potential to recover
3D geometry from 2D diffusion priors, but they typically suffer from
time-consuming per-shape optimization and inconsistent geometry. In contrast,
certain works directly produce 3D information via fast network inferences, but
their results are often of low quality and lack geometric details. To
holistically improve the quality, consistency, and efficiency of image-to-3D
tasks, we propose a cross-domain diffusion model that generates multi-view
normal maps and the corresponding color images. To ensure consistency, we
employ a multi-view cross-domain attention mechanism that facilitates
information exchange across views and modalities. Lastly, we introduce a
geometry-aware normal fusion algorithm that extracts high-quality surfaces from
the multi-view 2D representations. Our extensive evaluations demonstrate that
our method achieves high-quality reconstruction results, robust generalization,
and reasonably good efficiency compared to prior works.
DICE: Diverse Diffusion Model with Scoring for Trajectory Prediction
October 23, 2023
Younwoo Choi, Ray Coden Mercurius, Soheil Mohamad Alizadeh Shabestary, Amir Rasouli
Road user trajectory prediction in dynamic environments is a challenging but
crucial task for various applications, such as autonomous driving. One of the
main challenges in this domain is the multimodal nature of future trajectories
stemming from the unknown yet diverse intentions of the agents. Diffusion
models have shown to be very effective in capturing such stochasticity in
prediction tasks. However, these models involve many computationally expensive
denoising steps and sampling operations that make them a less desirable option
for real-time safety-critical applications. To this end, we present a novel
framework that leverages diffusion models for predicting future trajectories in
a computationally efficient manner. To minimize the computational bottlenecks
in iterative sampling, we employ an efficient sampling mechanism that allows us
to maximize the number of sampled trajectories for improved accuracy while
maintaining inference time in real time. Moreover, we propose a scoring
mechanism to select the most plausible trajectories by assigning relative
ranks. We show the effectiveness of our approach by conducting empirical
evaluations on common pedestrian (UCY/ETH) and autonomous driving (nuScenes)
benchmark datasets on which our model achieves state-of-the-art performance on
several subsets and metrics.
Diffusion-Based Adversarial Purification for Speaker Verification
October 22, 2023
Yibo Bai, Xiao-Lei Zhang
Recently, automatic speaker verification (ASV) based on deep learning is
easily contaminated by adversarial attacks, which is a new type of attack that
injects imperceptible perturbations to audio signals so as to make ASV produce
wrong decisions. This poses a significant threat to the security and
reliability of ASV systems. To address this issue, we propose a Diffusion-Based
Adversarial Purification (DAP) method that enhances the robustness of ASV
systems against such adversarial attacks. Our method leverages a conditional
denoising diffusion probabilistic model to effectively purify the adversarial
examples and mitigate the impact of perturbations. DAP first introduces
controlled noise into adversarial examples, and then performs a reverse
denoising process to reconstruct clean audio. Experimental results demonstrate
the efficacy of the proposed DAP in enhancing the security of ASV and meanwhile
minimizing the distortion of the purified audio signals.
Diffusion-based Data Augmentation for Nuclei Image Segmentation
October 22, 2023
Xinyi Yu, Guanbin Li, Wei Lou, Siqi Liu, Xiang Wan, Yan Chen, Haofeng Li
Nuclei segmentation is a fundamental but challenging task in the quantitative
analysis of histopathology images. Although fully-supervised deep
learning-based methods have made significant progress, a large number of
labeled images are required to achieve great segmentation performance.
Considering that manually labeling all nuclei instances for a dataset is
inefficient, obtaining a large-scale human-annotated dataset is time-consuming
and labor-intensive. Therefore, augmenting a dataset with only a few labeled
images to improve the segmentation performance is of significant research and
application value. In this paper, we introduce the first diffusion-based
augmentation method for nuclei segmentation. The idea is to synthesize a large
number of labeled images to facilitate training the segmentation model. To
achieve this, we propose a two-step strategy. In the first step, we train an
unconditional diffusion model to synthesize the Nuclei Structure that is
defined as the representation of pixel-level semantic and distance transform.
Each synthetic nuclei structure will serve as a constraint on histopathology
image synthesis and is further post-processed to be an instance map. In the
second step, we train a conditioned diffusion model to synthesize
histopathology images based on nuclei structures. The synthetic histopathology
images paired with synthetic instance maps will be added to the real dataset
for training the segmentation model. The experimental results show that by
augmenting 10% labeled real dataset with synthetic samples, one can achieve
comparable segmentation results with the fully-supervised baseline.
Fast Diffusion GAN Model for Symbolic Music Generation Controlled by Emotions
October 21, 2023
Jincheng Zhang, György Fazekas, Charalampos Saitis
Diffusion models have shown promising results for a wide range of generative
tasks with continuous data, such as image and audio synthesis. However, little
progress has been made on using diffusion models to generate discrete symbolic
music because this new class of generative models are not well suited for
discrete data while its iterative sampling process is computationally
expensive. In this work, we propose a diffusion model combined with a
Generative Adversarial Network, aiming to (i) alleviate one of the remaining
challenges in algorithmic music generation which is the control of generation
towards a target emotion, and (ii) mitigate the slow sampling drawback of
diffusion models applied to symbolic music generation. We first used a trained
Variational Autoencoder to obtain embeddings of a symbolic music dataset with
emotion labels and then used those to train a diffusion model. Our results
demonstrate the successful control of our diffusion model to generate symbolic
music with a desired emotion. Our model achieves several orders of magnitude
improvement in computational cost, requiring merely four time steps to denoise
while the steps required by current state-of-the-art diffusion models for
symbolic music generation is in the order of thousands.
GraphMaker: Can Diffusion Models Generate Large Attributed Graphs?
October 20, 2023
Mufei Li, Eleonora Kreačić, Vamsi K. Potluru, Pan Li
Large-scale graphs with node attributes are fundamental in real-world
scenarios, such as social and financial networks. The generation of synthetic
graphs that emulate real-world ones is pivotal in graph machine learning,
aiding network evolution understanding and data utility preservation when
original data cannot be shared. Traditional models for graph generation suffer
from limited model capacity. Recent developments in diffusion models have shown
promise in merely graph structure generation or the generation of small
molecular graphs with attributes. However, their applicability to large
attributed graphs remains unaddressed due to challenges in capturing intricate
patterns and scalability. This paper introduces GraphMaker, a novel diffusion
model tailored for generating large attributed graphs. We study the diffusion
models that either couple or decouple graph structure and node attribute
generation to address their complex correlation. We also employ node-level
conditioning and adopt a minibatch strategy for scalability. We further propose
a new evaluation pipeline using models trained on generated synthetic graphs
and tested on original graphs to evaluate the quality of synthetic data.
Empirical evaluations on real-world datasets showcase GraphMaker’s superiority
in generating realistic and diverse large-attributed graphs beneficial for
downstream tasks.
DPM-Solver-v3: Improved Diffusion ODE Solver with Empirical Model Statistics
October 20, 2023
Kaiwen Zheng, Cheng Lu, Jianfei Chen, Jun Zhu
Diffusion probabilistic models (DPMs) have exhibited excellent performance
for high-fidelity image generation while suffering from inefficient sampling.
Recent works accelerate the sampling procedure by proposing fast ODE solvers
that leverage the specific ODE form of DPMs. However, they highly rely on
specific parameterization during inference (such as noise/data prediction),
which might not be the optimal choice. In this work, we propose a novel
formulation towards the optimal parameterization during sampling that minimizes
the first-order discretization error of the ODE solution. Based on such
formulation, we propose DPM-Solver-v3, a new fast ODE solver for DPMs by
introducing several coefficients efficiently computed on the pretrained model,
which we call empirical model statistics. We further incorporate multistep
methods and a predictor-corrector framework, and propose some techniques for
improving sample quality at small numbers of function evaluations (NFE) or
large guidance scales. Experiments show that DPM-Solver-v3 achieves
consistently better or comparable performance in both unconditional and
conditional sampling with both pixel-space and latent-space DPMs, especially in
5$\sim$10 NFEs. We achieve FIDs of 12.21 (5 NFE), 2.51 (10 NFE) on
unconditional CIFAR10, and MSE of 0.55 (5 NFE, 7.5 guidance scale) on Stable
Diffusion, bringing a speed-up of 15%$\sim$30% compared to previous
state-of-the-art training-free methods. Code is available at
https://github.com/thu-ml/DPM-Solver-v3.
CycleNet: Rethinking Cycle Consistency in Text-Guided Diffusion for Image Manipulation
October 19, 2023
Sihan Xu, Ziqiao Ma, Yidong Huang, Honglak Lee, Joyce Chai
Diffusion models (DMs) have enabled breakthroughs in image synthesis tasks
but lack an intuitive interface for consistent image-to-image (I2I)
translation. Various methods have been explored to address this issue,
including mask-based methods, attention-based methods, and image-conditioning.
However, it remains a critical challenge to enable unpaired I2I translation
with pre-trained DMs while maintaining satisfying consistency. This paper
introduces Cyclenet, a novel but simple method that incorporates cycle
consistency into DMs to regularize image manipulation. We validate Cyclenet on
unpaired I2I tasks of different granularities. Besides the scene and object
level translation, we additionally contribute a multi-domain I2I translation
dataset to study the physical state changes of objects. Our empirical studies
show that Cyclenet is superior in translation consistency and quality, and can
generate high-quality images for out-of-domain distributions with a simple
change of the textual prompt. Cyclenet is a practical framework, which is
robust even with very limited training data (around 2k) and requires minimal
computational resources (1 GPU) to train. Project homepage:
https://cyclenetweb.github.io/
EMIT-Diff: Enhancing Medical Image Segmentation via Text-Guided Diffusion Model
October 19, 2023
Zheyuan Zhang, Lanhong Yao, Bin Wang, Debesh Jha, Elif Keles, Alpay Medetalibeyoglu, Ulas Bagci
Large-scale, big-variant, and high-quality data are crucial for developing
robust and successful deep-learning models for medical applications since they
potentially enable better generalization performance and avoid overfitting.
However, the scarcity of high-quality labeled data always presents significant
challenges. This paper proposes a novel approach to address this challenge by
developing controllable diffusion models for medical image synthesis, called
EMIT-Diff. We leverage recent diffusion probabilistic models to generate
realistic and diverse synthetic medical image data that preserve the essential
characteristics of the original medical images by incorporating edge
information of objects to guide the synthesis process. In our approach, we
ensure that the synthesized samples adhere to medically relevant constraints
and preserve the underlying structure of imaging data. Due to the random
sampling process by the diffusion model, we can generate an arbitrary number of
synthetic images with diverse appearances. To validate the effectiveness of our
proposed method, we conduct an extensive set of medical image segmentation
experiments on multiple datasets, including Ultrasound breast (+13.87%), CT
spleen (+0.38%), and MRI prostate (+7.78%), achieving significant improvements
over the baseline segmentation methods. For the first time, to our best
knowledge, the promising results demonstrate the effectiveness of our EMIT-Diff
for medical image segmentation tasks and show the feasibility of introducing a
first-ever text-guided diffusion model for general medical image segmentation
tasks. With carefully designed ablation experiments, we investigate the
influence of various data augmentation ratios, hyper-parameter settings, patch
size for generating random merging mask settings, and combined influence with
different network architectures.
Product of Gaussian Mixture Diffusion Models
October 19, 2023
Martin Zach, Erich Kobler, Antonin Chambolle, Thomas Pock
In this work we tackle the problem of estimating the density $ f_X $ of a
random variable $ X $ by successive smoothing, such that the smoothed random
variable $ Y $ fulfills the diffusion partial differential equation $
(\partial_t - \Delta_1)f_Y(\,\cdot\,, t) = 0 $ with initial condition $
f_Y(\,\cdot\,, 0) = f_X $. We propose a product-of-experts-type model utilizing
Gaussian mixture experts and study configurations that admit an analytic
expression for $ f_Y (\,\cdot\,, t) $. In particular, with a focus on image
processing, we derive conditions for models acting on filter-, wavelet-, and
shearlet responses. Our construction naturally allows the model to be trained
simultaneously over the entire diffusion horizon using empirical Bayes. We show
numerical results for image denoising where our models are competitive while
being tractable, interpretable, and having only a small number of learnable
parameters. As a byproduct, our models can be used for reliable noise
estimation, allowing blind denoising of images corrupted by heteroscedastic
noise.
Diverse Diffusion: Enhancing Image Diversity in Text-to-Image Generation
October 19, 2023
Mariia Zameshina, Olivier Teytaud, Laurent Najman
Latent diffusion models excel at producing high-quality images from text.
Yet, concerns appear about the lack of diversity in the generated imagery. To
tackle this, we introduce Diverse Diffusion, a method for boosting image
diversity beyond gender and ethnicity, spanning into richer realms, including
color diversity.Diverse Diffusion is a general unsupervised technique that can
be applied to existing text-to-image models. Our approach focuses on finding
vectors in the Stable Diffusion latent space that are distant from each other.
We generate multiple vectors in the latent space until we find a set of vectors
that meets the desired distance requirements and the required batch size.To
evaluate the effectiveness of our diversity methods, we conduct experiments
examining various characteristics, including color diversity, LPIPS metric, and
ethnicity/gender representation in images featuring humans.The results of our
experiments emphasize the significance of diversity in generating realistic and
varied images, offering valuable insights for improving text-to-image models.
Through the enhancement of image diversity, our approach contributes to the
creation of more inclusive and representative AI-generated art.
October 19, 2023
Christopher Scarvelis, Haitz Sáez de Ocáriz Borde, Justin Solomon
Score-based generative models (SGMs) sample from a target distribution by
iteratively transforming noise using the score function of the perturbed
target. For any finite training set, this score function can be evaluated in
closed form, but the resulting SGM memorizes its training data and does not
generate novel samples. In practice, one approximates the score by training a
neural network via score-matching. The error in this approximation promotes
generalization, but neural SGMs are costly to train and sample, and the
effective regularization this error provides is not well-understood
theoretically. In this work, we instead explicitly smooth the closed-form score
to obtain an SGM that generates novel samples without training. We analyze our
model and propose an efficient nearest-neighbor-based estimator of its score
function. Using this estimator, our method achieves sampling times competitive
with neural SGMs while running on consumer-grade CPUs.
Bayesian Flow Networks in Continual Learning
October 18, 2023
Mateusz Pyla, Kamil Deja, Bartłomiej Twardowski, Tomasz Trzciński
Bayesian Flow Networks (BFNs) has been recently proposed as one of the most
promising direction to universal generative modelling, having ability to learn
any of the data type. Their power comes from the expressiveness of neural
networks and Bayesian inference which make them suitable in the context of
continual learning. We delve into the mechanics behind BFNs and conduct the
experiments to empirically verify the generative capabilities on non-stationary
data.
To Generate or Not? Safety-Driven Unlearned Diffusion Models Are Still Easy To Generate Unsafe Images … For Now
October 18, 2023
Yimeng Zhang, Jinghan Jia, Xin Chen, Aochuan Chen, Yihua Zhang, Jiancheng Liu, Ke Ding, Sijia Liu
The recent advances in diffusion models (DMs) have revolutionized the
generation of complex and diverse images. However, these models also introduce
potential safety hazards, such as the production of harmful content and
infringement of data copyrights. Although there have been efforts to create
safety-driven unlearning methods to counteract these challenges, doubts remain
about their capabilities. To bridge this uncertainty, we propose an evaluation
framework built upon adversarial attacks (also referred to as adversarial
prompts), in order to discern the trustworthiness of these safety-driven
unlearned DMs. Specifically, our research explores the (worst-case) robustness
of unlearned DMs in eradicating unwanted concepts, styles, and objects,
assessed by the generation of adversarial prompts. We develop a novel
adversarial learning approach called UnlearnDiff that leverages the inherent
classification capabilities of DMs to streamline the generation of adversarial
prompts, making it as simple for DMs as it is for image classification attacks.
This technique streamlines the creation of adversarial prompts, making the
process as intuitive for generative modeling as it is for image classification
assaults. Through comprehensive benchmarking, we assess the unlearning
robustness of five prevalent unlearned DMs across multiple tasks. Our results
underscore the effectiveness and efficiency of UnlearnDiff when compared to
state-of-the-art adversarial prompting methods. Codes are available at
https://github.com/OPTML-Group/Diffusion-MU-Attack. WARNING: This paper
contains model outputs that may be offensive in nature.
A Survey on Video Diffusion Models
October 16, 2023
Zhen Xing, Qijun Feng, Haoran Chen, Qi Dai, Han Hu, Hang Xu, Zuxuan Wu, Yu-Gang Jiang
The recent wave of AI-generated content (AIGC) has witnessed substantial
success in computer vision, with the diffusion model playing a crucial role in
this achievement. Due to their impressive generative capabilities, diffusion
models are gradually superseding methods based on GANs and auto-regressive
Transformers, demonstrating exceptional performance not only in image
generation and editing, but also in the realm of video-related research.
However, existing surveys mainly focus on diffusion models in the context of
image generation, with few up-to-date reviews on their application in the video
domain. To address this gap, this paper presents a comprehensive review of
video diffusion models in the AIGC era. Specifically, we begin with a concise
introduction to the fundamentals and evolution of diffusion models.
Subsequently, we present an overview of research on diffusion models in the
video domain, categorizing the work into three key areas: video generation,
video editing, and other video understanding tasks. We conduct a thorough
review of the literature in these three key areas, including further
categorization and practical contributions in the field. Finally, we discuss
the challenges faced by research in this domain and outline potential future
developmental trends. A comprehensive list of video diffusion models studied in
this survey is available at
https://github.com/ChenHsing/Awesome-Video-Diffusion-Models.
Self-supervised Fetal MRI 3D Reconstruction Based on Radiation Diffusion Generation Model
October 16, 2023
Junpeng Tan, Xin Zhang, Yao Lv, Xiangmin Xu, Gang Li
Although the use of multiple stacks can handle slice-to-volume motion
correction and artifact removal problems, there are still several problems: 1)
The slice-to-volume method usually uses slices as input, which cannot solve the
problem of uniform intensity distribution and complementarity in regions of
different fetal MRI stacks; 2) The integrity of 3D space is not considered,
which adversely affects the discrimination and generation of globally
consistent information in fetal MRI; 3) Fetal MRI with severe motion artifacts
in the real-world cannot achieve high-quality super-resolution reconstruction.
To address these issues, we propose a novel fetal brain MRI high-quality volume
reconstruction method, called the Radiation Diffusion Generation Model (RDGM).
It is a self-supervised generation method, which incorporates the idea of
Neural Radiation Field (NeRF) based on the coordinate generation and diffusion
model based on super-resolution generation. To solve regional intensity
heterogeneity in different directions, we use a pre-trained transformer model
for slice registration, and then, a new regionally Consistent Implicit Neural
Representation (CINR) network sub-module is proposed. CINR can generate the
initial volume by combining a coordinate association map of two different
coordinate mapping spaces. To enhance volume global consistency and
discrimination, we introduce the Volume Diffusion Super-resolution Generation
(VDSG) mechanism. The global intensity discriminant generation from
volume-to-volume is carried out using the idea of diffusion generation, and
CINR becomes the deviation intensity generation network of the volume-to-volume
diffusion model. Finally, the experimental results on real-world fetal brain
MRI stacks demonstrate the state-of-the-art performance of our method.
Generative Entropic Neural Optimal Transport To Map Within and Across Spaces
October 13, 2023
Dominik Klein, Théo Uscidda, Fabian Theis, Marco Cuturi
Learning measure-to-measure mappings is a crucial task in machine learning,
featured prominently in generative modeling. Recent years have witnessed a
surge of techniques that draw inspiration from optimal transport (OT) theory.
Combined with neural network models, these methods collectively known as
\textit{Neural OT} use optimal transport as an inductive bias: such mappings
should be optimal w.r.t. a given cost function, in the sense that they are able
to move points in a thrifty way, within (by minimizing displacements) or across
spaces (by being isometric). This principle, while intuitive, is often
confronted with several practical challenges that require adapting the OT
toolbox: cost functions other than the squared-Euclidean cost can be
challenging to handle, the deterministic formulation of Monge maps leaves
little flexibility, mapping across incomparable spaces raises multiple
challenges, while the mass conservation constraint inherent to OT can provide
too much credit to outliers. While each of these mismatches between practice
and theory has been addressed independently in various works, we propose in
this work an elegant framework to unify them, called \textit{generative
entropic neural optimal transport} (GENOT). GENOT can accommodate any cost
function; handles randomness using conditional generative models; can map
points across incomparable spaces, and can be used as an \textit{unbalanced}
solver. We evaluate our approach through experiments conducted on various
synthetic datasets and demonstrate its practicality in single-cell biology. In
this domain, GENOT proves to be valuable for tasks such as modeling cell
development, predicting cellular responses to drugs, and translating between
different data modalities of cells.
Unseen Image Synthesis with Diffusion Models
October 13, 2023
Ye Zhu, Yu Wu, Zhiwei Deng, Olga Russakovsky, Yan Yan
While the current trend in the generative field is scaling up towards larger
models and more training data for generalized domain representations, we go the
opposite direction in this work by synthesizing unseen domain images without
additional training. We do so via latent sampling and geometric optimization
using pre-trained and frozen Denoising Diffusion Probabilistic Models (DDPMs)
on single-domain datasets. Our key observation is that DDPMs pre-trained even
just on single-domain images are already equipped with sufficient
representation abilities to reconstruct arbitrary images from the inverted
latent encoding following bi-directional deterministic diffusion and denoising
trajectories. This motivates us to investigate the statistical and geometric
behaviors of the Out-Of-Distribution (OOD) samples from unseen image domains in
the latent spaces along the denoising chain. Notably, we theoretically and
empirically show that the inverted OOD samples also establish Gaussians that
are distinguishable from the original In-Domain (ID) samples in the
intermediate latent spaces, which allows us to sample from them directly.
Geometrical domain-specific and model-dependent information of the unseen
subspace (e.g., sample-wise distance and angles) is used to further optimize
the sampled OOD latent encodings from the estimated Gaussian prior. We conduct
extensive analysis and experiments using pre-trained diffusion models (DDPM,
iDDPM) on different datasets (AFHQ, CelebA-HQ, LSUN-Church, and LSUN-Bedroom),
proving the effectiveness of this novel perspective to explore and re-think the
diffusion models’ data synthesis generalization ability.
Sampling from Mean-Field Gibbs Measures via Diffusion Processes
October 13, 2023
Ahmed El Alaoui, Andrea Montanari, Mark Sellke
We consider Ising mixed $p$-spin glasses at high-temperature and without
external field, and study the problem of sampling from the Gibbs distribution
$\mu$ in polynomial time. We develop a new sampling algorithm with complexity
of the same order as evaluating the gradient of the Hamiltonian and, in
particular, at most linear in the input size. We prove that, at sufficiently
high-temperature, it produces samples from a distribution $\mu^{alg}$ which is
close in normalized Wasserstein distance to $\mu$. Namely, there exists a
coupling of $\mu$ and $\mu^{alg}$ such that if $({\boldsymbol x},{\boldsymbol
x}^{alg})\in{-1,+1}^n\times {-1,+1}^n$ is a pair drawn from this coupling,
then $n^{-1}{\mathbb E}{|{\boldsymbol x}-{\boldsymbol
x}^{alg}|_2^2}=o_n(1)$. For the case of the Sherrington-Kirkpatrick model,
our algorithm succeeds in the full replica-symmetric phase.
We complement this result with a negative one for sampling algorithms
satisfying a certain stability' property, which is verified by many standard
techniques.
No stable algorithm can approximately sample at temperatures below the onset
of shattering, even under the normalized Wasserstein metric. Further, no
algorithm can sample at temperatures below the onset of replica symmetry
breaking.
Our sampling method implements a discretized version of a diffusion process
that has become recently popular in machine learning under the name of
denoising diffusion.’ We derive the same process from the general construction
of stochastic localization. Implementing the diffusion process requires to
efficiently approximate the mean of the tilted measure. To this end, we use an
approximate message passing algorithm that, as we prove, achieves sufficiently
accurate mean estimation.
October 13, 2023
Chaocheng Yang, Tingyin Wang, Xuanhui Yan
Anomaly detection in multivariate time series has emerged as a crucial
challenge in time series research, with significant research implications in
various fields such as fraud detection, fault diagnosis, and system state
estimation. Reconstruction-based models have shown promising potential in
recent years for detecting anomalies in time series data. However, due to the
rapid increase in data scale and dimensionality, the issues of noise and Weak
Identity Mapping (WIM) during time series reconstruction have become
increasingly pronounced. To address this, we introduce a novel Adaptive Dynamic
Neighbor Mask (ADNM) mechanism and integrate it with the Transformer and
Denoising Diffusion Model, creating a new framework for multivariate time
series anomaly detection, named Denoising Diffusion Mask Transformer (DDMT).
The ADNM module is introduced to mitigate information leakage between input and
output features during data reconstruction, thereby alleviating the problem of
WIM during reconstruction. The Denoising Diffusion Transformer (DDT) employs
the Transformer as an internal neural network structure for Denoising Diffusion
Model. It learns the stepwise generation process of time series data to model
the probability distribution of the data, capturing normal data patterns and
progressively restoring time series data by removing noise, resulting in a
clear recovery of anomalies. To the best of our knowledge, this is the first
model that combines Denoising Diffusion Model and the Transformer for
multivariate time series anomaly detection. Experimental evaluations were
conducted on five publicly available multivariate time series anomaly detection
datasets. The results demonstrate that the model effectively identifies
anomalies in time series data, achieving state-of-the-art performance in
anomaly detection.
Debias the Training of Diffusion Models
October 12, 2023
Hu Yu, Li Shen, Jie Huang, Man Zhou, Hongsheng Li, Feng Zhao
Diffusion models have demonstrated compelling generation quality by
optimizing the variational lower bound through a simple denoising score
matching loss. In this paper, we provide theoretical evidence that the
prevailing practice of using a constant loss weight strategy in diffusion
models leads to biased estimation during the training phase. Simply optimizing
the denoising network to predict Gaussian noise with constant weighting may
hinder precise estimations of original images. To address the issue, we propose
an elegant and effective weighting strategy grounded in the theoretically
unbiased principle. Moreover, we conduct a comprehensive and systematic
exploration to dissect the inherent bias problem deriving from constant
weighting loss from the perspectives of its existence, impact and reasons.
These analyses are expected to advance our understanding and demystify the
inner workings of diffusion models. Through empirical evaluation, we
demonstrate that our proposed debiased estimation method significantly enhances
sample quality without the reliance on complex techniques, and exhibits
improved efficiency compared to the baseline method both in training and
sampling processes.
Neural Diffusion Models
October 12, 2023
Grigory Bartosh, Dmitry Vetrov, Christian A. Naesseth
Diffusion models have shown remarkable performance on many generative tasks.
Despite recent success, most diffusion models are restricted in that they only
allow linear transformation of the data distribution. In contrast, broader
family of transformations can potentially help train generative distributions
more efficiently, simplifying the reverse process and closing the gap between
the true negative log-likelihood and the variational approximation. In this
paper, we present Neural Diffusion Models (NDMs), a generalization of
conventional diffusion models that enables defining and learning time-dependent
non-linear transformations of data. We show how to optimise NDMs using a
variational bound in a simulation-free setting. Moreover, we derive a
time-continuous formulation of NDMs, which allows fast and reliable inference
using off-the-shelf numerical ODE and SDE solvers. Finally, we demonstrate the
utility of NDMs with learnable transformations through experiments on standard
image generation benchmarks, including CIFAR-10, downsampled versions of
ImageNet and CelebA-HQ. NDMs outperform conventional diffusion models in terms
of likelihood and produce high-quality samples.
October 12, 2023
Xianghao Kong, Ollie Liu, Han Li, Dani Yogatama, Greg Ver Steeg
cs.LG, cs.AI, cs.IT, math.IT
Denoising diffusion models enable conditional generation and density modeling
of complex relationships like images and text. However, the nature of the
learned relationships is opaque making it difficult to understand precisely
what relationships between words and parts of an image are captured, or to
predict the effect of an intervention. We illuminate the fine-grained
relationships learned by diffusion models by noticing a precise relationship
between diffusion and information decomposition. Exact expressions for mutual
information and conditional mutual information can be written in terms of the
denoising model. Furthermore, pointwise estimates can be easily estimated as
well, allowing us to ask questions about the relationships between specific
images and captions. Decomposing information even further to understand which
variables in a high-dimensional space carry information is a long-standing
problem. For diffusion models, we show that a natural non-negative
decomposition of mutual information emerges, allowing us to quantify
informative relationships between words and pixels in an image. We exploit
these new relations to measure the compositional understanding of diffusion
models, to do unsupervised localization of objects in images, and to measure
effects when selectively editing images through prompt interventions.
Efficient Integrators for Diffusion Generative Models
October 11, 2023
Kushagra Pandey, Maja Rudolph, Stephan Mandt
cs.LG, cs.AI, cs.CV, stat.ML
Diffusion models suffer from slow sample generation at inference time.
Therefore, developing a principled framework for fast deterministic/stochastic
sampling for a broader class of diffusion models is a promising direction. We
propose two complementary frameworks for accelerating sample generation in
pre-trained models: Conjugate Integrators and Splitting Integrators. Conjugate
integrators generalize DDIM, mapping the reverse diffusion dynamics to a more
amenable space for sampling. In contrast, splitting-based integrators, commonly
used in molecular dynamics, reduce the numerical simulation error by cleverly
alternating between numerical updates involving the data and auxiliary
variables. After extensively studying these methods empirically and
theoretically, we present a hybrid method that leads to the best-reported
performance for diffusion models in augmented spaces. Applied to Phase Space
Langevin Diffusion [Pandey & Mandt, 2023] on CIFAR-10, our deterministic and
stochastic samplers achieve FID scores of 2.11 and 2.36 in only 100 network
function evaluations (NFE) as compared to 2.57 and 2.63 for the best-performing
baselines, respectively. Our code and model checkpoints will be made publicly
available at \url{https://github.com/mandt-lab/PSLD}.
Generative Modeling with Phase Stochastic Bridges
October 11, 2023
Tianrong Chen, Jiatao Gu, Laurent Dinh, Evangelos A. Theodorou, Josh Susskind, Shuangfei Zhai
Diffusion models (DMs) represent state-of-the-art generative models for
continuous inputs. DMs work by constructing a Stochastic Differential Equation
(SDE) in the input space (ie, position space), and using a neural network to
reverse it. In this work, we introduce a novel generative modeling framework
grounded in \textbf{phase space dynamics}, where a phase space is defined as
{an augmented space encompassing both position and velocity.} Leveraging
insights from Stochastic Optimal Control, we construct a path measure in the
phase space that enables efficient sampling. {In contrast to DMs, our framework
demonstrates the capability to generate realistic data points at an early stage
of dynamics propagation.} This early prediction sets the stage for efficient
data generation by leveraging additional velocity information along the
trajectory. On standard image generation benchmarks, our model yields favorable
performance over baselines in the regime of small Number of Function
Evaluations (NFEs). Furthermore, our approach rivals the performance of
diffusion models equipped with efficient sampling techniques, underscoring its
potential as a new tool generative modeling.
Score Regularized Policy Optimization through Diffusion Behavior
October 11, 2023
Huayu Chen, Cheng Lu, Zhengyi Wang, Hang Su, Jun Zhu
Recent developments in offline reinforcement learning have uncovered the
immense potential of diffusion modeling, which excels at representing
heterogeneous behavior policies. However, sampling from diffusion policies is
considerably slow because it necessitates tens to hundreds of iterative
inference steps for one action. To address this issue, we propose to extract an
efficient deterministic inference policy from critic models and pretrained
diffusion behavior models, leveraging the latter to directly regularize the
policy gradient with the behavior distribution’s score function during
optimization. Our method enjoys powerful generative capabilities of diffusion
modeling while completely circumventing the computationally intensive and
time-consuming diffusion sampling scheme, both during training and evaluation.
Extensive results on D4RL tasks show that our method boosts action sampling
speed by more than 25 times compared with various leading diffusion-based
methods in locomotion tasks, while still maintaining state-of-the-art
performance.
Generative Modeling on Manifolds Through Mixture of Riemannian Diffusion Processes
October 11, 2023
Jaehyeong Jo, Sung Ju Hwang
Learning the distribution of data on Riemannian manifolds is crucial for
modeling data from non-Euclidean space, which is required by many applications
from diverse scientific fields. Yet, existing generative models on manifolds
suffer from expensive divergence computation or rely on approximations of heat
kernel. These limitations restrict their applicability to simple geometries and
hinder scalability to high dimensions. In this work, we introduce the
Riemannian Diffusion Mixture, a principled framework for building a generative
process on manifolds as a mixture of endpoint-conditioned diffusion processes
instead of relying on the denoising approach of previous diffusion models, for
which the generative process is characterized by its drift guiding toward the
most probable endpoint with respect to the geometry of the manifold. We further
propose a simple yet efficient training objective for learning the mixture
process, that is readily applicable to general manifolds. Our method
outperforms previous generative models on various manifolds while scaling to
high dimensions and requires a dramatically reduced number of in-training
simulation steps for general manifolds.
State of the Art on Diffusion Models for Visual Computing
October 11, 2023
Ryan Po, Wang Yifan, Vladislav Golyanik, Kfir Aberman, Jonathan T. Barron, Amit H. Bermano, Eric Ryan Chan, Tali Dekel, Aleksander Holynski, Angjoo Kanazawa, C. Karen Liu, Lingjie Liu, Ben Mildenhall, Matthias Nießner, Björn Ommer, Christian Theobalt, Peter Wonka, Gordon Wetzstein
cs.AI, cs.CV, cs.GR, cs.LG
The field of visual computing is rapidly advancing due to the emergence of
generative artificial intelligence (AI), which unlocks unprecedented
capabilities for the generation, editing, and reconstruction of images, videos,
and 3D scenes. In these domains, diffusion models are the generative AI
architecture of choice. Within the last year alone, the literature on
diffusion-based tools and applications has seen exponential growth and relevant
papers are published across the computer graphics, computer vision, and AI
communities with new works appearing daily on arXiv. This rapid growth of the
field makes it difficult to keep up with all recent developments. The goal of
this state-of-the-art report (STAR) is to introduce the basic mathematical
concepts of diffusion models, implementation details and design choices of the
popular Stable Diffusion model, as well as overview important aspects of these
generative AI tools, including personalization, conditioning, inversion, among
others. Moreover, we give a comprehensive overview of the rapidly growing
literature on diffusion-based generation and editing, categorized by the type
of generated medium, including 2D images, videos, 3D objects, locomotion, and
4D scenes. Finally, we discuss available datasets, metrics, open challenges,
and social implications. This STAR provides an intuitive starting point to
explore this exciting topic for researchers, artists, and practitioners alike.
Diffusion Prior Regularized Iterative Reconstruction for Low-dose CT
October 10, 2023
Wenjun Xia, Yongyi Shi, Chuang Niu, Wenxiang Cong, Ge Wang
eess.IV, cs.LG, physics.med-ph
Computed tomography (CT) involves a patient’s exposure to ionizing radiation.
To reduce the radiation dose, we can either lower the X-ray photon count or
down-sample projection views. However, either of the ways often compromises
image quality. To address this challenge, here we introduce an iterative
reconstruction algorithm regularized by a diffusion prior. Drawing on the
exceptional imaging prowess of the denoising diffusion probabilistic model
(DDPM), we merge it with a reconstruction procedure that prioritizes data
fidelity. This fusion capitalizes on the merits of both techniques, delivering
exceptional reconstruction results in an unsupervised framework. To further
enhance the efficiency of the reconstruction process, we incorporate the
Nesterov momentum acceleration technique. This enhancement facilitates superior
diffusion sampling in fewer steps. As demonstrated in our experiments, our
method offers a potential pathway to high-definition CT image reconstruction
with minimized radiation.
Stochastic Super-resolution of Cosmological Simulations with Denoising Diffusion Models
October 10, 2023
Andreas Schanz, Florian List, Oliver Hahn
astro-ph.CO, astro-ph.IM, cs.LG
In recent years, deep learning models have been successfully employed for
augmenting low-resolution cosmological simulations with small-scale
information, a task known as “super-resolution”. So far, these cosmological
super-resolution models have relied on generative adversarial networks (GANs),
which can achieve highly realistic results, but suffer from various
shortcomings (e.g. low sample diversity). We introduce denoising diffusion
models as a powerful generative model for super-resolving cosmic large-scale
structure predictions (as a first proof-of-concept in two dimensions). To
obtain accurate results down to small scales, we develop a new “filter-boosted”
training approach that redistributes the importance of different scales in the
pixel-wise training objective. We demonstrate that our model not only produces
convincing super-resolution images and power spectra consistent at the percent
level, but is also able to reproduce the diversity of small-scale features
consistent with a given low-resolution simulation. This enables uncertainty
quantification for the generated small-scale features, which is critical for
the usefulness of such super-resolution models as a viable surrogate model for
cosmic structure formation.
What Does Stable Diffusion Know about the 3D Scene?
October 10, 2023
Guanqi Zhan, Chuanxia Zheng, Weidi Xie, Andrew Zisserman
Recent advances in generative models like Stable Diffusion enable the
generation of highly photo-realistic images. Our objective in this paper is to
probe the diffusion network to determine to what extent it ‘understands’
different properties of the 3D scene depicted in an image. To this end, we make
the following contributions: (i) We introduce a protocol to evaluate whether a
network models a number of physical ‘properties’ of the 3D scene by probing for
explicit features that represent these properties. The probes are applied on
datasets of real images with annotations for the property. (ii) We apply this
protocol to properties covering scene geometry, scene material, support
relations, lighting, and view dependent measures. (iii) We find that Stable
Diffusion is good at a number of properties including scene geometry, support
relations, shadows and depth, but less performant for occlusion. (iv) We also
apply the probes to other models trained at large-scale, including DINO and
CLIP, and find their performance inferior to that of Stable Diffusion.
Latent Diffusion Counterfactual Explanations
October 10, 2023
Karim Farid, Simon Schrodi, Max Argus, Thomas Brox
Counterfactual explanations have emerged as a promising method for
elucidating the behavior of opaque black-box models. Recently, several works
leveraged pixel-space diffusion models for counterfactual generation. To handle
noisy, adversarial gradients during counterfactual generation – causing
unrealistic artifacts or mere adversarial perturbations – they required either
auxiliary adversarially robust models or computationally intensive guidance
schemes. However, such requirements limit their applicability, e.g., in
scenarios with restricted access to the model’s training data. To address these
limitations, we introduce Latent Diffusion Counterfactual Explanations (LDCE).
LDCE harnesses the capabilities of recent class- or text-conditional foundation
latent diffusion models to expedite counterfactual generation and focus on the
important, semantic parts of the data. Furthermore, we propose a novel
consensus guidance mechanism to filter out noisy, adversarial gradients that
are misaligned with the diffusion model’s implicit classifier. We demonstrate
the versatility of LDCE across a wide spectrum of models trained on diverse
datasets with different learning paradigms. Finally, we showcase how LDCE can
provide insights into model errors, enhancing our understanding of black-box
model behavior.
Leveraging Diffusion-Based Image Variations for Robust Training on Poisoned Data
October 10, 2023
Lukas Struppek, Martin B. Hentschel, Clifton Poth, Dominik Hintersdorf, Kristian Kersting
Backdoor attacks pose a serious security threat for training neural networks
as they surreptitiously introduce hidden functionalities into a model. Such
backdoors remain silent during inference on clean inputs, evading detection due
to inconspicuous behavior. However, once a specific trigger pattern appears in
the input data, the backdoor activates, causing the model to execute its
concealed function. Detecting such poisoned samples within vast datasets is
virtually impossible through manual inspection. To address this challenge, we
propose a novel approach that enables model training on potentially poisoned
datasets by utilizing the power of recent diffusion models. Specifically, we
create synthetic variations of all training samples, leveraging the inherent
resilience of diffusion models to potential trigger patterns in the data. By
combining this generative approach with knowledge distillation, we produce
student models that maintain their general performance on the task while
exhibiting robust resistance to backdoor triggers.
JointNet: Extending Text-to-Image Diffusion for Dense Distribution Modeling
October 10, 2023
Jingyang Zhang, Shiwei Li, Yuanxun Lu, Tian Fang, David McKinnon, Yanghai Tsin, Long Quan, Yao Yao
We introduce JointNet, a novel neural network architecture for modeling the
joint distribution of images and an additional dense modality (e.g., depth
maps). JointNet is extended from a pre-trained text-to-image diffusion model,
where a copy of the original network is created for the new dense modality
branch and is densely connected with the RGB branch. The RGB branch is locked
during network fine-tuning, which enables efficient learning of the new
modality distribution while maintaining the strong generalization ability of
the large-scale pre-trained diffusion model. We demonstrate the effectiveness
of JointNet by using RGBD diffusion as an example and through extensive
experiments, showcasing its applicability in a variety of applications,
including joint RGBD generation, dense depth prediction, depth-conditioned
image generation, and coherent tile-based 3D panorama generation.
Latent Diffusion Model for DNA Sequence Generation
October 09, 2023
Zehui Li, Yuhao Ni, Tim August B. Huygelen, Akashaditya Das, Guoxuan Xia, Guy-Bart Stan, Yiren Zhao
The harnessing of machine learning, especially deep generative models, has
opened up promising avenues in the field of synthetic DNA sequence generation.
Whilst Generative Adversarial Networks (GANs) have gained traction for this
application, they often face issues such as limited sample diversity and mode
collapse. On the other hand, Diffusion Models are a promising new class of
generative models that are not burdened with these problems, enabling them to
reach the state-of-the-art in domains such as image generation. In light of
this, we propose a novel latent diffusion model, DiscDiff, tailored for
discrete DNA sequence generation. By simply embedding discrete DNA sequences
into a continuous latent space using an autoencoder, we are able to leverage
the powerful generative abilities of continuous diffusion models for the
generation of discrete data. Additionally, we introduce Fr'echet
Reconstruction Distance (FReD) as a new metric to measure the sample quality of
DNA sequence generations. Our DiscDiff model demonstrates an ability to
generate synthetic DNA sequences that align closely with real DNA in terms of
Motif Distribution, Latent Embedding Distribution (FReD), and Chromatin
Profiles. Additionally, we contribute a comprehensive cross-species dataset of
150K unique promoter-gene sequences from 15 species, enriching resources for
future generative modelling in genomics. We will make our code public upon
publication.
DiffuSeq-v2: Bridging Discrete and Continuous Text Spaces for Accelerated Seq2Seq Diffusion Models
October 09, 2023
Shansan Gong, Mukai Li, Jiangtao Feng, Zhiyong Wu, Lingpeng Kong
Diffusion models have gained prominence in generating high-quality sequences
of text. Nevertheless, current approaches predominantly represent discrete text
within a continuous diffusion space, which incurs substantial computational
overhead during training and results in slower sampling speeds. In this paper,
we introduce a soft absorbing state that facilitates the diffusion model in
learning to reconstruct discrete mutations based on the underlying Gaussian
space, thereby enhancing its capacity to recover conditional signals. During
the sampling phase, we employ state-of-the-art ODE solvers within the
continuous space to expedite the sampling process. Comprehensive experimental
evaluations reveal that our proposed method effectively accelerates the
training convergence by 4x and generates samples of similar quality 800x
faster, rendering it significantly closer to practical application.
\footnote{The code is released at \url{https://github.com/Shark-NLP/DiffuSeq}
DiffCPS: Diffusion Model based Constrained Policy Search for Offline Reinforcement Learning
October 09, 2023
Longxiang He, Linrui Zhang, Junbo Tan, Xueqian Wang
Constrained policy search (CPS) is a fundamental problem in offline
reinforcement learning, which is generally solved by advantage weighted
regression (AWR). However, previous methods may still encounter
out-of-distribution actions due to the limited expressivity of Gaussian-based
policies. On the other hand, directly applying the state-of-the-art models with
distribution expression capabilities (i.e., diffusion models) in the AWR
framework is insufficient since AWR requires exact policy probability
densities, which is intractable in diffusion models. In this paper, we propose
a novel approach called $\textbf{Diffusion Model based Constrained Policy
Search (DiffCPS)}$, which tackles the diffusion-based constrained policy search
without resorting to AWR. The theoretical analysis reveals our key insights by
leveraging the action distribution of the diffusion model to eliminate the
policy distribution constraint in the CPS and then utilizing the Evidence Lower
Bound (ELBO) of diffusion-based policy to approximate the KL constraint.
Consequently, DiffCPS admits the high expressivity of diffusion models while
circumventing the cumbersome density calculation brought by AWR. Extensive
experimental results based on the D4RL benchmark demonstrate the efficacy of
our approach. We empirically show that DiffCPS achieves better or at least
competitive performance compared to traditional AWR-based baselines as well as
recent diffusion-based offline RL methods. The code is now available at
$\href{https://github.com/felix-thu/DiffCPS}{https://github.com/felix-thu/DiffCPS}$.
Latent Diffusion Model for Medical Image Standardization and Enhancement
October 08, 2023
Md Selim, Jie Zhang, Faraneh Fathi, Michael A. Brooks, Ge Wang, Guoqiang Yu, Jin Chen
Computed tomography (CT) serves as an effective tool for lung cancer
screening, diagnosis, treatment, and prognosis, providing a rich source of
features to quantify temporal and spatial tumor changes. Nonetheless, the
diversity of CT scanners and customized acquisition protocols can introduce
significant inconsistencies in texture features, even when assessing the same
patient. This variability poses a fundamental challenge for subsequent research
that relies on consistent image features. Existing CT image standardization
models predominantly utilize GAN-based supervised or semi-supervised learning,
but their performance remains limited. We present DiffusionCT, an innovative
score-based DDPM model that operates in the latent space to transform disparate
non-standard distributions into a standardized form. The architecture comprises
a U-Net-based encoder-decoder, augmented by a DDPM model integrated at the
bottleneck position. First, the encoder-decoder is trained independently,
without embedding DDPM, to capture the latent representation of the input data.
Second, the latent DDPM model is trained while keeping the encoder-decoder
parameters fixed. Finally, the decoder uses the transformed latent
representation to generate a standardized CT image, providing a more consistent
basis for downstream analysis. Empirical tests on patient CT images indicate
notable improvements in image standardization using DiffusionCT. Additionally,
the model significantly reduces image noise in SPAD images, further validating
the effectiveness of DiffusionCT for advanced imaging tasks.
October 07, 2023
Zihan Zhou, Ruiying Liu, Tianshu Yu
physics.comp-ph, cs.AI, cs.LG
Diffusion-based generative models in SE(3)-invariant space have demonstrated
promising performance in molecular conformation generation, but typically
require solving stochastic differential equations (SDEs) with thousands of
update steps. Till now, it remains unclear how to effectively accelerate this
procedure explicitly in SE(3)-invariant space, which greatly hinders its wide
application in the real world. In this paper, we systematically study the
diffusion mechanism in SE(3)-invariant space via the lens of approximate errors
induced by existing methods. Thereby, we develop more precise approximate in
SE(3) in the context of projected differential equations. Theoretical analysis
is further provided as well as empirical proof relating hyper-parameters with
such errors. Altogether, we propose a novel acceleration scheme for generating
molecular conformations in SE(3)-invariant space. Experimentally, our scheme
can generate high-quality conformations with 50x–100x speedup compared to
existing methods.
October 07, 2023
Theodor Nguyen, Guangzhi Sun, Xianrui Zheng, Chao Zhang, Philip C Woodland
We propose DiffSpEx, a generative target speaker extraction method based on
score-based generative modelling through stochastic differential equations.
DiffSpEx deploys a continuous-time stochastic diffusion process in the complex
short-time Fourier transform domain, starting from the target speaker source
and converging to a Gaussian distribution centred on the mixture of sources.
For the reverse-time process, a parametrised score function is conditioned on a
target speaker embedding to extract the target speaker from the mixture of
sources. We utilise ECAPA-TDNN target speaker embeddings and condition the
score function alternately on the SDE time embedding and the target speaker
embedding. The potential of DiffSpEx is demonstrated with the WSJ0-2mix
dataset, achieving an SI-SDR of 12.9 dB and a NISQA score of 3.56. Moreover, we
show that fine-tuning a pre-trained DiffSpEx model to a specific speaker
further improves performance, enabling personalisation in target speaker
extraction.
DiffNAS: Bootstrapping Diffusion Models by Prompting for Better Architectures
October 07, 2023
Wenhao Li, Xiu Su, Shan You, Fei Wang, Chen Qian, Chang Xu
Diffusion models have recently exhibited remarkable performance on synthetic
data. After a diffusion path is selected, a base model, such as UNet, operates
as a denoising autoencoder, primarily predicting noises that need to be
eliminated step by step. Consequently, it is crucial to employ a model that
aligns with the expected budgets to facilitate superior synthetic performance.
In this paper, we meticulously analyze the diffusion model and engineer a base
model search approach, denoted “DiffNAS”. Specifically, we leverage GPT-4 as a
supernet to expedite the search, supplemented with a search memory to enhance
the results. Moreover, we employ RFID as a proxy to promptly rank the
experimental outcomes produced by GPT-4. We also adopt a rapid-convergence
training strategy to boost search efficiency. Rigorous experimentation
corroborates that our algorithm can augment the search efficiency by 2 times
under GPT-based scenarios, while also attaining a performance of 2.82 with 0.37
improvement in FID on CIFAR10 relative to the benchmark IDDPM algorithm.
October 06, 2023
Jiarui Hai, Helin Wang, Dongchao Yang, Karan Thakkar, Najim Dehak, Mounya Elhilali
Common target sound extraction (TSE) approaches primarily relied on
discriminative approaches in order to separate the target sound while
minimizing interference from the unwanted sources, with varying success in
separating the target from the background. This study introduces DPM-TSE, a
first generative method based on diffusion probabilistic modeling (DPM) for
target sound extraction, to achieve both cleaner target renderings as well as
improved separability from unwanted sounds. The technique also tackles common
background noise issues with DPM by introducing a correction method for noise
schedules and sample steps. This approach is evaluated using both objective and
subjective quality metrics on the FSD Kaggle 2018 dataset. The results show
that DPM-TSE has a significant improvement in perceived quality in terms of
target extraction and purity.
Generative Diffusion From An Action Principle
October 06, 2023
Akhil Premkumar
Generative diffusion models synthesize new samples by reversing a diffusive
process that converts a given data set to generic noise. This is accomplished
by training a neural network to match the gradient of the log of the
probability distribution of a given data set, also called the score. By casting
reverse diffusion as an optimal control problem, we show that score matching
can be derived from an action principle, like the ones commonly used in
physics. We use this insight to demonstrate the connection between different
classes of diffusion models.
Diffusion Random Feature Model
October 06, 2023
Esha Saha, Giang Tran
Diffusion probabilistic models have been successfully used to generate data
from noise. However, most diffusion models are computationally expensive and
difficult to interpret with a lack of theoretical justification. Random feature
models on the other hand have gained popularity due to their interpretability
but their application to complex machine learning tasks remains limited. In
this work, we present a diffusion model-inspired deep random feature model that
is interpretable and gives comparable numerical results to a fully connected
neural network having the same number of trainable parameters. Specifically, we
extend existing results for random features and derive generalization bounds
between the distribution of sampled data and the true distribution using
properties of score matching. We validate our findings by generating samples on
the fashion MNIST dataset and instrumental audio data.
Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference
October 06, 2023
Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, Hang Zhao
Latent Diffusion models (LDMs) have achieved remarkable results in
synthesizing high-resolution images. However, the iterative sampling process is
computationally intensive and leads to slow generation. Inspired by Consistency
Models (song et al.), we propose Latent Consistency Models (LCMs), enabling
swift inference with minimal steps on any pre-trained LDMs, including Stable
Diffusion (rombach et al). Viewing the guided reverse diffusion process as
solving an augmented probability flow ODE (PF-ODE), LCMs are designed to
directly predict the solution of such ODE in latent space, mitigating the need
for numerous iterations and allowing rapid, high-fidelity sampling. Efficiently
distilled from pre-trained classifier-free guided diffusion models, a
high-quality 768 x 768 2~4-step LCM takes only 32 A100 GPU hours for training.
Furthermore, we introduce Latent Consistency Fine-tuning (LCF), a novel method
that is tailored for fine-tuning LCMs on customized image datasets. Evaluation
on the LAION-5B-Aesthetics dataset demonstrates that LCMs achieve
state-of-the-art text-to-image generation performance with few-step inference.
Project Page: https://latent-consistency-models.github.io/
VI-Diff: Unpaired Visible-Infrared Translation Diffusion Model for Single Modality Labeled Visible-Infrared Person Re-identification
October 06, 2023
Han Huang, Yan Huang, Liang Wang
Visible-Infrared person re-identification (VI-ReID) in real-world scenarios
poses a significant challenge due to the high cost of cross-modality data
annotation. Different sensing cameras, such as RGB/IR cameras for good/poor
lighting conditions, make it costly and error-prone to identify the same person
across modalities. To overcome this, we explore the use of single-modality
labeled data for the VI-ReID task, which is more cost-effective and practical.
By labeling pedestrians in only one modality (e.g., visible images) and
retrieving in another modality (e.g., infrared images), we aim to create a
training set containing both originally labeled and modality-translated data
using unpaired image-to-image translation techniques. In this paper, we propose
VI-Diff, a diffusion model that effectively addresses the task of
Visible-Infrared person image translation. Through comprehensive experiments,
we demonstrate that VI-Diff outperforms existing diffusion and GAN models,
making it a promising solution for VI-ReID with single-modality labeled data.
Our approach can be a promising solution to the VI-ReID task with
single-modality labeled data and serves as a good starting point for future
study. Code will be available.
Observation-Guided Diffusion Probabilistic Models
October 06, 2023
Junoh Kang, Jinyoung Choi, Sungik Choi, Bohyung Han
We propose a novel diffusion model called observation-guided diffusion
probabilistic model (OGDM), which effectively addresses the trade-off between
quality control and fast sampling. Our approach reestablishes the training
objective by integrating the guidance of the observation process with the
Markov chain in a principled way. This is achieved by introducing an additional
loss term derived from the observation based on the conditional discriminator
on noise level, which employs Bernoulli distribution indicating whether its
input lies on the (noisy) real manifold or not. This strategy allows us to
optimize the more accurate negative log-likelihood induced in the inference
stage especially when the number of function evaluations is limited. The
proposed training method is also advantageous even when incorporated only into
the fine-tuning process, and it is compatible with various fast inference
strategies since our method yields better denoising networks using the exactly
same inference procedure without incurring extra computational cost. We
demonstrate the effectiveness of the proposed training algorithm using diverse
inference methods on strong diffusion model baselines.
Diffusion Models as Masked Audio-Video Learners
October 05, 2023
Elvis Nunez, Yanzi Jin, Mohammad Rastegari, Sachin Mehta, Maxwell Horton
cs.SD, cs.CV, cs.MM, eess.AS
Over the past several years, the synchronization between audio and visual
signals has been leveraged to learn richer audio-visual representations. Aided
by the large availability of unlabeled videos, many unsupervised training
frameworks have demonstrated impressive results in various downstream audio and
video tasks. Recently, Masked Audio-Video Learners (MAViL) has emerged as a
state-of-the-art audio-video pre-training framework. MAViL couples contrastive
learning with masked autoencoding to jointly reconstruct audio spectrograms and
video frames by fusing information from both modalities. In this paper, we
study the potential synergy between diffusion models and MAViL, seeking to
derive mutual benefits from these two frameworks. The incorporation of
diffusion into MAViL, combined with various training efficiency methodologies
that include the utilization of a masking ratio curriculum and adaptive batch
sizing, results in a notable 32% reduction in pre-training Floating-Point
Operations (FLOPS) and an 18% decrease in pre-training wall clock time.
Crucially, this enhanced efficiency does not compromise the model’s performance
in downstream audio-classification tasks when compared to MAViL’s performance.
Aligning Text-to-Image Diffusion Models with Reward Backpropagation
October 05, 2023
Mihir Prabhudesai, Anirudh Goyal, Deepak Pathak, Katerina Fragkiadaki
cs.CV, cs.AI, cs.LG, cs.RO
Text-to-image diffusion models have recently emerged at the forefront of
image generation, powered by very large-scale unsupervised or weakly supervised
text-to-image training datasets. Due to their unsupervised training,
controlling their behavior in downstream tasks, such as maximizing
human-perceived image quality, image-text alignment, or ethical image
generation, is difficult. Recent works finetune diffusion models to downstream
reward functions using vanilla reinforcement learning, notorious for the high
variance of the gradient estimators. In this paper, we propose AlignProp, a
method that aligns diffusion models to downstream reward functions using
end-to-end backpropagation of the reward gradient through the denoising
process. While naive implementation of such backpropagation would require
prohibitive memory resources for storing the partial derivatives of modern
text-to-image models, AlignProp finetunes low-rank adapter weight modules and
uses gradient checkpointing, to render its memory usage viable. We test
AlignProp in finetuning diffusion models to various objectives, such as
image-text semantic alignment, aesthetics, compressibility and controllability
of the number of objects present, as well as their combinations. We show
AlignProp achieves higher rewards in fewer training steps than alternatives,
while being conceptually simpler, making it a straightforward choice for
optimizing diffusion models for differentiable reward functions of interest.
Code and Visualization results are available at https://align-prop.github.io/.
Stochastic interpolants with data-dependent couplings
October 05, 2023
Michael S. Albergo, Mark Goldstein, Nicholas M. Boffi, Rajesh Ranganath, Eric Vanden-Eijnden
Generative models inspired by dynamical transport of measure – such as flows
and diffusions – construct a continuous-time map between two probability
densities. Conventionally, one of these is the target density, only accessible
through samples, while the other is taken as a simple base density that is
data-agnostic. In this work, using the framework of stochastic interpolants, we
formalize how to \textit{couple} the base and the target densities. This
enables us to incorporate information about class labels or continuous
embeddings to construct dynamical transport maps that serve as conditional
generative models. We show that these transport maps can be learned by solving
a simple square loss regression problem analogous to the standard independent
setting. We demonstrate the usefulness of constructing dependent couplings in
practice through experiments in super-resolution and in-painting.
Multimarginal generative modeling with stochastic interpolants
October 05, 2023
Michael S. Albergo, Nicholas M. Boffi, Michael Lindsey, Eric Vanden-Eijnden
Given a set of $K$ probability densities, we consider the multimarginal
generative modeling problem of learning a joint distribution that recovers
these densities as marginals. The structure of this joint distribution should
identify multi-way correspondences among the prescribed marginals. We formalize
an approach to this task within a generalization of the stochastic interpolant
framework, leading to efficient learning algorithms built upon dynamical
transport of measure. Our generative models are defined by velocity and score
fields that can be characterized as the minimizers of simple quadratic
objectives, and they are defined on a simplex that generalizes the time
variable in the usual dynamical transport framework. The resulting transport on
the simplex is influenced by all marginals, and we show that multi-way
correspondences can be extracted. The identification of such correspondences
has applications to style transfer, algorithmic fairness, and data
decorruption. In addition, the multimarginal perspective enables an efficient
algorithm for reducing the dynamical transport cost in the ordinary
two-marginal setting. We demonstrate these capacities with several numerical
examples.
Wasserstein Distortion: Unifying Fidelity and Realism
October 05, 2023
Yang Qiu, Aaron B. Wagner, Johannes Ballé, Lucas Theis
cs.IT, cs.CV, eess.IV, math.IT
We introduce a distortion measure for images, Wasserstein distortion, that
simultaneously generalizes pixel-level fidelity on the one hand and realism on
the other. We show how Wasserstein distortion reduces mathematically to a pure
fidelity constraint or a pure realism constraint under different parameter
choices. Pairs of images that are close under Wasserstein distortion illustrate
its utility. In particular, we generate random textures that have high fidelity
to a reference texture in one location of the image and smoothly transition to
an independent realization of the texture as one moves away from this point.
Connections between Wasserstein distortion and models of the human visual
system are noted.
Diffusing on Two Levels and Optimizing for Multiple Properties: A Novel Approach to Generating Molecules with Desirable Properties
October 05, 2023
Siyuan Guo, Jihong Guan, Shuigeng Zhou
In the past decade, Artificial Intelligence driven drug design and discovery
has been a hot research topic, where an important branch is molecule generation
by generative models, from GAN-based models and VAE-based models to the latest
diffusion-based models. However, most existing models pursue only the basic
properties like validity and uniqueness of the generated molecules, a few go
further to explicitly optimize one single important molecular property (e.g.
QED or PlogP), which makes most generated molecules little usefulness in
practice. In this paper, we present a novel approach to generating molecules
with desirable properties, which expands the diffusion model framework with
multiple innovative designs. The novelty is two-fold. On the one hand,
considering that the structures of molecules are complex and diverse, and
molecular properties are usually determined by some substructures (e.g.
pharmacophores), we propose to perform diffusion on two structural levels:
molecules and molecular fragments respectively, with which a mixed Gaussian
distribution is obtained for the reverse diffusion process. To get desirable
molecular fragments, we develop a novel electronic effect based fragmentation
method. On the other hand, we introduce two ways to explicitly optimize
multiple molecular properties under the diffusion model framework. First, as
potential drug molecules must be chemically valid, we optimize molecular
validity by an energy-guidance function. Second, since potential drug molecules
should be desirable in various properties, we employ a multi-objective
mechanism to optimize multiple molecular properties simultaneously. Extensive
experiments with two benchmark datasets QM9 and ZINC250k show that the
molecules generated by our proposed method have better validity, uniqueness,
novelty, Fr'echet ChemNet Distance (FCD), QED, and PlogP than those generated
by current SOTA models.
Denoising Diffusion Step-aware Models
October 05, 2023
Shuai Yang, Yukang Chen, Luozhou Wang, Shu Liu, Yingcong Chen
Denoising Diffusion Probabilistic Models (DDPMs) have garnered popularity for
data generation across various domains. However, a significant bottleneck is
the necessity for whole-network computation during every step of the generative
process, leading to high computational overheads. This paper presents a novel
framework, Denoising Diffusion Step-aware Models (DDSM), to address this
challenge. Unlike conventional approaches, DDSM employs a spectrum of neural
networks whose sizes are adapted according to the importance of each generative
step, as determined through evolutionary search. This step-wise network
variation effectively circumvents redundant computational efforts, particularly
in less critical steps, thereby enhancing the efficiency of the diffusion
model. Furthermore, the step-aware design can be seamlessly integrated with
other efficiency-geared diffusion models such as DDIMs and latent diffusion,
thus broadening the scope of computational savings. Empirical evaluations
demonstrate that DDSM achieves computational savings of 49% for CIFAR-10, 61%
for CelebA-HQ, 59% for LSUN-bedroom, 71% for AFHQ, and 76% for ImageNet, all
without compromising the generation quality. Our code and models will be
publicly available.
EfficientDM: Efficient Quantization-Aware Fine-Tuning of Low-Bit Diffusion Models
October 05, 2023
Yefei He, Jing Liu, Weijia Wu, Hong Zhou, Bohan Zhuang
Diffusion models have demonstrated remarkable capabilities in image synthesis
and related generative tasks. Nevertheless, their practicality for low-latency
real-world applications is constrained by substantial computational costs and
latency issues. Quantization is a dominant way to compress and accelerate
diffusion models, where post-training quantization (PTQ) and quantization-aware
training (QAT) are two main approaches, each bearing its own properties. While
PTQ exhibits efficiency in terms of both time and data usage, it may lead to
diminished performance in low bit-width. On the other hand, QAT can alleviate
performance degradation but comes with substantial demands on computational and
data resources. To capitalize on the advantages while avoiding their respective
drawbacks, we introduce a data-free and parameter-efficient fine-tuning
framework for low-bit diffusion models, dubbed EfficientDM, to achieve
QAT-level performance with PTQ-like efficiency. Specifically, we propose a
quantization-aware variant of the low-rank adapter (QALoRA) that can be merged
with model weights and jointly quantized to low bit-width. The fine-tuning
process distills the denoising capabilities of the full-precision model into
its quantized counterpart, eliminating the requirement for training data. We
also introduce scale-aware optimization and employ temporal learned step-size
quantization to further enhance performance. Extensive experimental results
demonstrate that our method significantly outperforms previous PTQ-based
diffusion models while maintaining similar time and data efficiency.
Specifically, there is only a marginal 0.05 sFID increase when quantizing both
weights and activations of LDM-4 to 4-bit on ImageNet 256x256. Compared to
QAT-based methods, our EfficientDM also boasts a 16.2x faster quantization
speed with comparable generation quality.
Diffusion Generative Flow Samplers: Improving learning signals through partial trajectory optimization
October 04, 2023
Dinghuai Zhang, Ricky Tian Qi Chen, Cheng-Hao Liu, Aaron Courville, Yoshua Bengio
cs.LG, cs.AI, stat.CO, stat.ME, stat.ML
We tackle the problem of sampling from intractable high-dimensional density
functions, a fundamental task that often appears in machine learning and
statistics. We extend recent sampling-based approaches that leverage controlled
stochastic processes to model approximate samples from these target densities.
The main drawback of these approaches is that the training objective requires
full trajectories to compute, resulting in sluggish credit assignment issues
due to use of entire trajectories and a learning signal present only at the
terminal time. In this work, we present Diffusion Generative Flow Samplers
(DGFS), a sampling-based framework where the learning process can be tractably
broken down into short partial trajectory segments, via parameterizing an
additional “flow function”. Our method takes inspiration from the theory
developed for generative flow networks (GFlowNets), allowing us to make use of
intermediate learning signals and benefit from off-policy exploration
capabilities. Through a variety of challenging experiments, we demonstrate that
DGFS results in more accurate estimates of the normalization constant than
closely-related prior methods.
On Memorization in Diffusion Models
October 04, 2023
Xiangming Gu, Chao Du, Tianyu Pang, Chongxuan Li, Min Lin, Ye Wang
Due to their capacity to generate novel and high-quality samples, diffusion
models have attracted significant research interest in recent years. Notably,
the typical training objective of diffusion models, i.e., denoising score
matching, has a closed-form optimal solution that can only generate training
data replicating samples. This indicates that a memorization behavior is
theoretically expected, which contradicts the common generalization ability of
state-of-the-art diffusion models, and thus calls for a deeper understanding.
Looking into this, we first observe that memorization behaviors tend to occur
on smaller-sized datasets, which motivates our definition of effective model
memorization (EMM), a metric measuring the maximum size of training data at
which a learned diffusion model approximates its theoretical optimum. Then, we
quantify the impact of the influential factors on these memorization behaviors
in terms of EMM, focusing primarily on data distribution, model configuration,
and training procedure. Besides comprehensive empirical results identifying the
influential factors, we surprisingly find that conditioning training data on
uninformative random labels can significantly trigger the memorization in
diffusion models. Our study holds practical significance for diffusion model
users and offers clues to theoretical research in deep generative models. Code
is available at https://github.com/sail-sg/DiffMemorize.
Generalization in diffusion models arises from geometry-adaptive harmonic representation
October 04, 2023
Zahra Kadkhodaie, Florentin Guth, Eero P. Simoncelli, Stéphane Mallat
High-quality samples generated with score-based reverse diffusion algorithms
provide evidence that deep neural networks (DNN) trained for denoising can
learn high-dimensional densities, despite the curse of dimensionality. However,
recent reports of memorization of the training set raise the question of
whether these networks are learning the “true” continuous density of the data.
Here, we show that two denoising DNNs trained on non-overlapping subsets of a
dataset learn nearly the same score function, and thus the same density, with a
surprisingly small number of training images. This strong generalization
demonstrates an alignment of powerful inductive biases in the DNN architecture
and/or training algorithm with properties of the data distribution. We analyze
these, demonstrating that the denoiser performs a shrinkage operation in a
basis adapted to the underlying image. Examination of these bases reveals
oscillating harmonic structures along contours and in homogeneous image
regions. We show that trained denoisers are inductively biased towards these
geometry-adaptive harmonic representations by demonstrating that they arise
even when the network is trained on image classes such as low-dimensional
manifolds, for which the harmonic basis is suboptimal. Additionally, we show
that the denoising performance of the networks is near-optimal when trained on
regular image classes for which the optimal basis is known to be
geometry-adaptive and harmonic.
MedDiffusion: Boosting Health Risk Prediction via Diffusion-based Data Augmentation
October 04, 2023
Yuan Zhong, Suhan Cui, Jiaqi Wang, Xiaochen Wang, Ziyi Yin, Yaqing Wang, Houping Xiao, Mengdi Huai, Ting Wang, Fenglong Ma
Health risk prediction is one of the fundamental tasks under predictive
modeling in the medical domain, which aims to forecast the potential health
risks that patients may face in the future using their historical Electronic
Health Records (EHR). Researchers have developed several risk prediction models
to handle the unique challenges of EHR data, such as its sequential nature,
high dimensionality, and inherent noise. These models have yielded impressive
results. Nonetheless, a key issue undermining their effectiveness is data
insufficiency. A variety of data generation and augmentation methods have been
introduced to mitigate this issue by expanding the size of the training data
set through the learning of underlying data distributions. However, the
performance of these methods is often limited due to their task-unrelated
design. To address these shortcomings, this paper introduces a novel,
end-to-end diffusion-based risk prediction model, named MedDiffusion. It
enhances risk prediction performance by creating synthetic patient data during
training to enlarge sample space. Furthermore, MedDiffusion discerns hidden
relationships between patient visits using a step-wise attention mechanism,
enabling the model to automatically retain the most vital information for
generating high-quality data. Experimental evaluation on four real-world
medical datasets demonstrates that MedDiffusion outperforms 14 cutting-edge
baselines in terms of PR-AUC, F1, and Cohen’s Kappa. We also conduct ablation
studies and benchmark our model against GAN-based alternatives to further
validate the rationality and adaptability of our model design. Additionally, we
analyze generated data to offer fresh insights into the model’s
interpretability.
Learning to Reach Goals via Diffusion
October 04, 2023
Vineet Jain, Siamak Ravanbakhsh
Diffusion models are a powerful class of generative models capable of mapping
random noise in high-dimensional spaces to a target manifold through iterative
denoising. In this work, we present a novel perspective on goal-conditioned
reinforcement learning by framing it within the context of diffusion modeling.
Analogous to the diffusion process, where Gaussian noise is used to create
random trajectories that walk away from the data manifold, we construct
trajectories that move away from potential goal states. We then learn a
goal-conditioned policy analogous to the score function. This approach, which
we call Merlin, can reach predefined or novel goals from an arbitrary initial
state without learning a separate value function. We consider three choices for
the noise model to replace Gaussian noise in diffusion - reverse play from the
buffer, reverse dynamics model, and a novel non-parametric approach. We
theoretically justify our approach and validate it on offline goal-reaching
tasks. Empirical results are competitive with state-of-the-art methods, which
suggests this perspective on diffusion for RL is a simple, scalable, and
effective direction for sequential decision-making.
SE(3)-Stochastic Flow Matching for Protein Backbone Generation
October 03, 2023
Avishek Joey Bose, Tara Akhound-Sadegh, Kilian Fatras, Guillaume Huguet, Jarrid Rector-Brooks, Cheng-Hao Liu, Andrei Cristian Nica, Maksym Korablyov, Michael Bronstein, Alexander Tong
The computational design of novel protein structures has the potential to
impact numerous scientific disciplines greatly. Toward this goal, we introduce
$\text{FoldFlow}$ a series of novel generative models of increasing modeling
power based on the flow-matching paradigm over $3\text{D}$ rigid motions –
i.e. the group $\text{SE(3)}$ – enabling accurate modeling of protein
backbones. We first introduce $\text{FoldFlow-Base}$, a simulation-free
approach to learning deterministic continuous-time dynamics and matching
invariant target distributions on $\text{SE(3)}$. We next accelerate training
by incorporating Riemannian optimal transport to create $\text{FoldFlow-OT}$,
leading to the construction of both more simple and stable flows. Finally, we
design $\text{FoldFlow-SFM}$ coupling both Riemannian OT and simulation-free
training to learn stochastic continuous-time dynamics over $\text{SE(3)}$. Our
family of $\text{FoldFlow}$ generative models offer several key advantages over
previous approaches to the generative modeling of proteins: they are more
stable and faster to train than diffusion-based approaches, and our models
enjoy the ability to map any invariant source distribution to any invariant
target distribution over $\text{SE(3)}$. Empirically, we validate our FoldFlow
models on protein backbone generation of up to $300$ amino acids leading to
high-quality designable, diverse, and novel samples.
Predicated Diffusion: Predicate Logic-Based Attention Guidance for Text-to-Image Diffusion Models
October 03, 2023
Kota Sueyoshi, Takashi Matsubara
Diffusion models have achieved remarkable results in generating high-quality,
diverse, and creative images. However, when it comes to text-based image
generation, they often fail to capture the intended meaning presented in the
text. For instance, a specified object may not be generated, an unnecessary
object may be generated, and an adjective may alter objects it was not intended
to modify. Moreover, we found that relationships indicating possession between
objects are often overlooked. While users’ intentions in text are diverse,
existing methods tend to specialize in only some aspects of these. In this
paper, we propose Predicated Diffusion, a unified framework to express users’
intentions. We consider that the root of the above issues lies in the text
encoder, which often focuses only on individual words and neglects the logical
relationships between them. The proposed method does not solely rely on the
text encoder, but instead, represents the intended meaning in the text as
propositions using predicate logic and treats the pixels in the attention maps
as the fuzzy predicates. This enables us to obtain a differentiable loss
function that makes the image fulfill the proposition by minimizing it. When
compared to several existing methods, we demonstrated that Predicated Diffusion
can generate images that are more faithful to various text prompts, as verified
by human evaluators and pretrained image-text models.
Conditional Diffusion Distillation
October 02, 2023
Kangfu Mei, Mauricio Delbracio, Hossein Talebi, Zhengzhong Tu, Vishal M. Patel, Peyman Milanfar
Generative diffusion models provide strong priors for text-to-image
generation and thereby serve as a foundation for conditional generation tasks
such as image editing, restoration, and super-resolution. However, one major
limitation of diffusion models is their slow sampling time. To address this
challenge, we present a novel conditional distillation method designed to
supplement the diffusion priors with the help of image conditions, allowing for
conditional sampling with very few steps. We directly distill the unconditional
pre-training in a single stage through joint-learning, largely simplifying the
previous two-stage procedures that involve both distillation and conditional
finetuning separately. Furthermore, our method enables a new
parameter-efficient distillation mechanism that distills each task with only a
small number of additional parameters combined with the shared frozen
unconditional backbone. Experiments across multiple tasks including
super-resolution, image editing, and depth-to-image generation demonstrate that
our method outperforms existing distillation techniques for the same sampling
time. Notably, our method is the first distillation strategy that can match the
performance of the much slower fine-tuned conditional diffusion models.
Mirror Diffusion Models for Constrained and Watermarked Generation
October 02, 2023
Guan-Horng Liu, Tianrong Chen, Evangelos A. Theodorou, Molei Tao
Modern successes of diffusion models in learning complex, high-dimensional
data distributions are attributed, in part, to their capability to construct
diffusion processes with analytic transition kernels and score functions. The
tractability results in a simulation-free framework with stable regression
losses, from which reversed, generative processes can be learned at scale.
However, when data is confined to a constrained set as opposed to a standard
Euclidean space, these desirable characteristics appear to be lost based on
prior attempts. In this work, we propose Mirror Diffusion Models (MDM), a new
class of diffusion models that generate data on convex constrained sets without
losing any tractability. This is achieved by learning diffusion processes in a
dual space constructed from a mirror map, which, crucially, is a standard
Euclidean space. We derive efficient computation of mirror maps for popular
constrained sets, such as simplices and $\ell_2$-balls, showing significantly
improved performance of MDM over existing methods. For safety and privacy
purposes, we also explore constrained sets as a new mechanism to embed
invisible but quantitative information (i.e., watermarks) in generated data,
for which MDM serves as a compelling approach. Our work brings new algorithmic
opportunities for learning tractable diffusion on complex domains.
Light Schrödinger Bridge
October 02, 2023
Alexander Korotin, Nikita Gushchin, Evgeny Burnaev
Despite the recent advances in the field of computational Schrodinger Bridges
(SB), most existing SB solvers are still heavy-weighted and require complex
optimization of several neural networks. It turns out that there is no
principal solver which plays the role of simple-yet-effective baseline for SB
just like, e.g., $k$-means method in clustering, logistic regression in
classification or Sinkhorn algorithm in discrete optimal transport. We address
this issue and propose a novel fast and simple SB solver. Our development is a
smart combination of two ideas which recently appeared in the field: (a)
parameterization of the Schrodinger potentials with sum-exp quadratic functions
and (b) viewing the log-Schrodinger potentials as the energy functions. We show
that combined together these ideas yield a lightweight, simulation-free and
theoretically justified SB solver with a simple straightforward optimization
objective. As a result, it allows solving SB in moderate dimensions in a matter
of minutes on CPU without a painful hyperparameter selection. Our light solver
resembles the Gaussian mixture model which is widely used for density
estimation. Inspired by this similarity, we also prove an important theoretical
result showing that our light solver is a universal approximator of SBs. The
code for the LightSB solver can be found at
https://github.com/ngushchin/LightSB
Consistency Trajectory Models: Learning Probability Flow ODE Trajectory of Diffusion
October 01, 2023
Dongjun Kim, Chieh-Hsin Lai, Wei-Hsiang Liao, Naoki Murata, Yuhta Takida, Toshimitsu Uesaka, Yutong He, Yuki Mitsufuji, Stefano Ermon
cs.LG, cs.AI, cs.CV, stat.ML
Consistency Models (CM) (Song et al., 2023) accelerate score-based diffusion
model sampling at the cost of sample quality but lack a natural way to
trade-off quality for speed. To address this limitation, we propose Consistency
Trajectory Model (CTM), a generalization encompassing CM and score-based models
as special cases. CTM trains a single neural network that can – in a single
forward pass – output scores (i.e., gradients of log-density) and enables
unrestricted traversal between any initial and final time along the Probability
Flow Ordinary Differential Equation (ODE) in a diffusion process. CTM enables
the efficient combination of adversarial training and denoising score matching
loss to enhance performance and achieves new state-of-the-art FIDs for
single-step diffusion model sampling on CIFAR-10 (FID 1.73) and ImageNet at
64X64 resolution (FID 2.06). CTM also enables a new family of sampling schemes,
both deterministic and stochastic, involving long jumps along the ODE solution
trajectories. It consistently improves sample quality as computational budgets
increase, avoiding the degradation seen in CM. Furthermore, CTM’s access to the
score accommodates all diffusion model inference techniques, including exact
likelihood computation.
DiffPoseTalk: Speech-Driven Stylistic 3D Facial Animation and Head Pose Generation via Diffusion Models
September 30, 2023
Zhiyao Sun, Tian Lv, Sheng Ye, Matthieu Gaetan Lin, Jenny Sheng, Yu-Hui Wen, Minjing Yu, Yong-jin Liu
The generation of stylistic 3D facial animations driven by speech poses a
significant challenge as it requires learning a many-to-many mapping between
speech, style, and the corresponding natural facial motion. However, existing
methods either employ a deterministic model for speech-to-motion mapping or
encode the style using a one-hot encoding scheme. Notably, the one-hot encoding
approach fails to capture the complexity of the style and thus limits
generalization ability. In this paper, we propose DiffPoseTalk, a generative
framework based on the diffusion model combined with a style encoder that
extracts style embeddings from short reference videos. During inference, we
employ classifier-free guidance to guide the generation process based on the
speech and style. We extend this to include the generation of head poses,
thereby enhancing user perception. Additionally, we address the shortage of
scanned 3D talking face data by training our model on reconstructed 3DMM
parameters from a high-quality, in-the-wild audio-visual dataset. Our extensive
experiments and user study demonstrate that our approach outperforms
state-of-the-art methods. The code and dataset will be made publicly available.
Efficient Planning with Latent Diffusion
September 30, 2023
Wenhao Li
Temporal abstraction and efficient planning pose significant challenges in
offline reinforcement learning, mainly when dealing with domains that involve
temporally extended tasks and delayed sparse rewards. Existing methods
typically plan in the raw action space and can be inefficient and inflexible.
Latent action spaces offer a more flexible paradigm, capturing only possible
actions within the behavior policy support and decoupling the temporal
structure between planning and modeling. However, current latent-action-based
methods are limited to discrete spaces and require expensive planning. This
paper presents a unified framework for continuous latent action space
representation learning and planning by leveraging latent, score-based
diffusion models. We establish the theoretical equivalence between planning in
the latent action space and energy-guided sampling with a pretrained diffusion
model and incorporate a novel sequence-level exact sampling method. Our
proposed method, $\texttt{LatentDiffuser}$, demonstrates competitive
performance on low-dimensional locomotion control tasks and surpasses existing
methods in higher-dimensional tasks.
Steered Diffusion: A Generalized Framework for Plug-and-Play Conditional Image Synthesis
September 30, 2023
Nithin Gopalakrishnan Nair, Anoop Cherian, Suhas Lohit, Ye Wang, Toshiaki Koike-Akino, Vishal M. Patel, Tim K. Marks
Conditional generative models typically demand large annotated training sets
to achieve high-quality synthesis. As a result, there has been significant
interest in designing models that perform plug-and-play generation, i.e., to
use a predefined or pretrained model, which is not explicitly trained on the
generative task, to guide the generative process (e.g., using language).
However, such guidance is typically useful only towards synthesizing high-level
semantics rather than editing fine-grained details as in image-to-image
translation tasks. To this end, and capitalizing on the powerful fine-grained
generative control offered by the recent diffusion-based generative models, we
introduce Steered Diffusion, a generalized framework for photorealistic
zero-shot conditional image generation using a diffusion model trained for
unconditional generation. The key idea is to steer the image generation of the
diffusion model at inference time via designing a loss using a pre-trained
inverse model that characterizes the conditional task. This loss modulates the
sampling trajectory of the diffusion process. Our framework allows for easy
incorporation of multiple conditions during inference. We present experiments
using steered diffusion on several tasks including inpainting, colorization,
text-guided semantic editing, and image super-resolution. Our results
demonstrate clear qualitative and quantitative improvements over
state-of-the-art diffusion-based plug-and-play models while adding negligible
additional computational cost.
FashionFlow: Leveraging Diffusion Models for Dynamic Fashion Video Synthesis from Static Imagery
September 29, 2023
Tasin Islam, Alina Miron, XiaoHui Liu, Yongmin Li
Our study introduces a new image-to-video generator called FashionFlow. By
utilising a diffusion model, we are able to create short videos from still
images. Our approach involves developing and connecting relevant components
with the diffusion model, which sets our work apart. The components include the
use of pseudo-3D convolutional layers to generate videos efficiently. VAE and
CLIP encoders capture vital characteristics from still images to influence the
diffusion model. Our research demonstrates a successful synthesis of fashion
videos featuring models posing from various angles, showcasing the fit and
appearance of the garment. Our findings hold great promise for improving and
enhancing the shopping experience for the online fashion industry.
Directly Fine-Tuning Diffusion Models on Differentiable Rewards
September 29, 2023
Kevin Clark, Paul Vicol, Kevin Swersky, David J Fleet
We present Direct Reward Fine-Tuning (DRaFT), a simple and effective method
for fine-tuning diffusion models to maximize differentiable reward functions,
such as scores from human preference models. We first show that it is possible
to backpropagate the reward function gradient through the full sampling
procedure, and that doing so achieves strong performance on a variety of
rewards, outperforming reinforcement learning-based approaches. We then propose
more efficient variants of DRaFT: DRaFT-K, which truncates backpropagation to
only the last K steps of sampling, and DRaFT-LV, which obtains lower-variance
gradient estimates for the case when K=1. We show that our methods work well
for a variety of reward functions and can be used to substantially improve the
aesthetic quality of images generated by Stable Diffusion 1.4. Finally, we draw
connections between our approach and prior work, providing a unifying
perspective on the design space of gradient-based fine-tuning algorithms.
In search of dispersed memories: Generative diffusion models are associative memory networks
September 29, 2023
Luca Ambrogioni
Uncovering the mechanisms behind long-term memory is one of the most
fascinating open problems in neuroscience and artificial intelligence.
Artificial associative memory networks have been used to formalize important
aspects of biological memory. Generative diffusion models are a type of
generative machine learning techniques that have shown great performance in
many tasks. Like associative memory systems, these networks define a dynamical
system that converges to a set of target states. In this work we show that
generative diffusion models can be interpreted as energy-based models and that,
when trained on discrete patterns, their energy function is (asymptotically)
identical to that of modern Hopfield networks. This equivalence allows us to
interpret the supervised training of diffusion models as a synaptic learning
process that encodes the associative dynamics of a modern Hopfield network in
the weight structure of a deep neural network. Leveraging this connection, we
formulate a generalized framework for understanding the formation of long-term
memory, where creative generation and memory recall can be seen as parts of a
unified continuum.
Diffusion Models as Stochastic Quantization in Lattice Field Theory
September 29, 2023
Lingxiao Wang, Gert Aarts, Kai Zhou
In this work, we establish a direct connection between generative diffusion
models (DMs) and stochastic quantization (SQ). The DM is realized by
approximating the reversal of a stochastic process dictated by the Langevin
equation, generating samples from a prior distribution to effectively mimic the
target distribution. Using numerical simulations, we demonstrate that the DM
can serve as a global sampler for generating quantum lattice field
configurations in two-dimensional $\phi^4$ theory. We demonstrate that DMs can
notably reduce autocorrelation times in the Markov chain, especially in the
critical region where standard Markov Chain Monte-Carlo (MCMC) algorithms
experience critical slowing down. The findings can potentially inspire further
advancements in lattice field theory simulations, in particular in cases where
it is expensive to generate large ensembles.
Consistency Models as a Rich and Efficient Policy Class for Reinforcement Learning
September 29, 2023
Zihan Ding, Chi Jin
Score-based generative models like the diffusion model have been testified to
be effective in modeling multi-modal data from image generation to
reinforcement learning (RL). However, the inference process of diffusion model
can be slow, which hinders its usage in RL with iterative sampling. We propose
to apply the consistency model as an efficient yet expressive policy
representation, namely consistency policy, with an actor-critic style algorithm
for three typical RL settings: offline, offline-to-online and online. For
offline RL, we demonstrate the expressiveness of generative models as policies
from multi-modal data. For offline-to-online RL, the consistency policy is
shown to be more computational efficient than diffusion policy, with a
comparable performance. For online RL, the consistency policy demonstrates
significant speedup and even higher average performances than the diffusion
policy.
Denoising Diffusion Bridge Models
September 29, 2023
Linqi Zhou, Aaron Lou, Samar Khanna, Stefano Ermon
Diffusion models are powerful generative models that map noise to data using
stochastic processes. However, for many applications such as image editing, the
model input comes from a distribution that is not random noise. As such,
diffusion models must rely on cumbersome methods like guidance or projected
sampling to incorporate this information in the generative process. In our
work, we propose Denoising Diffusion Bridge Models (DDBMs), a natural
alternative to this paradigm based on diffusion bridges, a family of processes
that interpolate between two paired distributions given as endpoints. Our
method learns the score of the diffusion bridge from data and maps from one
endpoint distribution to the other by solving a (stochastic) differential
equation based on the learned score. Our method naturally unifies several
classes of generative models, such as score-based diffusion models and
OT-Flow-Matching, allowing us to adapt existing design and architectural
choices to our more general problem. Empirically, we apply DDBMs to challenging
image datasets in both pixel and latent space. On standard image translation
problems, DDBMs achieve significant improvement over baseline methods, and,
when we reduce the problem to image generation by setting the source
distribution to random noise, DDBMs achieve comparable FID scores to
state-of-the-art methods despite being built for a more general task.
Distilling ODE Solvers of Diffusion Models into Smaller Steps
September 28, 2023
Sanghwan Kim, Hao Tang, Fisher Yu
Distillation techniques have substantially improved the sampling speed of
diffusion models, allowing of the generation within only one step or a few
steps. However, these distillation methods require extensive training for each
dataset, sampler, and network, which limits their practical applicability. To
address this limitation, we propose a straightforward distillation approach,
Distilled-ODE solvers (D-ODE solvers), that optimizes the ODE solver rather
than training the denoising network. D-ODE solvers are formulated by simply
applying a single parameter adjustment to existing ODE solvers. Subsequently,
D-ODE solvers with smaller steps are optimized by ODE solvers with larger steps
through distillation over a batch of samples. Our comprehensive experiments
indicate that D-ODE solvers outperform existing ODE solvers, including DDIM,
PNDM, DPM-Solver, DEIS, and EDM, especially when generating samples with fewer
steps. Our method incur negligible computational overhead compared to previous
distillation techniques, enabling simple and rapid integration with previous
samplers. Qualitative analysis further shows that D-ODE solvers enhance image
quality while preserving the sampling trajectory of ODE solvers.
DiffGAN-F2S: Symmetric and Efficient Denoising Diffusion GANs for Structural Connectivity Prediction from Brain fMRI
September 28, 2023
Qiankun Zuo, Ruiheng Li, Yi Di, Hao Tian, Changhong Jing, Xuhang Chen, Shuqiang Wang
Mapping from functional connectivity (FC) to structural connectivity (SC) can
facilitate multimodal brain network fusion and discover potential biomarkers
for clinical implications. However, it is challenging to directly bridge the
reliable non-linear mapping relations between SC and functional magnetic
resonance imaging (fMRI). In this paper, a novel diffusision generative
adversarial network-based fMRI-to-SC (DiffGAN-F2S) model is proposed to predict
SC from brain fMRI in an end-to-end manner. To be specific, the proposed
DiffGAN-F2S leverages denoising diffusion probabilistic models (DDPMs) and
adversarial learning to efficiently generate high-fidelity SC through a few
steps from fMRI. By designing the dual-channel multi-head spatial attention
(DMSA) and graph convolutional modules, the symmetric graph generator first
captures global relations among direct and indirect connected brain regions,
then models the local brain region interactions. It can uncover the complex
mapping relations between fMRI and structural connectivity. Furthermore, the
spatially connected consistency loss is devised to constrain the generator to
preserve global-local topological information for accurate intrinsic SC
prediction. Testing on the public Alzheimer’s Disease Neuroimaging Initiative
(ADNI) dataset, the proposed model can effectively generate empirical
SC-preserved connectivity from four-dimensional imaging data and shows superior
performance in SC prediction compared with other related models. Furthermore,
the proposed model can identify the vast majority of important brain regions
and connections derived from the empirical method, providing an alternative way
to fuse multimodal brain networks and analyze clinical disease.
Exploiting the Signal-Leak Bias in Diffusion Models
September 27, 2023
Martin Nicolas Everaert, Athanasios Fitsios, Marco Bocchio, Sami Arpa, Sabine Süsstrunk, Radhakrishna Achanta
There is a bias in the inference pipeline of most diffusion models. This bias
arises from a signal leak whose distribution deviates from the noise
distribution, creating a discrepancy between training and inference processes.
We demonstrate that this signal-leak bias is particularly significant when
models are tuned to a specific style, causing sub-optimal style matching.
Recent research tries to avoid the signal leakage during training. We instead
show how we can exploit this signal-leak bias in existing diffusion models to
allow more control over the generated images. This enables us to generate
images with more varied brightness, and images that better match a desired
style or color. By modeling the distribution of the signal leak in the spatial
frequency and pixel domains, and including a signal leak in the initial latent,
we generate images that better match expected results without any additional
training.
High Perceptual Quality Wireless Image Delivery with Denoising Diffusion Models
September 27, 2023
Selim F. Yilmaz, Xueyan Niu, Bo Bai, Wei Han, Lei Deng, Deniz Gunduz
eess.IV, cs.CV, cs.IT, cs.LG, cs.MM, math.IT
We consider the image transmission problem over a noisy wireless channel via
deep learning-based joint source-channel coding (DeepJSCC) along with a
denoising diffusion probabilistic model (DDPM) at the receiver. Specifically,
we are interested in the perception-distortion trade-off in the practical
finite block length regime, in which separate source and channel coding can be
highly suboptimal. We introduce a novel scheme that utilizes the range-null
space decomposition of the target image. We transmit the range-space of the
image after encoding and employ DDPM to progressively refine its null space
contents. Through extensive experiments, we demonstrate significant
improvements in distortion and perceptual quality of reconstructed images
compared to standard DeepJSCC and the state-of-the-art generative
learning-based method. We will publicly share our source code to facilitate
further research and reproducibility.
Factorized Diffusion Architectures for Unsupervised Image Generation and Segmentation
September 27, 2023
Xin Yuan, Michael Maire
We develop a neural network architecture which, trained in an unsupervised
manner as a denoising diffusion model, simultaneously learns to both generate
and segment images. Learning is driven entirely by the denoising diffusion
objective, without any annotation or prior knowledge about regions during
training. A computational bottleneck, built into the neural architecture,
encourages the denoising network to partition an input into regions, denoise
them in parallel, and combine the results. Our trained model generates both
synthetic images and, by simple examination of its internal predicted
partitions, a semantic segmentation of those images. Without any finetuning, we
directly apply our unsupervised model to the downstream task of segmenting real
images via noising and subsequently denoising them. Experiments demonstrate
that our model achieves accurate unsupervised image segmentation and
high-quality synthetic image generation across multiple datasets.
Learning Using Generated Privileged Information by Text-to-Image Diffusion Models
September 26, 2023
Rafael-Edy Menadil, Mariana-Iuliana Georgescu, Radu Tudor Ionescu
Learning Using Privileged Information is a particular type of knowledge
distillation where the teacher model benefits from an additional data
representation during training, called privileged information, improving the
student model, which does not see the extra representation. However, privileged
information is rarely available in practice. To this end, we propose a text
classification framework that harnesses text-to-image diffusion models to
generate artificial privileged information. The generated images and the
original text samples are further used to train multimodal teacher models based
on state-of-the-art transformer-based architectures. Finally, the knowledge
from multimodal teachers is distilled into a text-based (unimodal) student.
Hence, by employing a generative model to produce synthetic data as privileged
information, we guide the training of the student model. Our framework, called
Learning Using Generated Privileged Information (LUGPI), yields noticeable
performance gains on four text classification data sets, demonstrating its
potential in text classification without any additional cost during inference.
Diffusion-based Holistic Texture Rectification and Synthesis
September 26, 2023
Guoqing Hao, Satoshi Iizuka, Kensho Hara, Edgar Simo-Serra, Hirokatsu Kataoka, Kazuhiro Fukui
We present a novel framework for rectifying occlusions and distortions in
degraded texture samples from natural images. Traditional texture synthesis
approaches focus on generating textures from pristine samples, which
necessitate meticulous preparation by humans and are often unattainable in most
natural images. These challenges stem from the frequent occlusions and
distortions of texture samples in natural images due to obstructions and
variations in object surface geometry. To address these issues, we propose a
framework that synthesizes holistic textures from degraded samples in natural
images, extending the applicability of exemplar-based texture synthesis
techniques. Our framework utilizes a conditional Latent Diffusion Model (LDM)
with a novel occlusion-aware latent transformer. This latent transformer not
only effectively encodes texture features from partially-observed samples
necessary for the generation process of the LDM, but also explicitly captures
long-range dependencies in samples with large occlusions. To train our model,
we introduce a method for generating synthetic data by applying geometric
transformations and free-form mask generation to clean textures. Experimental
results demonstrate that our framework significantly outperforms existing
methods both quantitatively and quantitatively. Furthermore, we conduct
comprehensive ablation studies to validate the different components of our
proposed framework. Results are corroborated by a perceptual user study which
highlights the efficiency of our proposed approach.
Text-image guided Diffusion Model for generating Deepfake celebrity interactions
September 26, 2023
Yunzhuo Chen, Nur Al Hasan Haldar, Naveed Akhtar, Ajmal Mian
Deepfake images are fast becoming a serious concern due to their realism.
Diffusion models have recently demonstrated highly realistic visual content
generation, which makes them an excellent potential tool for Deepfake
generation. To curb their exploitation for Deepfakes, it is imperative to first
explore the extent to which diffusion models can be used to generate realistic
content that is controllable with convenient prompts. This paper devises and
explores a novel method in that regard. Our technique alters the popular stable
diffusion model to generate a controllable high-quality Deepfake image with
text and image prompts. In addition, the original stable model lacks severely
in generating quality images that contain multiple persons. The modified
diffusion model is able to address this problem, it add input anchor image’s
latent at the beginning of inferencing rather than Gaussian random latent as
input. Hence, we focus on generating forged content for celebrity interactions,
which may be used to spread rumors. We also apply Dreambooth to enhance the
realism of our fake images. Dreambooth trains the pairing of center words and
specific features to produce more refined and personalized output images. Our
results show that with the devised scheme, it is possible to create fake visual
content with alarming realism, such that the content can serve as believable
evidence of meetings between powerful political figures.
Bootstrap Diffusion Model Curve Estimation for High Resolution Low-Light Image Enhancement
September 26, 2023
Jiancheng Huang, Yifan Liu, Shifeng Chen
Learning-based methods have attracted a lot of research attention and led to
significant improvements in low-light image enhancement. However, most of them
still suffer from two main problems: expensive computational cost in high
resolution images and unsatisfactory performance in simultaneous enhancement
and denoising. To address these problems, we propose BDCE, a bootstrap
diffusion model that exploits the learning of the distribution of the curve
parameters instead of the normal-light image itself. Specifically, we adopt the
curve estimation method to handle the high-resolution images, where the curve
parameters are estimated by our bootstrap diffusion model. In addition, a
denoise module is applied in each iteration of curve adjustment to denoise the
intermediate enhanced result of each iteration. We evaluate BDCE on commonly
used benchmark datasets, and extensive experiments show that it achieves
state-of-the-art qualitative and quantitative performance.
Multiple Noises in Diffusion Model for Semi-Supervised Multi-Domain Translation
September 25, 2023
Tsiry Mayet, Simon Bernard, Clement Chatelain, Romain Herault
Domain-to-domain translation involves generating a target domain sample given
a condition in the source domain. Most existing methods focus on fixed input
and output domains, i.e. they only work for specific configurations (i.e. for
two domains, either $D_1\rightarrow{}D_2$ or $D_2\rightarrow{}D_1$). This paper
proposes Multi-Domain Diffusion (MDD), a conditional diffusion framework for
multi-domain translation in a semi-supervised context. Unlike previous methods,
MDD does not require defining input and output domains, allowing translation
between any partition of domains within a set (such as $(D_1,
D_2)\rightarrow{}D_3$, $D_2\rightarrow{}(D_1, D_3)$, $D_3\rightarrow{}D_1$,
etc. for 3 domains), without the need to train separate models for each domain
configuration. The key idea behind MDD is to leverage the noise formulation of
diffusion models by incorporating one noise level per domain, which allows
missing domains to be modeled with noise in a natural way. This transforms the
training task from a simple reconstruction task to a domain translation task,
where the model relies on less noisy domains to reconstruct more noisy domains.
We present results on a multi-domain (with more than two domains) synthetic
image translation dataset with challenging semantic domain inversion.
NetDiffus: Network Traffic Generation by Diffusion Models through Time-Series Imaging
September 23, 2023
Nirhoshan Sivaroopan, Dumindu Bandara, Chamara Madarasingha, Guilluame Jourjon, Anura Jayasumana, Kanchana Thilakarathna
Network data analytics are now at the core of almost every networking
solution. Nonetheless, limited access to networking data has been an enduring
challenge due to many reasons including complexity of modern networks,
commercial sensitivity, privacy and regulatory constraints. In this work, we
explore how to leverage recent advancements in Diffusion Models (DM) to
generate synthetic network traffic data. We develop an end-to-end framework -
NetDiffus that first converts one-dimensional time-series network traffic into
two-dimensional images, and then synthesizes representative images for the
original data. We demonstrate that NetDiffus outperforms the state-of-the-art
traffic generation methods based on Generative Adversarial Networks (GANs) by
providing 66.4% increase in fidelity of the generated data and 18.1% increase
in downstream machine learning tasks. We evaluate NetDiffus on seven diverse
traffic traces and show that utilizing synthetic data significantly improves
traffic fingerprinting, anomaly detection and traffic classification.
Domain-Guided Conditional Diffusion Model for Unsupervised Domain Adaptation
September 23, 2023
Yulong Zhang, Shuhao Chen, Weisen Jiang, Yu Zhang, Jiangang Lu, James T. Kwok
Limited transferability hinders the performance of deep learning models when
applied to new application scenarios. Recently, Unsupervised Domain Adaptation
(UDA) has achieved significant progress in addressing this issue via learning
domain-invariant features. However, the performance of existing UDA methods is
constrained by the large domain shift and limited target domain data. To
alleviate these issues, we propose DomAin-guided Conditional Diffusion Model
(DACDM) to generate high-fidelity and diversity samples for the target domain.
In the proposed DACDM, by introducing class information, the labels of
generated samples can be controlled, and a domain classifier is further
introduced in DACDM to guide the generated samples for the target domain. The
generated samples help existing UDA methods transfer from the source domain to
the target domain more easily, thus improving the transfer performance.
Extensive experiments on various benchmarks demonstrate that DACDM brings a
large improvement to the performance of existing UDA methods.
MosaicFusion: Diffusion Models as Data Augmenters for Large Vocabulary Instance Segmentation
September 22, 2023
Jiahao Xie, Wei Li, Xiangtai Li, Ziwei Liu, Yew Soon Ong, Chen Change Loy
We present MosaicFusion, a simple yet effective diffusion-based data
augmentation approach for large vocabulary instance segmentation. Our method is
training-free and does not rely on any label supervision. Two key designs
enable us to employ an off-the-shelf text-to-image diffusion model as a useful
dataset generator for object instances and mask annotations. First, we divide
an image canvas into several regions and perform a single round of diffusion
process to generate multiple instances simultaneously, conditioning on
different text prompts. Second, we obtain corresponding instance masks by
aggregating cross-attention maps associated with object prompts across layers
and diffusion time steps, followed by simple thresholding and edge-aware
refinement processing. Without bells and whistles, our MosaicFusion can produce
a significant amount of synthetic labeled data for both rare and novel
categories. Experimental results on the challenging LVIS long-tailed and
open-vocabulary benchmarks demonstrate that MosaicFusion can significantly
improve the performance of existing instance segmentation models, especially
for rare and novel categories. Code will be released at
https://github.com/Jiahao000/MosaicFusion.
Deep learning probability flows and entropy production rates in active matter
September 22, 2023
Nicholas M. Boffi, Eric Vanden-Eijnden
cond-mat.stat-mech, cond-mat.soft, cs.LG, cs.NA, math.NA
Active matter systems, from self-propelled colloids to motile bacteria, are
characterized by the conversion of free energy into useful work at the
microscopic scale. These systems generically involve physics beyond the reach
of equilibrium statistical mechanics, and a persistent challenge has been to
understand the nature of their nonequilibrium states. The entropy production
rate and the magnitude of the steady-state probability current provide
quantitative ways to do so by measuring the breakdown of time-reversal symmetry
and the strength of nonequilibrium transport of measure. Yet, their efficient
computation has remained elusive, as they depend on the system’s unknown and
high-dimensional probability density. Here, building upon recent advances in
generative modeling, we develop a deep learning framework that estimates the
score of this density. We show that the score, together with the microscopic
equations of motion, gives direct access to the entropy production rate, the
probability current, and their decomposition into local contributions from
individual particles, spatial regions, and degrees of freedom. To represent the
score, we introduce a novel, spatially-local transformer-based network
architecture that learns high-order interactions between particles while
respecting their underlying permutation symmetry. We demonstrate the broad
utility and scalability of the method by applying it to several
high-dimensional systems of interacting active particles undergoing
motility-induced phase separation (MIPS). We show that a single instance of our
network trained on a system of 4096 particles at one packing fraction can
generalize to other regions of the phase diagram, including systems with as
many as 32768 particles. We use this observation to quantify the spatial
structure of the departure from equilibrium in MIPS as a function of the number
of particles and the packing fraction.
A Diffusion-Model of Joint Interactive Navigation
September 21, 2023
Matthew Niedoba, Jonathan Wilder Lavington, Yunpeng Liu, Vasileios Lioutas, Justice Sefas, Xiaoxuan Liang, Dylan Green, Setareh Dabiri, Berend Zwartsenberg, Adam Scibior, Frank Wood
Simulation of autonomous vehicle systems requires that simulated traffic
participants exhibit diverse and realistic behaviors. The use of prerecorded
real-world traffic scenarios in simulation ensures realism but the rarity of
safety critical events makes large scale collection of driving scenarios
expensive. In this paper, we present DJINN - a diffusion based method of
generating traffic scenarios. Our approach jointly diffuses the trajectories of
all agents, conditioned on a flexible set of state observations from the past,
present, or future. On popular trajectory forecasting datasets, we report state
of the art performance on joint trajectory metrics. In addition, we demonstrate
how DJINN flexibly enables direct test-time sampling from a variety of valuable
conditional distributions including goal-based sampling, behavior-class
sampling, and scenario editing.
Latent Diffusion Models for Structural Component Design
September 20, 2023
Ethan Herron, Jaydeep Rade, Anushrut Jignasu, Baskar Ganapathysubramanian, Aditya Balu, Soumik Sarkar, Adarsh Krishnamurthy
Recent advances in generative modeling, namely Diffusion models, have
revolutionized generative modeling, enabling high-quality image generation
tailored to user needs. This paper proposes a framework for the generative
design of structural components. Specifically, we employ a Latent Diffusion
model to generate potential designs of a component that can satisfy a set of
problem-specific loading conditions. One of the distinct advantages our
approach offers over other generative approaches, such as generative
adversarial networks (GANs), is that it permits the editing of existing
designs. We train our model using a dataset of geometries obtained from
structural topology optimization utilizing the SIMP algorithm. Consequently,
our framework generates inherently near-optimal designs. Our work presents
quantitative results that support the structural performance of the generated
designs and the variability in potential candidate designs. Furthermore, we
provide evidence of the scalability of our framework by operating over voxel
domains with resolutions varying from $32^3$ to $128^3$. Our framework can be
used as a starting point for generating novel near-optimal designs similar to
topology-optimized designs.
Light Field Diffusion for Single-View Novel View Synthesis
September 20, 2023
Yifeng Xiong, Haoyu Ma, Shanlin Sun, Kun Han, Xiaohui Xie
Single-view novel view synthesis, the task of generating images from new
viewpoints based on a single reference image, is an important but challenging
task in computer vision. Recently, Denoising Diffusion Probabilistic Model
(DDPM) has become popular in this area due to its strong ability to generate
high-fidelity images. However, current diffusion-based methods directly rely on
camera pose matrices as viewing conditions, globally and implicitly introducing
3D constraints. These methods may suffer from inconsistency among generated
images from different perspectives, especially in regions with intricate
textures and structures. In this work, we present Light Field Diffusion (LFD),
a conditional diffusion-based model for single-view novel view synthesis.
Unlike previous methods that employ camera pose matrices, LFD transforms the
camera view information into light field encoding and combines it with the
reference image. This design introduces local pixel-wise constraints within the
diffusion models, thereby encouraging better multi-view consistency.
Experiments on several datasets show that our LFD can efficiently generate
high-fidelity images and maintain better 3D consistency even in intricate
regions. Our method can generate images with higher quality than NeRF-based
models, and we obtain sample quality similar to other diffusion-based models
but with only one-third of the model size.
Assessing the capacity of a denoising diffusion probabilistic model to reproduce spatial context
September 19, 2023
Rucha Deshpande, Muzaffer Özbey, Hua Li, Mark A. Anastasio, Frank J. Brooks
eess.IV, cs.CV, cs.LG, stat.ML
Diffusion models have emerged as a popular family of deep generative models
(DGMs). In the literature, it has been claimed that one class of diffusion
models – denoising diffusion probabilistic models (DDPMs) – demonstrate
superior image synthesis performance as compared to generative adversarial
networks (GANs). To date, these claims have been evaluated using either
ensemble-based methods designed for natural images, or conventional measures of
image quality such as structural similarity. However, there remains an
important need to understand the extent to which DDPMs can reliably learn
medical imaging domain-relevant information, which is referred to as spatial
context' in this work. To address this, a systematic assessment of the ability
of DDPMs to learn spatial context relevant to medical imaging applications is
reported for the first time. A key aspect of the studies is the use of
stochastic context models (SCMs) to produce training data. In this way, the
ability of the DDPMs to reliably reproduce spatial context can be
quantitatively assessed by use of post-hoc image analyses. Error-rates in
DDPM-generated ensembles are reported, and compared to those corresponding to a
modern GAN. The studies reveal new and important insights regarding the
capacity of DDPMs to learn spatial context. Notably, the results demonstrate
that DDPMs hold significant capacity for generating contextually correct images
that are
interpolated’ between training samples, which may benefit
data-augmentation tasks in ways that GANs cannot.
PGDiff: Guiding Diffusion Models for Versatile Face Restoration via Partial Guidance
September 19, 2023
Peiqing Yang, Shangchen Zhou, Qingyi Tao, Chen Change Loy
Exploiting pre-trained diffusion models for restoration has recently become a
favored alternative to the traditional task-specific training approach.
Previous works have achieved noteworthy success by limiting the solution space
using explicit degradation models. However, these methods often fall short when
faced with complex degradations as they generally cannot be precisely modeled.
In this paper, we propose PGDiff by introducing partial guidance, a fresh
perspective that is more adaptable to real-world degradations compared to
existing works. Rather than specifically defining the degradation process, our
approach models the desired properties, such as image structure and color
statistics of high-quality images, and applies this guidance during the reverse
diffusion process. These properties are readily available and make no
assumptions about the degradation process. When combined with a diffusion
prior, this partial guidance can deliver appealing results across a range of
restoration tasks. Additionally, PGDiff can be extended to handle composite
tasks by consolidating multiple high-quality image properties, achieved by
integrating the guidance from respective tasks. Experimental results
demonstrate that our method not only outperforms existing diffusion-prior-based
approaches but also competes favorably with task-specific models.
Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation
September 19, 2023
Yatong Bai, Trung Dang, Dung Tran, Kazuhito Koishida, Somayeh Sojoudi
cs.SD, cs.LG, cs.MM, eess.AS
Diffusion models power a vast majority of text-to-audio (TTA) generation
methods. Unfortunately, these models suffer from slow inference speed due to
iterative queries to the underlying denoising network, thus unsuitable for
scenarios with inference time or computational constraints. This work modifies
the recently proposed consistency distillation framework to train TTA models
that require only a single neural network query. In addition to incorporating
classifier-free guidance into the distillation process, we leverage the
availability of generated audio during distillation training to fine-tune the
consistency TTA model with novel loss functions in the audio space, such as the
CLAP score. Our objective and subjective evaluation results on the AudioCaps
dataset show that consistency models retain diffusion models’ high generation
quality and diversity while reducing the number of queries by a factor of 400.
Forgedit: Text Guided Image Editing via Learning and Forgetting
September 19, 2023
Shiwen Zhang, Shuai Xiao, Weilin Huang
Text guided image editing on real images given only the image and the target
text prompt as inputs, is a very general and challenging problem, which
requires the editing model to reason by itself which part of the image should
be edited, to preserve the characteristics of original image, and also to
perform complicated non-rigid editing. Previous fine-tuning based solutions are
time-consuming and vulnerable to overfitting, limiting their editing
capabilities. To tackle these issues, we design a novel text guided image
editing method, Forgedit. First, we propose a novel fine-tuning framework which
learns to reconstruct the given image in less than one minute by vision
language joint learning. Then we introduce vector subtraction and vector
projection to explore the proper text embedding for editing. We also find a
general property of UNet structures in Diffusion Models and inspired by such a
finding, we design forgetting strategies to diminish the fatal overfitting
issues and significantly boost the editing abilities of Diffusion Models. Our
method, Forgedit, implemented with Stable Diffusion, achieves new
state-of-the-art results on the challenging text guided image editing benchmark
TEdBench, surpassing the previous SOTA method Imagic with Imagen, in terms of
both CLIP score and LPIPS score. Codes are available at
https://github.com/witcherofresearch/Forgedit.
Learning End-to-End Channel Coding with Diffusion Models
September 19, 2023
Muah Kim, Rick Fritschek, Rafael F. Schaefer
The training of neural encoders via deep learning necessitates a
differentiable channel model due to the backpropagation algorithm. This
requirement can be sidestepped by approximating either the channel distribution
or its gradient through pilot signals in real-world scenarios. The initial
approach draws upon the latest advancements in image generation, utilizing
generative adversarial networks (GANs) or their enhanced variants to generate
channel distributions. In this paper, we address this channel approximation
challenge with diffusion models, which have demonstrated high sample quality in
image generation. We offer an end-to-end channel coding framework underpinned
by diffusion models and propose an efficient training algorithm. Our
simulations with various channel models establish that our diffusion models
learn the channel distribution accurately, thereby achieving near-optimal
end-to-end symbol error rates (SERs). We also note a significant advantage of
diffusion models: A robust generalization capability in high signal-to-noise
ratio regions, in contrast to GAN variants that suffer from error floor.
Furthermore, we examine the trade-off between sample quality and sampling
speed, when an accelerated sampling algorithm is deployed, and investigate the
effect of the noise scheduling on this trade-off. With an apt choice of noise
scheduling, sampling time can be significantly reduced with a minor increase in
SER.
Diffusion-based speech enhancement with a weighted generative-supervised learning loss
September 19, 2023
Jean-Eudes Ayilo, Mostafa Sadeghi, Romain Serizel
cs.CV, cs.SD, eess.AS, eess.SP, stat.ML
Diffusion-based generative models have recently gained attention in speech
enhancement (SE), providing an alternative to conventional supervised methods.
These models transform clean speech training samples into Gaussian noise
centered at noisy speech, and subsequently learn a parameterized model to
reverse this process, conditionally on noisy speech. Unlike supervised methods,
generative-based SE approaches usually rely solely on an unsupervised loss,
which may result in less efficient incorporation of conditioned noisy speech.
To address this issue, we propose augmenting the original diffusion training
objective with a mean squared error (MSE) loss, measuring the discrepancy
between estimated enhanced speech and ground-truth clean speech at each reverse
process iteration. Experimental results demonstrate the effectiveness of our
proposed methodology.
AutoDiffusion: Training-Free Optimization of Time Steps and Architectures for Automated Diffusion Model Acceleration
September 19, 2023
Lijiang Li, Huixia Li, Xiawu Zheng, Jie Wu, Xuefeng Xiao, Rui Wang, Min Zheng, Xin Pan, Fei Chao, Rongrong Ji
Diffusion models are emerging expressive generative models, in which a large
number of time steps (inference steps) are required for a single image
generation. To accelerate such tedious process, reducing steps uniformly is
considered as an undisputed principle of diffusion models. We consider that
such a uniform assumption is not the optimal solution in practice; i.e., we can
find different optimal time steps for different models. Therefore, we propose
to search the optimal time steps sequence and compressed model architecture in
a unified framework to achieve effective image generation for diffusion models
without any further training. Specifically, we first design a unified search
space that consists of all possible time steps and various architectures. Then,
a two stage evolutionary algorithm is introduced to find the optimal solution
in the designed search space. To further accelerate the search process, we
employ FID score between generated and real samples to estimate the performance
of the sampled examples. As a result, the proposed method is (i).training-free,
obtaining the optimal time steps and model architecture without any training
process; (ii). orthogonal to most advanced diffusion samplers and can be
integrated to gain better sample quality. (iii). generalized, where the
searched time steps and architectures can be directly applied on different
diffusion models with the same guidance scale. Experimental results show that
our method achieves excellent performance by using only a few time steps, e.g.
17.86 FID score on ImageNet 64 $\times$ 64 with only four steps, compared to
138.66 with DDIM. The code is available at
https://github.com/lilijiangg/AutoDiffusion.
Diffusion Methods for Generating Transition Paths
September 19, 2023
Luke Triplett, Jianfeng Lu
physics.comp-ph, cs.LG, stat.ML
In this work, we seek to simulate rare transitions between metastable states
using score-based generative models. An efficient method for generating
high-quality transition paths is valuable for the study of molecular systems
since data is often difficult to obtain. We develop two novel methods for path
generation in this paper: a chain-based approach and a midpoint-based approach.
The first biases the original dynamics to facilitate transitions, while the
second mirrors splitting techniques and breaks down the original transition
into smaller transitions. Numerical results of generated transition paths for
the M"uller potential and for Alanine dipeptide demonstrate the effectiveness
of these approaches in both the data-rich and data-scarce regimes.
Generating and Imputing Tabular Data via Diffusion and Flow-based Gradient-Boosted Trees
September 18, 2023
Alexia Jolicoeur-Martineau, Kilian Fatras, Tal Kachman
Tabular data is hard to acquire and is subject to missing values. This paper
proposes a novel approach to generate and impute mixed-type (continuous and
categorical) tabular data using score-based diffusion and conditional flow
matching. Contrary to previous work that relies on neural networks to learn the
score function or the vector field, we instead rely on XGBoost, a popular
Gradient-Boosted Tree (GBT) method. We empirically show on 27 different
datasets that our approach i) generates highly realistic synthetic data when
the training dataset is either clean or tainted by missing data and ii)
generates diverse plausible data imputations. Furthermore, our method
outperforms deep-learning generation methods on data generation and is
competitive on data imputation. Finally, it can be trained in parallel using
CPUs without the need for a GPU. To make it easily accessible, we release our
code through a Python library and an R package.
What is a Fair Diffusion Model? Designing Generative Text-To-Image Models to Incorporate Various Worldviews
September 18, 2023
Zoe De Simone, Angie Boggust, Arvind Satyanarayan, Ashia Wilson
cs.LG, cs.AI, cs.CV, cs.CY
Generative text-to-image (GTI) models produce high-quality images from short
textual descriptions and are widely used in academic and creative domains.
However, GTI models frequently amplify biases from their training data, often
producing prejudiced or stereotypical images. Yet, current bias mitigation
strategies are limited and primarily focus on enforcing gender parity across
occupations. To enhance GTI bias mitigation, we introduce DiffusionWorldViewer,
a tool to analyze and manipulate GTI models’ attitudes, values, stories, and
expectations of the world that impact its generated images. Through an
interactive interface deployed as a web-based GUI and Jupyter Notebook plugin,
DiffusionWorldViewer categorizes existing demographics of GTI-generated images
and provides interactive methods to align image demographics with user
worldviews. In a study with 13 GTI users, we find that DiffusionWorldViewer
allows users to represent their varied viewpoints about what GTI outputs are
fair and, in doing so, challenges current notions of fairness that assume a
universal worldview.
Speeding Up Speech Synthesis In Diffusion Models By Reducing Data Distribution Recovery Steps Via Content Transfer
September 18, 2023
Peter Ochieng
Diffusion based vocoders have been criticised for being slow due to the many
steps required during sampling. Moreover, the model’s loss function that is
popularly implemented is designed such that the target is the original input
$x_0$ or error $\epsilon_0$. For early time steps of the reverse process, this
results in large prediction errors, which can lead to speech distortions and
increase the learning time. We propose a setup where the targets are the
different outputs of forward process time steps with a goal to reduce the
magnitude of prediction errors and reduce the training time. We use the
different layers of a neural network (NN) to perform denoising by training them
to learn to generate representations similar to the noised outputs in the
forward process of the diffusion. The NN layers learn to progressively denoise
the input in the reverse process until finally the final layer estimates the
clean speech. To avoid 1:1 mapping between layers of the neural network and the
forward process steps, we define a skip parameter $\tau>1$ such that an NN
layer is trained to cumulatively remove the noise injected in the $\tau$ steps
in the forward process. This significantly reduces the number of data
distribution recovery steps and, consequently, the time to generate speech. We
show through extensive evaluation that the proposed technique generates
high-fidelity speech in competitive time that outperforms current
state-of-the-art tools. The proposed technique is also able to generalize well
to unseen speech.
Gradpaint: Gradient-Guided Inpainting with Diffusion Models
September 18, 2023
Asya Grechka, Guillaume Couairon, Matthieu Cord
Denoising Diffusion Probabilistic Models (DDPMs) have recently achieved
remarkable results in conditional and unconditional image generation. The
pre-trained models can be adapted without further training to different
downstream tasks, by guiding their iterative denoising process at inference
time to satisfy additional constraints. For the specific task of image
inpainting, the current guiding mechanism relies on copying-and-pasting the
known regions from the input image at each denoising step. However, diffusion
models are strongly conditioned by the initial random noise, and therefore
struggle to harmonize predictions inside the inpainting mask with the real
parts of the input image, often producing results with unnatural artifacts.
Our method, dubbed GradPaint, steers the generation towards a globally
coherent image. At each step in the denoising process, we leverage the model’s
“denoised image estimation” by calculating a custom loss measuring its
coherence with the masked input image. Our guiding mechanism uses the gradient
obtained from backpropagating this loss through the diffusion model itself.
GradPaint generalizes well to diffusion models trained on various datasets,
improving upon current state-of-the-art supervised and unsupervised methods.
Progressive Text-to-Image Diffusion with Soft Latent Direction
September 18, 2023
YuTeng Ye, Jiale Cai, Hang Zhou, Guanwen Li, Youjia Zhang, Zikai Song, Chenxing Gao, Junqing Yu, Wei Yang
In spite of the rapidly evolving landscape of text-to-image generation, the
synthesis and manipulation of multiple entities while adhering to specific
relational constraints pose enduring challenges. This paper introduces an
innovative progressive synthesis and editing operation that systematically
incorporates entities into the target image, ensuring their adherence to
spatial and relational constraints at each sequential step. Our key insight
stems from the observation that while a pre-trained text-to-image diffusion
model adeptly handles one or two entities, it often falters when dealing with a
greater number. To address this limitation, we propose harnessing the
capabilities of a Large Language Model (LLM) to decompose intricate and
protracted text descriptions into coherent directives adhering to stringent
formats. To facilitate the execution of directives involving distinct semantic
operations-namely insertion, editing, and erasing-we formulate the Stimulus,
Response, and Fusion (SRF) framework. Within this framework, latent regions are
gently stimulated in alignment with each operation, followed by the fusion of
the responsive latent components to achieve cohesive entity manipulation. Our
proposed framework yields notable advancements in object synthesis,
particularly when confronted with intricate and lengthy textual inputs.
Consequently, it establishes a new benchmark for text-to-image generation
tasks, further elevating the field’s performance standards.
Regularised Diffusion-Shock Inpainting
September 15, 2023
Kristina Schaefer, Joachim Weickert
We introduce regularised diffusion–shock (RDS) inpainting as a modification
of diffusion–shock inpainting from our SSVM 2023 conference paper. RDS
inpainting combines two carefully chosen components: homogeneous diffusion and
coherence-enhancing shock filtering. It benefits from the complementary synergy
of its building blocks: The shock term propagates edge data with perfect
sharpness and directional accuracy over large distances due to its high degree
of anisotropy. Homogeneous diffusion fills large areas efficiently. The second
order equation underlying RDS inpainting inherits a maximum–minimum principle
from its components, which is also fulfilled in the discrete case, in contrast
to competing anisotropic methods. The regularisation addresses the largest
drawback of the original model: It allows a drastic reduction in model
parameters without any loss in quality. Furthermore, we extend RDS inpainting
to vector-valued data. Our experiments show a performance that is comparable to
or better than many inpainting models, including anisotropic processes of
second or fourth order.
DCTTS: Discrete Diffusion Model with Contrastive Learning for Text-to-speech Generation
September 13, 2023
Zhichao Wu, Qiulin Li, Sixing Liu, Qun Yang
In the Text-to-speech(TTS) task, the latent diffusion model has excellent
fidelity and generalization, but its expensive resource consumption and slow
inference speed have always been a challenging. This paper proposes Discrete
Diffusion Model with Contrastive Learning for Text-to-Speech Generation(DCTTS).
The following contributions are made by DCTTS: 1) The TTS diffusion model based
on discrete space significantly lowers the computational consumption of the
diffusion model and improves sampling speed; 2) The contrastive learning method
based on discrete space is used to enhance the alignment connection between
speech and text and improve sampling quality; and 3) It uses an efficient text
encoder to simplify the model’s parameters and increase computational
efficiency. The experimental results demonstrate that the approach proposed in
this paper has outstanding speech synthesis quality and sampling speed while
significantly reducing the resource consumption of diffusion model. The
synthesized samples are available at https://github.com/lawtherWu/DCTTS.
Adapt and Diffuse: Sample-adaptive Reconstruction via Latent Diffusion Models
September 12, 2023
Zalan Fabian, Berk Tinaz, Mahdi Soltanolkotabi
eess.IV, cs.LG, I.2.6; I.4.5
Inverse problems arise in a multitude of applications, where the goal is to
recover a clean signal from noisy and possibly (non)linear observations. The
difficulty of a reconstruction problem depends on multiple factors, such as the
structure of the ground truth signal, the severity of the degradation, the
implicit bias of the reconstruction model and the complex interactions between
the above factors. This results in natural sample-by-sample variation in the
difficulty of a reconstruction task, which is often overlooked by contemporary
techniques. Recently, diffusion-based inverse problem solvers have established
new state-of-the-art in various reconstruction tasks. However, they have the
drawback of being computationally prohibitive. Our key observation in this
paper is that most existing solvers lack the ability to adapt their compute
power to the difficulty of the reconstruction task, resulting in long inference
times, subpar performance and wasteful resource allocation. We propose a novel
method that we call severity encoding, to estimate the degradation severity of
noisy, degraded signals in the latent space of an autoencoder. We show that the
estimated severity has strong correlation with the true corruption level and
can give useful hints at the difficulty of reconstruction problems on a
sample-by-sample basis. Furthermore, we propose a reconstruction method based
on latent diffusion models that leverages the predicted degradation severities
to fine-tune the reverse diffusion sampling trajectory and thus achieve
sample-adaptive inference times. We utilize latent diffusion posterior sampling
to maintain data consistency with observations. We perform experiments on both
linear and nonlinear inverse problems and demonstrate that our technique
achieves performance comparable to state-of-the-art diffusion-based techniques,
with significant improvements in computational efficiency.
On the Contraction Coefficient of the Schrödinger Bridge for Stochastic Linear Systems
September 12, 2023
Alexis M. H. Teter, Yongxin Chen, Abhishek Halder
math.OC, cs.LG, cs.SY, eess.SY, stat.ML
Schr"{o}dinger bridge is a stochastic optimal control problem to steer a
given initial state density to another, subject to controlled diffusion and
deadline constraints. A popular method to numerically solve the Schr"{o}dinger
bridge problems, in both classical and in the linear system settings, is via
contractive fixed point recursions. These recursions can be seen as dynamic
versions of the well-known Sinkhorn iterations, and under mild assumptions,
they solve the so-called Schr"{o}dinger systems with guaranteed linear
convergence. In this work, we study a priori estimates for the contraction
coefficients associated with the convergence of respective Schr"{o}dinger
systems. We provide new geometric and control-theoretic interpretations for the
same. Building on these newfound interpretations, we point out the possibility
of improved computation for the worst-case contraction coefficients of linear
SBPs by preconditioning the endpoint support sets.
Reasoning with Latent Diffusion in Offline Reinforcement Learning
September 12, 2023
Siddarth Venkatraman, Shivesh Khaitan, Ravi Tej Akella, John Dolan, Jeff Schneider, Glen Berseth
Offline reinforcement learning (RL) holds promise as a means to learn
high-reward policies from a static dataset, without the need for further
environment interactions. However, a key challenge in offline RL lies in
effectively stitching portions of suboptimal trajectories from the static
dataset while avoiding extrapolation errors arising due to a lack of support in
the dataset. Existing approaches use conservative methods that are tricky to
tune and struggle with multi-modal data (as we show) or rely on noisy Monte
Carlo return-to-go samples for reward conditioning. In this work, we propose a
novel approach that leverages the expressiveness of latent diffusion to model
in-support trajectory sequences as compressed latent skills. This facilitates
learning a Q-function while avoiding extrapolation error via
batch-constraining. The latent space is also expressive and gracefully copes
with multi-modal data. We show that the learned temporally-abstract latent
space encodes richer task-specific information for offline RL tasks as compared
to raw state-actions. This improves credit assignment and facilitates faster
reward propagation during Q-learning. Our method demonstrates state-of-the-art
performance on the D4RL benchmarks, particularly excelling in long-horizon,
sparse-reward tasks.
InstaFlow: One Step is Enough for High-Quality Diffusion-Based Text-to-Image Generation
September 12, 2023
Xingchao Liu, Xiwen Zhang, Jianzhu Ma, Jian Peng, Qiang Liu
Diffusion models have revolutionized text-to-image generation with its
exceptional quality and creativity. However, its multi-step sampling process is
known to be slow, often requiring tens of inference steps to obtain
satisfactory results. Previous attempts to improve its sampling speed and
reduce computational costs through distillation have been unsuccessful in
achieving a functional one-step model. In this paper, we explore a recent
method called Rectified Flow, which, thus far, has only been applied to small
datasets. The core of Rectified Flow lies in its \emph{reflow} procedure, which
straightens the trajectories of probability flows, refines the coupling between
noises and images, and facilitates the distillation process with student
models. We propose a novel text-conditioned pipeline to turn Stable Diffusion
(SD) into an ultra-fast one-step model, in which we find reflow plays a
critical role in improving the assignment between noise and images. Leveraging
our new pipeline, we create, to the best of our knowledge, the first one-step
diffusion-based text-to-image generator with SD-level image quality, achieving
an FID (Frechet Inception Distance) of $23.3$ on MS COCO 2017-5k, surpassing
the previous state-of-the-art technique, progressive distillation, by a
significant margin ($37.2$ $\rightarrow$ $23.3$ in FID). By utilizing an
expanded network with 1.7B parameters, we further improve the FID to $22.4$. We
call our one-step models \emph{InstaFlow}. On MS COCO 2014-30k, InstaFlow
yields an FID of $13.1$ in just $0.09$ second, the best in $\leq 0.1$ second
regime, outperforming the recent StyleGAN-T ($13.9$ in $0.1$ second). Notably,
the training of InstaFlow only costs 199 A100 GPU days. Project
page:~\url{https://github.com/gnobitab/InstaFlow}.
Elucidating the solution space of extended reverse-time SDE for diffusion models
September 12, 2023
Qinpeng Cui, Xinyi Zhang, Zongqing Lu, Qingmin Liao
Diffusion models (DMs) demonstrate potent image generation capabilities in
various generative modeling tasks. Nevertheless, their primary limitation lies
in slow sampling speed, requiring hundreds or thousands of sequential function
evaluations through large neural networks to generate high-quality images.
Sampling from DMs can be seen alternatively as solving corresponding stochastic
differential equations (SDEs) or ordinary differential equations (ODEs). In
this work, we formulate the sampling process as an extended reverse-time SDE
(ER SDE), unifying prior explorations into ODEs and SDEs. Leveraging the
semi-linear structure of ER SDE solutions, we offer exact solutions and
arbitrarily high-order approximate solutions for VP SDE and VE SDE,
respectively. Based on the solution space of the ER SDE, we yield mathematical
insights elucidating the superior performance of ODE solvers over SDE solvers
in terms of fast sampling. Additionally, we unveil that VP SDE solvers stand on
par with their VE SDE counterparts. Finally, we devise fast and training-free
samplers, ER-SDE-Solvers, achieving state-of-the-art performance across all
stochastic samplers. Experimental results demonstrate achieving 3.45 FID in 20
function evaluations and 2.24 FID in 50 function evaluations on the ImageNet
$64\times64$ dataset.
Diffusion-based Adversarial Purification for Robust Deep MRI Reconstruction
September 11, 2023
Ismail Alkhouri, Shijun Liang, Rongrong Wang, Qing Qu, Saiprasad Ravishankar
Deep learning (DL) techniques have been extensively employed in magnetic
resonance imaging (MRI) reconstruction, delivering notable performance
enhancements over traditional non-DL methods. Nonetheless, recent studies have
identified vulnerabilities in these models during testing, namely, their
susceptibility to (\textit{i}) worst-case measurement perturbations and to
(\textit{ii}) variations in training/testing settings like acceleration factors
and k-space sampling locations. This paper addresses the robustness challenges
by leveraging diffusion models. In particular, we present a robustification
strategy that improves the resilience of DL-based MRI reconstruction methods by
utilizing pretrained diffusion models as noise purifiers. In contrast to
conventional robustification methods for DL-based MRI reconstruction, such as
adversarial training (AT), our proposed approach eliminates the need to tackle
a minimax optimization problem. It only necessitates fine-tuning on purified
examples. Our experimental results highlight the efficacy of our approach in
mitigating the aforementioned instabilities when compared to leading
robustification approaches for deep MRI reconstruction, including AT and
randomized smoothing.
Diffusion-Guided Reconstruction of Everyday Hand-Object Interaction Clips
September 11, 2023
Yufei Ye, Poorvi Hebbar, Abhinav Gupta, Shubham Tulsiani
We tackle the task of reconstructing hand-object interactions from short
video clips. Given an input video, our approach casts 3D inference as a
per-video optimization and recovers a neural 3D representation of the object
shape, as well as the time-varying motion and hand articulation. While the
input video naturally provides some multi-view cues to guide 3D inference,
these are insufficient on their own due to occlusions and limited viewpoint
variations. To obtain accurate 3D, we augment the multi-view signals with
generic data-driven priors to guide reconstruction. Specifically, we learn a
diffusion network to model the conditional distribution of (geometric)
renderings of objects conditioned on hand configuration and category label, and
leverage it as a prior to guide the novel-view renderings of the reconstructed
scene. We empirically evaluate our approach on egocentric videos across 6
object categories, and observe significant improvements over prior single-view
and multi-view methods. Finally, we demonstrate our system’s ability to
reconstruct arbitrary clips from YouTube, showing both 1st and 3rd person
interactions.
Diffusion-Based Co-Speech Gesture Generation Using Joint Text and Audio Representation
September 11, 2023
Anna Deichler, Shivam Mehta, Simon Alexanderson, Jonas Beskow
eess.AS, cs.HC, cs.LG, cs.SD, 68T42, I.2.6; I.2.7
This paper describes a system developed for the GENEA (Generation and
Evaluation of Non-verbal Behaviour for Embodied Agents) Challenge 2023. Our
solution builds on an existing diffusion-based motion synthesis model. We
propose a contrastive speech and motion pretraining (CSMP) module, which learns
a joint embedding for speech and gesture with the aim to learn a semantic
coupling between these modalities. The output of the CSMP module is used as a
conditioning signal in the diffusion-based gesture synthesis model in order to
achieve semantically-aware co-speech gesture generation. Our entry achieved
highest human-likeness and highest speech appropriateness rating among the
submitted entries. This indicates that our system is a promising approach to
achieve human-like co-speech gestures in agents that carry semantic meaning.
Treatment-aware Diffusion Probabilistic Model for Longitudinal MRI Generation and Diffuse Glioma Growth Prediction
September 11, 2023
Qinghui Liu, Elies Fuster-Garcia, Ivar Thokle Hovden, Donatas Sederevicius, Karoline Skogen, Bradley J MacIntosh, Edvard Grødem, Till Schellhorn, Petter Brandal, Atle Bjørnerud, Kyrre Eeg Emblem
Diffuse gliomas are malignant brain tumors that grow widespread through the
brain. The complex interactions between neoplastic cells and normal tissue, as
well as the treatment-induced changes often encountered, make glioma tumor
growth modeling challenging. In this paper, we present a novel end-to-end
network capable of generating future tumor masks and realistic MRIs of how the
tumor will look at any future time points for different treatment plans. Our
approach is based on cutting-edge diffusion probabilistic models and
deep-segmentation neural networks. We included sequential multi-parametric
magnetic resonance images (MRI) and treatment information as conditioning
inputs to guide the generative diffusion process. This allows for tumor growth
estimates at any given time point. We trained the model using real-world
postoperative longitudinal MRI data with glioma tumor growth trajectories
represented as tumor segmentation maps over time. The model has demonstrated
promising performance across a range of tasks, including the generation of
high-quality synthetic MRIs with tumor masks, time-series tumor segmentations,
and uncertainty estimates. Combined with the treatment-aware generated MRIs,
the tumor growth predictions with uncertainty estimates can provide useful
information for clinical decision-making.
Diff-Privacy: Diffusion-based Face Privacy Protection
September 11, 2023
Xiao He, Mingrui Zhu, Dongxin Chen, Nannan Wang, Xinbo Gao
Privacy protection has become a top priority as the proliferation of AI
techniques has led to widespread collection and misuse of personal data.
Anonymization and visual identity information hiding are two important facial
privacy protection tasks that aim to remove identification characteristics from
facial images at the human perception level. However, they have a significant
difference in that the former aims to prevent the machine from recognizing
correctly, while the latter needs to ensure the accuracy of machine
recognition. Therefore, it is difficult to train a model to complete these two
tasks simultaneously. In this paper, we unify the task of anonymization and
visual identity information hiding and propose a novel face privacy protection
method based on diffusion models, dubbed Diff-Privacy. Specifically, we train
our proposed multi-scale image inversion module (MSI) to obtain a set of SDM
format conditional embeddings of the original image. Based on the conditional
embeddings, we design corresponding embedding scheduling strategies and
construct different energy functions during the denoising process to achieve
anonymization and visual identity information hiding. Extensive experiments
have been conducted to validate the effectiveness of our proposed framework in
protecting facial privacy.
Learning Energy-Based Models by Cooperative Diffusion Recovery Likelihood
September 10, 2023
Yaxuan Zhu, Jianwen Xie, Yingnian Wu, Ruiqi Gao
Training energy-based models (EBMs) with maximum likelihood estimation on
high-dimensional data can be both challenging and time-consuming. As a result,
there a noticeable gap in sample quality between EBMs and other generative
frameworks like GANs and diffusion models. To close this gap, inspired by the
recent efforts of learning EBMs by maximimizing diffusion recovery likelihood
(DRL), we propose cooperative diffusion recovery likelihood (CDRL), an
effective approach to tractably learn and sample from a series of EBMs defined
on increasingly noisy versons of a dataset, paired with an initializer model
for each EBM. At each noise level, the initializer model learns to amortize the
sampling process of the EBM, and the two models are jointly estimated within a
cooperative training framework. Samples from the initializer serve as starting
points that are refined by a few sampling steps from the EBM. With the refined
samples, the EBM is optimized by maximizing recovery likelihood, while the
initializer is optimized by learning from the difference between the refined
samples and the initial samples. We develop a new noise schedule and a variance
reduction technique to further improve the sample quality. Combining these
advances, we significantly boost the FID scores compared to existing EBM
methods on CIFAR-10 and ImageNet 32x32, with a 2x speedup over DRL. In
addition, we extend our method to compositional generation and image inpainting
tasks, and showcase the compatibility of CDRL with classifier-free guidance for
conditional generation, achieving similar trade-offs between sample quality and
sample diversity as in diffusion models.
SA-Solver: Stochastic Adams Solver for Fast Sampling of Diffusion Models
September 10, 2023
Shuchen Xue, Mingyang Yi, Weijian Luo, Shifeng Zhang, Jiacheng Sun, Zhenguo Li, Zhi-Ming Ma
Diffusion Probabilistic Models (DPMs) have achieved considerable success in
generation tasks. As sampling from DPMs is equivalent to solving diffusion SDE
or ODE which is time-consuming, numerous fast sampling methods built upon
improved differential equation solvers are proposed. The majority of such
techniques consider solving the diffusion ODE due to its superior efficiency.
However, stochastic sampling could offer additional advantages in generating
diverse and high-quality data. In this work, we engage in a comprehensive
analysis of stochastic sampling from two aspects: variance-controlled diffusion
SDE and linear multi-step SDE solver. Based on our analysis, we propose
SA-Solver, which is an improved efficient stochastic Adams method for solving
diffusion SDE to generate data with high quality. Our experiments show that
SA-Solver achieves: 1) improved or comparable performance compared with the
existing state-of-the-art sampling methods for few-step sampling; 2) SOTA FID
scores on substantial benchmark datasets under a suitable number of function
evaluations (NFEs).
Prefix-diffusion: A Lightweight Diffusion Model for Diverse Image Captioning
September 10, 2023
Guisheng Liu, Yi Li, Zhengcong Fei, Haiyan Fu, Xiangyang Luo, Yanqing Guo
While impressive performance has been achieved in image captioning, the
limited diversity of the generated captions and the large parameter scale
remain major barriers to the real-word application of these systems. In this
work, we propose a lightweight image captioning network in combination with
continuous diffusion, called Prefix-diffusion. To achieve diversity, we design
an efficient method that injects prefix image embeddings into the denoising
process of the diffusion model. In order to reduce trainable parameters, we
employ a pre-trained model to extract image features and further design an
extra mapping network. Prefix-diffusion is able to generate diverse captions
with relatively less parameters, while maintaining the fluency and relevance of
the captions benefiting from the generative capabilities of the diffusion
model. Our work paves the way for scaling up diffusion models for image
captioning, and achieves promising performance compared with recent approaches.
MaskDiffusion: Boosting Text-to-Image Consistency with Conditional Mask
September 08, 2023
Yupeng Zhou, Daquan Zhou, Zuo-Liang Zhu, Yaxing Wang, Qibin Hou, Jiashi Feng
Recent advancements in diffusion models have showcased their impressive
capacity to generate visually striking images. Nevertheless, ensuring a close
match between the generated image and the given prompt remains a persistent
challenge. In this work, we identify that a crucial factor leading to the
text-image mismatch issue is the inadequate cross-modality relation learning
between the prompt and the output image. To better align the prompt and image
content, we advance the cross-attention with an adaptive mask, which is
conditioned on the attention maps and the prompt embeddings, to dynamically
adjust the contribution of each text token to the image features. This
mechanism explicitly diminishes the ambiguity in semantic information embedding
from the text encoder, leading to a boost of text-to-image consistency in the
synthesized images. Our method, termed MaskDiffusion, is training-free and
hot-pluggable for popular pre-trained diffusion models. When applied to the
latent diffusion models, our MaskDiffusion can significantly improve the
text-to-image consistency with negligible computation overhead compared to the
original diffusion models.
InstructDiffusion: A Generalist Modeling Interface for Vision Tasks
September 07, 2023
Zigang Geng, Binxin Yang, Tiankai Hang, Chen Li, Shuyang Gu, Ting Zhang, Jianmin Bao, Zheng Zhang, Han Hu, Dong Chen, Baining Guo
We present InstructDiffusion, a unifying and generic framework for aligning
computer vision tasks with human instructions. Unlike existing approaches that
integrate prior knowledge and pre-define the output space (e.g., categories and
coordinates) for each vision task, we cast diverse vision tasks into a
human-intuitive image-manipulating process whose output space is a flexible and
interactive pixel space. Concretely, the model is built upon the diffusion
process and is trained to predict pixels according to user instructions, such
as encircling the man’s left shoulder in red or applying a blue mask to the
left car. InstructDiffusion could handle a variety of vision tasks, including
understanding tasks (such as segmentation and keypoint detection) and
generative tasks (such as editing and enhancement). It even exhibits the
ability to handle unseen tasks and outperforms prior methods on novel datasets.
This represents a significant step towards a generalist modeling interface for
vision tasks, advancing artificial general intelligence in the field of
computer vision.
DiffusionEngine: Diffusion Model is Scalable Data Engine for Object Detection
September 07, 2023
Manlin Zhang, Jie Wu, Yuxi Ren, Ming Li, Jie Qin, Xuefeng Xiao, Wei Liu, Rui Wang, Min Zheng, Andy J. Ma
Data is the cornerstone of deep learning. This paper reveals that the
recently developed Diffusion Model is a scalable data engine for object
detection. Existing methods for scaling up detection-oriented data often
require manual collection or generative models to obtain target images,
followed by data augmentation and labeling to produce training pairs, which are
costly, complex, or lacking diversity. To address these issues, we
presentDiffusionEngine (DE), a data scaling-up engine that provides
high-quality detection-oriented training pairs in a single stage. DE consists
of a pre-trained diffusion model and an effective Detection-Adapter,
contributing to generating scalable, diverse and generalizable detection data
in a plug-and-play manner. Detection-Adapter is learned to align the implicit
semantic and location knowledge in off-the-shelf diffusion models with
detection-aware signals to make better bounding-box predictions. Additionally,
we contribute two datasets, i.e., COCO-DE and VOC-DE, to scale up existing
detection benchmarks for facilitating follow-up research. Extensive experiments
demonstrate that data scaling-up via DE can achieve significant improvements in
diverse scenarios, such as various detection algorithms, self-supervised
pre-training, data-sparse, label-scarce, cross-domain, and semi-supervised
learning. For example, when using DE with a DINO-based adapter to scale up
data, mAP is improved by 3.1% on COCO, 7.6% on VOC, and 11.5% on Clipart.
Text-to-feature diffusion for audio-visual few-shot learning
September 07, 2023
Otniel-Bogdan Mercea, Thomas Hummel, A. Sophia Koepke, Zeynep Akata
Training deep learning models for video classification from audio-visual data
commonly requires immense amounts of labeled training data collected via a
costly process. A challenging and underexplored, yet much cheaper, setup is
few-shot learning from video data. In particular, the inherently multi-modal
nature of video data with sound and visual information has not been leveraged
extensively for the few-shot video classification task. Therefore, we introduce
a unified audio-visual few-shot video classification benchmark on three
datasets, i.e. the VGGSound-FSL, UCF-FSL, ActivityNet-FSL datasets, where we
adapt and compare ten methods. In addition, we propose AV-DIFF, a
text-to-feature diffusion framework, which first fuses the temporal and
audio-visual features via cross-modal attention and then generates multi-modal
features for the novel classes. We show that AV-DIFF obtains state-of-the-art
performance on our proposed benchmark for audio-visual (generalised) few-shot
learning. Our benchmark paves the way for effective audio-visual classification
when only limited labeled data is available. Code and data are available at
https://github.com/ExplainableML/AVDIFF-GFSL.
DiffDefense: Defending against Adversarial Attacks via Diffusion Models
September 07, 2023
Hondamunige Prasanna Silva, Lorenzo Seidenari, Alberto Del Bimbo
This paper presents a novel reconstruction method that leverages Diffusion
Models to protect machine learning classifiers against adversarial attacks, all
without requiring any modifications to the classifiers themselves. The
susceptibility of machine learning models to minor input perturbations renders
them vulnerable to adversarial attacks. While diffusion-based methods are
typically disregarded for adversarial defense due to their slow reverse
process, this paper demonstrates that our proposed method offers robustness
against adversarial threats while preserving clean accuracy, speed, and
plug-and-play compatibility. Code at:
https://github.com/HondamunigePrasannaSilva/DiffDefence.
Reuse and Diffuse: Iterative Denoising for Text-to-Video Generation
September 07, 2023
Jiaxi Gu, Shicong Wang, Haoyu Zhao, Tianyi Lu, Xing Zhang, Zuxuan Wu, Songcen Xu, Wei Zhang, Yu-Gang Jiang, Hang Xu
Inspired by the remarkable success of Latent Diffusion Models (LDMs) for
image synthesis, we study LDM for text-to-video generation, which is a
formidable challenge due to the computational and memory constraints during
both model training and inference. A single LDM is usually only capable of
generating a very limited number of video frames. Some existing works focus on
separate prediction models for generating more video frames, which suffer from
additional training cost and frame-level jittering, however. In this paper, we
propose a framework called “Reuse and Diffuse” dubbed $\textit{VidRD}$ to
produce more frames following the frames already generated by an LDM.
Conditioned on an initial video clip with a small number of frames, additional
frames are iteratively generated by reusing the original latent features and
following the previous diffusion process. Besides, for the autoencoder used for
translation between pixel space and latent space, we inject temporal layers
into its decoder and fine-tune these layers for higher temporal consistency. We
also propose a set of strategies for composing video-text data that involve
diverse content from multiple existing datasets including video datasets for
action recognition and image-text datasets. Extensive experiments show that our
method achieves good results in both quantitative and qualitative evaluations.
Our project page is available
$\href{https://anonymous0x233.github.io/ReuseAndDiffuse/}{here}$.
Diffusion Model is Secretly a Training-free Open Vocabulary Semantic Segmenter
September 06, 2023
Jinglong Wang, Xiawei Li, Jing Zhang, Qingyuan Xu, Qin Zhou, Qian Yu, Lu Sheng, Dong Xu
Recent research has explored the utilization of pre-trained text-image
discriminative models, such as CLIP, to tackle the challenges associated with
open-vocabulary semantic segmentation. However, it is worth noting that the
alignment process based on contrastive learning employed by these models may
unintentionally result in the loss of crucial localization information and
object completeness, which are essential for achieving accurate semantic
segmentation. More recently, there has been an emerging interest in extending
the application of diffusion models beyond text-to-image generation tasks,
particularly in the domain of semantic segmentation. These approaches utilize
diffusion models either for generating annotated data or for extracting
features to facilitate semantic segmentation. This typically involves training
segmentation models by generating a considerable amount of synthetic data or
incorporating additional mask annotations. To this end, we uncover the
potential of generative text-to-image conditional diffusion models as highly
efficient open-vocabulary semantic segmenters, and introduce a novel
training-free approach named DiffSegmenter. Specifically, by feeding an input
image and candidate classes into an off-the-shelf pre-trained conditional
latent diffusion model, the cross-attention maps produced by the denoising
U-Net are directly used as segmentation scores, which are further refined and
completed by the followed self-attention maps. Additionally, we carefully
design effective textual prompts and a category filtering mechanism to further
enhance the segmentation results. Extensive experiments on three benchmark
datasets show that the proposed DiffSegmenter achieves impressive results for
open-vocabulary semantic segmentation.
Diffusion-EDFs: Bi-equivariant Denoising Generative Modeling on SE(3) for Visual Robotic Manipulation
September 06, 2023
Hyunwoo Ryu, Jiwoo Kim, Hyunseok An, Junwoo Chang, Joohwan Seo, Taehan Kim, Yubin Kim, Chaewon Hwang, Jongeun Choi, Roberto Horowitz
Diffusion generative modeling has become a promising approach for learning
robotic manipulation tasks from stochastic human demonstrations. In this paper,
we present Diffusion-EDFs, a novel SE(3)-equivariant diffusion-based approach
for visual robotic manipulation tasks. We show that our proposed method
achieves remarkable data efficiency, requiring only 5 to 10 human
demonstrations for effective end-to-end training in less than an hour.
Furthermore, our benchmark experiments demonstrate that our approach has
superior generalizability and robustness compared to state-of-the-art methods.
Lastly, we validate our methods with real hardware experiments. Project
Website: https://sites.google.com/view/diffusion-edfs/home
Diffusion on the Probability Simplex
September 05, 2023
Griffin Floto, Thorsteinn Jonsson, Mihai Nica, Scott Sanner, Eric Zhengyu Zhu
Diffusion models learn to reverse the progressive noising of a data
distribution to create a generative model. However, the desired continuous
nature of the noising process can be at odds with discrete data. To deal with
this tension between continuous and discrete objects, we propose a method of
performing diffusion on the probability simplex. Using the probability simplex
naturally creates an interpretation where points correspond to categorical
probability distributions. Our method uses the softmax function applied to an
Ornstein-Unlenbeck Process, a well-known stochastic differential equation. We
find that our methodology also naturally extends to include diffusion on the
unit cube which has applications for bounded image generation.
Diffusion-based 3D Object Detection with Random Boxes
September 05, 2023
Xin Zhou, Jinghua Hou, Tingting Yao, Dingkang Liang, Zhe Liu, Zhikang Zou, Xiaoqing Ye, Jianwei Cheng, Xiang Bai
3D object detection is an essential task for achieving autonomous driving.
Existing anchor-based detection methods rely on empirical heuristics setting of
anchors, which makes the algorithms lack elegance. In recent years, we have
witnessed the rise of several generative models, among which diffusion models
show great potential for learning the transformation of two distributions. Our
proposed Diff3Det migrates the diffusion model to proposal generation for 3D
object detection by considering the detection boxes as generative targets.
During training, the object boxes diffuse from the ground truth boxes to the
Gaussian distribution, and the decoder learns to reverse this noise process. In
the inference stage, the model progressively refines a set of random boxes to
the prediction results. We provide detailed experiments on the KITTI benchmark
and achieve promising performance compared to classical anchor-based 3D
detection methods.
Diffusion Generative Inverse Design
September 05, 2023
Marin Vlastelica, Tatiana López-Guevara, Kelsey Allen, Peter Battaglia, Arnaud Doucet, Kimberley Stachenfeld
Inverse design refers to the problem of optimizing the input of an objective
function in order to enact a target outcome. For many real-world engineering
problems, the objective function takes the form of a simulator that predicts
how the system state will evolve over time, and the design challenge is to
optimize the initial conditions that lead to a target outcome. Recent
developments in learned simulation have shown that graph neural networks (GNNs)
can be used for accurate, efficient, differentiable estimation of simulator
dynamics, and support high-quality design optimization with gradient- or
sampling-based optimization procedures. However, optimizing designs from
scratch requires many expensive model queries, and these procedures exhibit
basic failures on either non-convex or high-dimensional problems. In this work,
we show how denoising diffusion models (DDMs) can be used to solve inverse
design problems efficiently and propose a particle sampling algorithm for
further improving their efficiency. We perform experiments on a number of fluid
dynamics design challenges, and find that our approach substantially reduces
the number of calls to the simulator compared to standard techniques.
sasdim: self-adaptive noise scaling diffusion model for spatial time series imputation
September 05, 2023
Shunyang Zhang, Senzhang Wang, Xianzhen Tan, Ruochen Liu, Jian Zhang, Jianxin Wang
Spatial time series imputation is critically important to many real
applications such as intelligent transportation and air quality monitoring.
Although recent transformer and diffusion model based approaches have achieved
significant performance gains compared with conventional statistic based
methods, spatial time series imputation still remains as a challenging issue
due to the complex spatio-temporal dependencies and the noise uncertainty of
the spatial time series data. Especially, recent diffusion process based models
may introduce random noise to the imputations, and thus cause negative impact
on the model performance. To this end, we propose a self-adaptive noise scaling
diffusion model named SaSDim to more effectively perform spatial time series
imputation. Specially, we propose a new loss function that can scale the noise
to the similar intensity, and propose the across spatial-temporal global
convolution module to more effectively capture the dynamic spatial-temporal
dependencies. Extensive experiments conducted on three real world datasets
verify the effectiveness of SaSDim by comparison with current state-of-the-art
baselines.
Gradient Domain Diffusion Models for Image Synthesis
September 05, 2023
Yuanhao Gong
cs.CV, cs.LG, cs.MM, cs.PF, eess.IV
Diffusion models are getting popular in generative image and video synthesis.
However, due to the diffusion process, they require a large number of steps to
converge. To tackle this issue, in this paper, we propose to perform the
diffusion process in the gradient domain, where the convergence becomes faster.
There are two reasons. First, thanks to the Poisson equation, the gradient
domain is mathematically equivalent to the original image domain. Therefore,
each diffusion step in the image domain has a unique corresponding gradient
domain representation. Second, the gradient domain is much sparser than the
image domain. As a result, gradient domain diffusion models converge faster.
Several numerical experiments confirm that the gradient domain diffusion models
are more efficient than the original diffusion models. The proposed method can
be applied in a wide range of applications such as image processing, computer
vision and machine learning tasks.
Softmax Bias Correction for Quantized Generative Models
September 04, 2023
Nilesh Prasad Pandey, Marios Fournarakis, Chirag Patel, Markus Nagel
Post-training quantization (PTQ) is the go-to compression technique for large
generative models, such as stable diffusion or large language models. PTQ
methods commonly keep the softmax activation in higher precision as it has been
shown to be very sensitive to quantization noise. However, this can lead to a
significant runtime and power overhead during inference on resource-constraint
edge devices. In this work, we investigate the source of the softmax
sensitivity to quantization and show that the quantization operation leads to a
large bias in the softmax output, causing accuracy degradation. To overcome
this issue, we propose an offline bias correction technique that improves the
quantizability of softmax without additional compute during deployment, as it
can be readily absorbed into the quantization parameters. We demonstrate the
effectiveness of our method on stable diffusion v1.5 and 125M-size OPT language
model, achieving significant accuracy improvement for 8-bit quantized softmax.
Relay Diffusion: Unifying diffusion process across resolutions for image synthesis
September 04, 2023
Jiayan Teng, Wendi Zheng, Ming Ding, Wenyi Hong, Jianqiao Wangni, Zhuoyi Yang, Jie Tang
Diffusion models achieved great success in image synthesis, but still face
challenges in high-resolution generation. Through the lens of discrete cosine
transformation, we find the main reason is that \emph{the same noise level on a
higher resolution results in a higher Signal-to-Noise Ratio in the frequency
domain}. In this work, we present Relay Diffusion Model (RDM), which transfers
a low-resolution image or noise into an equivalent high-resolution one for
diffusion model via blurring diffusion and block noise. Therefore, the
diffusion process can continue seamlessly in any new resolution or model
without restarting from pure noise or low-resolution conditioning. RDM achieves
state-of-the-art FID on CelebA-HQ and sFID on ImageNet 256$\times$256,
surpassing previous works such as ADM, LDM and DiT by a large margin. All the
codes and checkpoints are open-sourced at
\url{https://github.com/THUDM/RelayDiffusion}.
GenSelfDiff-HIS: Generative Self-Supervision Using Diffusion for Histopathological Image Segmentation
September 04, 2023
Vishnuvardhan Purma, Suhas Srinath, Seshan Srirangarajan, Aanchal Kakkar, Prathosh A. P
Histopathological image segmentation is a laborious and time-intensive task,
often requiring analysis from experienced pathologists for accurate
examinations. To reduce this burden, supervised machine-learning approaches
have been adopted using large-scale annotated datasets for histopathological
image analysis. However, in several scenarios, the availability of large-scale
annotated data is a bottleneck while training such models. Self-supervised
learning (SSL) is an alternative paradigm that provides some respite by
constructing models utilizing only the unannotated data which is often
abundant. The basic idea of SSL is to train a network to perform one or many
pseudo or pretext tasks on unannotated data and use it subsequently as the
basis for a variety of downstream tasks. It is seen that the success of SSL
depends critically on the considered pretext task. While there have been many
efforts in designing pretext tasks for classification problems, there haven’t
been many attempts on SSL for histopathological segmentation. Motivated by
this, we propose an SSL approach for segmenting histopathological images via
generative diffusion models in this paper. Our method is based on the
observation that diffusion models effectively solve an image-to-image
translation task akin to a segmentation task. Hence, we propose generative
diffusion as the pretext task for histopathological image segmentation. We also
propose a multi-loss function-based fine-tuning for the downstream task. We
validate our method using several metrics on two publically available datasets
along with a newly proposed head and neck (HN) cancer dataset containing
hematoxylin and eosin (H\&E) stained images along with annotations. Codes will
be made public at
https://github.com/PurmaVishnuVardhanReddy/GenSelfDiff-HIS.git.
FinDiff: Diffusion Models for Financial Tabular Data Generation
September 04, 2023
Timur Sattarov, Marco Schreyer, Damian Borth
The sharing of microdata, such as fund holdings and derivative instruments,
by regulatory institutions presents a unique challenge due to strict data
confidentiality and privacy regulations. These challenges often hinder the
ability of both academics and practitioners to conduct collaborative research
effectively. The emergence of generative models, particularly diffusion models,
capable of synthesizing data mimicking the underlying distributions of
real-world data presents a compelling solution. This work introduces ‘FinDiff’,
a diffusion model designed to generate real-world financial tabular data for a
variety of regulatory downstream tasks, for example economic scenario modeling,
stress tests, and fraud detection. The model uses embedding encodings to model
mixed modality financial data, comprising both categorical and numeric
attributes. The performance of FinDiff in generating synthetic tabular
financial data is evaluated against state-of-the-art baseline models using
three real-world financial datasets (including two publicly available datasets
and one proprietary dataset). Empirical results demonstrate that FinDiff excels
in generating synthetic tabular financial data with high fidelity, privacy, and
utility.
Accelerating Markov Chain Monte Carlo sampling with diffusion models
September 04, 2023
N. T. Hunt-Smith, W. Melnitchouk, F. Ringer, N. Sato, A. W Thomas, M. J. White
Global fits of physics models require efficient methods for exploring
high-dimensional and/or multimodal posterior functions. We introduce a novel
method for accelerating Markov Chain Monte Carlo (MCMC) sampling by pairing a
Metropolis-Hastings algorithm with a diffusion model that can draw global
samples with the aim of approximating the posterior. We briefly review
diffusion models in the context of image synthesis before providing a
streamlined diffusion model tailored towards low-dimensional data arrays. We
then present our adapted Metropolis-Hastings algorithm which combines local
proposals with global proposals taken from a diffusion model that is regularly
trained on the samples produced during the MCMC run. Our approach leads to a
significant reduction in the number of likelihood evaluations required to
obtain an accurate representation of the Bayesian posterior across several
analytic functions, as well as for a physical example based on a global
analysis of parton distribution functions. Our method is extensible to other
MCMC techniques, and we briefly compare our method to similar approaches based
on normalizing flows. A code implementation can be found at
https://github.com/NickHunt-Smith/MCMC-diffusion.
Diffusion Models with Deterministic Normalizing Flow Priors
September 03, 2023
Mohsen Zand, Ali Etemad, Michael Greenspan
For faster sampling and higher sample quality, we propose DiNof
($\textbf{Di}$ffusion with $\textbf{No}$rmalizing $\textbf{f}$low priors), a
technique that makes use of normalizing flows and diffusion models. We use
normalizing flows to parameterize the noisy data at any arbitrary step of the
diffusion process and utilize it as the prior in the reverse diffusion process.
More specifically, the forward noising process turns a data distribution into
partially noisy data, which are subsequently transformed into a Gaussian
distribution by a nonlinear process. The backward denoising procedure begins
with a prior created by sampling from the Gaussian distribution and applying
the invertible normalizing flow transformations deterministically. To generate
the data distribution, the prior then undergoes the remaining diffusion
stochastic denoising procedure. Through the reduction of the number of total
diffusion steps, we are able to speed up both the forward and backward
processes. More importantly, we improve the expressive power of diffusion
models by employing both deterministic and stochastic mappings. Experiments on
standard image generation datasets demonstrate the advantage of the proposed
method over existing approaches. On the unconditional CIFAR10 dataset, for
example, we achieve an FID of 2.01 and an Inception score of 9.96. Our method
also demonstrates competitive performance on CelebA-HQ-256 dataset as it
obtains an FID score of 7.11. Code is available at
https://github.com/MohsenZand/DiNof.
NADiffuSE: Noise-aware Diffusion-based Model for Speech Enhancement
September 03, 2023
Wen Wang, Dongchao Yang, Qichen Ye, Bowen Cao, Yuexian Zou
The goal of speech enhancement (SE) is to eliminate the background
interference from the noisy speech signal. Generative models such as diffusion
models (DM) have been applied to the task of SE because of better
generalization in unseen noisy scenes. Technical routes for the DM-based SE
methods can be summarized into three types: task-adapted diffusion process
formulation, generator-plus-conditioner (GPC) structures and the multi-stage
frameworks. We focus on the first two approaches, which are constructed under
the GPC architecture and use the task-adapted diffusion process to better deal
with the real noise. However, the performance of these SE models is limited by
the following issues: (a) Non-Gaussian noise estimation in the task-adapted
diffusion process. (b) Conditional domain bias caused by the weak conditioner
design in the GPC structure. (c) Large amount of residual noise caused by
unreasonable interpolation operations during inference. To solve the above
problems, we propose a noise-aware diffusion-based SE model (NADiffuSE) to
boost the SE performance, where the noise representation is extracted from the
noisy speech signal and introduced as a global conditional information for
estimating the non-Gaussian components. Furthermore, the anchor-based inference
algorithm is employed to achieve a compromise between the speech distortion and
noise residual. In order to mitigate the performance degradation caused by the
conditional domain bias in the GPC framework, we investigate three model
variants, all of which can be viewed as multi-stage SE based on the
preprocessing networks for Mel spectrograms. Experimental results show that
NADiffuSE outperforms other DM-based SE models under the GPC infrastructure.
Audio samples are available at: https://square-of-w.github.io/NADiffuSE-demo/.
VGDiffZero: Text-to-image Diffusion Models Can Be Zero-shot Visual Grounders
September 03, 2023
Xuyang Liu, Siteng Huang, Yachen Kang, Honggang Chen, Donglin Wang
Large-scale text-to-image diffusion models have shown impressive capabilities
across various generative tasks, enabled by strong vision-language alignment
obtained through pre-training. However, most vision-language discriminative
tasks require extensive fine-tuning on carefully-labeled datasets to acquire
such alignment, with great cost in time and computing resources. In this work,
we explore directly applying a pre-trained generative diffusion model to the
challenging discriminative task of visual grounding without any fine-tuning and
additional training dataset. Specifically, we propose VGDiffZero, a simple yet
effective zero-shot visual grounding framework based on text-to-image diffusion
models. We also design a comprehensive region-scoring method considering both
global and local contexts of each isolated proposal. Extensive experiments on
RefCOCO, RefCOCO+, and RefCOCOg show that VGDiffZero achieves strong
performance on zero-shot visual grounding.
DiCLET-TTS: Diffusion Model based Cross-lingual Emotion Transfer for Text-to-Speech – A Study between English and Mandarin
September 02, 2023
Tao Li, Chenxu Hu, Jian Cong, Xinfa Zhu, Jingbei Li, Qiao Tian, Yuping Wang, Lei Xie
While the performance of cross-lingual TTS based on monolingual corpora has
been significantly improved recently, generating cross-lingual speech still
suffers from the foreign accent problem, leading to limited naturalness.
Besides, current cross-lingual methods ignore modeling emotion, which is
indispensable paralinguistic information in speech delivery. In this paper, we
propose DiCLET-TTS, a Diffusion model based Cross-Lingual Emotion Transfer
method that can transfer emotion from a source speaker to the intra- and
cross-lingual target speakers. Specifically, to relieve the foreign accent
problem while improving the emotion expressiveness, the terminal distribution
of the forward diffusion process is parameterized into a speaker-irrelevant but
emotion-related linguistic prior by a prior text encoder with the emotion
embedding as a condition. To address the weaker emotional expressiveness
problem caused by speaker disentanglement in emotion embedding, a novel
orthogonal projection based emotion disentangling module (OP-EDM) is proposed
to learn the speaker-irrelevant but emotion-discriminative embedding. Moreover,
a condition-enhanced DPM decoder is introduced to strengthen the modeling
ability of the speaker and the emotion in the reverse diffusion process to
further improve emotion expressiveness in speech delivery. Cross-lingual
emotion transfer experiments show the superiority of DiCLET-TTS over various
competitive models and the good design of OP-EDM in learning speaker-irrelevant
but emotion-discriminative embedding.
September 02, 2023
Yu Guan, Chuanming Yu, Shiyu Lu, Zhuoxu Cui, Dong Liang, Qiegen Liu
Most existing MRI reconstruction methods perform tar-geted reconstruction of
the entire MR image without tak-ing specific tissue regions into consideration.
This may fail to emphasize the reconstruction accuracy on im-portant tissues
for diagnosis. In this study, leveraging a combination of the properties of
k-space data and the diffusion process, our novel scheme focuses on mining the
multi-frequency prior with different strategies to pre-serve fine texture
details in the reconstructed image. In addition, a diffusion process can
converge more quickly if its target distribution closely resembles the noise
distri-bution in the process. This can be accomplished through various
high-frequency prior extractors. The finding further solidifies the
effectiveness of the score-based gen-erative model. On top of all the
advantages, our method improves the accuracy of MRI reconstruction and
accel-erates sampling process. Experimental results verify that the proposed
method successfully obtains more accurate reconstruction and outperforms
state-of-the-art methods.
Diffusion Modeling with Domain-conditioned Prior Guidance for Accelerated MRI and qMRI Reconstruction
September 02, 2023
Wanyu Bian, Albert Jang, Fang Liu
This study introduces a novel approach for image reconstruction based on a
diffusion model conditioned on the native data domain. Our method is applied to
multi-coil MRI and quantitative MRI reconstruction, leveraging the
domain-conditioned diffusion model within the frequency and parameter domains.
The prior MRI physics are used as embeddings in the diffusion model, enforcing
data consistency to guide the training and sampling process, characterizing MRI
k-space encoding in MRI reconstruction, and leveraging MR signal modeling for
qMRI reconstruction. Furthermore, a gradient descent optimization is
incorporated into the diffusion steps, enhancing feature learning and improving
denoising. The proposed method demonstrates a significant promise, particularly
for reconstructing images at high acceleration factors. Notably, it maintains
great reconstruction accuracy and efficiency for static and quantitative MRI
reconstruction across diverse anatomical structures. Beyond its immediate
applications, this method provides potential generalization capability, making
it adaptable to inverse problems across various domains.