1-17hit |
Gang LIU Xin CHEN Zhixiang GAO
Photo animation is to transform photos of real-world scenes into anime style images, which is a challenging task in AIGC (AI Generated Content). Although previous methods have achieved promising results, they often introduce noticeable artifacts or distortions. In this paper, we propose a novel double-tail generative adversarial network (DTGAN) for fast photo animation. DTGAN is the third version of the AnimeGAN series. Therefore, DTGAN is also called AnimeGANv3. The generator of DTGAN has two output tails, a support tail for outputting coarse-grained anime style images and a main tail for refining coarse-grained anime style images. In DTGAN, we propose a novel learnable normalization technique, termed as linearly adaptive denormalization (LADE), to prevent artifacts in the generated images. In order to improve the visual quality of the generated anime style images, two novel loss functions suitable for photo animation are proposed: 1) the region smoothing loss function, which is used to weaken the texture details of the generated images to achieve anime effects with abstract details; 2) the fine-grained revision loss function, which is used to eliminate artifacts and noise in the generated anime style image while preserving clear edges. Furthermore, the generator of DTGAN is a lightweight generator framework with only 1.02 million parameters in the inference phase. The proposed DTGAN can be easily end-to-end trained with unpaired training data. Extensive experiments have been conducted to qualitatively and quantitatively demonstrate that our method can produce high-quality anime style images from real-world photos and perform better than the state-of-the-art models.
Xing ZHU Yuxuan LIU Lingyu LIANG Tao WANG Zuoyong LI Qiaoming DENG Yubo LIU
Recently, many AI-aided layout design systems are developed to reduce tedious manual intervention based on deep learning. However, most methods focus on a specific generation task. This paper explores a challenging problem to obtain multiple layout design generation (LDG), which generates floor plan or urban plan from a boundary input under a unified framework. One of the main challenges of multiple LDG is to obtain reasonable topological structures of layout generation with irregular boundaries and layout elements for different types of design. This paper formulates the multiple LDG task as an image-to-image translation problem, and proposes a conditional generative adversarial network (GAN), called LDGAN, with adaptive modules. The framework of LDGAN is based on a generator-discriminator architecture, where the generator is integrated with conditional convolution constrained by the boundary input and the attention module with channel and spatial features. Qualitative and quantitative experiments were conducted on the SCUT-AutoALP and RPLAN datasets, and the comparison with the state-of-the-art methods illustrate the effectiveness and superiority of the proposed LDGAN.
Shi-Long SHEN Ai-Guo WU Yong XU
A generative model is presented for two types of person image generation in this paper. First, this model is applied to pose-guided person image generation, i.e., converting the pose of a source person image to the target pose while preserving the texture of that source person image. Second, this model is also used for clothing-guided person image generation, i.e., changing the clothing texture of a source person image to the desired clothing texture. The core idea of the proposed model is to establish the multi-scale correspondence, which can effectively address the misalignment introduced by transferring pose, thereby preserving richer information on appearance. Specifically, the proposed model consists of two stages: 1) It first generates the target semantic map imposed on the target pose to provide more accurate guidance during the generation process. 2) After obtaining the multi-scale feature map by the encoder, the multi-scale correspondence is established, which is useful for a fine-grained generation. Experimental results show the proposed method is superior to state-of-the-art methods in pose-guided person image generation and show its effectiveness in clothing-guided person image generation.
In this paper, we propose improved Generative Adversarial Networks with attention module in Generator, which can enhance the effectiveness of Generator. Furthermore, recent work has shown that Generator conditioning affects GAN performance. Leveraging this insight, we explored the effect of different normalization (spectral normalization, instance normalization) on Generator and Discriminator. Moreover, an enhanced loss function called Wasserstein Divergence distance, can alleviate the problem of difficult to train module in practice.
Junki OSHIBA Motoi IWATA Koichi KISE
Recently, deep learning for image generation with a guide for the generation has been progressing. Many methods have been proposed to generate the animation of facial expression change from a single face image by transferring some facial expression information to the face image. In particular, the method of using facial landmarks as facial expression information can generate a variety of facial expressions. However, most methods do not focus on anime characters but humans. Moreover, we attempted to apply several existing methods to anime characters by training the methods on an anime character face dataset; however, they generated images with noise, even in regions where there was no change. The first order motion model (FOMM) is an image generation method that takes two images as input and transfers one facial expression or pose to the other. By explicitly calculating the difference between the two images based on optical flow, FOMM can generate images with low noise in the unchanged regions. In the following, we focus on the aspect of the face image generation in FOMM. When we think about the employment of facial landmarks as targets, the performance of FOMM is not enough because FOMM cannot use a facial landmark as a facial expression target because the appearances of a face image and a facial landmark are quite different. Therefore, we propose an advanced FOMM method to use facial landmarks as a facial expression target. In the proposed method, we change the input data and data flow to use facial landmarks. Additionally, to generate face images with expressions that follow the target landmarks more closely, we introduce the landmark estimation loss, which is computed by comparing the landmark detected from the generated image with the target landmark. Our experiments on an anime character face image dataset demonstrated that our method is effective for landmark-guided face image generation for anime characters. Furthermore, our method outperformed other methods quantitatively and generated face images with less noise.
Weiguo ZHANG Jiaqi LU Jing ZHANG Xuewen LI Qi ZHAO
The haze situation will seriously affect the quality of license plate recognition and reduce the performance of the visual processing algorithm. In order to improve the quality of haze pictures, a license plate recognition algorithm based on haze weather is proposed in this paper. The algorithm in this paper mainly consists of two parts: The first part is MPGAN image dehazing, which uses a generative adversarial network to dehaze the image, and combines multi-scale convolution and perceptual loss. Multi-scale convolution is conducive to better feature extraction. The perceptual loss makes up for the shortcoming that the mean square error (MSE) is greatly affected by outliers; the second part is to recognize the license plate, first we use YOLOv3 to locate the license plate, the STN network corrects the license plate, and finally enters the improved LPRNet network to get license plate information. Experimental results show that the dehazing model proposed in this paper achieves good results, and the evaluation indicators PSNR and SSIM are better than other representative algorithms. After comparing the license plate recognition algorithm with the LPRNet algorithm, the average accuracy rate can reach 93.9%.
In this letter, we propose a deep neural network and semi-supervised learning based dehazing algorithm. The dehazing network uses a pyramidal architecture to recover the haze-free scene from a single hazy image in a coarse-to-fine order. To faithfully restore the objects with different scales, we incorporate cascaded multi-scale convolutional blocks into each level of the pyramid. Feature fusion and transfer in the network are achieved using the paths constructed by interleaved residual connections. For better generalization to the complicated haze in real-world environments, we also devise a discriminator that enables semi-supervised adversarial training. Experimental results demonstrate that the proposed work outperforms comparative ones with higher quantitative metrics and more visually pleasant outputs. It can also enhance the robustness of object detection under haze.
The spectral envelope parameter is a significant speech parameter in the vocoder's quality. Recently, the Vector Quantized Variational AutoEncoder (VQ-VAE) is a state-of-the-art end-to-end quantization method based on the deep learning model. This paper proposed a new technique for improving the embedding space learning of VQ-VAE with the Generative Adversarial Network for quantizing the spectral envelope parameter, called VQ-VAE-EMGAN. In experiments, we designed the quantizer for the spectral envelope parameters of the WORLD vocoder extracted from the 16kHz speech waveform. As the results shown, the proposed technique reduced the Log Spectral Distortion (LSD) around 0.5dB and increased the PESQ by around 0.17 on average for four target bit operations compared to the conventional VQ-VAE.
Hiroya YAMAMOTO Daichi KITAHARA Hiroki KURODA Akira HIRABAYASHI
This paper addresses single image super-resolution (SR) based on convolutional neural networks (CNNs). It is known that recovery of high-frequency components in output SR images of CNNs learned by the least square errors or least absolute errors is insufficient. To generate realistic high-frequency components, SR methods using generative adversarial networks (GANs), composed of one generator and one discriminator, are developed. However, when the generator tries to induce the discriminator's misjudgment, not only realistic high-frequency components but also some artifacts are generated, and objective indices such as PSNR decrease. To reduce the artifacts in the GAN-based SR methods, we consider the set of all SR images whose square errors between downscaling results and the input image are within a certain range, and propose to apply the metric projection onto this consistent set in the output layers of the generators. The proposed technique guarantees the consistency between output SR images and input images, and the generators with the proposed projection can generate high-frequency components with few artifacts while keeping low-frequency ones as appropriate for the known noise level. Numerical experiments show that the proposed technique reduces artifacts included in the original SR images of a GAN-based SR method while generating realistic high-frequency components with better PSNR values in both noise-free and noisy situations. Since the proposed technique can be integrated into various generators if the downscaling process is known, we can give the consistency to existing methods with the input images without degrading other SR performance.
Lin CAO Kaixuan LI Kangning DU Yanan GUO Peiran SONG Tao WANG Chong FU
Face sketch synthesis refers to transform facial photos into sketches. Recent research on face sketch synthesis has achieved great success due to the development of Generative Adversarial Networks (GAN). However, these generative methods prone to neglect detailed information and thus lose some individual specific features, such as glasses and headdresses. In this paper, we propose a novel method called Feature Learning Generative Adversarial Network (FL-GAN) to synthesize detail-preserving high-quality sketches. Precisely, the proposed FL-GAN consists of one Feature Learning (FL) module and one Adversarial Learning (AL) module. The FL module aims to learn the detailed information of the image in a latent space, and guide the AL module to synthesize detail-preserving sketch. The AL Module aims to learn the structure and texture of sketch and improve the quality of synthetic sketch by adversarial learning strategy. Quantitative and qualitative comparisons with seven state-of-the-art methods such as the LLE, the MRF, the MWF, the RSLCR, the RL, the FCN and the GAN on four facial sketch datasets demonstrate the superiority of this method.
Yung-Hui LI Muhammad Saqlain ASLAM Latifa Nabila HARFIYA Ching-Chun CHANG
The recent development of deep learning-based generative models has sharply intensified the interest in data synthesis and its applications. Data synthesis takes on an added importance especially for some pattern recognition tasks in which some classes of data are rare and difficult to collect. In an iris dataset, for instance, the minority class samples include images of eyes with glasses, oversized or undersized pupils, misaligned iris locations, and iris occluded or contaminated by eyelids, eyelashes, or lighting reflections. Such class-imbalanced datasets often result in biased classification performance. Generative adversarial networks (GANs) are one of the most promising frameworks that learn to generate synthetic data through a two-player minimax game between a generator and a discriminator. In this paper, we utilized the state-of-the-art conditional Wasserstein generative adversarial network with gradient penalty (CWGAN-GP) for generating the minority class of iris images which saves huge amount of cost of human labors for rare data collection. With our model, the researcher can generate as many iris images of rare cases as they want and it helps to develop any deep learning algorithm whenever large size of dataset is needed.
Jinhua WANG Xuewei LI Hongzhe LIU
At present, the generative adversarial network (GAN) plays an important role in learning tasks. The basic idea of a GAN is to train the discriminator and generator simultaneously. A GAN-based inverse tone mapping method can generate high dynamic range (HDR) images corresponding to a scene according to multiple image sequences of a scene with different exposures. However, subsequent tone mapping algorithm processing is needed to display it on a general device. This paper proposes an end-to-end multi-exposure image fusion algorithm based on a relative GAN (called RaGAN-EF), which can fuse multiple image sequences with different exposures directly to generate a high-quality image that can be displayed on a general device without further processing. The RaGAN is used to design the loss function, which can retain more details in the source images. In addition, the number of input image sequences of multi-exposure image fusion algorithms is often uncertain, which limits the application of many existing GANs. This paper proposes a convolutional layer with weights shared between channels, which can solve the problem of variable input length. Experimental results demonstrate that the proposed method performs better in terms of both objective evaluation and visual quality.
Rintaro YANAGI Ren TOGO Takahiro OGAWA Miki HASEYAMA
Various cross-modal retrieval methods that can retrieve images related to a query sentence without text annotations have been proposed. Although a high level of retrieval performance is achieved by these methods, they have been developed for a single domain retrieval setting. When retrieval candidate images come from various domains, the retrieval performance of these methods might be decreased. To deal with this problem, we propose a new domain adaptive cross-modal retrieval method. By translating a modality and domains of a query and candidate images, our method can retrieve desired images accurately in a different domain retrieval setting. Experimental results for clipart and painting datasets showed that the proposed method has better retrieval performance than that of other conventional and state-of-the-art methods.
Yuki SAITO Kei AKUZAWA Kentaro TACHIBANA
This paper presents a method for many-to-one voice conversion using phonetic posteriorgrams (PPGs) based on an adversarial training of deep neural networks (DNNs). A conventional method for many-to-one VC can learn a mapping function from input acoustic features to target acoustic features through separately trained DNN-based speech recognition and synthesis models. However, 1) the differences among speakers observed in PPGs and 2) an over-smoothing effect of generated acoustic features degrade the converted speech quality. Our method performs a domain-adversarial training of the recognition model for reducing the PPG differences. In addition, it incorporates a generative adversarial network into the training of the synthesis model for alleviating the over-smoothing effect. Unlike the conventional method, ours jointly trains the recognition and synthesis models so that they are optimized for many-to-one VC. Experimental evaluation demonstrates that the proposed method significantly improves the converted speech quality compared with conventional VC methods.
Chao-Yuan KAO Sangwook PARK Alzahra BADI David K. HAN Hanseok KO
Performance in Automatic Speech Recognition (ASR) degrades dramatically in noisy environments. To alleviate this problem, a variety of deep networks based on convolutional neural networks and recurrent neural networks were proposed by applying L1 or L2 loss. In this Letter, we propose a new orthogonal gradient penalty (OGP) method for Wasserstein Generative Adversarial Networks (WGAN) applied to denoising and despeeching models. WGAN integrates a multi-task autoencoder which estimates not only speech features but also noise features from noisy speech. While achieving 14.1% improvement in Wasserstein distance convergence rate, the proposed OGP enhanced features are tested in ASR and achieve 9.7%, 8.6%, 6.2%, and 4.8% WER improvements over DDAE, MTAE, R-CED(CNN) and RNN models.
Joanna Kazzandra DUMAGPI Woo-Young JUNG Yong-Jin JEONG
Threat object recognition in x-ray security images is one of the important practical applications of computer vision. However, research in this field has been limited by the lack of available dataset that would mirror the practical setting for such applications. In this paper, we present a novel GAN-based anomaly detection (GBAD) approach as a solution to the extreme class-imbalance problem in multi-label classification. This method helps in suppressing the surge in false positives induced by training a CNN on a non-practical dataset. We evaluate our method on a large-scale x-ray image database to closely emulate practical scenarios in port security inspection systems. Experiments demonstrate improvement against the existing algorithm.
Hyun KWON Yongchul KIM Hyunsoo YOON Daeseon CHOI
We propose new CAPTCHA image generation systems by using generative adversarial network (GAN) techniques to strengthen against CAPTCHA solvers. To verify whether a user is human, CAPTCHA images are widely used on the web industry today. We introduce two different systems for generating CAPTCHA images, namely, the distance GAN (D-GAN) and composite GAN (C-GAN). The D-GAN adds distance values to the original CAPTCHA images to generate new ones, and the C-GAN generates a CAPTCHA image by composing multiple source images. To evaluate the performance of the proposed schemes, we used the CAPTCHA breaker software as CAPTCHA solver. Then, we compared the resistance of the original source images and the generated CAPTCHA images against the CAPTCHA solver. The results show that the proposed schemes improve the resistance to the CAPTCHA solver by over 67.1% and 89.8% depending on the system.