Deep Face Drawing: Deep generation of face images from sketches

26 Feb 2023

Introduction

Creating realistic human face images from scratch benefits various applications including criminal investigation, character design, educational training, etc. Due to their simplicity, conciseness, and ease of use, sketches are often used to depict desired faces.The recently proposed deep learning based image-to-image translation techniques allow automatic generation of photo images from sketches for various object categories including human faces, and lead to impressive results.

This system consists of three main modules, namely, CE (Component Embedding), FM (Feature Mapping), and IS (Image Synthesis).

The CE module adopts an auto-encoder architecture and separately learns five feature descriptors from the face sketch data, namely, for “left-eye”, “right-eye”, “nose”, “mouth”, and “remainder” for locally spanning the component manifolds.

The FM and IS modules together form another deep learning sub-network for conditional image generation, and map component feature vectors to realistic images.

Methodology

A possible approach to synthesizing realistic faces from hand-drawn sketches is to first project an input sketch to such a 3D face space and then synthesize a face image from a generated 3D face. However, such a global parametric model is not flexible enough to accommodate rich image details or support local editing. The effectiveness of a local-global structure for faithful local detail synthesis, this method aims for modelling the shape spaces of face components in the image domain

The upper half is the Component Embedding module. We learn feature embedding’s of face components using individual auto-encoders. The feature vectors of component samples are considered as the point samples of the underlying component manifolds and are used to refine an input hand-drawn sketch by projecting its individual parts to the corresponding component manifolds.

The lower half illustrates a sub-network consisting of the Feature Mapping (FM) and the Image Synthesis (IS) modules. The FM module decodes the component feature vectors to the corresponding multi-channel feature maps (𝐻 ×𝑊 × 32), which are combined according to the spatial locations of the corresponding facial components before passing them to the IS module.

Sketch-to-Image Synthesis Architecture

The deep learning framework takes as input a sketch image and generates a high-quality facial image of size 512×512. It consists of two sub-networks: The first sub-network is our CE module, which is responsible for learning feature embed[1]dings of individual face components using separate auto-encoder networks. This step turns component sketches into semantically meaningful feature vectors. The second sub-network consists of two modules: FM and IS. FM turns the component feature vectors into the corresponding feature maps to improve the information flow. The feature maps of individual face components are then combined according to the face structure and finally passed to IS for face image synthesis

Component Embedding Module.

Since human faces share a clear structure, we decompose a face sketch into five components, denoted as 𝑆 𝑐 , 𝑐 ∈ {1, 2, 3, 4, 5} for “left-eye", “right-eye", “nose", “mouth", and “remainder", respectively. To handle the details in-between components, we define the first four components simply by using four overlapping windows centered at individual face components (derived from the pre-labeled segmentation masks in the dataset). A “remainder” image corresponding to the “remainder” component is the same as the original sketch image but with the eyes, nose, and mouth removed. Here we treat “left-eye” and “right-eye” separately to best explore the flexibility in the generated faces. To better control of the details of individual components, for each face component type we learn a local feature embedding. Thus obtain the feature descriptors of individual components by using five auto-encoder networks, denoted as {𝐸𝑐, 𝐷𝑐 } with 𝐸𝑐 being an encoder and 𝐷𝑐 a decoder for component 𝑐.

Each auto-encoder consists of five encoding layers and five decoding layers. Here add a fully connected layer in the middle to ensure the latent descriptor is of 512 dimensions for all five components. We experimented with different numbers of dimensions for the latent representation (128, 256, 512) – we found that 512 dimensions are enough for reconstructing and representing the sketch details. Instead, lower-dimensional representations tend to lead to blurry results. By trial and error, we append a residual block after every convolution/deconvolution operation in each encoding/decoding layer to construct the latent descriptors instead of only using convolution and deconvolution layers. We use Adam solver in the training process. Please find the details of the network architectures and the parameter settings in the supplemental materials.

Feature Mapping Module.

Given an input sketch, we can project its individual parts to the component manifolds to increase its plausibility . One possible solution to synthesize a realistic image is to first convert the feature vectors of the projected manifold points back to the component sketches using the learned decoders {𝐷𝑐 }, then perform component-level sketch-to-image synthesis (e.g., based on [38]), and finally fuse the component images together to get a complete face. However, this straightforward solution easily leads to inconsistencies in synthesized results in terms of both local details and global styles, since there is no mechanism to coordinate the individual generation processes. Another possible solution is to first fuse the decoded component sketches into a complete face sketch and then perform sketch-to-image synthesis to get a face image . It can be seen that this solution also easily causes artifacts (e.g., misalignment between face components, incompatible hairstyles) in the Synthes-sized sketch, and such artifacts are inherited from the synthesized image, since existing deep learning solutions for sketch-to-image synthesis tend to use input sketches as rather hard constraints, as discussed previously

Image Synthesis Module

Given the combined feature maps, the IS module converts them to a realistic face image. We implement this module using a conditional GAN architecture, which takes the feature maps as input to a generator, with the generation guided by a discriminator. Like the global generator in pix2pixHD, our generator contains an encoding part, a residual block, and a decoding unit. The input feature maps go through these units sequentially. Similar to, the discriminator is designed to determine the samples in a multi-scale manner: we downsample the input to multiple sizes and use multiple discriminators to process different inputs at different scales. We use this setting to learn the high-level correlations among parts implicitly.

Two-stage Training

Here, we adopt a two-stage training strategy to train our network using our dataset of sketch-image pairs. In Stage I, we train only the CE module, by using component sketches to train five individual autoencoders for feature embedding’s. The training is done in a self-supervised manner, with the mean square error (MSE) loss between an input sketch image and the reconstructed image. In Stage II, we fix the parameters of the trained component encoders and train the entire network with the unknown parameters in the FM and IS modules together in an end-to-end manner. For the GAN in the IS, besides the GAN loss, we also incorporate a 𝐿1 loss to further guide the generator and thus ensure the pixel-wise quality of generated images. We use the perceptual loss in the discriminator to compare the high-level difference between real and generated images. Due to the different characteristics of female and male portraits, we train the network using the complete set but constrain the searching space into the male and female spaces for testing.

Manifold Projection

Let S = {𝑠𝑖 } denote a set of sketch images used to train the feature embeddings of face components. For each component𝑐, we can get a set of points in the 𝑐-component feature space by using the trained encoders, denoted as F 𝑐 = {𝑓 𝑐 𝑖 = 𝐸𝑐 (𝑠 𝑐 𝑖 )}. Although each feature space is 512-dimensional, given that similar component images are placed closely in such feature spaces, we tend to believe that all the points in F 𝑐 are in an underlying low-dimensional manifold, denoted as M𝑐 , and further assume each component manifold is locally linear: each point and its neighbors lie on or close to a locally linear patch of the manifold .

Illustration of manifold projection.

Given a new feature vector 𝑓 𝑐 𝑠˜ , we replace it with the projected feature vector 𝑓 𝑐 𝑝𝑟𝑜 𝑗 using K nearest neighbors of 𝑓 𝑐 𝑠˜ .

Conclusion

This article presented a novel deep learning framework for synthesizing realistic face images from rough and/or incomplete freehand sketches. Here first take a local-to-global approach by first decomposing a sketched face into components, refining its individual components by projecting them to component manifolds defined by the existing component samples in the feature spaces, mapping the refined feature vectors to the feature maps for spatial combination, and finally translating the combined feature maps to realistic images.