Training an Anime Generator

Training an Anime Generator

Project Info

A Generative Adversarial Network (StyleGAN 2 model) to draw anime faces.

◷ 14/06/2021 – ??/??/????

Difficulty: ★ ★ ☆ ☆ ☆

This project was created to test Nvidia’s StyleGAN 2 ADA. I decided to jump into this project with minimal prior knowledge of both neural networks and coding, which was likely a mistake!


  • cwRsync (Windows replacement for rsync)
  • Microsoft Visual Studio 2019
  • Python
  • OpenCV
  • PyTorch
  • CUDA Toolkit 11
  • waifu2x-caffe – Upscaling results
  • WSL + Windows / Ubuntu
  • Bulk Rename Utility – Useful!
  • Decently powerful NVIDIA GPU with 8GB+ VRAM


G Wern – Dataset creation and cleaning. An amazing article to read.

Towards Data Science – Training StyleGAN.

Images for the training dataset were taken from Danbooru 2020 (SFW?, 512px): G Wern – Dataset

Note that not all images in this dataset are completely SFW. Be mindful of this if you decide to download the data for yourself.

Creating the Dataset

To begin, I downloaded the Danbooru 2020 dataset.

The images used were 512px – this meant many of the images saved were too small and unusable for StyleGAN (in total, 49,875 of the images were used, out of 1,000,000), but it saved on disk space and download time.

I used rsync in WSL to download the dataset:

rsync --recursive --verbose rsync:// ./DanbooruDataset

While you can use cwRsync Client instead, I do not recommend this as it had issues with various file directories.

Once the images were downloaded, I modified G Wern’s script to crop the all the images into square portraits:

These portraits were checked for grayscale using code from stackoverflow (and moved for manual checking):

The value MSE_cutoff=200 could be changed, although higher numbers gave more false positives (which were fine since they were manually checked afterwards).

I also used findimagedupes to find similar images, and exported these to a text file, where I could move them out:

for file in $(cat /mnt/d/lbpcascade_animeface-master/CroppedDataset/output.txt); do mv "$file" /mnt/d/lbpcascade_animeface-master/CroppedDataset/originals; done

In total, only 35,000 of the 650,000 downloaded images were suitable for training.

A collage of the training dataset

Training test run

To test the dataset and make sure everything could run properly, I ran StyleGAN 2 ADA and set it to train on 20,000 images.

After cloning the GitHub repo, installing missing dependencies and finally starting training, my RTX 2060 Super quickly ran out of VRAM (as expected!). Changing the batch sized fix this, but it didn’t fix the horribly long training training time, so I downscaled all 512×512 images to 256×256 using Python, with the intent of upscaling the 256×256 outputs back into higher resolutions later.

I also noticed a few non faces appearing in the dataset (1 in 100?), which were less than the 5% expected error rate for Nagadomi’s Anime Face detector. While I removed some by hand, there were still quite a few remaining. I decided this was acceptable, mainly because I was not prepared to manually scan through all 20,000 images to make sure that they were all faces.

The training rate was ~55s/kimg and StyleGAN was trained for 6 hours. The test run produced the following results:

Training after 6 hours

The results were all similarly drawn and low quality, but showed that setup was complete and that the model is ready for final training.

Final training

With the test run sucessful, I added the remaining ~15,000 images and re-ran duplicate and size checks with a very brief manual look through the dataset. The dataset was 3.22GB in size, with 33,357 images in total.

Once everything was complete, the final dataset was copied and the StyleGAN dataset was created with

Training started with the following command:

D:\stylegan2-ada-pytorch-main\ --kimg 50000 --data=D:\stylegan2-ada-pytorch-main\StyleGANDataset --outdir=D:\stylegan2-ada-pytorch-main\finalModel

After 12 hours of training, I realised I could effectively double the available dataset by adding –mirror=1 to the training command, so I updated the command and resumed on the previous model:

D:\stylegan2-ada-pytorch-main\ --kimg 50000 --data=D:\stylegan2-ada-pytorch-main\StyleGANDataset --outdir=D:\stylegan2-ada-pytorch-main\finalModel --mirror=1 --resume D:\stylegan2-ada-pytorch-main\finalModel\00000-StyleGANDataset-auto1-kimg50000\network-snapshot-000800.pkl

Later, I tried moving training to Google Colab for a possible speed boost, but I found that the Tesla T4 assigned was still slower than my 2060 Super, averaging of 75.8 sec/kimg compared to an average of 62 sec/kimg from my 2060 over the entire day, with this average higher due to other GPU usage (mostly gaming).

There is a possibility that the P100s are faster than my 2060 Super, but they are very hard to obtain. Combining this with the fact that the VM resets every 12 hours, I decided it would be best to train on my local machine.

Training on Google Colab


After 10 days and 18 hours of training, the final model finished training. I decided to end training here as the results were barely improving, ending on a value of 9.26 for the metric fid50k_full.

Although the network stopped training at value 9.26, this was decently higher than the lower values of 7.3 seen 3 days before the end of training. However, I wanted to share the latest model I had available, which in this case was not necessarily the best. This served as a notice for any future projects: more training did not always mean a better model. I noticed this after writing the results section and generating images, so all images seen here as well as my results page are using the latest network snapshot. The latest network snapshot does seem to handle non-modern artstyles better compared to the earlier model.


Impressively, the network was able to generate a wide range of different styles, recreating the many unique art styles seen in the Danbooru dataset:

While the images could fool someone on first glance, they usually fell apart on closer inspection:


  • Back
  • Clothing artifacts
  • Ears


  • Back
  • Hands
  • Ears

The network generally tripped up on the same features, commonly ears, clothing, hair ties and eye distances.

However, in other times, the network completely missed:

Overall, however, the network did an decent job, with majority of the images having small defects. A hand selected few were marginally better than the rest.