Exploring DreamBooth + Stable Diffusion 2.1 for Album Art

using fine tuned AI models to synthesize your own subjects with personalized styles

Sean Dearnaley
18 min readDec 28, 2022
Generated image of a couple looking at an image of a robot in an art gallery
the future of art : artificial intelligence (midjourney v4 dec 22)

Update Feb.19.23:

I wish I had time to explore more but I’m working hard on some AG-Tech issues @ my company Indigo. I haven’t even had time to formally prepare my release… but here’s some hot tips off the press — use ControlNet to add even more personal creative context. Please watch Sebastian Kamph’s fantastic videos on the subject:

ControlNet — Revealing my Workflow to Perfect Images. — YouTube

Support already available in the amazing AUTOMATIC1111/stable-diffusion-webui: Stable Diffusion web UI (github.com) through the https://github.com/Mikubill/sd-webui-controlnet plugin

Also: ControlNET and Stable Diffusion: A Game Changer for AI Image Generation | by Tristan Wolff | Feb, 2023 | Bootcamp (uxdesign.cc)

Introduction

There’s something truly fulfilling about creating something from scratch, and I’ve always been drawn to art as a way to express myself. Growing up, I was constantly doodling and drawing, and as I got older, I became interested in 3D modeling, music and other forms of digital art. While art has always been a vital outlet for human creativity, we’ve recently seen the emergence of machines that can generate images, challenging traditional ideas about what art is and how it’s created.

In this article, I will share thoughts and techniques for working with the popular open source latent text-to-image diffusion model Stable Diffusion, I will discuss adjacent tools and techniques. I will share my workflow for creating album art with personal avatars.

With the rapid pace of innovation, it’s important to stay up-to-date on tutorials and documentation. However, this can be a challenge as many online resources quickly become outdated. Instead of providing specific steps, I want to help you develop a mental framework for navigating this constantly evolving landscape without feeling overwhelmed. By keeping in mind what you need to know and double-checking the dates of any information you come across, you can stay on top of the latest developments in the field.

woman in art gallery looking at a painting of a cyborg like woman
the future of art : artificial intelligence (midjourney v4 dec 22)

Art Created by Artificial Intelligence

Artificial intelligence (AI) has been used to create art for decades, with early examples dating back to the 1960s and 1970s. In the 1980s, artist Harold Cohen created the program AARON, which was able to create paintings and drawings in a variety of styles. In the 1970s and 1980s, AI-generated art was used more extensively in computer-aided design, allowing for more complex and realistic images to be created. In the 1990s, AI-generated art was used for more than just visual effects, with algorithms being used to generate music and poetry, and robots being programmed to create paintings and sculptures.

More recently, AI has been used to create text-to-image generators that make use of diffusion models. These models are trained by viewing millions of images with associated captions and learn the relationship between words and images, allowing them to generate novel images from text. Applications like MidJourney, Dall-E 2, and Stable Diffusion have become popular, and companies like Google have developed their own private AI art generators, such as Imagen and Parti.

However, there is still debate over whether or not AI-generated art can truly be considered “art” in the same way as human-created art. One early example of this debate can be seen in the case of the painting “The Portrait of Edmond de Belamy,” which was created by a group of artists and an AI program and sold at auction for $432,500 in 2018. The painting was part of a series called “The Belamy Family,” which was created using a machine learning algorithm called a Generative Adversarial Network (GAN). The GAN was trained on a dataset of 15,000 portraits from the 14th to the 20th centuries and was able to generate new portraits based on this training. In 2022 with AI going mainstream there have been major controversies artists flooded the portfolio site and marketplace Artstation with “No AI Art” protest images after it refused to ban the content.

photorealistic generated image from midjourney showing a woman in a gallery looking a painting of a robot girl
the future of art : artificial intelligence (midjourney v4 dec 22)

How to stay up to date

Staying up to date can be challenging, especially if you’re not accustomed to reading research papers. One useful source of current information is Discord, where you can ask questions and join communities such as the r/stablediffusion discord and stablediffusion discord Discord servers. Alternatively, you can also check out the corresponding Reddit forums at https://www.reddit.com/r/machinelearning , https://www.reddit.com/r/StableDiffusion/.

Dreambooth

Text-to-image synthesis models can generate realistic images from a given text description, but they often struggle to accurately depict specific subjects in various contexts. In August 2022, researchers at Google published a paper titled “DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation.” which presents a method for “personalizing” these models.

several pictures of corgis recontextualized in different scenarios
from DreamBooth paper

With this personalized approach, the model can synthesize the specified subject in various contexts using the unique identifier in the text prompt, even if those contexts do not appear in the reference images. This technique shows effectiveness in a range of tasks, including subject recontextualization, text-guided view synthesis, appearance modification, and artistic rendering.

This technique allowed people to fine-tune a text-to-image model with a few examples, meaning new visual concepts could be added to the model. However, neither the model nor the pre-trained weights of Imagen are available to the public. Fortunately, AWS AI scientist Zhisheng Xiao implemented the idea of DreamBooth using Stable Diffusion and made it publically available: DreamBooth-Stable-Diffusion.

DreamBooth-SD has already been forked and optimized for different platforms. Some talented contributors have even managed to reduce the memory requirements to run on consumer graphics cards, making it much more accessible (see DreamBooth-SD-optimized and Shivam). TheLastBen’s DreamBooth colab is very popular and supports fine tuning with the Stable Diffusion V2 model.

Artists could find DreamBooth extremely useful in a number of ways. For example, an artist could use it to generate images of a specific subject in various contexts, allowing them to experiment with different compositions and settings without having to physically recreate the scene. This could be especially useful for digital artists, who could quickly and easily generate a range of images to choose from without relying on photographs or other reference material. The text-guided view synthesis feature allows them to specify the perspective from which they want to view the subject. This could be particularly useful for artists working in 3D or virtual reality, who could use this to generate images from different viewpoints to help visualize their work.

The ability to modify the appearance of a subject and generate artistic renderings could also be useful to artists. The appearance modification feature allows artists to experiment with different colors, textures, and other visual elements without physically altering the subject. The artistic rendering feature allows users to generate images in a range of styles, from realistic to abstract, enabling them to explore different creative possibilities and find the look that works best for their work.

Overall, DreamBooth could be a powerful tool, providing users with the ability to quickly and easily generate a range of images and experiment with different visual elements.

Lensa

The viral AI avatar app Lensa has been generating digital portraits of people based on their selfies using the Magic Avatars feature. However, some users have raised concerns about its effectiveness and have suggested alternative options like Google Colab, Draw Things, Astria.ai, Drip.art, and Maple Diffusion on Github as ways to create persistent avatars or characters without needing high-powered video cards or special hardware. These alternatives are either free or have low costs, but may require some time to learn and practice. The main downside with Lensa is that its not free and you have limited control over the generation process. IMHO selling works generated by a model trained by living artists albeit legal (supposedly this model was created with works marked Creative Commons) , there is something unsettling about those living artists not getting a cut. To be clear this seems to be a problem with Stable diffusion in particular, and to be fair versions like 2.0+ seem to take these ethical considerations much more seriously.

A woman who used the app, Melissa Heikkilä, discovered that out of 100 avatars generated for her, 16 were topless, and 14 were in extremely skimpy clothes and overly sexualized poses. Heikkilä is of Asian heritage and found that the app only seemed to pick up on this aspect of her appearance, generating images of generic Asian women based on anime or video-game characters, or possibly pornography. Heikkilä’s white female colleague received significantly fewer sexualized images, while another colleague of Chinese heritage received similar results to Heikkilä. Stable Diffusion, is built using the LAION-5B data set compiled by scraping images from the internet. Because the internet contains a high number of images of naked or partially dressed women and racist, sexist stereotypes, the data set is also skewed toward these types of images. This leads to AI models that sexualize women, particularly those who have been historically disadvantaged, regardless of whether they want to be depicted this way.

Screen shot of the app Lensa showing the virtual avatars feature
lensa magic avatars

In addition to the Lensa AI app, there are various methods for creating persistent avatars or characters, including DreamBooth, hypernetworks, and textual embedding. Hardware optimization can also be used to improve performance. Ultimately, the most effective method will depend on the individual’s needs and preferences.

“quadtrych banner for an article called the future of art : artificial intelligence” (midjourney Aug 22)
“quadtrych banner for an article called the future of art : artificial intelligence” (midjourney Aug 22)

TheLastBen/ Fast-dreambooth

Fast-DreamBooth is a tool for training and fine-tuning AI models from the https://github.com/TheLastBen/fast-stable-diffusion repository. It is designed to make it easy for users to create or load a session, upload instance images, and begin training without too much memory or overhead. It also includes a Colab notebook with a user-friendly interface for easily accessing the tool.

It comes with more than 65% speed increase, support for T4, P100, and V100 GPUs, and requires less than 12GB of VRAM. Even better, the 8-bit Adam optimizer can be enabled to reduce the VRAM further and make it possible to run the model on even a 8GB GPU. (You may have to explore various forks and configurations to get this to work. Quality won’t be lost with this optimization, as the newer algorithm used is far better than the previous one, reducing O(n²) down to O(n) or even O(log(n)).

To use Fast-DreamBooth, users must first install the necessary dependencies and download a model. Once the setup is complete, users can choose the model version, Huggingface token, and custom model version that they want to use. The path to Hugging Face or a CKPT (checkpoint) path/link can also be specified.

During training, users can specify the number of steps and learning rate for the UNet model, as well as the number of steps for the text encoder model. Training can be resumed if desired, and users can save the trained model and the session for future use. The UNet model is a neural network architecture that is commonly used for image segmentation tasks, while the text encoder is a type of neural network used to process and encode natural language input.

Fast-DreamBooth also allows users to generate images using the trained model. Users can specify a source image and the desired resolution, and Fast-DreamBooth will generate the images using the trained model.

With this colab, you can be sure that you are using high-quality reference pictures for your training. The appropriate number of steps gave the most accurate likeness. The method also allows for model retraining, which is a great way to fine-tune the results. It is recommended to use the latest version of the Colab notebook from the repository. Retraining the model is a better option than putting in more steps in a single session.

It is important to note that instance names should be long and scrambled without vowels, like “llmcbrrrqqdpj” to avoid tokenizer issues that could arise. It is also recommended to check the results for automatic1111’s X/Y plot before training again. This can help identify potential issues, such as traits of trained people leaking into other subjects.

4 images that are very similar to the original instance image
example of overfit model (note it’s a slight variation of my passport photo)

The steps needed for training scale with the amount of images uploaded, so it was recommended to use 50 steps per picture as a good rule of thumb/ that being said things are continually being optimized, read the notebook docs and check GitHub for updates. Additionally, Aesthetic Gradients can be used for style training without the need for instance images. It is also possible to batch train multiple concepts with varying instance images, using a lower step count per concept and retraining them afterwards.

When training a single person, it is better to use more instance images than fewer. Using too few can lead to poor training results, as can be seen with one subject that had only 7 images. 30 instance images is not a magic number, and it is recommended to adjust the steps and learning rate accordingly to the number of images used. For example, number of steps = number of instance images \* 10.

For multiple subject training, it is best to adjust the steps and learning rate accordingly to the number of images used. It is also possible to train one subject and then upload new pictures with a different naming scheme and retrain just by pressing the training cell again with resume checked. However, if the Colab free stops working mid-training, any progress made would be lost.

It is also possible to upload multiple pictures straight to Google Drive to the session’s folder’s instance\_images as an alternative to using the cell. The recommended amount of steps for training with TheLastBen’s Dreambooth (FAST) is 650 (as of 12/28/22), but this can vary depending on the number of instance images used. It is best to experiment with different learning rates and samplers, as well as retraining the model, to achieve the best results.

With its increased speed, less than 12GB VRAM, and support for T4, P100, and V100 GPUs, the fast-dreambooth colab is a great choice for those looking for an easy and efficient way to train their models. With its memory efficient attention and customizable prompts, you can quickly get up and running with the latest technology.

  • Fast-DreamBooth is a tool for training AI models from the https://github.com/TheLastBen/fast-stable-diffusion repository.
  • Fast-DreamBooth provides options for fine-tuning the trained model, including the ability to specify the number of steps, learning rate, and the source image for fine-tuning.
  • Once training is complete, users can generate images using the trained model.
  • Fast-DreamBooth allows users to save the trained model and the session for future use.
  • Fast-DreamBooth supports Stable Diffusion 2.1
TheLastBen fast-DreamBooth (12/28/22)

Unet/Text Encoder

A UNet model is a type of neural network architecture that is often used for image segmentation tasks. It was first introduced by Olaf Ronneberger, Philipp Fischer, and Thomas Brox in their paper “U-Net: Convolutional Networks for Biomedical Image Segmentation.”

The UNet model consists of a contracting path (downsampling) and an expanding path (upsampling). The contracting path is made up of a series of convolutional and max pooling layers that reduce the spatial resolution of the input image, while the expanding path is made up of a series of convolutional layers that increase the spatial resolution of the image. The contracting and expanding paths are connected by a series of “skip connections,” which allow the model to retain spatial information from the contracting path as it upsamples the image in the expanding path.

The UNet model is commonly used for tasks such as medical image segmentation, where it can be used to identify and segment specific structures or objects in an image. In the context of Fast-DreamBooth, the UNet model is likely being used for a similar purpose, such as segmenting or identifying specific objects or features in images.

A text encoder is a type of neural network architecture that is used to process and encode natural language input, such as text. Text encoders are often used in natural language processing (NLP) tasks, such as language translation, question answering, and text classification.

In the context of Fast-DreamBooth, it is likely that the text encoder is being used to process and encode text input in some way, such as by converting the text into a numerical representation that can be used as input to another model. It is not clear from the information provided exactly how the text encoder is being used in Fast-DreamBooth, but it is likely being used to process and encode text data in some way as part of the training process.

generated image showing a man wearing an ornate hat
I don’t own this hat

Album Art

The main goal of this project is to infuse as much of my personal expression into the album art as possible. I want to go beyond simply selecting a beautiful image and instead consider curating unique and original images for the cover. Originality is important to me because it adds value and challenges the status quo, leading to new insights and understanding. In order to achieve this, I will have clear intentions about what I am synthesizing and be mindful of not appropriating someone else’s style. By understanding the capabilities of the model and exploring common terms for describing images, I can create something truly original and authentic.

  • Infuse personal expression into album art
  • Explore curating unique and original images for the cover
  • Maintain focus on originality and authenticity
  • Have clear intentions about what is being synthesized
  • Avoid appropriating someone else’s style
  • Understand model capabilities and explore common terms for describing images
  • If you have the budget to hire an artist, you should do it- maybe use AI art to prototype ideas rapidly and hand those off to a human artist

Instance Images

instance images

Generations:

Stable Diffusion 1.5 Observations

  • 512x512 model, 2.1 is 768x768
  • 1.5 has more artistic hacks and responds to things like ‘trending on artstation’, personally I don’t recommend using artist names and rely on better descriptions.
  • anecdotally I think 1.5 has more interesting images, but the low res and ethical concerns make the model controversial
  • Lensa is based on Stable Diffusion 1.5
Examples from 1.5 custom model (fast-dreambooth)

Stable Diffusion 2.1 Observations

  • 768x768 notably higher res
  • make sure to use 768x768 instance images
  • follow the fast-dreambooth instructions in the Colab, they change as the model is tweaked, now you can train a 2.1 model very quickly
  • can’t use artist hacks, so it’s harder to get beautiful results, you have to work harder in the prompt which I appreciate. Plagiarizing styles is not good.
  • 2.1 has all the same issues with appendages that stable diffusion models suffer from, there don’t seem to be major architectural differences
  • the 512x512 depth model is fantastic but I don’t know how to fine tune that
  • negative prompts seem to have bigger effect
  • unfortunately the 2.1 model seems to generate a lot of watermarked images, I seem to be able to mitigate them with negative prompts
  • here is a good set negative inputs
center label, full-body, blur, cut-out, vinyl, out of focus, mid-body, ugly, duplicate, morbid, beard, horse shoe, tattoos, old, hands, teeth, pixelated, pixelation, blocky, jpeg compression, mounted, frame, framed, kaleidoscope, kaleidoscopic, diagonal watermark and/or faint, translucent image/pattern superimposed, complex, camera, pattern overlay, dreamstime, adobe stock, Shutterstock, stock photo, city, landscape, interior, hanging from wall, icon, faint logo, superimposed text, partially transparent, nature, hands, feet, watermark, caption, wide, bland, dark, Reflected Glare, Veiling Reflection, poster, gallery, border, text, writing, signed, text logo, signature, mutilated, [out of frame], extra fingers, mutated hands, poorly drawn hands, poorly drawn face, mutation, deformed, ugly, blurry, bad anatomy, bad proportions, extra limbs, cloned face, disfigured, out of frame, ugly, extra limbs, bad anatomy, gross proportions, malformed limbs, missing arms, missing legs, extra arms, extra legs, mutated hands, fused fingers, too many fingers, long neck
Examples from 2.1 custom model (fast-dreambooth)

Tips for training models

  • A large number of instance images doesn’t necessarily mean better results; it takes some experimentation to figure out what to include. It’s advised to include a mixture of full body, mid body, and head shots. I did 8 iterations of my personal DreamBooth model and got pretty good results from a starting set of 9 images. However, I noticed a lot of generated images had a larger than normal head that resembled one particular picture in the set. My head was merged with a glare in the image, so I removed it from the training set and the artifact no longer occurred in the outputs.
  • When I tried a large number of images with less regard to quality, I found that gave less reliable results. For example, more outputs that looked nothing like me. I came to realize that the more images there are, the more options for inference, and particularly if blurry images are included, I seem to see more variation in my face. So it has been better to remove these from the training set. Likewise, I’ve tried to build DreamBooth models from just 5 low quality images, but that was too limited; Stable generated many images that were like the ones in the set, but almost none of the outputs looked like the subject. I needed clearer images and slightly more variety, but still an overall consistency in the subject. The best sets I’ve created have been 12–25 images, and I prune the sets based on the images generated. I have a model that seems to favor younger images of me, another that only gets similar people, and another that fairly accurately captures me today.
  • It can be difficult to get the right images for full body shots, since you’ve only got 768x768, and the internal aperture of the model is probably even smaller. I’ve found most of my full body outputs have a distorted face; this is probably due to the relatively small set of pixels in the face cluster. So I would suggest trying to get well-lit, well-defined faces in long shots, to keep the subject in frame. Otherwise, the model will replicate objects that are cut out of frame.
  • To understand different sampling methods, it can be useful to experiment with fixed seed + x/y matrix. Euler needs very few steps to get good results, while adaptive sampling styles use more steps which could be useful for higher resolutions.
“quadtrych banner for an article called the future of art : artificial intelligence” (midjourney Aug 22)
“quadtrych banner for an article called the future of art : artificial intelligence” (midjourney Aug 22)

Human Agency

Artificial intelligence (AI) can provide many benefits for independent artists when generating album cover art, including increased efficiency and the ability to access a wider range of styles and techniques. However, there are also potential drawbacks to consider, such as legal issues, ethical concerns, and a lack of originality or authenticity. It is important for artists to be aware of these potential pitfalls and ensure that they are using AI in a responsible and ethical manner.

One potential benefit of using AI is the cost-saving aspect, as it can provide a more affordable alternative for creating album cover art for independent artists who may not have the resources to hire a professional artist or design agency. However, there is also the risk that the use of AI may devalue the work of human artists, leading to a decrease in demand for human-created art. The opposite could also happen, with original works demanding higher value.

Another potential benefit is the ability to generate a variety of cover art options quickly, potentially saving artists time and effort in the creative process. However, there is also the risk that the AI-generated art may lack the same level of depth and complexity as art created by human artists, lacking nuance and emotion. This could be a concern for artists looking to create art that resonates emotionally with their audience.

Artists should also be aware of the potential for bias in AI, as it may reproduce or amplify existing biases and stereotypes if trained on biased data. This could be a problem for artists trying to create diverse and inclusive artwork.

Overall, while using AI to generate album cover art can provide many benefits, it is important for artists to carefully consider the potential drawbacks and ensure that they are using AI in a responsible and ethical manner. This may involve seeking legal advice, being mindful of the potential for bias, and considering the impact that AI may have on their artistic brand and style.

Conclusion

The use of AI to create art is a complex and controversial topic that continues to be debated by researchers and artists alike, with some arguing that it lacks the creativity and emotion of human-created art and others claiming that its ability to produce unique and original works is proof of its creative capabilities. Despite this debate, AI art has continued to evolve and gain popularity, with tools like DreamBooth and Stable Diffusion 2.1 becoming popular for synthesizing subjects with personalized styles and creating album art. While there are still challenges and controversies surrounding AI art, it is clear that it will continue to be a significant part of the future of art.

--

--

Sean Dearnaley

I have worked on different applications for the music industry, government, education and agriculture.