List of info and utilities for Stable Diffusion
Go to file
2022-10-04 21:14:46 +00:00
utils/renamer cum 2022-10-04 21:14:46 +00:00
README.md cum 2022-10-04 21:14:46 +00:00

Textual Inversion Guide w/ E621 Content

Assumptions

This guide assumes the following basics:

  • you have a working installation of Voldy's Web UI set ui
  • you already have the furry/yiffy models downloaded
  • you have node.js installed (!TODO! Port script to python, as everyone should have python already)

You can also extend this into any other booru-oriented model, but you'll have to modify the pre-processing script according to the site images were pulled from. The general concepts still apply.

Preface

I've burnt through seven or so models trying to train two of my husbandos, each try with different methods, so the "best" method still needs to be found, and my findings are all just extrapolations from my findings.

What works for you will differ from what works for me, but do not be discouraged if output during training looks decent, but real output in txt2img and img2img fails. Just try different, well constructed prompts, and also try and increase the size. I've thought I've had embeddings failed, when it just took increasing the resolution and having a better prompt.

Acquiring Source Material

The first step of training against a subject (or art style) is to acquire source content. Hugging Face's instructions specify having three to five images, cropped to 512x512, but there's no hard upper limit on how many, nor does having more images have any bearings on the final output size or performance. However, there's a virtual limit on how many pictures to provide, as after a certain point, it'll be harder to converge (despite converging implies overfitment, I think in the context of textual inversion, it's fine), and requires a lot more iterations to train better, but 50 to 100 should be a good target.

If you're lacking material, the web UI's pre-processing tools to flip and split should work decently enough to cover the gap for low content. Flipping will duplicate images and flip them across the Y axis, (presumably) adding more symmetry to the final embedding, while splitting will help deal with non-square content and provide good coverage for partially generating your subject.

If you rather would have finely-crafted material, you're more than welcome to manually crop and square images. A compromise for cropping an image is to expand the canvas size to square it off, and then fill the new empty space with colors to crudely blend with the background, and crudly adding color blobs to expand limbs outside the frame. It's not that imperative to do so, but it helps.

These tips also can apply to training an artist's art style instead, but I've yet to try it myself.

Pre-Processing Script

!TODO!: actually test the script, and port it to Python

Below is a quick hack job from my server of the script. You're required to already have node.js and node-fetch version 2.x (npm install node-fetch@2 in the folder with the script). It is imperative you install version 2, as later versions moved to needing import (YUCK) over require.

You are not required to actually run this, as this script is just a shortcut to manually renaming files and curating the tags, but it cuts the bulk work of it.

The generalized procedure is as followed:

  • load a list of tags associated with the SD model
  • grab a list of filenames
  • for every file, yank out the MD5 hash from the filename
  • query e621 with md5:${hash}
  • parse the tag lists, filtering out any tag that isn't in the model's tag list
  • sort the tags based on how many times that tag shows up in the model's training data
  • yank out the artist and content rating, and prepend the list of tags
  • copy the source file with the name being the processed list of tags

Pre-Requisites

There's little safety checks or error messages, so triple check you have:

  • the below script in any folder of your choosing
  • a folder called in, filled with images from e621 with the filenames intact (should be 32 alphanumeric characters of "gibberish")
  • a folder called out, where you'll get copies of the files fron in, but with the tags in the filename
  • a file called tags.csv, you can get it from here (Tag counts)
  • patience, e621 allows a request every half second, so the script has to rate limit itself

Clone this repo, open a command prompt/terminal at ./utils/renamer/, and invoke it with node preprocess.js

Tune-ables

You can also add in tags to be ignored in the filename, and adjust the character limit for the filename.

!TODO! Gut out Voldy's web UI's method of calculating tokens, and limit from there

Training Prompt Template

The final piece of the puzzle is providing a decent template to train against. Under ./stable-diffusion-webui/textual_inversion_templates/ are text files for these templates. The Web UI provides rudimentary keywords ([name] and [filewords]) to help provide better crafted prompts used during training. The pre-processing script handles the [filewords] requirement, while [name] will be where you want the embedding's name to plop in the prompt.

An adequate starting point is simply:

uploaded on e621, [name], [filewords]

I've had decent results with just that for training subjects. I've had mixed results with expanding that with filling in more artists to train against, for example:

uploaded on e621, [name] by motogen, [filewords]
uploaded on e621, [name] by oaks16, [filewords]
uploaded on e621, [name] by jumperbear, [filewords]

would theoretically help keep the embedding from "learning" the art style itself of your subject, but again, your mileage may vary. I still need more tests between an embedding trained with one over the other template.

If you really want to be safe, you can add some flavor to the template like:

a photo of [name], uploaded on e621, [filewords]
an oil painting of [name], uploaded on e621, [filewords]
a picture of [name], uploaded on e621, [filewords]

I've yet to test results when training like that, so I don't have much anecdotal advice.

Once you've managed to bang out your training template, make sure to note where you put it to reference later in the UI

Training

Now that everything is set up, it's time to start training. For systems with adequate enough VRAM, you're free to run the web UI with --no-half --precision full (whatever "adequate entails"). You'll take a very slight performance hit, but quality improves barely enough I was able to notice.

Run the Web UI, and click the Textual Inversion tab.

Create your embedding to train on by providing:

  • a name
    • can be changed later, it's just the filename, and the way to access your embedding in prompts
  • the initialization text
    • can be left *
    • it's only relevant for the very beginning training
    • for embeds with zero training, it's effectively the same as the initialization text. For example, you can create embeds for shortcut keywords to other keywords. (The original documentation used this to """diversify""" doctors with a shortcut keyword)
  • vectors per token
    • this governs how much "data" can be trained to the token
    • these do eat up how many tokens are left for the prompt, for example, setting this to 16 means you have 16 less tokens used for prompts
    • a good range is 12 to 16 (I've only trained my last attempt on my husbando at 12, but I've also cut the training iterations and material source count in half, so I don't have much to extrapolate)

Click create, and the starting file will be created.

Afterwards, you can pre-process your source material further by duplicating to flip (will remove the filenames if you preprocessed them already, so beware), or split (presumably will also eat your filenames).

Next:

  • select your embedding to train on in the dropdown
  • if you're adventurous, adjust the learning rate. The default of 0.005 is fine enough, and shouldn't cause learning/loss problems, but if you're erring on the side of caution, you can set it to 0.0005, but more training will be needed.
  • pass in the path to the folder of your source material to train against
  • put in the path to the prompt file you created earlier. if you put it in the same folder as the web UI's default prompts, just rename the filename there
  • adjust how long you want the training to be done before terminating. Paperspace seems to let me do ~70000 on an A6000 before shutting down after 6 hours. An 80GB A100 will let me get shy of the full 100000 before auto-shutting down after 6 hours.
  • the last two values are creature comforts and have no real effect on training, values are up to player preference

Afterwards, hit Train, and wait and watch your creation come to life.

If you didn't pre-process your images with flipped copies, I suggest midway through to pause training, then use ImageMagick's mogrify to flip your images with mogrify -flop * in the directory of your source material. I feel I've gotten nicer quality pictures because of it over an embedding I trained without it (but with a different prompt template).

Using the Embedding

Using your newly trained embedding is as simple as putting in the name of the file in the prompt. Before, you would need to signal to the prompt parser with <token>, but it seems now you do not. I don't know if still using <> has any bearings on output, but take note you do not need it anymore.

Do not be discouraged if your initial output looks disgusting. I've found you need a nicely crafted prompt, and increasing the resolution a few notches will get something decent out of it Play around with prompts in the thread, but I've found this one to finally give me decent output (credits to anon and anon for letting me shamelessly steal it for my perverted kemobara needs):

e621, explicit , by Pino Daeni, (chunie), wildlife photography, sharp details, <TOKEN>, solo, [:bulky:.6], detailed fur, hairy body, detailed eyes, penis, balls, <FLAVORS>
Negative prompt: blurry, out of frame, deformed, (bad anatomy), disfigured, bad hands, poorly drawn face, mutation, mutated, extra limb, amputee, messy, blurry, tiling, dark, human, text, watermark, copyright
Steps: 40, Sampler: Heun, CFG scale: 7, Seed: 1239293657, Size: 512x704, Model hash: 50ad914b

where <TOKEN> is the name of the embedding you used, and <FLAVORS> are additional tags you want to put in.

After Words

I've mentioned adding in a drop-in replacement for dataset.py, with fancier stuff, like an easier way to grab tags, and to shuffle during training, but so far I don't think it's necessary. It also messes with git pulls, as any future updates will need intervention if that file updates. The initial need to "fix" it was just to not use commas, but it also updated to accept booru strings. I will try later to see if the grandeur of shuffling tags has an effect, but I imagine it's minor at most.