mrq/stable-diffusion-utils

Fork 0

mrq ea16688b5c slight fix, added more notes on hypernetworks

2022-10-12 15:53:12 +00:00

26 KiB

Executable File

Raw Blame History

Textual Inversion Guide w/ E621 Content

An up-to-date repo with all the necessary files can be found here: https://git.coom.tech/mrq/stable-diffusion-utils

!WARNING! !CAUTION! DO NOT POST THE REPO'S URL ON 4CHAN !CAUTION! !WARNING!

coom.tech is an automatic 30-day ban if posted. I am not responsible if you share that URL. Share the rentry instead.

Assumptions

This guide assumes the following basics:

you have a working installation of Voldy's Web UI
you already have the furry/yiffy models downloaded
you can run a python script

You can also extend this into any other booru-oriented model, but you'll have to modify the pre-processing script according to the site images were pulled from. The general concepts still apply.

Glossary

Below is a list of terms clarified. I notice I'll use some terms interchangably with other concepts. These do not necessarily cover everything that's generally related to Stable Diffusion, but moreso about Textual Inversion and terms I'll use that needs disambiguation:

Textual Inversion: the method of "training" your embedding; comparable to training a model, but not entirely accurate.
training, learning: running Textual Inversion to improve your embedding
subject: a character / object / noun of what you're trying to train against. For e621 (or another booru) applications, it's extremely likely it's a character. Textual Inversion excels at training against subjects.
style: an artist's style. Textual Inversion can also incorporate subjects in a style.
source content/material: the images you're using to train against; pulled from e621 (or another booru)
embedding: the trained "model" of the subject or style in question. "Model" would be wrong to call the trained output, as Textual Inversion isn't true training
hypernetwork: a different way to train custom content against a model, almost all of the same prinicples here apply for hypernetworks
epoch: a term derived from typical neural network training
- normally, it's referred to as a full training cycle over your source material
- in this context, it's the above times the number of repeats per single image

Preface

I've burnt through seven or so models trying to train three of my hazubandos, each try with different methods. I've found my third attempt to have very strong results, yet I don't recall exactly what I did to get it. My later subjects failed to yield such strong results, so your mileage will greatly vary depending on the subject/style you're training against.

What works for you will differ from what works for me, but do not be discouraged if output during training looks decent, but real output in txt2img and img2img fails. Just try different, well constructed prompts, change where you place your subject, and also try and increase the size a smidge (such as 512x704, or 704x512). I've thought I've had embeddings failed, when it just took some clever tweaking for decent output.

Acquiring Source Material

The first step of training against a subject (or art style) is to acquire source content. Hugging Face's instructions specify having three to five images, cropped to 512x512, but there's no hard upper limit on how many, nor does having more images have any bearings on the final output size or performance. However, the more images you use, the harder it'll take for it to converge (despite convergence in typical neural network model training means overfitment).

I cannot imagine a scenario where you should stick with low image counts, such as selecting from a pool and pruning for the "best of the best". If you can get lots of images, do it. While it may appear the test outputs during training looks better with a smaller pool, when it comes to real image generation, embeddings from big image pools (140-190) yieled far better results over later embeddings trained on half the size of the first one (50-100).

If you're lacking material, the web UI's pre-processing tools to flip and split should work decently enough to cover the gap for low content. Flipping will duplicate images and flip them across the Y axis, (presumably) adding more symmetry to the final embedding, while splitting will help deal with non-square content and provide good coverage for partially generating your subject (for example, bust shots, waist below, chest only, etc.).

If you rather would have finely-crafted material, you're more than welcome to manually crop and square images. A compromise for cropping an image is to expand the canvas size to square it off, and then fill the new empty space with colors to crudely blend with the background, and crudly adding color blobs to expand limbs outside the frame. It's not that imperative to do so, but it helps.

Lastly, for Textual Inversion, your results will vary greatly depending on the character you're trying to train against. A character with features you could easily describe in a prompt will yield good results, while characters with hard/impossible to describe attributes will make it very tough for the embedding to learn and replicate.

Fetch Script

If you want to accelerate your ~~scraping~~ content acquisition, consult the fetch script under ./utils/renamer/. It's a """simple but powerful""" script that can ~~scrape~~ download from e621 given a search query.

All you simply need to do is invoke the script with python3 fetch.py "search query". For example: python3 fetch.py "zangoose -female score:>0".

Source Material For A Style

The above tips all also apply to training a style, but some additional care needs to be taken:

Avoid having a recurring subject. Textual Inversion excels at training against a recurring element, especially a subject. It's very easy for your embedding to associate with a particular character moreso than a particular style. Minimize your training material having recurring subjects.

If you already have an embedding trained for a subject, and the artist you're training against has art including that character, use that character's trained embedding. I've found it gives very promising results during training, rather than using one after the fact. It's very, very hard to get txt2img to generate an image using a subject embedding and a style embedding without having to compromise one for the other.

Use the automatic pre-processing script in the web UI to flip and split your source material, as you don't have to focus on a particular subject for training. You can get very strong results by introducing style traits that aren't tied to a specific orientation.

Pre-Processing Script

You are not required to actually run this, as this script is just a shortcut to manually renaming files and curating the tags, but it cuts the bulk work of it.

Included in the repo under ./utils/renamer/ is a script for tagging images from e621 in the filename for later user in the web UI.

You can also have multiple variations of the same images, as it's useful if you're splitting an image into multiple parts. For example, the following is valid:

ef331a09e313914aa0bcb2c5310660ec.jpg
aacb4870a669b0fc7e1ede0c1652fa8c (1).jpg // manually sliced top half of an image
aacb4870a669b0fc7e1ede0c1652fa8c (2).jpg // manually sliced bottom half on an image
554982d3498e67a50f768e6e18088072.jpg
554982d3498e67a50f768e6e18088072 (1).jpg // manually sliced left half of an image
554982d3498e67a50f768e6e18088072 (2).jpg // manually sliced right half of an image

00001-0 554982d3498e67a50f768e6e18088072.jpg // automatically preprocessed image
00001-1 554982d3498e67a50f768e6e18088072.jpg // automatically preprocessed image
00001-2 554982d3498e67a50f768e6e18088072.jpg // automatically preprocessed image
00001-3 554982d3498e67a50f768e6e18088072.jpg // automatically preprocessed image

The generalized procedure is as followed:

load a list of tags associated with the SD model
grab a list of filenames
for every file, yank out the MD5 hash from the filename
query e621 with md5:${hash}
parse the tag lists, filtering out any tag that isn't in the model's tag list
sort the tags based on how many times that tag shows up in the model's training data
yank out the artist and content rating, and prepend the list of tags
copy the source file with the name being the processed list of tags

Additional information about the scripts can be found under the README under ./utils/rename/README.md.

Pre-Requisites

There's little safety checks or error messages, so triple check you have:

the below script in any folder of your choosing
a folder called in, filled with images from e621 with the filenames intact (should be 32 alphanumeric characters of "gibberish")
a folder called out, where you'll get copies of the files fron in, but with the tags in the filename
a file called tags.csv, you can get it from here (Tag counts)
patience, e621 allows a request every half second, so the script has to rate limit itself

Clone this repo, open a command prompt/terminal at ./utils/renamer/, and invoke it with python3 preprocess.py

Consult the script if you want to adjust it's behavior. I tried my best to explain which each one does, and make it easy to edit them.

Caveats

There's some "bugs" with the script, be it limitations with interfacing with web UI, or oversights in processing tags:

commas do not carry over to the training prompt, as this is a matter of how the web UI re-assembles tokens passed from the prompt template/filename. There's functionally no difference with having ,, or as your delimiter in this preprocess script.
tags with parentheses, such as boxers_(clothing), or curt_(animal_crossing), the web UI will decide whatever it wants to when it comes to processing parentheses. The script can overcome this problem by simply removing anything in parentheses, as you can't really escape them in the filename without editing the web UI's script.
Species tags seemed to not be included in the tags.csv, yet they OBVIOUSLY affect the output. I haven't taken close note of it, but your results may or may not improve if you manually tag your species, either in the template or the filenames (whether the """pedantic""" reddit taxonomy term like ursid that e621 uses or the normal term like bear is prefered is unknown). The pre-process script will include them by default, but be warned that it will include any of the pedantic species tags (stuff like suina sus boar pig)
filtering out common tags like anthro, human, male, female, could have negative effects with training either a subject or a style. I've definitely noticed I had to add negative terms for f*moid parts or else my hazubando will have a cooter that I need to inpaint some cock and balls over. I've also noticed during training a style (that both has anthros and humans), a prompt associated with something anthro will generate something human. Just take notice if you don't foresee yourself ever generating a human with an anthro embedding, or anthro with a human embedding. (This also carries to ferals, but I'm sure that can be assumed)
the more images you do use, the longer it will take for the web UI to load and process them, and presumably more VRAM needed. 200 images isn't too bad, but 9000 will take 10 minutes on an A100-80G.

Training Prompt Template

The final piece of the puzzle is providing a decent template to train against. Under ./stable-diffusion-webui/textual_inversion_templates/ are text files for these templates. The Web UI provides rudimentary keywords ([name] and [filewords]) to help provide better crafted prompts used during training. The pre-processing script handles the [filewords] requirement, while [name] will be where you want the embedding's name to plop in the prompt.

The ~~adequate~~ recommended starting point is simply:

uploaded on e621, [name], [filewords]

or for the pedantic:

uploaded on e621, [filewords], [name]

I've had decent results with just that for training subjects with the first one. I imagine the second one being more pedantic can help too, but places your training token at the very end. It's a bit more correct, as I can rarely ever actually have my trained token in the early part of the prompt without it compromising other elements.

Once you've managed to bang out your training template, make sure to note where you put it to reference later in the UI.

Alternative Training Prompt Templates

I've had mixed results with expanding that by filling in more artists to train against, for example:

uploaded on e621, [name] by motogen, [filewords]
uploaded on e621, [name] by oaks16, [filewords]
uploaded on e621, [name] by jumperbear, [filewords]

would theoretically help keep the embedding from "learning" the art style itself of your subject, but again, your mileage may vary, and wouldn't use this first. I still need more tests between an embedding trained with one over the other template.

If you really want to be safe, you can add some flavor to the template like:

a photo of [name], uploaded on e621, [filewords]
an oil painting of [name], uploaded on e621, [filewords]
a picture of [name], uploaded on e621, [filewords]

I've yet to test results when training like that, so I don't have much anecdotal advice, but only use this if you're getting output with little variation between different prompts.

For Training A Style

A small adjustment is needed if you're training on a style. Your template will be:

uploaded on e621, by [name], [filewords]

Training

Now that everything is set up, it's time to start training. For systems with adequate enough VRAM, you're free to run the web UI with --no-half --precision full (whatever "adequate entails"). You'll take a very slight performance hit, but quality improves barely enough I was able to notice.

Make sure you're using the correct model you want to train against, as training uses the currently selected model.

Run the Web UI, and click the Training sub-tab.

Create your embedding to train on by providing the following under the Create embedding:

a name
- can be changed later, it's just the filename, and the way to access your embedding in prompts
the initialization text
- can be left *
- it's only relevant for the very beginning training
- for embeds with zero training, it's effectively the same as the initialization text. For example, you can create embeds for shortcut keywords to other keywords. (The original documentation used this to """diversify""" doctors with a shortcut keyword)
vectors per token
- this governs how much "data" can be trained to the token
- these do eat up how many tokens are left for the prompt, for example, setting this to 16 means you have 16 less tokens used for prompts
- a good range is 12 to 16, but the more you can afford the better. Given the recent change to the prompt limitation, you could easily just set this to 24 or 32 without worry, but I haven't personally tested the additional caveats that applies when going beyond the initial 75 tokens limit.

Click create, and the starting file will be created.

Afterwards, you can click the Preprocess images sub-tab to pre-process your source material further by duplicating to flip, or split.

Next, under the Train sub-tab:

embedding or hypernetwork: select your embedding/hypernetwork to train on in the dropdown
learning rate: if you're adventurous, adjust the learning rate. The default of 0.005 is fine enough, and shouldn't cause learning/loss problems, but if you're erring on the side of caution, you can set it to 0.0005, but more training will be needed.
- similar to prompt editing, you can also specify when to change the learning rate. For example: 0.000005:2500,0.0000025:20000,0.0000001:40000,0.00000001:-1 will use the first rate until 2500 steps, the second one until 20000 steps, the third until 40000 steps, then hold with the last one for the rest of the training.
dataset directory: pass in the path to the folder of your source material to train against
log directory: player preference, the default is sane enough
prompt template file: put in the path to the prompt file you created earlier. if you put it in the same folder as the web UI's default prompts, just rename the filename there
width and height: I assume this determines the size of the image to generate when requested, I'd leave it to the default 512x512 for now
max steps: adjust how long you want the training to be done before terminating. Paperspace seems to let me do ~70000 on an A6000 before shutting down after 6 hours. An 80GB A100 will let me get shy of the full 100000 before auto-shutting down after 6 hours.
epoch length: this value (allegedly) governs the learning rate correction when training based on defining how long an epoch is. for larger training sets, you would want to decrease this. I don't see any differences with this at the meantime.
save an image/copy: the last two values are creature comforts and have no real effect on training, values are up to player preference.
preview prompt: the prompt to use for the preview training image. if left empty, it'll use the last prompt used for training. it's useful for accurately measuring coherence between generations.

Afterwards, hit Train Embedding, and wait and watch your creation come to life.

If you didn't pre-process your images with flipped copies, I suggest midway through to pause training, then use ImageMagick's mogrify to flip your images with mogrify -flop * in the directory of your source material. I feel I've gotten nicer quality pictures because of it over an embedding I trained without it (but with a different prompt template).

Lastly, if you're training this on a VM in the "cloud", or through the shared gradio URL, I notice the web UI will desync and stop updating from the actual server. You can lazily resync by opening the gradio URL in a new window, navigate back to the Training tabs, and click Train again without touching any settings. It'll re-grab the training progress.

For Training a Hypernetwork

As an alternative to Textual Inversion, the web UI also provied training a hypernetwork (effectively an overlay for the last few layers of a model to re-tune it). This is very, very experimental, and I'm not finding success close to being comparable to Textual Inversion, so be aware that this is pretty much conjecture until I can nail some decent results.

I highly suggest waiting for more developments around training hypernetworks. If you want something headache free, stick to using a Textual Inversion. Despite most likely being overhyped, hypernetworks still seem promising for quality improvements and for anons with lower VRAM GPUs.

The very core concepts are the same for training one, with the main difference being the learning rate is very, very sensitive, and needs to be reduced as more steps are ran. I've seen my hypernetworks quickly dip into some incoherent noise, and I've seen some slowly turn into some schizo's dream where the backgrounds and edges are noisy.

The official documentation lazily suggests a learning rate of either 0.000005 or 0.0000005, but I find it to be inadequate. For the mean time, I suggest using 0.000000025 to get started. I'll provide a better value that makes use of the learning rate editing feature when I find a good range.

Caveats

Please, please, please be aware that training a hypernetwork also uses any embeddings from textual inversion. You will get false results if you use a hypernetwork trained with a textual inversion embedding. This is very easy to do if you have your hypernetwork named the same as an embedding you have, especially if you're using the [name] keyword in your training template.

You're free to use a embedding in your hypernetwork training, but some caveats I've noticed:

any image generation without your embedding will get terrible output
using a hypernetwork + embedding of the same concept doesn't seem to give very much of a difference, although my test was with a embedding I didn't have very great results from anyways
if you wish to share your hypernetwork, and you in fact did train it with an embedding, it's important the very same embedding is included
like embeddings, hypernetworks are still bound to the model you trained against. unlike an embedding, using this on a different model will absolutely not work.

I'm also not too keen whether you need to have a [name] token in your training template, as hypernetworks apply more on a model level than a token level.

Using the Hypernetwork

To be discovered later. As of now, you just have to go into Settings, scroll at the bottom, and select your newly trained hypernetwork in the dropdown.

I can assume that you do not need to have any additional keywords if you trained with a template that did not include the [name] keyword. I also feel like you don't need to even if you did, but I'll come back and edit my findings after I re-train a hypernetwork.

Using the Embedding

Using your newly trained embedding is as simple as putting in the name of the file in the prompt. Before, you would need to signal to the prompt parser with <token>, but it seems now you do not. I don't know if still using <> has any bearings on output, but take note you do not need it anymore.

Do not be discouraged if your initial output looks disgusting. I've found you need a nicely crafted prompt, and increasing the resolution a few notches will get something decent out of it. Play around with prompts in the thread, but I've found this one to finally give me decent output (credits to anon and anon for letting me shamelessly steal it for my perverted kemobara needs):

e621, explicit , by Pino Daeni, (chunie), wildlife photography, sharp details, <TOKEN>, solo, [:bulky:.6], detailed fur, hairy body, detailed eyes, penis, balls, <FLAVORS>
Negative prompt: blurry, out of frame, deformed, (bad anatomy), disfigured, bad hands, poorly drawn face, mutation, mutated, extra limb, amputee, messy, blurry, tiling, dark, human, text, watermark, copyright
Steps: 40, Sampler: Heun, CFG scale: 7, Seed: 1239293657, Size: 512x704, Model hash: 50ad914b

And and adjusted one of the above that I found to yield very tasteful results:

uploaded on e621, explict content, by [Pino Daeni:__e6_artist__:0.75] and [chunie:__e6_artist__:0.75], (photography, sharp details, detailed fur, detailed eyes:1.0), <TOKEN>, hairy body, <FLAVORS>

where <TOKEN> is the name of the embedding you used, <FLAVORS> are additional tags you want to put in, and __e6__artist__ is used with the Wildcards third-party script (you can manually substitute them with other artists of your choosing for subtle nuances in your ouptut).

Ordering really matters when it comes to your embedding, and additionally the weight of your embedding. Too early in the prompt, and the weight for other terms will greatly fall off, but too late in the prompt, and your embedding will lose it's influence. Too much weight applied to your embedding, and you'll deepfry your output.

If you're using an embedding primarily focused on an artstyle, and you're also using an embedding trained on a subject, take great care in your weights on your additional embedding. Too much, even the smallest amount, and you'll destroy your style's embedding in the final output.

Lastly, when you do use your embedding, make sure you're using the same model you trained against. You can use embeddings on different models, as you'll definitely get usable results, but don't expect it to give stellar ones.

After Words

Despite being very wordy, I do hope that it's digestable and able to process for even the most inexperience of users. Everything in here is pretty much from my own observations and tests, so I can get (You), anon, closer to generating what you love.

Lastly, the following section serves no bearings on training, but serve as way to put my observations on:

The Nature of Textual Inversion embeddings

I'm definitely no expert on this, and I could definitely just try and read the source code to confirm whether I'm right or wrong, but keep in mind this is just from my observations on training and using embeddings.

Textual Inversion embeddings serve as mini-"models" to extend a current one. When the prompt is parsed, the keyword taps into the embedding to figure out which tokens to pull from and their associated weights. Training is just figuring out the right tokens necessary to represent the source material. This is evident through:

"vectors per token" consumes how many tokens from the prompt
subjects that are easy to describe in a prompt (vintage white fur, a certain shape and colored glasses, eye color, fur shagginess, three toes, etc.) give far better results
subjects that are nigh impossible to describe in a prompt (four ears, half are shaped one way, the other half another, middle eye, tusks, neckbeard tufts, etc. // brown fur, vintage white muzzle and chest marking) are very hard for an embedding to output
using an embedding trained on a different model will still give the concepts that it was trained against (using an embedding of a species of animal will generate something somewhat reminiscent of a real live version of that species of animal)

Contrarily, hypernetworks are another variation of extending the model with another mini-"model". They apply to the entire model as whole, rather than tokens, allowing it to target a subsection of the model.

26 KiB Executable File Raw Blame History

Textual Inversion Guide w/ E621 Content

Assumptions

Glossary

Preface

Acquiring Source Material

Fetch Script

Source Material For A Style

Pre-Processing Script

Pre-Requisites

Caveats

Training Prompt Template

Alternative Training Prompt Templates

For Training A Style

Training

For Training a Hypernetwork

Caveats

Using the Hypernetwork

Using the Embedding

After Words

The Nature of Textual Inversion embeddings

26 KiB

Executable File

Raw Blame History