Added some notes regarding style training, refined some ambiguous nerd-language

master
mrq 2022-10-10 02:34:48 +07:00
parent 6d78379883
commit 57f2eb0cf8
2 changed files with 70 additions and 20 deletions

@ -9,16 +9,27 @@ An up-to-date repo with all the necessary files can be found here: https://git.c
## Assumptions
This guide assumes the following basics:
* you have a working installation of Voldy's Web UI set ui
* you have a working installation of Voldy's Web UI
* you already have the furry/yiffy models downloaded
* you can run a python script
You can also extend this into any other booru-oriented model, but you'll have to modify the pre-processing script according to the site images were pulled from. The general concepts still apply.
## Glossary
Below is a list of terms clarified. I notice I'll use some terms interchangably with other concepts. These do not necessarily cover everything that's generally related to Stable Diffusion, but moreso about Textual Inversion and terms I'll use that needs disambiguation:
* `Textual Inversion`: the method of "training" your embedding; comparable to training a model, but not entirely accurate.
* `training`, `learning`: running Textual Inversion to improve your embedding
* `subject`: a character / object / noun of what you're trying to train against. For e621 (or another booru) applications, it's extremely likely it's a character. Textual Inversion excels at training against subjects.
* `style`: an artist's style. Textual Inversion can also incorporate subjects in a style.
* `source content/material`: the images you're using to train against; pulled from e621 (or another booru)
* `embedding`: the trained "model" of the subject or style in question. "Model" would be wrong to call the trained output, as Textual Inversion isn't true training
## Preface
I've burnt through seven or so models trying to train two of my husbandos, each try with different methods, so the "best" method still needs to be found, and my findings are all just extrapolations from my findings.
I've burnt through seven or so models trying to train two of my hazubandos, each try with different methods. I've found my third attempt to have very strong results, yet I don't recall exactly what I did to get it. My later subjects failed to yield such strong results, so your mileage will greatly vary depending on the subject/style you're training against.
What works for you will differ from what works for me, but do not be discouraged if output during training looks decent, but real output in txt2img and img2img fails. Just try different, well constructed prompts, and also try and increase the size. I've thought I've had embeddings failed, when it just took increasing the resolution and having a better prompt.
What works for you will differ from what works for me, but do not be discouraged if output during training looks decent, but real output in txt2img and img2img fails. Just try different, well constructed prompts, change where you place your subject, and also try and increase the size a smidge (such as 512x704, or 704x512). I've thought I've had embeddings failed, when it just took some clever tweaking for decent output.
## Acquiring Source Material
@ -30,7 +41,15 @@ If you're lacking material, the web UI's pre-processing tools to flip and split
If you rather would have finely-crafted material, you're more than welcome to manually crop and square images. A compromise for cropping an image is to expand the canvas size to square it off, and then fill the new empty space with colors to crudely blend with the background, and crudly adding color blobs to expand limbs outside the frame. It's not that imperative to do so, but it helps.
These tips also can apply to training an artist's art style instead, but I've yet to try it myself.
### Source Material For A Style
The above tips all also apply to training a style, but some additional care needs to be taken:
***Avoid*** having a recurring subject. Textual Inversion excels at training against a recurring element, especially a subject. It's very easy for your embedding to associate with a particular character moreso than a particular style. Minimize your training material having recurring subjects.
If you already have an embedding trained for a subject, and the artist you're training against has art including that character, use that character's trained embedding. I've found it gives very promising results during training, rather than using one after the fact. It's very, very hard to get txt2img to generate an image using a subject embedding and a style embedding without having to compromise one for the other.
Use the automatic pre-processing script in the web UI to flip and split your source material, as you don't have to focus on a particular subject for training. You can get very strong results by introducing style traits that aren't tied to a specific orientation.
## Pre-Processing Script
@ -38,16 +57,19 @@ You are not required to actually run this, as this script is just a shortcut to
Included in the repo under [`./utils/renamer/`](https://git.coom.tech/mrq/stable-diffusion-utils/src/branch/master/utils/renamer) is a script for tagging images from e621 in the filename for later user in the web UI.
For little additional configuration, use the python variant: `preprocess.py` (credits to [anon](https://boards.4chan.org/trash/thread/51463059#p51472156)). Just put your images in the `./utils/renamer/in/` folder, then run the script.
You can also have multiple variations of the same images, as it's useful if you're splitting an image into multiple parts. For example, the following is valid:
```
ef331a09e313914aa0bcb2c5310660ec.jpg
aacb4870a669b0fc7e1ede0c1652fa8c (1).jpg
aacb4870a669b0fc7e1ede0c1652fa8c (2).jpg
aacb4870a669b0fc7e1ede0c1652fa8c (1).jpg // manually sliced top half of an image
aacb4870a669b0fc7e1ede0c1652fa8c (2).jpg // manually sliced bottom half on an image
554982d3498e67a50f768e6e18088072.jpg
554982d3498e67a50f768e6e18088072 (1).jpg
554982d3498e67a50f768e6e18088072 (2).jpg
554982d3498e67a50f768e6e18088072 (1).jpg // manually sliced left half of an image
554982d3498e67a50f768e6e18088072 (2).jpg // manually sliced right half of an image
00001-0 554982d3498e67a50f768e6e18088072.jpg // automatically preprocessed image
00001-1 554982d3498e67a50f768e6e18088072.jpg // automatically preprocessed image
00001-2 554982d3498e67a50f768e6e18088072.jpg // automatically preprocessed image
00001-3 554982d3498e67a50f768e6e18088072.jpg // automatically preprocessed image
```
The generalized procedure is as followed:
@ -69,13 +91,21 @@ There's little safety checks or error messages, so triple check you have:
* a file called `tags.csv`, you can get it from [here](https://rentry.org/sdmodels#yiffy-e18ckpt-50ad914b) (Tag counts)
* patience, e621 allows a request every half second, so the script has to rate limit itself
Clone [this repo](https://git.coom.tech/mrq/stable-diffusion-utils), open a command prompt/terminal at `./utils/renamer/`, and invoke it with `node preprocess.js`
Clone [this repo](https://git.coom.tech/mrq/stable-diffusion-utils), open a command prompt/terminal at `./utils/renamer/`, and invoke it with `python3 preprocess.py`
### Tune-ables
You can also add in tags to be ignored in the filename, and adjust the character limit for the filename.
**!**TODO**!** Gut out Voldy's web UI's method of calculating tokens, and limit from there
I've yet to actually test this, but seeing that the prompt limit has been lifted for normal generation, I can assume it's also lifted for training. If you're feeling adventurous, you can adjust the character limit in the script to 240.
### Caveats
There's some "bugs" with the script, be it limitations with interfacing with web UI, or oversights in processing tags:
* commas are dropped both to save on filename/token count, and because the web UI will drop them anyways. This shouldn't matter so much, as commas only gives nuanced output when used, but for the strictest of users, be wary of this problem
* tags with parentheses, such as boxers_(clothing), or curt_(animal_crossing), the web UI will decide whatever it wants to when it comes to processing parentheses. I've seen some retain just the opening `(`, and some just the closing `)`, and some with multiple dangling `)`. At the absolute worst, it'll leak and emphasis some tokens during training.
* Species tags seemed to not be included in the `tags.csv`, yet they OBVIOUSLY affect the output. I haven't taken close note of it, but your results may or may not improve if you manually tag your species, either in the template or the filenames (whether the """pedantic""" reddit taxonomy term like `ursid` that e621 uses or the normal term like `bear` is prefered is unknown).
* filtering out common tags like `anthro, human, male, female`, could have negative effects with training either a subject or a style. I've definitely noticed I had to add negative terms for f\*moid parts or else my hazubando will have a cooter that I need to inpaint some cock and balls over. I've also noticed during training a style (that both has anthros and humans), a prompt associated with something anthro will generate something human. Just take notice if you don't foresee yourself ever generating a human with an anthro embedding, or anthro with a human embedding. (This also carries to ferals, but I'm sure that can be assumed)
## Training Prompt Template
@ -93,7 +123,7 @@ uploaded on e621, [name] by motogen, [filewords]
uploaded on e621, [name] by oaks16, [filewords]
uploaded on e621, [name] by jumperbear, [filewords]
```
would theoretically help keep the embedding from "learning" the art style itself of your subject, but again, your mileage may vary. I still need more tests between an embedding trained with one over the other template.
would theoretically help keep the embedding from "learning" the art style itself of your subject, but again, your mileage may vary, and wouldn't use this first. I still need more tests between an embedding trained with one over the other template.
If you really want to be safe, you can add some flavor to the template like:
```
@ -101,9 +131,16 @@ a photo of [name], uploaded on e621, [filewords]
an oil painting of [name], uploaded on e621, [filewords]
a picture of [name], uploaded on e621, [filewords]
```
I've yet to test results when training like that, so I don't have much anecdotal advice.
I've yet to test results when training like that, so I don't have much anecdotal advice, but only use this if you're getting output with little variation between different prompts.
Once you've managed to bang out your training template, make sure to note where you put it to reference later in the UI.
### For Training A Style
Once you've managed to bang out your training template, make sure to note where you put it to reference later in the UI
A small adjustment is needed if you're training on a style. Your template will be:
```
uploaded on e621, by [name], [filewords]
```
## Training
@ -121,7 +158,7 @@ Create your embedding to train on by providing:
* vectors per token
- this governs how much "data" can be trained to the token
- these do eat up how many tokens are left for the prompt, for example, setting this to 16 means you have 16 less tokens used for prompts
- a good range is 12 to 16 (I've only trained my last attempt on my husbando at 12, but I've also cut the training iterations and material source count in half, so I don't have much to extrapolate)
- a good range is 12 to 16, but the more you can afford the better. Given the recent change to the prompt limitation, you *could* easily just set this to 24 or 32 without worry, but I haven't personally tested the additional caveats that applies when going beyond the initial 75 tokens limit.
Click create, and the starting file will be created.
@ -143,13 +180,21 @@ If you didn't pre-process your images with flipped copies, I suggest midway thro
Using your newly trained embedding is as simple as putting in the name of the file in the prompt. Before, you would need to signal to the prompt parser with `<token>`, but it seems now you do not. I don't know if still using `<>` has any bearings on output, but take note you do not need it anymore.
Do not be discouraged if your initial output looks disgusting. I've found you need a nicely crafted prompt, and increasing the resolution a few notches will get something decent out of it Play around with prompts in the thread, but I've found this one to finally give me [decent output](https://desuarchive.org/trash/thread/51397474/#51400626) (credits to [anon](https://desuarchive.org/trash/thread/51387852/#51391540) and [anon](https://desuarchive.org/trash/thread/51397474/#51397741) for letting me shamelessly steal it for my perverted kemobara needs):
Do not be discouraged if your initial output looks disgusting. I've found you need a nicely crafted prompt, and increasing the resolution a few notches will get something decent out of it. Play around with prompts in the thread, but I've found this one to finally give me [decent output](https://desuarchive.org/trash/thread/51397474/#51400626) (credits to [anon](https://desuarchive.org/trash/thread/51387852/#51391540) and [anon](https://desuarchive.org/trash/thread/51397474/#51397741) for letting me shamelessly steal it for my perverted kemobara needs):
```
e621, explicit , by Pino Daeni, (chunie), wildlife photography, sharp details, <TOKEN>, solo, [:bulky:.6], detailed fur, hairy body, detailed eyes, penis, balls, <FLAVORS>
Negative prompt: blurry, out of frame, deformed, (bad anatomy), disfigured, bad hands, poorly drawn face, mutation, mutated, extra limb, amputee, messy, blurry, tiling, dark, human, text, watermark, copyright
Steps: 40, Sampler: Heun, CFG scale: 7, Seed: 1239293657, Size: 512x704, Model hash: 50ad914b
```
where `<TOKEN>` is the name of the embedding you used, and `<FLAVORS>` are additional tags you want to put in.
And and adjusted one of the above that I found to yield very tasteful results:
```
uploaded on e621, explict content, by [Pino Daeni:__e6_artist__:0.75] and [chunie:__e6_artist__:0.75], (photography, sharp details, detailed fur, detailed eyes:1.0), <TOKEN>, hairy body, <FLAVORS>
```
where `<TOKEN>` is the name of the embedding you used, `<FLAVORS>` are additional tags you want to put in, and `__e6__artist__` is used with the Wildcards third-party script (you can manually substitute them with other artists of your choosing for subtle nuances in your ouptut).
Ordering ***really*** matters when it comes to your embedding, and additionally the weight of your embedding. Too early in the prompt, and the weight for other terms will greatly fall off, but too late in the prompt, and your embedding will lose it's influence. Too much weight applied to your embedding, and you'll deepfry your output.
If you're using an embedding primarily focused on an artstyle, and you're also using an embedding trained on a subject, take great care in your weights on your additional embedding. Too much, even the smallest amount, and you'll destroy your style's embedding in the final output.
## After Words

@ -14,7 +14,8 @@ config = {
'cache': './cache.json', # JSON file of cached tags, will speed up processing if re-running
'rateLimit': 500, # time to wait between requests, in milliseconds, e621 imposes a rate limit of 2 requests per second
'filenameLimit': 192, # maximum characters to put in the filename, necessary to abide by filesystem limitations, and to "limit" token count for the prompt parser
'filenameLimit': 192, # maximum characters to put in the filename, necessary to abide by filesystem limitations
# you can set this to 250, as the web UI has uncapped the prompt limit, but I have yet to test this if this limit was also lifted for textual inversion
'filter': True,
# fill it with tags of whatever you don't want to make it into the filename
@ -59,9 +60,13 @@ def parse():
files.append(file)
for i in range(len(files)):
file = files[i]
# try filenames like "83737b5e961b594c26e8feaed301e7a5 (1).jpg" (duplicated copies from a file manager)
md5 = re.match(r"^([a-f0-9]{32})", file)
if not md5:
continue
# try filenames like "00001-83737b5e961b594c26e8feaed301e7a5.jpg" (output from voldy's web UI preprocessing)
md5 = re.match(r"([a-f0-9]{32})\.(jpe?g|png)$", file)
if not md5:
continue
md5 = md5.group(1)
print(f"[{(100.0 * i / len(files)):3.0f}%]: {md5}")