cleaned repo layout, added more hypernetwork notes

master
mrq 2022-10-12 18:35:05 +07:00
parent ea16688b5c
commit c0249cb3f8
11 changed files with 94 additions and 89 deletions

17
.gitignore vendored

@ -1,12 +1,13 @@
# ---> Node
utils/renamer/in/*.jpg
utils/renamer/in/*.png
utils/renamer/in/*.gif
utils/renamer/out/*.png
utils/renamer/out/*.jpg
utils/renamer/out/*.gif
utils/renamer/cache.json
images/downloaded/*.jpg
images/downloaded/*.png
images/downloaded/*.gif
images/tagged/*.png
images/tagged/*.jpg
images/tagged/*.gif
data/cache.json
package-lock.json

@ -1,10 +1,12 @@
# Textual Inversion Guide w/ E621 Content
# Textual Inversion/Hypernetwork Guide w/ E621 Content
An up-to-date repo with all the necessary files can be found here: https://git.coom.tech/mrq/stable-diffusion-utils
**!**WARNING**!** **!**CAUTION**!** ***DO NOT POST THE REPO'S URL ON 4CHAN*** **!**CAUTION**!** **!**WARNING**!**
`coom.tech` is an automatic 30-day ban if posted. I am not responsible if you share that URL. Share the [rentry](https://rentry.org/sd-e621-textual-inversion/) instead.
`coom.tech` is an automatic 30-day ban if posted. I am not responsible if you share that URL. Share the [rentry](https://rentry.org/sd-e621-textual-inversion/) instead, as this is effectively a copy of the README.
This guide has been stitched together the past few days with different trains of thoughts as I learn the ins and outs of effectively training concepts. Please keep this in mind if the guide seems to shift a bit or sound confusing. I intend to do a clean rewrite to make things more to-the-point yet concise.
## Assumptions
@ -13,7 +15,7 @@ This guide assumes the following basics:
* you already have the furry/yiffy models downloaded
* you can run a python script
You can also extend this into any other booru-oriented model, but you'll have to modify the pre-processing script according to the site images were pulled from. The general concepts still apply.
You can also extend this into any other booru-oriented model, but you'll have to modify the fetch and pre-processing script according to the site images were pulled from. The general concepts still apply.
## Glossary
@ -25,9 +27,10 @@ Below is a list of terms clarified. I notice I'll use some terms interchangably
* `source content/material`: the images you're using to train against; pulled from e621 (or another booru)
* `embedding`: the trained "model" of the subject or style in question. "Model" would be wrong to call the trained output, as Textual Inversion isn't true training
* `hypernetwork`: a different way to train custom content against a model, almost all of the same prinicples here apply for hypernetworks
* `loss rate`: a calculated value determining how close the actual output is to the expected output. Typically a value between `0.1` and `0.15` seem to be a good sign
* `epoch`: a term derived from typical neural network training
- normally, it's referred to as a full training cycle over your source material
- in this context, it's the above times the number of repeats per single image
- in this context, it's the above times the number of repeats per single image.
## Preface
@ -35,9 +38,49 @@ I've burnt through seven or so models trying to train three of my hazubandos, ea
What works for you will differ from what works for me, but do not be discouraged if output during training looks decent, but real output in txt2img and img2img fails. Just try different, well constructed prompts, change where you place your subject, and also try and increase the size a smidge (such as 512x704, or 704x512). I've thought I've had embeddings failed, when it just took some clever tweaking for decent output.
This guide also aims to try and document the best way to go about training a hypernetwork. If you're not sure whether you should use an embedding or a hypernetwork:
### Are Hypernetworks Right For Me?
Hypernetworks are a different flavor of extending models. Where Textual Inversion will train the best concepts to use during generation, hypernetworks will re-tune the outer layers of a model to better fit what you want. However, hypernetworks aren't a magic bullet to replace textual inversion. Below are some pros and cons between the two:
* Embedding (Textual Inversion):
- Pros:
+ trained embeddings are very small, can even be embedded in the outputs that use them
+ excel really well at concepts you can represent in a prompt
+ easy to use, just put the keyword in the prompt, and when you don't want it, don't include it
+ *can* be used with little training if your concept is pretty simple
+ *can* be used in other models that it wasn't necessarily trained on
+ simple to train, the default learning rate is "just good enough"
- Cons:
+ take quite a bit of VRAM
+ takes quite a lot of time to train for fantastic results
+ grows in size the longer you train them (still pretty small)
+ consumes tokens in the prompt
+ *can* be used with other embeddings, but usually attribute leaks
+ very sensitive in a prompt, usually needs to be placed in the right order and will break if the weights aren't "just right"
* Hypernetworks:
- Pros:
+ *can* work wonders on concepts you can't really represent in a prompt
+ (theoretically) works better for learning bigger concepts, like art styles, *certain* niches (fetishes), or species, but works fine on subjects
+ very quick to see results, can get by with even lower training steps, making it easier for anyone to train
+ (apparently) does not need very much VRAM to train, making it easier for anyone to train
+ appears to better generate trained concepts that an embedding has trouble generating; for example: `penis_through_fly`, as I've had terrible luck at best with getting that from an embedding
+ can rapidly train if you use an embedding already trained in the training prompt, or on concepts the model is already well familiar with
- Cons:
+ fixed, large size of ~87MiB, will eat space during training with frequent copies
+ very, *very*, **very**, ***very*** sensitive to "high" learning rates, will need to have it adjusted turing training
* remedied with a well-tuned set of stepping learning rates
+ quick to fry, will either slowly degrade in quality into a noisy mess, or rapidly turn into noise.
+ finnicky to swap, have to go into Settings to enable/disable
+ can be very error proned if you're using an embedding
+ very xenophobic to other models, as the weights greatly depend on the rest of the model
If you're still unsure, just stick with Textual Embeds for now. I don't have a concrete way of getting consistent training results with Hypernetworks at the moment. Hypernetworks seem to require less training overall, so it's not going to be too tough to swap over to train one.
## Acquiring Source Material
The first step of training against a subject (or art style) is to acquire source content. Hugging Face's instructions specify having three to five images, cropped to 512x512, but there's no hard upper limit on how many, nor does having more images have any bearings on the final output size or performance. However, the more images you use, the harder it'll take for it to converge (despite convergence in typical neural network model training means overfitment).
The first step of training against a subject (or art style or concept) is to acquire source content. Hugging Face's instructions specify having three to five images, cropped to 512x512, while the web UI just requires a 1:1 squared image, but there's no hard upper limit on how many, nor does having more images have any bearings on the final output size or performance. However, the more images you use, the harder it'll take for it to converge (despite convergence in typical neural network model training means overfitment).
I cannot imagine a scenario where you should stick with low image counts, such as selecting from a pool and pruning for the "best of the best". If you can get lots of images, do it. While it may appear the test outputs during training looks better with a smaller pool, when it comes to real image generation, embeddings from big image pools (140-190) yieled far better results over later embeddings trained on half the size of the first one (50-100).
@ -49,9 +92,9 @@ Lastly, for Textual Inversion, your results will vary greatly depending on the c
### Fetch Script
If you want to accelerate your ~~scraping~~ content acquisition, consult the fetch script under [`./utils/renamer/`](https://git.coom.tech/mrq/stable-diffusion-utils/src/branch/master/utils/renamer/). It's a """simple but powerful""" script that can ~~scrape~~ download from e621 given a search query.
If you want to accelerate your ~~scraping~~ content acquisition, consult the fetch script under [`./src/`](https://git.coom.tech/mrq/stable-diffusion-utils/src/branch/master/src/). It's a """simple but powerful""" script that can ~~scrape~~ download from e621 given a search query.
All you simply need to do is invoke the script with `python3 fetch.py "search query"`. For example: `python3 fetch.py "zangoose -female score:>0"`.
All you simply need to do is invoke the script with `python3 ./src/fetch.py "search query"`. For example: `python3 ./src/fetch.py "zangoose -female score:>0"`.
### Source Material For A Style
@ -67,7 +110,7 @@ Use the automatic pre-processing script in the web UI to flip and split your sou
You are not required to actually run this, as this script is just a shortcut to manually renaming files and curating the tags, but it cuts the bulk work of it.
Included in the repo under [`./utils/renamer/`](https://git.coom.tech/mrq/stable-diffusion-utils/src/branch/master/utils/renamer) is a script for tagging images from e621 in the filename for later user in the web UI.
Included in the repo under [`./src/`](https://git.coom.tech/mrq/stable-diffusion-utils/src/branch/master/src/) is a script for tagging images from e621 in the filename for later user in the web UI.
You can also have multiple variations of the same images, as it's useful if you're splitting an image into multiple parts. For example, the following is valid:
```
@ -94,18 +137,17 @@ The generalized procedure is as followed:
* yank out the artist and content rating, and prepend the list of tags
* copy the source file with the name being the processed list of tags
Additional information about the scripts can be found under the README under [`./utils/rename/README.md`](https://git.coom.tech/mrq/stable-diffusion-utils/src/branch/master/utils/renamer/).
### Pre-Requisites
There's little safety checks or error messages, so triple check you have:
* the below script in any folder of your choosing
* a folder called `in`, filled with images from e621 with the filenames intact (should be 32 alphanumeric characters of "gibberish")
* a folder called `out`, where you'll get copies of the files fron `in`, but with the tags in the filename
* a file called `tags.csv`, you can get it from [here](https://rentry.org/sdmodels#yiffy-e18ckpt-50ad914b) (Tag counts)
* patience, e621 allows a request every half second, so the script has to rate limit itself
Clone [this repo](https://git.coom.tech/mrq/stable-diffusion-utils), open a command prompt/terminal at `./utils/renamer/`, and invoke it with `python3 preprocess.py`
* downloaded/cloned [this repo](https://git.coom.tech/mrq/stable-diffusion-utils)
* open a command prompt/terminal where you downloaded/cloned this rep
* fill the `./images/downloaded/` folder with the images you want to use
- if you're manually supplying your images, make sure they retain the original filenames from e621
- if you're using the fetch script, no additional step is needed
- if you intend to use the web UI's preprocessing functions (flip/split), do that now, empty this folder, then move the files back into this folder
* invoke it with `python3 ./src/preprocess.py`
* you're done!
Consult the script if you want to adjust it's behavior. I tried my best to explain which each one does, and make it easy to edit them.
@ -153,7 +195,6 @@ a picture of [name], uploaded on e621, [filewords]
```
I've yet to test results when training like that, so I don't have much anecdotal advice, but only use this if you're getting output with little variation between different prompts.
### For Training A Style
A small adjustment is needed if you're training on a style. Your template will be:
@ -161,9 +202,11 @@ A small adjustment is needed if you're training on a style. Your template will b
uploaded on e621, by [name], [filewords]
```
I'm not quite clear on the differences by including the `by`, but the yiffy model was trained on a `uploaded on e621, [rating] content, by [artist], [tags]` format, but we can only get so close to the master-trained format.
## Training
Now that everything is set up, it's time to start training. For systems with adequate enough VRAM, you're free to run the web UI with `--no-half --precision full` (whatever "adequate entails"). You'll take a very slight performance hit, but quality improves barely enough I was able to notice.
Now that everything is set up, it's time to start training. For systems with "enough" VRAM (I don't have a number on what is adequate), you're free to run the web UI with `--no-half --precision full` (whatever "adequate entails"). You'll take a very slight performance hit, but quality improves barely enough I was able to notice. The Xformers feature seems to get disabled during training, but appears to make preview generations faster? So don't worry about getting xformers configured.
Make sure you're using the correct model you want to train against, as training uses the currently selected model.
@ -192,11 +235,11 @@ Next, under the `Train` sub-tab:
* `dataset directory`: pass in the path to the folder of your source material to train against
* `log directory`: player preference, the default is sane enough
* `prompt template file`: put in the path to the prompt file you created earlier. if you put it in the same folder as the web UI's default prompts, just rename the filename there
* `width` and `height`: I assume this determines the size of the image to generate when requested, I'd leave it to the default 512x512 for now
* `width` and `height`: I assume this determines the size of the image to generate when requested. Or it could actually work for training at different aspect ratios. I'd leave it to the default 512x512 for now.
* `max steps`: adjust how long you want the training to be done before terminating. Paperspace seems to let me do ~70000 on an A6000 before shutting down after 6 hours. An 80GB A100 will let me get shy of the full 100000 before auto-shutting down after 6 hours.
* `epoch length`: this value (*allegedly*) governs the learning rate correction when training based on defining how long an epoch is. for larger training sets, you would want to decrease this. I don't see any differences with this at the meantime.
* `save an image/copy`: the last two values are creature comforts and have no real effect on training, values are up to player preference.
* `preview prompt`: the prompt to use for the preview training image. if left empty, it'll use the last prompt used for training. it's useful for accurately measuring coherence between generations.
* `save an image/copy`: these two values are creature comforts and have no real effect on training, values are up to player preference.
* `preview prompt`: the prompt to use for the preview training image. if left empty, it'll use the last prompt used for training. it's useful for accurately measuring coherence between generations. I highly recommend using this with a prompt you want to use later. takes the same `[name]` and `[fileword]` keywords passed through to the template
Afterwards, hit `Train Embedding`, and wait and watch your creation come to life.
@ -206,17 +249,17 @@ Lastly, if you're training this on a VM in the "cloud", or through the shared gr
### For Training a Hypernetwork
As an alternative to Textual Inversion, the web UI also provied training a hypernetwork (effectively an overlay for the last few layers of a model to re-tune it). This is very, very experimental, and I'm not finding success close to being comparable to Textual Inversion, so be aware that this is pretty much conjecture until I can nail some decent results.
As an alternative to Textual Inversion, the web UI also provides training a hypernetwork. This is very, very experimental, and I'm not finding success close to being comparable to Textual Inversion, so be aware that this is pretty much conjecture until I can nail some decent results.
I ***highly*** suggest waiting for more developments around training hypernetworks. If you want something headache free, stick to using a Textual Inversion. Despite most likely being overhyped, hypernetworks still seem promising for quality improvements and for anons with lower VRAM GPUs.
The very core concepts are the same for training one, with the main difference being the learning rate is very, very sensitive, and needs to be reduced as more steps are ran. I've seen my hypernetworks quickly dip into some incoherent noise, and I've seen some slowly turn into some schizo's dream where the backgrounds and edges are noisy.
The official documentation lazily suggests a learning rate of either `0.000005` or `0.0000005`, but I find it to be inadequate. For the mean time, I suggest using `0.000000025` to get started. I'll provide a better value that makes use of the learning rate editing feature when I find a good range.
The official documentation lazily suggests a learning rate of either `0.000005` or `0.0000005`, but I find it to be inadequate. For the mean time, I suggest using `0.000000025` to get started if you're fine babysitting or use `0.000005:2500,0.0000025:20000,0.0000001:30000,0.000000075:-1`.
#### Caveats
Please, please, ***please*** be aware that training a hypernetwork also uses any embeddings from textual inversion. You ***will*** get false results if you use a hypernetwork trained with a textual inversion embedding. This is very easy to do if you have your hypernetwork named the same as an embedding you have, especially if you're using the `[name]` keyword in your training template.
Please, please, ***please*** be aware that training a hypernetwork also uses any embeddings from textual inversion. You ***will*** get misrepresented results if you use a hypernetwork trained with a textual inversion embedding. This is very easy to do if you have your hypernetwork named the same as an embedding you have, especially if you're using the `[name]` keyword in your training template.
You're free to use a embedding in your hypernetwork training, but some caveats I've noticed:
* any image generation without your embedding will get terrible output
@ -224,13 +267,11 @@ You're free to use a embedding in your hypernetwork training, but some caveats I
* if you wish to share your hypernetwork, and you in fact did train it with an embedding, it's important the very same embedding is included
* like embeddings, hypernetworks are still bound to the model you trained against. unlike an embedding, using this on a different model will absolutely not work.
I'm also not too keen whether you need to have a `[name]` token in your training template, as hypernetworks apply more on a model level than a token level.
### Using the Hypernetwork
I'm also not too keen whether you need to have a `[name]` token in your training template, as hypernetworks apply more on a model level than a token level. I suggest leaving the training template alone and keeping it in.
To be discovered later. As of now, you just have to go into Settings, scroll at the bottom, and select your newly trained hypernetwork in the dropdown.
#### Using the Hypernetwork
I can *assume* that you do not need to have any additional keywords if you trained with a template that did not include the `[name]` keyword. I also *feel* like you don't need to even if you did, but I'll come back and edit my findings after I re-train a hypernetwork.
It's as simple as selecting it under Settings in the Hypernetworks drop-down box. Hit save after selecting. Afterwards, happy prompting, happy fapping.
## Using the Embedding
@ -268,6 +309,7 @@ Textual Inversion embeddings serve as mini-"models" to extend a current one. Whe
* "vectors per token" consumes how many tokens from the prompt
* subjects that are easy to describe in a prompt (vintage white fur, a certain shape and colored glasses, eye color, fur shagginess, three toes, etc.) give far better results
* subjects that are nigh impossible to describe in a prompt (four ears, half are shaped one way, the other half another, middle eye, tusks, neckbeard tufts, etc. // brown fur, vintage white muzzle and chest marking) are *very* hard for an embedding to output
* attributes associated with the embedding can leak onto other subjects in the output; for example: `[...] <TOKEN> and anthro cat [...]` will give you two of your trained subject with cat attributes. Whether this is more of a side-effect of Textual Inversion itself, or a symptom of attribute leaking in general with how the web UI parses prompts, is unknown.
* using an embedding trained on a different model will still give the concepts that it was trained against (using an embedding of a species of animal will generate something somewhat reminiscent of a real live version of that species of animal)
Contrarily, hypernetworks are another variation of extending the model with another mini-"model". They apply to the entire model as whole, rather than tokens, allowing it to target a subsection of the model.
Contrarily, hypernetworks are another variation of extending the model with another mini-"model". They apply to the last outer layers as a whole, allowing you to effectively re-tune the model.

Can't render this file because it is too large.

@ -41,8 +41,8 @@ let config = {
query: ``, // example query if no argument is passed, kept empty so the script can scream at you for not having it tagged
output: `./in/`, // directory to save your files
cache: `./cache.json`, // JSON file of cached tags, will speed up processing when used for the renamer script
output: `./images/downloaded/`, // directory to save your files
cache: `./data/cache.json`, // JSON file of cached tags, will speed up processing when used for the renamer script
limit: 320, // how many posts to pull in one go
concurrency: 4, // how many file requests to keep in flight at the same time

@ -46,8 +46,8 @@ config = {
'query': '', # example query if no argument is passed, kept empty so the script can scream at you for not having it tagged
'output': './in/', # directory to save your files
'cache': './cache.json', # JSON file of cached tags, will speed up processing when used for the renamer script
'output': './images/downloaded/', # directory to save your files
'cache': './data/cache.json', # JSON file of cached tags, will speed up processing when used for the renamer script
'limit': 320, # how many posts to pull in one go

@ -2,10 +2,10 @@ let FS = require("fs")
let Fetch = require("node-fetch")
let config = {
input: `./in/`, // files to process
output: `./out/`, // files to copy files to
tags: `./tags.csv`, // csv of tags associated with the yiffy model (replace for other flavor of booru's taglist associated with the model you're training against)
cache: `./cache.json`, // JSON file of cached tags, will speed up processing if re-running
input: `./images/downloaded/`, // files to process
output: `./images/tagged/`, // files to copy files to
tags: `./data/tags.csv`, // csv of tags associated with the yiffy model (replace for other flavor of booru's taglist associated with the model you're training against)
cache: `./data/cache.json`, // JSON file of cached tags, will speed up processing if re-running
rateLimit: 500, // time to wait between requests, in milliseconds, e621 imposes a rate limit of 2 requests per second
filenameLimit: 243, // maximum characters to put in the filename, necessary to abide by filesystem limitations, and to "limit" token count for the prompt parser

@ -9,10 +9,10 @@ import math
import urllib.request
config = {
'input': './in/', # files to process
'output': './out/', # files to copy files to
'tags': './tags.csv', # csv of tags associated with the yiffy model (replace for other flavor of booru's taglist associated with the model you're training against)
'cache': './cache.json', # JSON file of cached tags, will speed up processing if re-running
'input': './images/downloaded/', # files to process
'output': './images/tagged/', # files to copy files to
'tags': './data/tags.csv', # csv of tags associated with the yiffy model (replace for other flavor of booru's taglist associated with the model you're training against)
'cache': './data/cache.json', # JSON file of cached tags, will speed up processing if re-running
'rateLimit': 500, # time to wait between requests, in milliseconds, e621 imposes a rate limit of 2 requests per second
'filenameLimit': 245, # maximum characters to put in the filename, necessary to abide by filesystem limitations

@ -1,33 +0,0 @@
# E621 Scripts
Included are the utilities provided for ~~scraping~~ acquiring your source content to train on.
If you're targeting another booru, the same principles apply, but you'll need to adjust your repo URL and processing your booru's JSON output. Doing so is left as an exercise to the reader.
Lastly, feature parity between the two scripts may not be up to par, as I'm a sepples programmer, not a Python dev. The initial `preprocess.py` was gratiously written by an anon, and I've cobbled together the `fetch.py` one myself. The node.js version will definitely have more features, as I'm better at node.js
## Dependencies
The python scripts have no additional dependencies, while node.js scripts requires running `npm install node-fetch@2` (v2.x because I'm old and still using `require` for my includes).
## Fetch
This script is responsible for ~~scraping~~ downloading from e621 all requested files for your target subject/style.
To run, simply invoke the script with `python fetch.py [search query]`. For example: `python fetch.py "kemono -dog"` to download all non-dog posts tagged as kemono.
In the script are some tune-ables, but the defaults are sane enough not to require any additional configuration.
If you're using another booru, extending the script to support your booru of choice is easy, as the script was configured to allow for additional booru definitions. Just reference the provided one for e621 if you need a starting point.
The python script is nearly at feature-parity with the node.js script, albeit missing the concurrency option. Please understand, not a Python dev.
## Pre-Process
The bread and butter of this repo is the preprocess script, responsible for associating your images from e621 with tags to train against during Textual Inversion.
The output from the fetch script seamlessy integrates with the inputs for the preprocess script. The `cache.json` file should also have all the necessary tags to further accelerate this script.
For the python version, simply place your source material into the `./in/` folder, invoke the script with `python3 preprocess.py`, then get your processed files from `./out/`. For the node.js version, do the same thing, but with `node preprocess.js`.
This script *should* gracefully support files already pre-processed through the web UI, as long as they were processed with their original filenames (the MD5 hash booru filenames).

@ -1,5 +0,0 @@
{
"dependencies": {
"node-fetch": "^2.6.7"
}
}