From ea16688b5cfefac5c2f18a38877fad7712a6e60d Mon Sep 17 00:00:00 2001 From: mrq Date: Wed, 12 Oct 2022 15:53:12 +0000 Subject: [PATCH] slight fix, added more notes on hypernetworks --- README.md | 71 +++++++++++++++++++++++-------------- utils/renamer/README.md | 2 +- utils/renamer/preprocess.js | 7 ++-- utils/renamer/preprocess.py | 5 +-- 4 files changed, 53 insertions(+), 32 deletions(-) diff --git a/README.md b/README.md index f09584d..2a09a45 100755 --- a/README.md +++ b/README.md @@ -31,13 +31,12 @@ Below is a list of terms clarified. I notice I'll use some terms interchangably ## Preface -I've burnt through seven or so models trying to train two of my hazubandos, each try with different methods. I've found my third attempt to have very strong results, yet I don't recall exactly what I did to get it. My later subjects failed to yield such strong results, so your mileage will greatly vary depending on the subject/style you're training against. +I've burnt through seven or so models trying to train three of my hazubandos, each try with different methods. I've found my third attempt to have very strong results, yet I don't recall exactly what I did to get it. My later subjects failed to yield such strong results, so your mileage will greatly vary depending on the subject/style you're training against. What works for you will differ from what works for me, but do not be discouraged if output during training looks decent, but real output in txt2img and img2img fails. Just try different, well constructed prompts, change where you place your subject, and also try and increase the size a smidge (such as 512x704, or 704x512). I've thought I've had embeddings failed, when it just took some clever tweaking for decent output. ## Acquiring Source Material - The first step of training against a subject (or art style) is to acquire source content. Hugging Face's instructions specify having three to five images, cropped to 512x512, but there's no hard upper limit on how many, nor does having more images have any bearings on the final output size or performance. However, the more images you use, the harder it'll take for it to converge (despite convergence in typical neural network model training means overfitment). I cannot imagine a scenario where you should stick with low image counts, such as selecting from a pool and pruning for the "best of the best". If you can get lots of images, do it. While it may appear the test outputs during training looks better with a smaller pool, when it comes to real image generation, embeddings from big image pools (140-190) yieled far better results over later embeddings trained on half the size of the first one (50-100). @@ -46,8 +45,14 @@ If you're lacking material, the web UI's pre-processing tools to flip and split If you rather would have finely-crafted material, you're more than welcome to manually crop and square images. A compromise for cropping an image is to expand the canvas size to square it off, and then fill the new empty space with colors to crudely blend with the background, and crudly adding color blobs to expand limbs outside the frame. It's not that imperative to do so, but it helps. +Lastly, for Textual Inversion, your results will vary greatly depending on the character you're trying to train against. A character with features you could easily describe in a prompt will yield good results, while characters with hard/impossible to describe attributes will make it very tough for the embedding to learn and replicate. + +### Fetch Script + If you want to accelerate your ~~scraping~~ content acquisition, consult the fetch script under [`./utils/renamer/`](https://git.coom.tech/mrq/stable-diffusion-utils/src/branch/master/utils/renamer/). It's a """simple but powerful""" script that can ~~scrape~~ download from e621 given a search query. +All you simply need to do is invoke the script with `python3 fetch.py "search query"`. For example: `python3 fetch.py "zangoose -female score:>0"`. + ### Source Material For A Style The above tips all also apply to training a style, but some additional care needs to be taken: @@ -102,31 +107,37 @@ There's little safety checks or error messages, so triple check you have: Clone [this repo](https://git.coom.tech/mrq/stable-diffusion-utils), open a command prompt/terminal at `./utils/renamer/`, and invoke it with `python3 preprocess.py` -### Tune-ables - -You can also add in tags to be ignored in the filename, and adjust the character limit for the filename. - -I've yet to actually test this, but seeing that the prompt limit has been lifted for normal generation, I can assume it's also lifted for training. If you're feeling adventurous, you can adjust the character limit in the script to 240. +Consult the script if you want to adjust it's behavior. I tried my best to explain which each one does, and make it easy to edit them. ### Caveats There's some "bugs" with the script, be it limitations with interfacing with web UI, or oversights in processing tags: -* commas are dropped both to save on filename/token count, and because the web UI will drop them anyways. This shouldn't matter so much, as commas only gives nuanced output when used, but for the strictest of users, be wary of this problem -* tags with parentheses, such as boxers_(clothing), or curt_(animal_crossing), the web UI will decide whatever it wants to when it comes to processing parentheses. I've seen some retain just the opening `(`, and some just the closing `)`, and some with multiple dangling `)`. At the absolute worst, it'll leak and emphasis some tokens during training. -* Species tags seemed to not be included in the `tags.csv`, yet they OBVIOUSLY affect the output. I haven't taken close note of it, but your results may or may not improve if you manually tag your species, either in the template or the filenames (whether the """pedantic""" reddit taxonomy term like `ursid` that e621 uses or the normal term like `bear` is prefered is unknown). +* commas do not carry over to the training prompt, as this is a matter of how the web UI re-assembles tokens passed from the prompt template/filename. There's functionally no difference with having `,`, or ` ` as your delimiter in this preprocess script. +* tags with parentheses, such as `boxers_(clothing)`, or `curt_(animal_crossing)`, the web UI will decide whatever it wants to when it comes to processing parentheses. The script can overcome this problem by simply removing anything in parentheses, as you can't really escape them in the filename without editing the web UI's script. +* Species tags seemed to not be included in the `tags.csv`, yet they OBVIOUSLY affect the output. I haven't taken close note of it, but your results may or may not improve if you manually tag your species, either in the template or the filenames (whether the """pedantic""" reddit taxonomy term like `ursid` that e621 uses or the normal term like `bear` is prefered is unknown). The pre-process script will include them by default, but be warned that it will include any of the pedantic species tags (stuff like `suina sus boar pig`) * filtering out common tags like `anthro, human, male, female`, could have negative effects with training either a subject or a style. I've definitely noticed I had to add negative terms for f\*moid parts or else my hazubando will have a cooter that I need to inpaint some cock and balls over. I've also noticed during training a style (that both has anthros and humans), a prompt associated with something anthro will generate something human. Just take notice if you don't foresee yourself ever generating a human with an anthro embedding, or anthro with a human embedding. (This also carries to ferals, but I'm sure that can be assumed) +* the more images you do use, the longer it will take for the web UI to load and process them, and presumably more VRAM needed. 200 images isn't too bad, but 9000 will take 10 minutes on an A100-80G. ## Training Prompt Template -The final piece of the puzzle is providing a decent template to train against. Under `./stable-diffusion-webui/textual_inversion_templates/` are text files for these templates. The Web UI provides rudimentary keywords (\[name\] and \[filewords\]) to help provide better crafted prompts used during training. The pre-processing script handles the \[filewords\] requirement, while \[name\] will be where you want the embedding's name to plop in the prompt. +The final piece of the puzzle is providing a decent template to train against. Under `./stable-diffusion-webui/textual_inversion_templates/` are text files for these templates. The Web UI provides rudimentary keywords (`[name]` and `[filewords]`) to help provide better crafted prompts used during training. The pre-processing script handles the `[filewords]` requirement, while `[name]` will be where you want the embedding's name to plop in the prompt. The ~~adequate~~ ***recommended*** starting point is simply: - ``` uploaded on e621, [name], [filewords] ``` +or for the pedantic: +``` +uploaded on e621, [filewords], [name] +``` -I've had decent results with just that for training subjects. I've had mixed results with expanding that by filling in more artists to train against, for example: +I've had decent results with just that for training subjects with the first one. I imagine the second one being more pedantic can help too, but places your training token at the very end. It's a bit *more* correct, as I can rarely ever actually have my trained token in the early part of the prompt without it compromising other elements. + +Once you've managed to bang out your training template, make sure to note where you put it to reference later in the UI. + +### Alternative Training Prompt Templates + +I've had mixed results with expanding that by filling in more artists to train against, for example: ``` uploaded on e621, [name] by motogen, [filewords] uploaded on e621, [name] by oaks16, [filewords] @@ -142,7 +153,6 @@ a picture of [name], uploaded on e621, [filewords] ``` I've yet to test results when training like that, so I don't have much anecdotal advice, but only use this if you're getting output with little variation between different prompts. -Once you've managed to bang out your training template, make sure to note where you put it to reference later in the UI. ### For Training A Style @@ -157,9 +167,9 @@ Now that everything is set up, it's time to start training. For systems with ade Make sure you're using the correct model you want to train against, as training uses the currently selected model. -Run the Web UI, and click the `Textual Inversion` tab. +Run the Web UI, and click the `Training` sub-tab. -Create your embedding to train on by providing: +Create your embedding to train on by providing the following under the `Create embedding`: * a name - can be changed later, it's just the filename, and the way to access your embedding in prompts * the initialization text @@ -173,38 +183,47 @@ Create your embedding to train on by providing: Click create, and the starting file will be created. -Afterwards, you can pre-process your source material further by duplicating to flip (will remove the filenames if you preprocessed them already, so beware), or split (presumably will also eat your filenames). +Afterwards, you can click the `Preprocess images` sub-tab to pre-process your source material further by duplicating to flip, or split. -Next: +Next, under the `Train` sub-tab: * `embedding` or `hypernetwork`: select your embedding/hypernetwork to train on in the dropdown * `learning rate`: if you're adventurous, adjust the learning rate. The default of `0.005` is fine enough, and shouldn't cause learning/loss problems, but if you're erring on the side of caution, you can set it to `0.0005`, but more training will be needed. - - If you're training a hypernetwork, use `0.000005` or `0.0000005` for a learning rate. + - similar to prompt editing, you can also specify when to change the learning rate. For example: `0.000005:2500,0.0000025:20000,0.0000001:40000,0.00000001:-1` will use the first rate until 2500 steps, the second one until 20000 steps, the third until 40000 steps, then hold with the last one for the rest of the training. * `dataset directory`: pass in the path to the folder of your source material to train against * `log directory`: player preference, the default is sane enough * `prompt template file`: put in the path to the prompt file you created earlier. if you put it in the same folder as the web UI's default prompts, just rename the filename there * `width` and `height`: I assume this determines the size of the image to generate when requested, I'd leave it to the default 512x512 for now * `max steps`: adjust how long you want the training to be done before terminating. Paperspace seems to let me do ~70000 on an A6000 before shutting down after 6 hours. An 80GB A100 will let me get shy of the full 100000 before auto-shutting down after 6 hours. -* `epoch length`: this value governs the learning rate correction when training based on defining how long an epoch is. for larger training sets, you would want to decrease this. +* `epoch length`: this value (*allegedly*) governs the learning rate correction when training based on defining how long an epoch is. for larger training sets, you would want to decrease this. I don't see any differences with this at the meantime. * `save an image/copy`: the last two values are creature comforts and have no real effect on training, values are up to player preference. +* `preview prompt`: the prompt to use for the preview training image. if left empty, it'll use the last prompt used for training. it's useful for accurately measuring coherence between generations. -Afterwards, hit Train, and wait and watch your creation come to life. +Afterwards, hit `Train Embedding`, and wait and watch your creation come to life. If you didn't pre-process your images with flipped copies, I suggest midway through to pause training, then use ImageMagick's `mogrify` to flip your images with `mogrify -flop *` in the directory of your source material. I feel I've gotten nicer quality pictures because of it over an embedding I trained without it (but with a different prompt template). +Lastly, if you're training this on a VM in the "cloud", or through the shared gradio URL, I notice the web UI will desync and stop updating from the actual server. You can lazily resync by opening the gradio URL in a new window, navigate back to the Training tabs, and click Train again *without touching any settings*. It'll re-grab the training progress. + ### For Training a Hypernetwork -Please, please, ***please*** be aware that training a hypernetwork also uses any embeddings from textual inversion. You ***will*** get false results if you use a hypernetowrk trained with a textual inversion embedding. This is very easy to do if you have your hypernetwork named the same as an embedding you have, especially if you're using the `[name]` keyword in your training template. +As an alternative to Textual Inversion, the web UI also provied training a hypernetwork (effectively an overlay for the last few layers of a model to re-tune it). This is very, very experimental, and I'm not finding success close to being comparable to Textual Inversion, so be aware that this is pretty much conjecture until I can nail some decent results. + +I ***highly*** suggest waiting for more developments around training hypernetworks. If you want something headache free, stick to using a Textual Inversion. Despite most likely being overhyped, hypernetworks still seem promising for quality improvements and for anons with lower VRAM GPUs. + +The very core concepts are the same for training one, with the main difference being the learning rate is very, very sensitive, and needs to be reduced as more steps are ran. I've seen my hypernetworks quickly dip into some incoherent noise, and I've seen some slowly turn into some schizo's dream where the backgrounds and edges are noisy. + +The official documentation lazily suggests a learning rate of either `0.000005` or `0.0000005`, but I find it to be inadequate. For the mean time, I suggest using `0.000000025` to get started. I'll provide a better value that makes use of the learning rate editing feature when I find a good range. + +#### Caveats + +Please, please, ***please*** be aware that training a hypernetwork also uses any embeddings from textual inversion. You ***will*** get false results if you use a hypernetwork trained with a textual inversion embedding. This is very easy to do if you have your hypernetwork named the same as an embedding you have, especially if you're using the `[name]` keyword in your training template. You're free to use a embedding in your hypernetwork training, but some caveats I've noticed: -* it is imperative to use a really low learning rate, or you'll fry the hypernetwork and get garbage output after 2200, 4400, or 5200 steps * any image generation without your embedding will get terrible output * using a hypernetwork + embedding of the same concept doesn't seem to give very much of a difference, although my test was with a embedding I didn't have very great results from anyways * if you wish to share your hypernetwork, and you in fact did train it with an embedding, it's important the very same embedding is included -* embedding files are orders of magnitude larger than an embedding, but doesn't *seem* to grow in size as you train it, unlike an embedding where it's still pretty small, but grows in size as you train it. * like embeddings, hypernetworks are still bound to the model you trained against. unlike an embedding, using this on a different model will absolutely not work. -Now that you understand the caveats, training a hypernetwork is (almost) the same as training an embedding through Textual Inversion. The only real difference in training seems to be needing a very much lower learning rate of either `0.000005` or `0.0000005`. As of 2022.10.11, it seems voldy's web UI also can adjust your learning rate based on how many epochs have passed (as a refresher, it's how many times you processed your source material, times whatever value you set in the web UI). I'm not too keen on how to adjust it, but there seems to be commits involving it being added in. - I'm also not too keen whether you need to have a `[name]` token in your training template, as hypernetworks apply more on a model level than a token level. ### Using the Hypernetwork diff --git a/utils/renamer/README.md b/utils/renamer/README.md index 8e528ad..ebb9a8d 100755 --- a/utils/renamer/README.md +++ b/utils/renamer/README.md @@ -30,4 +30,4 @@ The output from the fetch script seamlessy integrates with the inputs for the pr For the python version, simply place your source material into the `./in/` folder, invoke the script with `python3 preprocess.py`, then get your processed files from `./out/`. For the node.js version, do the same thing, but with `node preprocess.js`. -This script *should* also support files already pre-processed through the web UI, as long as they were processed with their original filenames (the MD5 hash booru filenames). Pre-processing in the web UI after running this script might prove tricky, as I've had files named something like `00001-0anthro[...]`, and had to use a clever rename command to break it apart. \ No newline at end of file +This script *should* gracefully support files already pre-processed through the web UI, as long as they were processed with their original filenames (the MD5 hash booru filenames). \ No newline at end of file diff --git a/utils/renamer/preprocess.js b/utils/renamer/preprocess.js index 486a229..adb3d69 100755 --- a/utils/renamer/preprocess.js +++ b/utils/renamer/preprocess.js @@ -8,7 +8,7 @@ let config = { cache: `./cache.json`, // JSON file of cached tags, will speed up processing if re-running rateLimit: 500, // time to wait between requests, in milliseconds, e621 imposes a rate limit of 2 requests per second - filenameLimit: 245, // maximum characters to put in the filename, necessary to abide by filesystem limitations, and to "limit" token count for the prompt parser + filenameLimit: 243, // maximum characters to put in the filename, necessary to abide by filesystem limitations, and to "limit" token count for the prompt parser filter: true, // fill it with tags of whatever you don't want to make it into the filename @@ -35,7 +35,8 @@ let config = { // if you're cautious (paranoid), include species you want, but I found I don't really even need to include specis // you can also include character names / series names if you're using this for hypernetworks // you can also use this to boost a tag already defined to max priority - tagsOverride: ["character", "species", "copyright"], // useful for hypernetwork training + tagsOverride: ["character", "copyright"], // useful for hypernetwork training +// tagsOverride: ["character", "species", "copyright"], // useful for hypernetwork training tagsOverrideCategories: true, // override categories tagsOverrideStart: 1000000, // starting score that your overriden tags will start from, for sorting purposes @@ -163,7 +164,7 @@ let parse = async () => { let joined = filtered.join(config.tagDelimiter) // NOOOOOO YOU'RE SUPPOSE TO DO IT ASYNCHRONOUSLY SO IT'S NOT BLOCKING - FS.copyFileSync(`${config.input}/${file}`, `${config.output}/${file.replace(md5, joined)}`) + FS.copyFileSync(`${config.input}/${file}`, `${config.output}/${file.replace(md5, " "+joined).trim()}`) if ( rateLimit && config.rateLimit ) await new Promise( (resolve) => { setTimeout(resolve, config.rateLimit) diff --git a/utils/renamer/preprocess.py b/utils/renamer/preprocess.py index cf8845e..3085bb3 100755 --- a/utils/renamer/preprocess.py +++ b/utils/renamer/preprocess.py @@ -41,7 +41,8 @@ config = { # treat these tags as already being included in the # if you're cautious (paranoid), include species you want, but I found I don't really even need to include specis # you can also include character names / series names if you're using this for hypernetworks - 'tagsOverride': ["species", "character", "copyright"], # useful for hypernetwork training + 'tagsOverride': ["character", "copyright"], # useful for textual inversion training +# 'tagsOverride': ["character", "species", "copyright"], # useful for hypernetwork training 'tagsOverrideStart': 1000000, # starting score that your overriden tags will start from, for sorting purposes # tags to always include in the list @@ -174,7 +175,7 @@ def parse(): filtered.append(i) joined = config['tagDelimiter'].join(filtered) - shutil.copy(os.path.join(config['input'], file), os.path.join(config['output'], file.replace(md5, joined))) + shutil.copy(os.path.join(config['input'], file), os.path.join(config['output'], file.replace(md5, " "+joined).strip())) if rateLimit and config['rateLimit']: time.sleep(config['rateLimit'] / 1000.0)