more sloppy quick notes but for CLIP aesthetics

2022-10-21 15:00:21 +00:00 · 2022-10-21 15:00:21 +00:00 · bd93df445f
commit bd93df445f
parent 3bec2624fb
1 changed files with 42 additions and 3 deletions
--- a/README.md
+++ b/README.md
@ -6,7 +6,11 @@ An up-to-date repo with all the necessary files can be found [here](https://git.

 `coom.tech` is an automatic 30-day ban if posted. I am not responsible if you share that URL. Share the [rentry](https://rentry.org/sd-e621-textual-inversion/) instead, as this is effectively a copy of the README.

-This guide has been stitched together the past few days with different trains of thoughts as I learn the ins and outs of effectively training concepts. Please keep this in mind if the guide seems to shift a bit, or sound confusing, or feels like it's covering unncessary topics. I intend to do a clean rewrite to make things more to-the-point yet concise.
+This guide has been stitched together with different trains of thoughts as I learn the ins and outs of effectively training concepts. Please keep this in mind if the guide seems to shift a bit, or sound confusing, or feels like it's covering unncessary topics. I intend to do a clean rewrite to make things more to-the-point yet concise.
+
+Also, as new features get added, they have to find room among the details for Textual Inversion, so bear in mind something seems rather forced to be included. As examples:
+* hypernetworks released a week or two after training textual inversion in the web UI was added
+* the CLIP Aesthetic feature also released, and, while it requires little set-up, has a hard time finding a home in this guide

 Unlike any guide for getting Voldy's Web UI up, a good majority of this guide is focused on getting the right content, and feeding it the right content, rather than running commands.

@ -28,6 +32,7 @@ Below is a list of terms clarified. I notice I'll use some terms interchangably
 * `style`: an artist's style. Textual Inversion can also incorporate subjects in a style.
 * `source content/material`: the images you're using to train against; pulled from e621 (or another booru)
 * `embedding`: the trained "model" of the subject or style in question. "Model" would be wrong to call the trained output, as Textual Inversion isn't true training
+	- `aesthetic image embedding/clip aesthetic`: a collection of images of an aesthetic you're trying to capture to use for CLIP Aesthetic features
 * `hypernetwork`: a different way to train custom content against a model, almost all of the same prinicples here apply for hypernetworks
 * `loss rate`: a calculated value determining how close the actual output is to the expected output. Typically a value between `0.1` and `0.15` seem to be a good sign
 * `epoch`: a term derived from typical neural network training, normally, it's referred to as a full training cycle over your source material (total iterations / training set size), but the web UI doesn't actually do anything substantial with it.
@ -38,9 +43,9 @@ I've burnt through countless models trying to train three of my hazubandos and a

 What works for you will differ from what works for me, but do not be discouraged if output during training looks decent, but real output in txt2img and img2img fails to live up to expectations. Just try different, well constructed prompts, change where you place your subject, and also try and increase the size a smidge (such as 512x704, or 704x512). I've thought I've had embeddings failed, when it just took some clever tweaking for decent output.

-This guide also aims to try and document the best way to go about training a hypernetwork. If you're not sure whether you should use an embedding or a hypernetwork:
+This guide also aims to try and document the best way to go about training a hypernetwork and using a CLIP aesthetic embedding. If you're not sure on which to use:

-### Are Hypernetworks Right For Me?
+### What Is Right For Me?

 Hypernetworks are a different flavor of extending models. Where Textual Inversion will train the best concepts to use during generation, hypernetworks will re-tune the outer layers of a model to better fit what you want. However, hypernetworks aren't a magic bullet to replace textual inversion. I propose a short answer and a long answer:

@ -92,6 +97,17 @@ If you're not satisfied with such a short query, I present some pros and cons be
 		+ requires trying not to deviate so hard from the prompt you trained it against
 		+ very xenophobic to other models, as the weights greatly depend on the rest of the model
 		+ doesn't seem to solve the issue any better of embeddings failing to represent hard-to-describe concepts
+* Embedding (CLIP Aesthetic):
+	- this is another recent tech that I need to put more time into, but from a very quick glance:
+	- Pros:
+		+ very, very quick to set up, just gather your source content, create the embedding, and it's good to use
+		+ (theoretically) appears to work really well for art styles
+		+ """training""" is done JIT (just in time) rather than beforehand, making the investment practically zero-cost
+		+ gives decent results (if you're comfortable really tuning the settings)
+	- Cons:
+		+ the web UI's defaults are overkill and will ruin the image
+		+ requires more finesse with tuning the settings
+		+ (theoretically) appears to only really work for art styles

 If you're still unsure, just stick with Textual Embeds for now. Despite the *apparent* upsides in training performance compared to an embedding, until better learning rates are found, I can't bring myself to suggest it.

@ -255,6 +271,29 @@ Click create, and the starting file will be created.

 There's only one thing you need to do, and that's giving it a name. Afterwards, click create.

+### Training for a CLIP Aesthetic
+
+Getting started with a CLIP Aesthetic embedding is very, very easy. Just:
+* navigate to `Train` > `Create aesthetic images embedding`
+* name your embedding to what you want
+* put in the path to where you have your images to be "trained" on
+* if you know what you're doing, adjust the batch size, it *could* speed up the process
+* click `Create images embedding`
+
+And you're done! You can now use your CLIP Aesthetic for generating images.
+
+You ***do not*** need to train the embedding.
+
+#### Using the CLIP Aesthetic
+
+When generating an image with txt2img or img2img, a drop down box labeled `Open for Clip Aesthetic!` will be available to use your new CLIP Aesthetic.
+
+Adjust the `Aesthetic weight` to something low first, as the default is not sane and will overbake your image.
+
+Under `Aesthetic imgs embedding`, select the CLIP aesthetic embedding you want to use.
+
+Generate your image, and keep adjusting your weight until you get something you like. Then you can play around with the steps.
+
 ## Training

 Under the `Training` tab, click the `Train` sub-tab. You'll be greeted with a slew of settings: