added more notes involving hypernetworks, added arguments to specify folders in the node.js scripts (need to backport later)

2022-10-13 20:07:25 +00:00 · 2022-10-13 20:07:25 +00:00 · 7676d56de5
commit 7676d56de5
parent caa1a1707b
5 changed files with 126 additions and 35 deletions
--- a/README.md
+++ b/README.md
@ -1,12 +1,14 @@
 # Textual Inversion/Hypernetwork Guide w/ E621 Content

-An up-to-date repo with all the necessary files can be found here: https://git.coom.tech/mrq/stable-diffusion-utils
+An up-to-date repo with all the necessary files can be found [here](https://git.ecker.tech/mrq/stable-diffusion-utils): ([mirror](https://git.coom.tech/mrq/stable-diffusion-utils))

-**!**WARNING**!** **!**CAUTION**!** ***DO NOT POST THE REPO'S URL ON 4CHAN*** **!**CAUTION**!** **!**WARNING**!**
+**!**WARNING**!** **!**CAUTION**!** ***DO NOT POST THE MIRROR REPO'S URL ON 4CHAN*** **!**CAUTION**!** **!**WARNING**!**

 `coom.tech` is an automatic 30-day ban if posted. I am not responsible if you share that URL. Share the [rentry](https://rentry.org/sd-e621-textual-inversion/) instead, as this is effectively a copy of the README.

-This guide has been stitched together the past few days with different trains of thoughts as I learn the ins and outs of effectively training concepts. Please keep this in mind if the guide seems to shift a bit or sound confusing. I intend to do a clean rewrite to make things more to-the-point yet concise.
+This guide has been stitched together the past few days with different trains of thoughts as I learn the ins and outs of effectively training concepts. Please keep this in mind if the guide seems to shift a bit, or sound confusing, or feels like it's covering unncessary topics. I intend to do a clean rewrite to make things more to-the-point yet concise.
+
+Unlike any guide for getting Voldy's Web UI up, a good majority of this guide is focused on getting the right content, and feeding it the right content.

 ## Assumptions

@ -15,7 +17,7 @@ This guide assumes the following basics:
 * you already have the furry/yiffy models downloaded
 * you can run a python script

-You can also extend this into any other booru-oriented model, but you'll have to modify the fetch and pre-processing script according to the site images were pulled from. The general concepts still apply.
+You can also extend this into any other booru-oriented model (I doubt anyone reading this cares, seems the normalfags are well content in their own circles), but you'll have to modify the fetch and pre-processing script according to the site images were pulled from. The general concepts still apply.

 ## Glossary

@ -28,13 +30,13 @@ Below is a list of terms clarified. I notice I'll use some terms interchangably
 * `embedding`: the trained "model" of the subject or style in question. "Model" would be wrong to call the trained output, as Textual Inversion isn't true training
 * `hypernetwork`: a different way to train custom content against a model, almost all of the same prinicples here apply for hypernetworks
 * `loss rate`: a calculated value determining how close the actual output is to the expected output. Typically a value between `0.1` and `0.15` seem to be a good sign
-* `epoch`: a term derived from typical neural network training, normally, it's referred to as a full training cycle over your source material, but the web UI doesn't actually do anything substantial with it.
+* `epoch`: a term derived from typical neural network training, normally, it's referred to as a full training cycle over your source material (total iterations / training set size), but the web UI doesn't actually do anything substantial with it.

 ## Preface

-I've burnt through seven or so models trying to train three of my hazubandos, each try with different methods. I've found my third attempt to have very strong results, yet I don't recall exactly what I did to get it. My later subjects failed to yield such strong results, so your mileage will greatly vary depending on the subject/style you're training against.
+I've burnt through countless models trying to train three of my hazubandos and an artist (with implied consent), each try with different methods. I've found my third attempt to have very strong results, yet I don't recall exactly what I did to get it. My later subjects failed to yield such strong results, so your mileage will greatly vary depending on the subject/style you're training against.

-What works for you will differ from what works for me, but do not be discouraged if output during training looks decent, but real output in txt2img and img2img fails. Just try different, well constructed prompts, change where you place your subject, and also try and increase the size a smidge (such as 512x704, or 704x512). I've thought I've had embeddings failed, when it just took some clever tweaking for decent output.
+What works for you will differ from what works for me, but do not be discouraged if output during training looks decent, but real output in txt2img and img2img fails to live up to expectations. Just try different, well constructed prompts, change where you place your subject, and also try and increase the size a smidge (such as 512x704, or 704x512). I've thought I've had embeddings failed, when it just took some clever tweaking for decent output.

 This guide also aims to try and document the best way to go about training a hypernetwork. If you're not sure whether you should use an embedding or a hypernetwork:

@ -44,45 +46,49 @@ Hypernetworks are a different flavor of extending models. Where Textual Inversio

 * Embedding (Textual Inversion):
 	- Pros:
-		+ trained embeddings are very small, can even be embedded in the outputs that use them
+		+ trained embeddings are very small, the file can even be embedded in the outputs that use them
 		+ excel really well at concepts you can represent in a prompt
-		+ easy to use, just put the keyword in the prompt, and when you don't want it, don't include it
+		+ easy to use, just put the keyword in the prompt you named it to, and when you don't want it, don't include it
 		+ *can* be used with little training if your concept is pretty simple
 		+ *can* be used in other models that it wasn't necessarily trained on
 		+ simple to train, the default learning rate is "just good enough"
+		+ can very easily deviate from the prompts you trained it on
 	- Cons:
 		+ take quite a bit of VRAM
 		+ takes quite a lot of time to train for fantastic results
 		+ grows in size the longer you train them (still pretty small)
 		+ consumes tokens in the prompt
-		+ *can* be used with other embeddings, but usually attribute leaks
+		+ *can* be used with other embeddings, but usually attribute leaks, or draws weight away from each other
 		+ very sensitive in a prompt, usually needs to be placed in the right order and will break if the weights aren't "just right"
 * Hypernetworks:
 	- Pros:
-		+ *can* work wonders on concepts you can't really represent in a prompt
+		+ *can* work wonders on concepts you can't really represent in a prompt, as there's a ton more room to learn concepts
 		+ (theoretically) works better for learning bigger concepts, like art styles, *certain* niches (fetishes), or species, but works fine on subjects
-		+ very quick to see results, can get by with even lower training steps, making it easier for anyone to train
+		+ very quick to see *some* results, can get by with even lower training steps, making it easier for anyone to train
 		+ (apparently) does not need very much VRAM to train, making it easier for anyone to train
 		+ appears to better generate trained concepts that an embedding has trouble generating; for example: `penis_through_fly`, as I've had terrible luck at best with getting that from an embedding
 		+ can rapidly train if you use an embedding already trained in the training prompt, or on concepts the model is already well familiar with
+		+ *theoretically* can carry over and enhance other concepts if there's some overlap
 	- Cons:
 		+ fixed, large size of ~87MiB, will eat space during training with frequent copies
 		+ very, *very*, **very**, ***very*** sensitive to "high" learning rates, will need to have it adjusted turing training
 			* remedied with a well-tuned set of stepping learning rates
 		+ quick to fry, will either slowly degrade in quality into a noisy mess, or rapidly turn into noise.
-		+ finnicky to swap, have to go into Settings to enable/disable
+		+ finnicky to swap, have to go into Settings to enable/disable ("solved" by changing a setting to have it as a quick setting)
 		+ can be very error proned if you're using an embedding
+		+ requires trying not to deviate so hard from the prompt you trained it against
 		+ very xenophobic to other models, as the weights greatly depend on the rest of the model
+		+ doesn't seem to solve the issue any better of embeddings failing to represent hard-to-describe concepts

-If you're still unsure, just stick with Textual Embeds for now. I don't have a concrete way of getting consistent training results with Hypernetworks at the moment. Hypernetworks seem to require less training overall, so it's not going to be too tough to swap over to train one.
+If you're still unsure, just stick with Textual Embeds for now.

 ## Acquiring Source Material

-The first step of training against a subject (or art style or concept) is to acquire source content. Hugging Face's instructions specify having three to five images, cropped to 512x512, while the web UI just requires a 1:1 squared image, but there's no hard upper limit on how many, nor does having more images have any bearings on the final output size or performance. However, the more images you use, the harder it'll take for it to converge (despite convergence in typical neural network model training means overfitment).
+The first step of training against a subject (or art style or concept) is to acquire source content. Hugging Face's instructions specify having three to five images, cropped to 512x512, while the web UI just requires a 1:1 squared image, but there's no hard upper limit on how many, nor does having more images have any bearings on the final output size or performance. However, the more images you use, the harder it'll take for it to converge (despite convergence in typical neural network model training means overfitment). For the common user, just stick with 512x512 images.

-I cannot imagine a scenario where you should stick with low image counts, such as selecting from a pool and pruning for the "best of the best". If you can get lots of images, do it. While it may appear the test outputs during training looks better with a smaller pool, when it comes to real image generation, embeddings from big image pools (140-190) yieled far better results over later embeddings trained on half the size of the first one (50-100).
+I cannot imagine a scenario where you should intentionally stick with low image counts over large image counts, such as selecting from a pool and pruning for the "best of the best". If you can get lots of images, do it. While it may appear the test outputs during training looks better with a smaller pool, when it comes to real image generation, embeddings from big image pools (140-190) yieled far better results over later embeddings trained on half the size of the first one (50-100).

-If you're lacking material, the web UI's pre-processing tools to flip and split should work decently enough to cover the gap for low content. Flipping will duplicate images and flip them across the Y axis, (presumably) adding more symmetry to the final embedding, while splitting will help deal with non-square content and provide good coverage for partially generating your subject (for example, bust shots, waist below, chest only, etc.).
+If you're lacking material, the web UI's pre-processing tools to flip and split should work decently enough to cover the gap for low content. Flipping will duplicate images and flip them across the Y axis, (presumably) adding more symmetry to the final embedding, while splitting will help deal with non-square content and provide good coverage for partially generating your subject (for example, bust shots, waist below, chest only, etc.). It does an *okay* job compared to manually curating, but it's perfectly fine if you're training an art style.

 If you rather would have finely-crafted material, you're more than welcome to manually crop and square images. A compromise for cropping an image is to expand the canvas size to square it off, and then fill the new empty space with colors to crudely blend with the background, and crudly adding color blobs to expand limbs outside the frame. It's not that imperative to do so, but it helps.

@ -152,11 +158,10 @@ Consult the script if you want to adjust it's behavior. I tried my best to expla
 ### Caveats

 There's some "bugs" with the script, be it limitations with interfacing with web UI, or oversights in processing tags:
-* commas do not carry over to the training prompt, as this is a matter of how the web UI re-assembles tokens passed from the prompt template/filename. There's functionally no difference with having `,`, or ` ` as your delimiter in this preprocess script.
-* tags with parentheses, such as `boxers_(clothing)`, or `curt_(animal_crossing)`, the web UI will decide whatever it wants to when it comes to processing parentheses. The script can overcome this problem by simply removing anything in parentheses, as you can't really escape them in the filename without editing the web UI's script.
-* Species tags seemed to not be included in the `tags.csv`, yet they OBVIOUSLY affect the output. I haven't taken close note of it, but your results may or may not improve if you manually tag your species, either in the template or the filenames (whether the """pedantic""" reddit taxonomy term like `ursid` that e621 uses or the normal term like `bear` is prefered is unknown). The pre-process script will include them by default, but be warned that it will include any of the pedantic species tags (stuff like `suina sus boar pig`)
+* without setting the "Filename regex string", no additional parsing of the filename will be done (except for removing the prefix the web UI's preprocessing adds)
+* species tags seemed to not be included in the `tags.csv`, yet they OBVIOUSLY affect the output. I haven't taken close note of it, but your results may or may not improve if you manually tag your species, either in the template or the filenames (whether the """pedantic""" reddit taxonomy term like `ursid` that e621 uses or the normal term like `bear` is prefered is unknown). The pre-process script will include them by default, but be warned that it will include any of the pedantic species tags (stuff like `suina sus boar pig`).
 * filtering out common tags like `anthro, human, male, female`, could have negative effects with training either a subject or a style. I've definitely noticed I had to add negative terms for f\*moid parts or else my hazubando will have a cooter that I need to inpaint some cock and balls over. I've also noticed during training a style (that both has anthros and humans), a prompt associated with something anthro will generate something human. Just take notice if you don't foresee yourself ever generating a human with an anthro embedding, or anthro with a human embedding. (This also carries to ferals, but I'm sure that can be assumed)
-* the more images you do use, the longer it will take for the web UI to load and process them, and presumably more VRAM needed. 200 images isn't too bad, but 9000 will take 10 minutes on an A100-80G.
+* the more images you do use, the longer it will take for the web UI to load and process them for training, and presumably more VRAM needed. 200 images isn't too bad, but 9000 will take 10 minutes on an A100-80G.

 ## Training Prompt Template

@ -202,17 +207,20 @@ uploaded on e621, by [name], [filewords]

 I'm not quite clear on the differences by including the `by`, but the yiffy model was trained on a `uploaded on e621, [rating] content, by [artist], [tags]` format, but we can only get so close to the master-trained format.

-## Training
-
+## Preparing for Training

 Now that everything is set up, it's time to start training. For systems with "enough" VRAM (I don't have a number on what is adequate), you're free to run the web UI with `--no-half --precision full` (whatever "adequate entails"). You'll take a very slight performance hit, but quality improves barely enough I was able to notice. The Xformers feature seems to get disabled during training, but appears to make preview generations faster? So don't worry about getting xformers configured.

 Make sure you're using the correct model you want to train against, as training uses the currently selected model.

-**!**OPTIONAL**!** Make sure to go into the Settings tab, find the `Training` section, then under `Filename join string`, set it to `, `, as this will keep your training prompts comma separated. This doesn't make *too* big of a difference, but it's another step for correctness.
+**!**NOTE**!**: If you're using a `Filename regex`, make sure to go into the Settings tab, find the `Training` section, then under `Filename join string`, set it to `, `, as this will keep your training prompts comma separated. This doesn't make *too* big of a difference, but it's another step for correctness. This is not relevant if you left the `Filename regex` blank.

 Run the Web UI, and click the `Training` sub-tab.

+After creating your embedding/hypernetwork base file, you can click the `Preprocess images` sub-tab to pre-process your source material further by duplicating to flip, or split.
+
+### Training for Textual Inversion
+
 Create your embedding to train on by providing the following under the `Create embedding`:
 * a name
 	- can be changed later, it's just the filename, and the way to access your embedding in prompts
@ -227,20 +235,24 @@ Create your embedding to train on by providing the following under the `Create e

 Click create, and the starting file will be created.

-Afterwards, you can click the `Preprocess images` sub-tab to pre-process your source material further by duplicating to flip, or split.
+### Training for a Hypernetwork

-Next, under the `Train` sub-tab:
+There's only one thing you need to do, and that's giving it a name. Afterwards, click create.
+
+## Training
+
+Under the `Training` tab, click the `Train` sub-tab. You'll be greeted with a slew of settings:
 * `embedding` or `hypernetwork`: select your embedding/hypernetwork to train on in the dropdown
 * `learning rate`: if you're adventurous, adjust the learning rate. The default of `0.005` is fine enough, and shouldn't cause learning/loss problems, but if you're erring on the side of caution, you can set it to `0.0005`, but more training will be needed. 
 	- similar to prompt editing, you can also specify when to change the learning rate. For example: `0.000005:2500,0.0000025:20000,0.0000001:40000,0.00000001:-1` will use the first rate until 2500 steps, the second one until 20000 steps, the third until 40000 steps, then hold with the last one for the rest of the training.
 * `dataset directory`: pass in the path to the folder of your source material to train against
-* `log directory`: player preference, the default is sane enough
+* `log directory`: player preference, the default is sane enough (the hierarchy it uses afterwards, however, is not)
 * `prompt template file`: put in the path to the prompt file you created earlier. if you put it in the same folder as the web UI's default prompts, just rename the filename there
 * `width` and `height`: I assume this determines the size of the image to generate when requested. Or it could actually work for training at different aspect ratios. I'd leave it to the default 512x512 for now.
 * `max steps`: adjust how long you want the training to be done before terminating. Paperspace seems to let me do ~70000 on an A6000 before shutting down after 6 hours. An 80GB A100 will let me get shy of the full 100000 before auto-shutting down after 6 hours.
 * `epoch length`: this value is only cosmetic and doesn't actually do the dream idea of it actually correcting the learning rate per epoch. don't even bother with this.
 * `save an image/copy`: these two values are creature comforts and have no real effect on training, values are up to player preference.
-* `preview prompt`: the prompt to use for the preview training image. if left empty, it'll use the last prompt used for training. it's useful for accurately measuring coherence between generations. I highly recommend using this with a prompt you want to use later. takes the same `[name]` and `[fileword]` keywords passed through to the template
+* `preview prompt`: the prompt to use for the preview training image. if left empty, it'll use the last prompt used for training. it's useful for accurately measuring coherence between generations. I highly recommend using this with a prompt you want to use later to gauge the quality over time. ***Does not*** take the same `[name]` and `[fileword]` keywords passed through to the template

 Afterwards, hit `Train Embedding`, and wait and watch your creation come to life.

@ -256,7 +268,7 @@ I ***highly*** suggest waiting for more developments around training hypernetwor

 The very core concepts are the same for training one, with the main difference being the learning rate is very, very sensitive, and needs to be reduced as more steps are ran. I've seen my hypernetworks quickly dip into some incoherent noise, and I've seen some slowly turn into some schizo's dream where the backgrounds and edges are noisy.

-The official documentation lazily suggests a learning rate of either `0.000005` or `0.0000005`, but I find it to be inadequate. For the mean time, I suggest using `0.000000025` to get started if you're fine babysitting or use `0.000005:2500,0.0000025:20000,0.0000001:30000,0.000000075:-1`.
+The official documentation lazily suggests a learning rate of either `0.000005` or `0.0000005`, but I find it to be inadequate. For the mean time, I suggest using `0.000000025` to get started if you're fine babysitting or if you're overcautious, use `0.000005:2500,0.0000025:20000,0.0000001:30000,0.000000075:-1`. I find this value to be too slow, but appears to wrangle it in to have it somewhat-comparable to Textual Inversion's training progression in the long run.

 #### Caveats

@ -274,9 +286,11 @@ I'm also not too keen whether you need to have a `[name]` token in your training

 It's as simple as selecting it under Settings in the Hypernetworks drop-down box. Hit save after selecting. Afterwards, happy prompting, happy fapping.

+The big caveat with using a hypernetwork is that you should try and avoid deviating from the training prompt you used. Hypernetworks excel if you use the terms you trained it on, and gets flaccid when you do not. ***Please*** keep this in mind, as this is not a caveat when using an embedding.
+
 ## Using the Embedding

-Using your newly trained embedding is as simple as putting in the name of the file in the prompt. Before, you would need to signal to the prompt parser with `<token>`, but it seems now you do not. I don't know if still using `<>` has any bearings on output, but take note you do not need it anymore.
+Using your newly trained embedding is as simple as putting in the name of the file in the prompt. You do not need to wrap it in `<>` like you used to. Unlike hypernetworks, you're not required to use the terms associated with your embedding. You *can*, as it seems to further amplify the attributes it could have associated with.

 Do not be discouraged if your initial output looks disgusting. I've found you need a nicely crafted prompt, and increasing the resolution a few notches will get something decent out of it. Play around with prompts in the thread, but I've found this one to finally give me [decent output](https://desuarchive.org/trash/thread/51397474/#51400626) (credits to [anon](https://desuarchive.org/trash/thread/51387852/#51391540) and [anon](https://desuarchive.org/trash/thread/51397474/#51397741) for letting me shamelessly steal it for my perverted kemobara needs):
 ```
@ -288,7 +302,7 @@ And and adjusted one of the above that I found to yield very tasteful results:
 ```
 uploaded on e621, explict content, by [Pino Daeni:__e6_artist__:0.75] and [chunie:__e6_artist__:0.75], (photography, sharp details, detailed fur, detailed eyes:1.0), <TOKEN>, hairy body, <FLAVORS>
 ```
-where `<TOKEN>` is the name of the embedding you used, `<FLAVORS>` are additional tags you want to put in, and `__e6__artist__` is used with the Wildcards third-party script (you can manually substitute them with other artists of your choosing for subtle nuances in your ouptut).
+where `<TOKEN>` is the name of the embedding you used, `<FLAVORS>` are additional tags you want to put in, and `__e6_artist__` is used with the Wildcards third-party script (you can manually substitute them with other artists of your choosing for subtle nuances in your ouptut).

 Ordering ***really*** matters when it comes to your embedding, and additionally the weight of your embedding. Too early in the prompt, and the weight for other terms will greatly fall off, but too late in the prompt, and your embedding will lose it's influence. Too much weight applied to your embedding, and you'll deepfry your output.

@ -296,11 +310,19 @@ If you're using an embedding primarily focused on an artstyle, and you're also u

 Lastly, when you do use your embedding, make sure you're using the same model you trained against. You *can* use embeddings on different models, as you'll definitely get usable results, but don't expect it to give stellar ones.

+## Testimonials
+
+Here I'll try to catalog results my results, and results I've caught from other anons (without consent)
+
+* Katt from Breath of Fire: ttps://desuarchive.org/trash/thread/51599762/#51607820
+	- Hypernetwork named `2bofkatt`
+	- 40,000 iterations, learning rate of 0.000000025
+
 ## After Words

 Despite being very wordy, I do hope that it's digestable and able to process for even the most inexperience of users. Everything in here is pretty much from my own observations and tests, so I can get (You), anon, closer to generating what you love.

-Lastly, the following section serves no bearings on training, but serve as way to put my observations on:
+Lastly, the following sections serves no bearings on training, but serve as way to put my observations on:

 ### The Nature of Textual Inversion embeddings

@ -312,5 +334,33 @@ Textual Inversion embeddings serve as mini-"models" to extend a current one. Whe
 * subjects that are nigh impossible to describe in a prompt (four ears, half are shaped one way, the other half another, middle eye, tusks, neckbeard tufts, etc. // brown fur, vintage white muzzle and chest marking) are *very* hard for an embedding to output
 * attributes associated with the embedding can leak onto other subjects in the output; for example: `[...] <TOKEN> and anthro cat [...]` will give you two of your trained subject with cat attributes. Whether this is more of a side-effect of Textual Inversion itself, or a symptom of attribute leaking in general with how the web UI parses prompts, is unknown.
 * using an embedding trained on a different model will still give the concepts that it was trained against (using an embedding of a species of animal will generate something somewhat reminiscent of a real live version of that species of animal)
+* you are not required to use the prompts similar to what you trained it on

-Contrarily, hypernetworks are another variation of extending the model with another mini-"model". They apply to the last outer layers as a whole, allowing you to effectively re-tune the model.
+Contrarily, hypernetworks are another variation of extending the model with another mini-"model". They apply to the last outer layers as a whole, allowing you to effectively re-tune the model. They effectively will modify what comes out of the prompt and into the image, effectively amplifying/modifying their effects. This is evident through:
+	* using a verbose prompt with one enabled, your output will have more detail in what you prompted
+	* in the context of NovelAI, you're still somewhat required to prompt what you want, but the associated hypernetwork will strongly bring about what you want.
+
+### Hiccups With Assessing Training A Hypernetwork
+
+I don't have a concrete way of getting consistent training results with Hypernetworks at the moment. Most of the headache seems to be from:
+	* working around a very sensitive learning rate, and finding the sweet spot between "too high, it'll fry" and "too low, it's so slow"
+	* figuring out what exactly is the best way to try and train it, and the best *thing* to train it on, such as:
+		- should I train it with tags like I do for Textual Inversion (the character + descriptor tags), or use more generalized tags (like all the various species, very generic tags like anthro male)
+		- should I train it the same as my best embedding of a character, to try and draw comparisons between the two?
+		- should I train it on a character/art style I had a rough time getting accurate results from, to see if it's better suited for it?
+			+ given the preview training output at ~52k iterations w/ 184 images, I found it to not have any advantages over a regular Textual Inversion embedding
+		- should I train it on a broader concept, like a series of characters or a specific tag (fetish), to go ahead and recommend it quicker for anyone interested in it, then train to draw conclusions of the above after?
+			+ given the preview training output at ~175k iterations w/ 9322 images, I found it to be *getting there* in looking like the eight or so characters I'm group batching for a "series of characters", but this doesn't really seem to be the way to go.
+			+ as for training it on a specific tag (fetish), I'd have to figure out which one I'd want to train it on, as I don't necessarily have any specific fetishes (at least, any that would be susbtantial to train against)
+	* it takes a long, long time to get to ~150k iterations, the sweet spot I found Textual Inversions to sit at. I feel it's better to just take the extra half hour to keep training it rather than waste it fiddling with the output.
+
+There doesn't seem to be a good resource for the less narrower concepts like the above.
+A rentry I found for hypernetwork training in the /g/ thread is low quality.
+The other resources seems to be "lol go to the training discord".
+The discussion on it on the Web UI github is pretty much just:
+	* *"I want to to face transfers onto Tom Cruise / a woman / some other thing"*
+	* *"habibi i want this art style please sir help"*
+	* dead end discussion about learning rates
+	* hopeless conjecture about how quick it is to get decent results, but it failing to actually apply to anything for e621-related applications
+
+I doubt anyone else can really give some pointers in the right direction, so I have to bang my head against the wall to figure the best path, as I feel if it works for even me, it'll work for (You).
--- a/package.json
+++ b/package.json
@ -0,0 +1,5 @@
+{
+  "dependencies": {
+    "node-fetch": "^2.6.7"
+  }
+}
--- a/src/fetch.js
+++ b/src/fetch.js
@ -73,12 +73,24 @@ try {
 let args = process.argv;
 args.shift();
 args.shift();
-if ( args.length ) config.query = args.join(" ");
+if ( args[0] ) config.query = args[0];
+if ( args[1] ) config.output = args[1];
+
+
 // require a query, without it you effectively have a script to download the entirety of e621
 if ( !config.query ) {
 	console.error("No arguments passed; example: `node fetch.js 'kemono -dog'`")
 	return;
 }
+try {
+	if ( !FS.lstatSync(config.output).isDirectory() ) {
+		console.error(`specified path for output is not a directory: ${config.output}`)
+		return;
+	}
+} catch ( e ) {
+	console.error(`specified path for output is not found: ${config.output}`)
+	return;
+}
 // clamp concurrency
 if ( !config.concurrency || config.concurrency < 1 ) config.concurrency = 1;
 // fetch options to use for each request
@ -89,6 +101,8 @@ let options = {
 	}
 }

+console.log(`Downloading images of tags "${config.query} to folder ${config.output}"`)
+
 let parse = async () => {
 	let posts = [];
 	let last = ''; // last ID used, used for grabbing the next page
--- a/src/fetch.py
+++ b/src/fetch.py
@ -75,7 +75,8 @@ try:
 except:
 	pass

-args = sys.argv[1:]
+args = sys.argv[1]
+
 if len(args) == 0:
 	print('No arguments passed, example: `python3 fetch.py \"kemono -dog\"`')
 	quit()
--- a/src/preprocess.js
+++ b/src/preprocess.js
@ -76,6 +76,27 @@ try {
 	cache = {};
 }

+let args = process.argv;
+args.shift();
+args.shift();
+
+if ( args[0] ) config.input = args[0];
+if ( args[1] ) config.output = args[1];
+
+for ( let k in {"input":null, "output":null} ) {
+	try {
+		if ( !FS.lstatSync(config[k]).isDirectory() ) {
+			console.error(`specified path for ${k} is not a directory: ${config[k]}`)
+			return;
+		}
+	} catch ( e ) {
+		console.error(`specified path for ${k} is not found: ${config[k]}`)
+		return;
+	}
+}
+
+console.log(`Parsing ${files.length} files from ${config.input} => ${config.output}`)
+
 let parse = async () => {
 	for ( let i in files ) {
 		let file = files[i];