generalized the fetch script to also allow other boorus, README 2

2022-10-10 18:34:56 +00:00 · 2022-10-10 18:34:56 +00:00 · d7e163e0e6
commit d7e163e0e6
parent da9e6e442e
4 changed files with 150 additions and 25 deletions
--- a/README.md
+++ b/README.md
@ -33,6 +33,7 @@ What works for you will differ from what works for me, but do not be discouraged

 ## Acquiring Source Material

+
 The first step of training against a subject (or art style) is to acquire source content. Hugging Face's instructions specify having three to five images, cropped to 512x512, but there's no hard upper limit on how many, nor does having more images have any bearings on the final output size or performance. However, the more images you use, the harder it'll take for it to converge (despite convergence in typical neural network model training means overfitment).

 I cannot imagine a scenario where you should stick with low image counts, such as selecting from a pool and pruning for the "best of the best". If you can get lots of images, do it. While it may appear the test outputs during training looks better with a smaller pool, when it comes to real image generation, embeddings from big image pools (140-190) yieled far better results over later embeddings trained on half the size of the first one (50-100).
@ -41,6 +42,8 @@ If you're lacking material, the web UI's pre-processing tools to flip and split

 If you rather would have finely-crafted material, you're more than welcome to manually crop and square images. A compromise for cropping an image is to expand the canvas size to square it off, and then fill the new empty space with colors to crudely blend with the background, and crudly adding color blobs to expand limbs outside the frame. It's not that imperative to do so, but it helps.

+If you want to accelerate your ~~scraping~~ content acquisition, consult the fetch script under [`./utils/renamer/`](https://git.coom.tech/mrq/stable-diffusion-utils/src/branch/master/utils/renamer/).
+
 ### Source Material For A Style

 The above tips all also apply to training a style, but some additional care needs to be taken:
@ -82,6 +85,8 @@ The generalized procedure is as followed:
 * yank out the artist and content rating, and prepend the list of tags
 * copy the source file with the name being the processed list of tags

+Additional information about the scripts can be found under the README under [`./utils/rename/README.md`](https://git.coom.tech/mrq/stable-diffusion-utils/src/branch/master/utils/renamer/).
+
 ### Pre-Requisites

 There's little safety checks or error messages, so triple check you have:
--- a/utils/renamer/README.md
+++ b/utils/renamer/README.md
@ -0,0 +1,33 @@
+# E621 Scripts
+
+Included are the utilities provided for ~~scraping~~ acquiring your source content to train on.
+
+If you're targeting another booru, the same principles apply, but you'll need to adjust your repo URL and processing your booru's JSON output. Doing so is left as an exercise to the reader.
+
+## Dependencies
+
+While I strive to make dependencies minimal, only the pre-processsing script is available in Python, while the e621 downloading script is only available in node.js, as I'm not that strong of a python dev. It's reasonable to assume everyone has python, as it's a hard dependency for using voldy's web UI.
+
+Python scripts have no additional dependencies, while node.js scripts require running `npm install node-fetch@2` (v2.x because I'm old and still using `require` for my includes).
+
+## Fetch
+
+**!**TODO**!** Rewrite in python, currently only available in node.js
+
+This script is responsible for ~~scraping~~ downloading from e621 all requested files for your target subject/style.
+
+To run, simply invoke the script with `node fetch.js [search query]`. For example: `node fetch.js "kemono -dog"` to download all non-dog posts tagged as kemono.
+
+In the script are some tune-ables, but the defaults are sane enough not to require any additional configuration.
+
+If you're using another booru, extending the script to support your booru of choice is easy, as the script was configured to allow for additional booru definitions. Just reference the provided one for e621 if you need a starting point.
+
+## Pre-Process
+
+The bread and butter of this repo is the preprocess script, responsible for associating your images from e621 with tags to train against during Textual Inversion.
+
+The output from the fetch script seamlessy integrates with the inputs for the preprocess script. The `cache.json` file should also have all the necessary tags to further accelerate this script.
+
+For the python version, simply place your source material into the `./in/` folder, invoke the script with `python3 preprocess.py`, then get your processed files from `./out/`. For the node.js version, do the same thing, but with `node preprocess.js`.
+
+This script *should* also support files already pre-processed through the web UI, as long as they were processed with their original filenames (the MD5 hash booru filenames). Pre-processing in the web UI after running this script might prove tricky, as I've had files named something like `00001-0anthro[...]`, and had to use a clever rename command to break it apart.
--- a/utils/renamer/fetch.js
+++ b/utils/renamer/fetch.js
@ -1,16 +1,60 @@
 let FS = require("fs")
 let Fetch = require("node-fetch")

+let boorus = {
+	"e621": {
+		urls: {
+			api: "https://e621.net/posts.json", // endpoint to grab tag info from
+			posts: "https://e621.net/posts/", // url to show post page, only for console logging
+		},
+		config: {
+			rateLimit: 500, // time to wait between requests, in milliseconds, e621 imposes a rate limit of 2 requests per second
+			cookie: null, // put your account cookie here if for whatever reason you need it
+		},
+		posts: ( json ) => { return json.posts; }, // way to process API output into an array of posts
+		post: ( json ) => { // way to process JSON into a format this script uses
+			let tags = [];
+			for ( let cat in json.tags ) {
+				for ( let k in json.tags[cat] ) {
+					tags.push(json.tags[cat][k])
+				}
+			}
+
+			return {
+				id: json.id,
+				url: json.file.url,
+				md5: json.file.md5,
+				filename: `${json.file.md5}.${json.file.ext}`,
+				tags
+			};
+		}
+	}
+}
+
 let config = {
-	query: `leib_(tas)`, // example query if no argument is passed
+	booru: "e621", // booru definition to use from the above object, currently only supports e621
+
+	query: ``, // example query if no argument is passed, kept empty so the script can scream at you for not having it tagged

 	output: `./in/`, // directory to save your files
 	cache: `./cache.json`, // JSON file of cached tags, will speed up processing when used for the renamer script

 	limit: 10, // how many posts to pull in one go
-	rateLimit: 500, // time to wait between requests, in milliseconds, e621 imposes a rate limit of 2 requests per second
+	concurrency: 4, // how many file requests to keep in flight at the same time
+
+	userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36',
+
+	filter: true, // sort of redundant given you can "filter" with the query, but e621 has a query limit, so here you can bypass it
+	filters: [ // consult the preprocess.js script for examples
+		"animated", // training only supports static images
+	],
 }

+let booru = boorus[config.booru];
+// merge booru and config
+for ( let k in booru.config ) if ( !config[k] ) config[k] = booru.config[k];
+
+// load tag cache to merge into later
 let cache;
 try {
 	cache = JSON.parse( FS.readFileSync(config.cache) )
@ -18,57 +62,97 @@ try {
 	cache = {};
 }

+// grab requested query from arguments
 let args = process.argv;
 args.shift();
 args.shift();
-
 if ( args.length ) config.query = args.join(" ");
+// require a query, without it you effectively have a script to download the entirety of e621
+if ( !config.query ) {
+	console.error("No arguments passed; example: `node fetch.js 'kemono -dog'")
+	return;
+}
+// clamp concurrency
+if ( !config.concurrency || config.concurrency < 1 ) config.concurrency = 1;
+// fetch options to use for each request
+let options = {
+	headers: {
+		'user-agent': config.userAgent,
+		'cookie': config.cookie,
+	}
+}

 let parse = async () => {
-	console.log(`Fetching: ${config.query}`)
-
 	let posts = [];
-	let last = '';
-
+	let last = ''; // last ID used, used for grabbing the next page
 	do {
 		let query = [`tags=${config.query}`]
 		if ( config.limit ) query.push(`limit=${config.limit}`)
 		if ( last ) query.push(`page=b${last}`)
-
 		query = encodeURI(query.join("&"));
-		console.log(`Querying: ${query}`)
-		let r = await Fetch( `https://e621.net/posts.json?${query}`, {
-			headers: {
-				'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
-			}
-		} );
-		let j = JSON.parse(await r.text());	
-		posts = j.posts;

+		let r = await Fetch( `${booru.urls.api}?${query}`, options );
+		posts = booru.posts(JSON.parse(await r.text()));
+	
+		let promises = [];
 		for ( let i in posts ) {
-			let post = posts[i];
+			let post = booru.post(posts[i]);
+
 			last = `${post.id}`
-			cache[post.file.md5] = post;
+			cache[post.md5] = posts[i];

-			if ( FS.existsSync(`${config.output}${post.file.md5}.${post.file.ext}`) ) continue;
+			if ( FS.existsSync(`${config.output}${post.filename}`) ) {
+				console.log(`Skipping existing file: ${booru.urls.posts}${post.id}`)
+				continue;
+			}

-			await Fetch(post.file.url).then(res => new Promise((resolve, reject) => {
-				const dest = FS.createWriteStream(`${config.output}${post.file.md5}.${post.file.ext}`);
+			if ( config.filter ) {
+				let filtered = false;
+
+				// nasty nested loops, dying for a go-to
+				for ( let j in post.tags ) {
+					let tag = post.tags[j];
+					for ( let k in config.filters ) {
+						let filter = config.filters[k];
+						if ( filter === tag || ( filter instanceof RegExp && tag.match(filter) ) ) {
+							filtered = true;
+							break;
+						}
+					}
+					if ( filtered ) break;
+				}
+				if ( filtered ) {
+					console.log(`Skipping filtered post: ${booru.urls.posts}${post.id}`, tag)
+					break;
+				}
+			}
+
+			if ( promises.length >= config.concurrency ) {
+				for ( let i in promises ) await promises[i];
+				promises = [];
+			}
+
+
+
+			promises.push(Fetch(post.url, options).then(res => new Promise((resolve, reject) => {
+				const dest = FS.createWriteStream(`${config.output}${post.filename}`);
 				res.body.pipe(dest);
 				dest.on('close', () => {
-					console.log(`Downloaded https://e621.net/posts/${post.id}`)
+					console.log(`Downloaded: ${booru.urls.posts}${post.id}`)
 					resolve()
 				});
 				dest.on('error', reject);
+			})).catch((err)=>{
+				console.error(`Error while fetching: ${post.id}`, err);
 			}));
 		}		

 		if ( config.rateLimit ) await new Promise( (resolve) => {
 			setTimeout(resolve, config.rateLimit)
 		} )
-	} while ( posts.length );
 		
-	FS.writeFileSync(config.cache, JSON.stringify( cache, null, "\t" ))
+		FS.writeFileSync(config.cache, JSON.stringify( cache, null, "\t" ))
+	} while ( posts.length );
 }

 parse();
--- a/utils/renamer/preprocess.js
+++ b/utils/renamer/preprocess.js
@ -50,7 +50,10 @@ let parse = async () => {
 	for ( let i in files ) {
 		let file = files[i];
 		let md5 = file.match(/^([a-f0-9]{32})/);
-		if ( !md5 ) continue;
+		if ( !md5 ) {
+			md5 = file.match(/([a-f0-9]{32})\.(jpe?g|png)$/);
+			if ( !md5 ) continue;
+		}
 		md5 = md5[1];
 				
 		console.log(`[${(100.0 * i / files.length).toFixed(3)}%]: ${md5}`);