Dataset documentation

2020-11-26 11:58:39 -07:00 · 2020-11-26 11:58:39 -07:00 · 0c6d7971b9
commit 0c6d7971b9
parent 45a489110f
1 changed files with 78 additions and 0 deletions
--- a/codes/data/README.md
+++ b/codes/data/README.md
@ -0,0 +1,78 @@
+# DLAS Datasets
+
+## Quick Overview
+
+DLAS uses the standard Torch Dataset infrastructure. Datasets are expected to be constructed using an "options" dict,
+which is fed directly from the configuration file. They are also expected to output a dict, where the keys are injected
+directly into the trainer state.
+
+Datasets conforming to the above expectations must be registered in `__init__.py` to be used by a configuration.
+
+## Reference Datasets
+
+This directory contains several reference datasets which I have used in building DLAS. They include:
+
+1. Stylegan2Dataset - Reads a set of images from a directory, performs some basic augmentations on them and injects
+   them directly into the state. LQ = HQ in this dataset.
+1. SingleImageDataset - Reads image patches from a 'chunked' format along with the reference image and metadata about
+   how the patch was originally computed. The 'chunked' format is described below. Includes built-in ImageCorruption
+   features actuated by `image_corruptor.py`.
+1. MultiframeDataset - Similar to SingleImageDataset, but infers a temporal relationship between images based on their
+   filenames: the last 12 characters before the file extension are assumed to be a frame counter. Images from this 
+   dataset are grouped together with a temporal dimension for working with video data.
+1. MultiscaleDataset - Reads full images from a directory and builds a tree of images constructed by cropping squares
+   from the source image and resizing them to the target size recursively until the native resolution is hit. Each
+   recursive step decreases the crop size by a factor of 2.
+1. FullImageDataset - An image patch dataset where the patches are dynamically extracted from full-size images. I have
+   generally stopped using this for performance reasons in favor of SingleImageDataset but it is useful for validation
+   and test so I keep it around.
+   
+## Information about the "chunked" format
+
+This is the main format I have used in my experiments with image super resolution. It is fast to read and provides
+rich metadata on the images that the patches are derived from, including a downsized "reference" fullsize image and
+information on where the crop was taken from in the original image.
+
+### Creating a chunked dataset
+
+The file format for 'chunked' datasets is very particular. I recommend using `scripts/extract_subimages_with_ref.py`
+to build these datasets from raw images. Here is how you would do that:
+
+1. Edit `scripts/extract_subimages_with_ref.py` to set these configuration options:
+    ```
+    opt['input_folder'] = <path to raw images>
+    opt['save_folder'] = <where your chunked dataset will be stored>
+    opt['crop_sz'] = [256, 512]  # A list, the size of each sub-image that will be extracted and turned into patches.
+    opt['step'] = [128, 256]  # The pixel distance the algorithm will step for each sub-image. If this is < crop_sz, patches will share image content.
+    opt['thres_sz'] = 128  # Amount of space that must be present on the edges of an image for it to be included in the image patch. Generally should be equal to the lowest step size.
+    opt['resize_final_img'] = [1, .5] # Reduction factor that will be applied to image patches at this crop_sz level. TODO: infer this.
+    opt['only_resize'] = False # If true, disables the patch-removal algorithm and just resizes the input images.
+    opt['vertical_split'] = False # Used for stereoscopic images. Not documented.
+    ```
+   Note: the defaults should work fine for many applications.
+1. Execute the script: `python scripts/extract_subimages_with_ref.py`. If you are having issues with imports, make sure
+   you set `PYTHONPATH` to the repo root.
+
+### Chunked cache
+
+To make trainer startup fast, the chunked datasets perform some preprocessing the first time they are loaded. The entire
+dataset is scanned and a cache is built up and saved in cache.pth. Future invocations only need to load cache.pth on
+startup, which greatly speeds up trainer startup when you are debugging issues.
+
+There is an important caveat here: this cache will not be recomputed unless you delete it. This means if you add new
+images to your dataset, you must delete the cache for them to be picked up! Likewise, if you copy your dataset to a
+new file path or a different computer, cache.pth must be deleted for it to work. In the latter case, you'll likely run 
+into some weird errors.
+
+### Details about the dataset format
+
+If you look inside of a dataset folder output by above, you'll see a list of folders. Each folder represents a single
+image that was found by the script.
+
+Inside of that folder, you will see 3 different types of files:
+
+1. Image patches, each of which have a unique ID within the given set. These IDs do not necessarily need to be unique
+   across the entire dataset.
+1. `centers.pt` A pytorch pickle which is just a dict that describes some metadata about the patches, like: where they
+   were located in the source image and their original width/height.
+1. `ref.jpg` Is a square version of the original image that is downsampled to the patch size.