Dataset documentation

This commit is contained in:
James Betker 2020-11-26 11:58:39 -07:00
parent 45a489110f
commit 0c6d7971b9

View File

@ -0,0 +1,78 @@
# DLAS Datasets
## Quick Overview
DLAS uses the standard Torch Dataset infrastructure. Datasets are expected to be constructed using an "options" dict,
which is fed directly from the configuration file. They are also expected to output a dict, where the keys are injected
directly into the trainer state.
Datasets conforming to the above expectations must be registered in `__init__.py` to be used by a configuration.
## Reference Datasets
This directory contains several reference datasets which I have used in building DLAS. They include:
1. Stylegan2Dataset - Reads a set of images from a directory, performs some basic augmentations on them and injects
them directly into the state. LQ = HQ in this dataset.
1. SingleImageDataset - Reads image patches from a 'chunked' format along with the reference image and metadata about
how the patch was originally computed. The 'chunked' format is described below. Includes built-in ImageCorruption
features actuated by `image_corruptor.py`.
1. MultiframeDataset - Similar to SingleImageDataset, but infers a temporal relationship between images based on their
filenames: the last 12 characters before the file extension are assumed to be a frame counter. Images from this
dataset are grouped together with a temporal dimension for working with video data.
1. MultiscaleDataset - Reads full images from a directory and builds a tree of images constructed by cropping squares
from the source image and resizing them to the target size recursively until the native resolution is hit. Each
recursive step decreases the crop size by a factor of 2.
1. FullImageDataset - An image patch dataset where the patches are dynamically extracted from full-size images. I have
generally stopped using this for performance reasons in favor of SingleImageDataset but it is useful for validation
and test so I keep it around.
## Information about the "chunked" format
This is the main format I have used in my experiments with image super resolution. It is fast to read and provides
rich metadata on the images that the patches are derived from, including a downsized "reference" fullsize image and
information on where the crop was taken from in the original image.
### Creating a chunked dataset
The file format for 'chunked' datasets is very particular. I recommend using `scripts/extract_subimages_with_ref.py`
to build these datasets from raw images. Here is how you would do that:
1. Edit `scripts/extract_subimages_with_ref.py` to set these configuration options:
```
opt['input_folder'] = <path to raw images>
opt['save_folder'] = <where your chunked dataset will be stored>
opt['crop_sz'] = [256, 512] # A list, the size of each sub-image that will be extracted and turned into patches.
opt['step'] = [128, 256] # The pixel distance the algorithm will step for each sub-image. If this is < crop_sz, patches will share image content.
opt['thres_sz'] = 128 # Amount of space that must be present on the edges of an image for it to be included in the image patch. Generally should be equal to the lowest step size.
opt['resize_final_img'] = [1, .5] # Reduction factor that will be applied to image patches at this crop_sz level. TODO: infer this.
opt['only_resize'] = False # If true, disables the patch-removal algorithm and just resizes the input images.
opt['vertical_split'] = False # Used for stereoscopic images. Not documented.
```
Note: the defaults should work fine for many applications.
1. Execute the script: `python scripts/extract_subimages_with_ref.py`. If you are having issues with imports, make sure
you set `PYTHONPATH` to the repo root.
### Chunked cache
To make trainer startup fast, the chunked datasets perform some preprocessing the first time they are loaded. The entire
dataset is scanned and a cache is built up and saved in cache.pth. Future invocations only need to load cache.pth on
startup, which greatly speeds up trainer startup when you are debugging issues.
There is an important caveat here: this cache will not be recomputed unless you delete it. This means if you add new
images to your dataset, you must delete the cache for them to be picked up! Likewise, if you copy your dataset to a
new file path or a different computer, cache.pth must be deleted for it to work. In the latter case, you'll likely run
into some weird errors.
### Details about the dataset format
If you look inside of a dataset folder output by above, you'll see a list of folders. Each folder represents a single
image that was found by the script.
Inside of that folder, you will see 3 different types of files:
1. Image patches, each of which have a unique ID within the given set. These IDs do not necessarily need to be unique
across the entire dataset.
1. `centers.pt` A pytorch pickle which is just a dict that describes some metadata about the patches, like: where they
were located in the source image and their original width/height.
1. `ref.jpg` Is a square version of the original image that is downsampled to the patch size.