I'd like to try some different (newer) transformer variants. The way to get there is softly decoupling the transformer portion of this architecture from GPT. This actually should be fairly easy.