PRX Part 3 — Training a Text-to-Image Model in 24h!

https://huggingface.co/blog/Photoroom/prx-part3(huggingface.co)

A text-to-image diffusion model was trained in just 24 hours on 32 H200 GPUs with a budget of approximately $1500. This speedrun combines several advanced techniques to maximize performance under a strict compute constraint. The training recipe utilizes x-prediction to train directly in pixel space, eliminating the need for a VAE and enabling the use of perceptual losses like LPIPS and DINO. To further improve efficiency, token routing with TREAD is employed to reduce the computational load of the transformer blocks. The model is trained directly at a 512px resolution before being fine-tuned at 1024px, demonstrating a modern and efficient approach to generative model training.

0 points•by ogg•3 months ago

Comments (0)

No comments yet. Be the first to comment!

Want to join the discussion?