It's a bit weird to talk about steps but not about the sampler (20 steps with Euler vs 20 steps in DPM+2M Karras are pretty different beasts both in terms of speed and quality).
I also see compiling but no AITemplate, which seems to be the among the hottest way to speed-up SD recently.
This could save alot of money on Replicate.ai
Especially if you are charging your users the same 1,000% markup while your own costs have been cut into 1/3rd and deliver results faster
I don’t know man, out of the box on SD-Next it’s about 3-4 secs for a picture at 1024 with UniPC and 20 steps on a 4090
Playing around with cfg technique, I'm finding that turning off guidance at the 40% mark causes requested fine details to not appear in the final image. This sorta implies that switching cfg midway and/or switching prompt vectors might be interesting from a prompting standpoint, but it kinda kills it as a performance optimization.