hey HN, we use used to run an ML consultancy for a year that helped companies build & host models in prod. We learned how tedious & expensive it was to host ML. Customer models had to run on a fleet of always-on GPUs that would often get <10% utilization, which felt like a big money sink.
Over time we built infrastructure to improve GPU utilization. Six months ago we made a pivot to focus solely on productizing this infra into a hosting platform for ML teams to use that would remove the pain of deployment and reduce the cost of hosting models.
We deploy on A100 GPUs, and you pay per second of inference. If you aren’t running inferences you pay nothing. Couple points to clarify: Yes, the models are actually cold-booted, we aren’t just running them in the background. We boot models faster due to how we manage OS memory. Yes, there is still cold-boot time, it’s not instant but it’s significantly faster (e.g., 15 seconds instead of 10 minutes for some transformers like GPTJ).
Lastly, model quality is not lost on Banana because we aren’t doing traditional weight quantization or network pruning which makes networks smaller/faster but sacrifices quality. You can think of Banana more as a compiler + hosting platform. We break down your code to run faster on GPUs.
author here:
hey HN, we use used to run an ML consultancy for a year that helped companies build & host models in prod. We learned how tedious & expensive it was to host ML. Customer models had to run on a fleet of always-on GPUs that would often get <10% utilization, which felt like a big money sink.
Over time we built infrastructure to improve GPU utilization. Six months ago we made a pivot to focus solely on productizing this infra into a hosting platform for ML teams to use that would remove the pain of deployment and reduce the cost of hosting models.
We deploy on A100 GPUs, and you pay per second of inference. If you aren’t running inferences you pay nothing. Couple points to clarify: Yes, the models are actually cold-booted, we aren’t just running them in the background. We boot models faster due to how we manage OS memory. Yes, there is still cold-boot time, it’s not instant but it’s significantly faster (e.g., 15 seconds instead of 10 minutes for some transformers like GPTJ).
Lastly, model quality is not lost on Banana because we aren’t doing traditional weight quantization or network pruning which makes networks smaller/faster but sacrifices quality. You can think of Banana more as a compiler + hosting platform. We break down your code to run faster on GPUs.
Try it out and let us know what you think!