Hacker News

Ask HN: How are LLM's served at scale?

by KuriousCaton 12/15/2024, 8:02:47 AM with 2 comments

by neximo64on 12/15/2024, 10:32:08 AM
the session is tied to a gpu cluster. It would actually be quite inefficient to switch gpu cluster to another one mid session, but its needed in a failure scenario
by ansonhwon 12/15/2024, 12:41:26 PM
good batching and tensor parallelization prob