Ask HN: How do you make sure all your batch jobs ran?

  • At work we use Icinga, and batch jobs submit passive results, if they fail or don't check in within a reasonable time, it triggers an alert. Probably more than you want to set up, but might give you some ideas. We have a wrapper script that sends the job status and output.

    Alternatively, you could make the jobs quiet on success and noisy on failure, and just have cron mail you about them? Or make a wrapper script that saves off the status and a status job that mails you if any failed?

  • If the batch jobs have few or simple interdependencies, scheduling is the easy part. Are your tasks too complex for cron/at/batch? For example, do they require coordination across machines? That might suggest looking at slurm/lsf or another distributed job scheduler or implementing them on Kubernetes. Sounds like that would be overkill in your case.

    It doesn't sound like a scheduling problem - it sounds like a noticing problem. You have to figure out what to do on failure - email, text, retry, log, etc. (Hence the suggestion for Kubernetes, or another declarative automation system like Ansible or Puppet). If "daemon X should be running", checking for it and sending an email is the easiest and most useless response.