Enforce TTL on Jobs¶
- Status: accepted
- Deciders: arch meeting
- Date: 2023-03-30
Context and Problem Statement¶
Lingering finished Jobs are annoying, put unnecessary pressure on the API server, and causes bloat metrics. I fail to see any point of having the possibility of keeping Jobs in Kubernetes around forever. Others agree with me and since k8s 1.23 TTL on Jobs entered a stable API state.
So, should we enforce the usage of TTL on Jobs?
- We want to maintain platform security and stability.
- We want to find a solution which is scalable and minimizes MSE burden.
- We want to best serve our customers.
- We want to make the operator life easier.
- Do nothing
- Enforce TTL via Gatekeeper, aka reject if not set in Job spec.
- Enforce TTL via Gatekeeper mutation, aka update Job spec if not set.
- Enforce TTL via Gatekeeper mutation, aka update Job spec if not set. and lower the value if set too high.
Chosen option: 3 - "Enforce TTL via Gatekeeper mutation, aka update Job spec if not set. We decided for a default Job TTL of 7 days, as this is a good compromise between being able to inspect the Pod -- e.g., check exit status and logs -- and not keeping Pods for too long.
- Customer don't need to worry about setting any value
- We don't have to worry about setting any value for our jobs (assuming we will run Gatekeeper in SC, in the meantime we can just update our Job speces in Apps)
- There won't be any lingering Jobs in Kubernetes (assuming that no bad finalizers are set)
- TTL of Jobs will be visible in out IO site and possibly stated in our ToS
- Some customers might not agree with our set TTL (not likely though IMO)