feat: allow graceful shutdowns #546
Labels
No Label
kind
bug
kind
build
kind/compatible
kind
dependencies
kind
docs
kind
enhancement
kind
feature
kind
help wanted
kind
proposal
kind
refactor
related
act
related
environment
related
exec
related
gitea
related
workflow
reviewed
confirmed
reviewed
duplicate
reviewed
invalid
reviewed
needs feedback
reviewed
wontfix
reviewed
workaround
No Milestone
No Assignees
3 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: gitea/act_runner#546
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "rowan-allspice/act_runner:graceful-shutdown"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Add a
Shutdown(context.Context) error
method to the Poller. Calling this method will first shutdown all active polling, preventing any new jobs from spawning. It will then wait for either all jobs to finish, or for the context to be cancelled. If the context is cancelled, it will then force all jobs to end, and then exit.Fixes #107
@ -128,0 +127,4 @@
<-ctx.Done()
log.Infof("runner: %s gracefully shutting down", resp.Msg.Runner.Name)
return poller.Shutdown(context.Background())
Not sure about this, since it's
context.Background()
, how to make it works like "If the context is cancelled, it will then force all jobs to end, and then exit"?I think it could be like: the first signal to tell the runner to stop taking new tasks and wait for the current task to finish, the second signal to tell it to force stop current tasks but let them report errors.(Not idea, see the new comment.)
The specific context here can be changed depending on the end design. The shutdown logic here was meant to mirror the same API provided by
http.Server.Shutdown
. Providing acontext.Background()
here essentially just makes thePoller
wait indefinitely to for running jobs to finish.One more thing, many runners are started by Docker, and
docker stop
will sendSIGTERM
first, then sendSIGKILL
a few seconds later if the container doesn't stop.And things would be like:
docker stop [the_runner]
SIGTERM
, and it stopped to take new tasks but waiting for current running tasks to finish.SIGKILL
has been sent, and the runner is killed immediately, even cannot report errors to Gitea.So maybe this could be a better design:
docker stop
,kill
orctrl C
will stop current running tasks but wait them to report errors to Gitea, then exit.docker kill -signal SIGUSR1
orkill -SIGUSR1
will stop taking new tasks and wait for the current task to finish. So it's not a default behavior but a high-level optional operation, users may have to wait for a long time.docker kill
orkill -SIGKILL
will stop runner immediately.I think relying on
SIGUSR1
to initiate a graceful shutdown will complicate running this in several container orchestrators. Neither Kubernetes nor ECS provide a way for end users to customize the signal sent to a container, instead relying on theSTOPSIGNAL
Dockerfile directive to provide correct signaling. This means that if a user wants to deploy a runner with graceful shutdowns, they will need to make a new image overriding theSTOPSIGNAL
directive to sendSIGUSR1
.These orchestrators do provide a way to specify the timeout period for stopping (the equivalent of
docker stop -t <timeout in seconds>
. We are currently using this for rolling updates to our clusters, by setting the timeout for the container to a value longer than therunner.timeout
provided in the config file. This causes the runner process to immediately stop polling for new jobs upon receiving theSIGTERM
, and then wait for the individual jobs to either finish succesfully, or timeout per the specifiedrunner.timeout
period.I agree
SIGUSR1
would be a complex operation, but it's OK for users to not use it.SIGTERM
will stop current running tasks and let them report errors to Gitea; it's a kind of graceful shutdown.Let's discuss different types of shutdown.
And I think most users don't really want kind 1 to wait hours, so they will run
docker stop
without a timeout, and it will lead to kind 3.So my idea is to avoid kind 3, make kind 2 easy to reach for most users, and make kind 1 optional for advanced users. If
SIGUSR1
is not convenient enough, runners could also provide more ways to reach kind 1, like accepting HTTP requests, executing a subcommand, or reading a unix socket. But notSIGTERM
, leave it to kind 2.Would a new runner config option of "ShutdownTimeout" that defaults to 5 seconds work? Proposed behavior:
SIGTERM
is sent, triggering polling to stop immediatelyPoller
waits up toShutdownTimeout
for running jobs to finish gracefullyPoller
then kills any still running jobs, and reports their status to GiteaThe default of 5 seconds is chosen to be less than the default of a 10 second timeout for
docker stop
. This provides a default that splits the difference between the shutdown model of #1 & #2 that you provided, while avoiding #3. This also allows for either model #1 or #2 to be implemented by setting aShutdownTimeout
to the same value asTimeout
or0
, respectively. It also enables more flexible configurations, such as aShutdownTimeout
less thanTimeout
in order to bound deployment time, while avoiding killing running jobs.👍 Yes, I believe "ShutdownTimeout" is a good idea. However, The default value could be set to zero, which means not to wait and instead stop running jobs and report errors immediately. A timeout of 5 seconds would be insignificant for most use cases. And it would be easier to be compatible with the old config file which does have the new option: just treat it as zero.
Updated to use the
ShutdownTimeout
logic described above with a default of 0s.