feat: allow graceful shutdowns #546

Merged
lunny merged 3 commits from rowan-allspice/act_runner:graceful-shutdown into main 2024-05-27 07:38:56 +00:00
Contributor

Add a Shutdown(context.Context) error method to the Poller. Calling this method will first shutdown all active polling, preventing any new jobs from spawning. It will then wait for either all jobs to finish, or for the context to be cancelled. If the context is cancelled, it will then force all jobs to end, and then exit.

Fixes #107

Add a `Shutdown(context.Context) error` method to the Poller. Calling this method will first shutdown all active polling, preventing any new jobs from spawning. It will then wait for either all jobs to finish, or for the context to be cancelled. If the context is cancelled, it will then force all jobs to end, and then exit. Fixes https://gitea.com/gitea/act_runner/issues/107
rowan-allspice added 1 commit 2024-05-13 20:19:17 +00:00
feat: allow graceful shutdowns
All checks were successful
checks / check and test (pull_request) Successful in 1m0s
b350719527
Add a Shutdown(context.Context) method to the Poller. This will first
shutdown all active polling, preventing any new jobs from spawning. It
will then wait for either all jobs to finish, or for the context to
finish. If the context finishes, it will then force all jobs to end,
and then exit.
lunny requested review from appleboy 2024-05-14 14:34:47 +00:00
rowan-allspice added 1 commit 2024-05-17 17:22:46 +00:00
fix: exec act_runner
All checks were successful
checks / check and test (pull_request) Successful in 27s
96b9b922cd
This passes signals for graceful shutdown.
wolfogre reviewed 2024-05-21 02:20:33 +00:00
@ -128,0 +127,4 @@
<-ctx.Done()
log.Infof("runner: %s gracefully shutting down", resp.Msg.Runner.Name)
return poller.Shutdown(context.Background())
Owner

Not sure about this, since it's context.Background(), how to make it works like "If the context is cancelled, it will then force all jobs to end, and then exit"?

I think it could be like: the first signal to tell the runner to stop taking new tasks and wait for the current task to finish, the second signal to tell it to force stop current tasks but let them report errors.
(Not idea, see the new comment.)

Not sure about this, since it's `context.Background()`, how to make it works like "If the context is cancelled, it will then force all jobs to end, and then exit"? ~~I think it could be like: the first signal to tell the runner to stop taking new tasks and wait for the current task to finish, the second signal to tell it to force stop current tasks but let them report errors.~~ (Not idea, see the new comment.)
Author
Contributor

The specific context here can be changed depending on the end design. The shutdown logic here was meant to mirror the same API provided by http.Server.Shutdown. Providing a context.Background() here essentially just makes the Poller wait indefinitely to for running jobs to finish.

The specific context here can be changed depending on the end design. The shutdown logic here was meant to mirror the same API provided by [`http.Server.Shutdown`](https://pkg.go.dev/net/http#Server.Shutdown). Providing a `context.Background()` here essentially just makes the `Poller` wait indefinitely to for running jobs to finish.
Owner

One more thing, many runners are started by Docker, and docker stop will send SIGTERM first, then send SIGKILL a few seconds later if the container doesn't stop.

And things would be like:

  1. docker stop [the_runner]
  2. The runner received SIGTERM, and it stopped to take new tasks but waiting for current running tasks to finish.
  3. But the time is not enough, so SIGKILL has been sent, and the runner is killed immediately, even cannot report errors to Gitea.

So maybe this could be a better design:

  • docker stop, kill or ctrl C will stop current running tasks but wait them to report errors to Gitea, then exit.
  • docker kill -signal SIGUSR1 or kill -SIGUSR1 will stop taking new tasks and wait for the current task to finish. So it's not a default behavior but a high-level optional operation, users may have to wait for a long time.
  • docker kill or kill -SIGKILL will stop runner immediately.
One more thing, many runners are started by Docker, and `docker stop` will send `SIGTERM` first, then send `SIGKILL` a few seconds later if the container doesn't stop. And things would be like: 1. `docker stop [the_runner]` 2. The runner received `SIGTERM`, and it stopped to take new tasks but waiting for current running tasks to finish. 3. But the time is not enough, so `SIGKILL` has been sent, and the runner is killed immediately, even cannot report errors to Gitea. So maybe this could be a better design: - `docker stop`, `kill` or `ctrl C` will stop current running tasks but wait them to report errors to Gitea, then exit. - `docker kill -signal SIGUSR1` or `kill -SIGUSR1` will stop taking new tasks and wait for the current task to finish. So it's not a default behavior but a high-level optional operation, users may have to wait for a long time. - `docker kill` or `kill -SIGKILL` will stop runner immediately.
Author
Contributor

I think relying on SIGUSR1 to initiate a graceful shutdown will complicate running this in several container orchestrators. Neither Kubernetes nor ECS provide a way for end users to customize the signal sent to a container, instead relying on the STOPSIGNAL Dockerfile directive to provide correct signaling. This means that if a user wants to deploy a runner with graceful shutdowns, they will need to make a new image overriding the STOPSIGNAL directive to send SIGUSR1.

These orchestrators do provide a way to specify the timeout period for stopping (the equivalent of docker stop -t <timeout in seconds>. We are currently using this for rolling updates to our clusters, by setting the timeout for the container to a value longer than the runner.timeout provided in the config file. This causes the runner process to immediately stop polling for new jobs upon receiving the SIGTERM, and then wait for the individual jobs to either finish succesfully, or timeout per the specified runner.timeout period.

I think relying on `SIGUSR1` to initiate a graceful shutdown will complicate running this in several container orchestrators. Neither [Kubernetes](https://github.com/kubernetes/kubernetes/issues/30051) nor [ECS](https://github.com/aws/containers-roadmap/issues/359) provide a way for end users to customize the signal sent to a container, instead relying on the [`STOPSIGNAL` Dockerfile directive](https://docs.docker.com/reference/dockerfile/#stopsignal) to provide correct signaling. This means that if a user wants to deploy a runner with graceful shutdowns, they will need to make a new image overriding the `STOPSIGNAL` directive to send `SIGUSR1`. These orchestrators do provide a way to specify the timeout period for stopping (the equivalent of `docker stop -t <timeout in seconds>`. We are currently using this for rolling updates to our clusters, by setting the timeout for the container to a value longer than the `runner.timeout` provided in the config file. This causes the runner process to immediately stop polling for new jobs upon receiving the `SIGTERM`, and then wait for the individual jobs to either finish succesfully, or timeout per the specified `runner.timeout` period.
Owner

I agree SIGUSR1 would be a complex operation, but it's OK for users to not use it. SIGTERM will stop current running tasks and let them report errors to Gitea; it's a kind of graceful shutdown.

Let's discuss different types of shutdown.

  1. Stop taking new tasks and wait for the current task to finish. The most graceful, but it may require a long wait, maybe hours.
  2. Stop current running tasks and let them report errors to Gitea. Kind of graceful, but it just requires seconds to wait.
  3. Stop immediately, and Gitea doesn't know what happened and wait for a zombie task.

And I think most users don't really want kind 1 to wait hours, so they will run docker stop without a timeout, and it will lead to kind 3.

So my idea is to avoid kind 3, make kind 2 easy to reach for most users, and make kind 1 optional for advanced users. If SIGUSR1 is not convenient enough, runners could also provide more ways to reach kind 1, like accepting HTTP requests, executing a subcommand, or reading a unix socket. But not SIGTERM, leave it to kind 2.

I agree `SIGUSR1` would be a complex operation, but it's OK for users to not use it. `SIGTERM` will stop current running tasks and let them report errors to Gitea; it's a kind of graceful shutdown. Let's discuss different types of shutdown. 1. Stop taking new tasks and wait for the current task to finish. The most graceful, but it may require a long wait, maybe hours. 2. Stop current running tasks and let them report errors to Gitea. Kind of graceful, but it just requires seconds to wait. 3. Stop immediately, and Gitea doesn't know what happened and wait for a zombie task. And I think most users don't really want kind 1 to wait hours, so they will run `docker stop` without a timeout, and it will lead to kind 3. So my idea is to avoid kind 3, make kind 2 easy to reach for most users, and make kind 1 optional for advanced users. If `SIGUSR1` is not convenient enough, runners could also provide more ways to reach kind 1, like accepting HTTP requests, executing a subcommand, or reading a unix socket. But not `SIGTERM`, leave it to kind 2.
Author
Contributor

Would a new runner config option of "ShutdownTimeout" that defaults to 5 seconds work? Proposed behavior:

  • SIGTERM is sent, triggering polling to stop immediately
  • The Poller waits up to ShutdownTimeout for running jobs to finish gracefully
  • The Poller then kills any still running jobs, and reports their status to Gitea
  • The process then exits

The default of 5 seconds is chosen to be less than the default of a 10 second timeout for docker stop. This provides a default that splits the difference between the shutdown model of #1 & #2 that you provided, while avoiding #3. This also allows for either model #1 or #2 to be implemented by setting a ShutdownTimeout to the same value as Timeout or 0, respectively. It also enables more flexible configurations, such as a ShutdownTimeout less than Timeout in order to bound deployment time, while avoiding killing running jobs.

Would a new runner config option of "ShutdownTimeout" that defaults to 5 seconds work? Proposed behavior: * `SIGTERM` is sent, triggering polling to stop immediately * The `Poller` waits up to `ShutdownTimeout` for running jobs to finish gracefully * The `Poller` then kills any still running jobs, and reports their status to Gitea * The process then exits The default of 5 seconds is chosen to be less than the default of a 10 second timeout for `docker stop`. This provides a default that splits the difference between the shutdown model of #1 & #2 that you provided, while avoiding #3. This also allows for either model #1 or #2 to be implemented by setting a `ShutdownTimeout` to the same value as `Timeout` or `0`, respectively. It also enables more flexible configurations, such as a `ShutdownTimeout` less than `Timeout` in order to bound deployment time, while avoiding killing running jobs.
Owner

👍 Yes, I believe "ShutdownTimeout" is a good idea. However, The default value could be set to zero, which means not to wait and instead stop running jobs and report errors immediately. A timeout of 5 seconds would be insignificant for most use cases. And it would be easier to be compatible with the old config file which does have the new option: just treat it as zero.

👍 Yes, I believe "ShutdownTimeout" is a good idea. However, The default value could be set to zero, which means not to wait and instead stop running jobs and report errors immediately. A timeout of 5 seconds would be insignificant for most use cases. And it would be easier to be compatible with the old config file which does have the new option: just treat it as zero.
rowan-allspice added 1 commit 2024-05-23 16:21:09 +00:00
feat: add Runner.ShutdownTimeout config option
All checks were successful
checks / check and test (pull_request) Successful in 55s
0d79479484
This controls the amount of time the runner will wait for running jobs
to finish before cancelling them. Defaults to 0s in order to maintain
backwards compatibility with previous behavior.
Author
Contributor

Updated to use the ShutdownTimeout logic described above with a default of 0s.

Updated to use the `ShutdownTimeout` logic described above with a default of 0s.
wolfogre approved these changes 2024-05-27 04:23:14 +00:00
lunny approved these changes 2024-05-27 07:38:47 +00:00
lunny merged commit d1d3cad4b0 into main 2024-05-27 07:38:56 +00:00
Sign in to join this conversation.
No description provided.