feat: allow graceful shutdowns #546

rowan-allspice · 2024-05-13T20:19:17Z

rowan-allspice commented

2024-05-13 20:19:17 +00:00

Add a Shutdown(context.Context) error method to the Poller. Calling this method will first shutdown all active polling, preventing any new jobs from spawning. It will then wait for either all jobs to finish, or for the context to be cancelled. If the context is cancelled, it will then force all jobs to end, and then exit.

Fixes #107

Add a `Shutdown(context.Context) error` method to the Poller. Calling this method will first shutdown all active polling, preventing any new jobs from spawning. It will then wait for either all jobs to finish, or for the context to be cancelled. If the context is cancelled, it will then force all jobs to end, and then exit. Fixes https://gitea.com/gitea/act_runner/issues/107

rowan-allspice added 1 commit 2024-05-13 20:19:17 +00:00

feat: allow graceful shutdowns

checks / check and test (pull_request) Successful in 1m0s

Details

b350719527

Add a Shutdown(context.Context) method to the Poller. This will first
shutdown all active polling, preventing any new jobs from spawning. It
will then wait for either all jobs to finish, or for the context to
finish. If the context finishes, it will then force all jobs to end,
and then exit.

lunny requested review from appleboy 2024-05-14 14:34:47 +00:00

rowan-allspice added 1 commit 2024-05-17 17:22:46 +00:00

fix: exec act_runner

checks / check and test (pull_request) Successful in 27s

Details

96b9b922cd

This passes signals for graceful shutdown.

wolfogre reviewed 2024-05-21 02:20:33 +00:00

internal/app/cmd/daemon.go Outdated

						
				@ -128,0 +127,4 @@

						<-ctx.Done()

						log.Infof("runner: %s gracefully shutting down", resp.Msg.Runner.Name)

						return poller.Shutdown(context.Background())

wolfogre commented

2024-05-21 02:20:33 +00:00

Not sure about this, since it's context.Background(), how to make it works like "If the context is cancelled, it will then force all jobs to end, and then exit"?

I think it could be like: the first signal to tell the runner to stop taking new tasks and wait for the current task to finish, the second signal to tell it to force stop current tasks but let them report errors.
(Not idea, see the new comment.)

Not sure about this, since it's `context.Background()`, how to make it works like "If the context is cancelled, it will then force all jobs to end, and then exit"? ~~I think it could be like: the first signal to tell the runner to stop taking new tasks and wait for the current task to finish, the second signal to tell it to force stop current tasks but let them report errors.~~ (Not idea, see the new comment.)

rowan-allspice commented

2024-05-21 21:00:53 +00:00

The specific context here can be changed depending on the end design. The shutdown logic here was meant to mirror the same API provided by http.Server.Shutdown. Providing a context.Background() here essentially just makes the Poller wait indefinitely to for running jobs to finish.

The specific context here can be changed depending on the end design. The shutdown logic here was meant to mirror the same API provided by [`http.Server.Shutdown`](https://pkg.go.dev/net/http#Server.Shutdown). Providing a `context.Background()` here essentially just makes the `Poller` wait indefinitely to for running jobs to finish.

wolfogre commented

2024-05-21 02:40:43 +00:00

One more thing, many runners are started by Docker, and docker stop will send SIGTERM first, then send SIGKILL a few seconds later if the container doesn't stop.

And things would be like:

docker stop [the_runner]
The runner received SIGTERM, and it stopped to take new tasks but waiting for current running tasks to finish.
But the time is not enough, so SIGKILL has been sent, and the runner is killed immediately, even cannot report errors to Gitea.

So maybe this could be a better design:

docker stop, kill or ctrl C will stop current running tasks but wait them to report errors to Gitea, then exit.
docker kill -signal SIGUSR1 or kill -SIGUSR1 will stop taking new tasks and wait for the current task to finish. So it's not a default behavior but a high-level optional operation, users may have to wait for a long time.
docker kill or kill -SIGKILL will stop runner immediately.

One more thing, many runners are started by Docker, and `docker stop` will send `SIGTERM` first, then send `SIGKILL` a few seconds later if the container doesn't stop. And things would be like: 1. `docker stop [the_runner]` 2. The runner received `SIGTERM`, and it stopped to take new tasks but waiting for current running tasks to finish. 3. But the time is not enough, so `SIGKILL` has been sent, and the runner is killed immediately, even cannot report errors to Gitea. So maybe this could be a better design: - `docker stop`, `kill` or `ctrl C` will stop current running tasks but wait them to report errors to Gitea, then exit. - `docker kill -signal SIGUSR1` or `kill -SIGUSR1` will stop taking new tasks and wait for the current task to finish. So it's not a default behavior but a high-level optional operation, users may have to wait for a long time. - `docker kill` or `kill -SIGKILL` will stop runner immediately.

rowan-allspice commented

2024-05-21 21:14:42 +00:00

I think relying on SIGUSR1 to initiate a graceful shutdown will complicate running this in several container orchestrators. Neither Kubernetes nor ECS provide a way for end users to customize the signal sent to a container, instead relying on the STOPSIGNAL Dockerfile directive to provide correct signaling. This means that if a user wants to deploy a runner with graceful shutdowns, they will need to make a new image overriding the STOPSIGNAL directive to send SIGUSR1.

These orchestrators do provide a way to specify the timeout period for stopping (the equivalent of docker stop -t <timeout in seconds>. We are currently using this for rolling updates to our clusters, by setting the timeout for the container to a value longer than the runner.timeout provided in the config file. This causes the runner process to immediately stop polling for new jobs upon receiving the SIGTERM, and then wait for the individual jobs to either finish succesfully, or timeout per the specified runner.timeout period.

I think relying on `SIGUSR1` to initiate a graceful shutdown will complicate running this in several container orchestrators. Neither [Kubernetes](https://github.com/kubernetes/kubernetes/issues/30051) nor [ECS](https://github.com/aws/containers-roadmap/issues/359) provide a way for end users to customize the signal sent to a container, instead relying on the [`STOPSIGNAL` Dockerfile directive](https://docs.docker.com/reference/dockerfile/#stopsignal) to provide correct signaling. This means that if a user wants to deploy a runner with graceful shutdowns, they will need to make a new image overriding the `STOPSIGNAL` directive to send `SIGUSR1`. These orchestrators do provide a way to specify the timeout period for stopping (the equivalent of `docker stop -t <timeout in seconds>`. We are currently using this for rolling updates to our clusters, by setting the timeout for the container to a value longer than the `runner.timeout` provided in the config file. This causes the runner process to immediately stop polling for new jobs upon receiving the `SIGTERM`, and then wait for the individual jobs to either finish succesfully, or timeout per the specified `runner.timeout` period.

wolfogre commented

2024-05-22 01:51:12 +00:00

I agree SIGUSR1 would be a complex operation, but it's OK for users to not use it. SIGTERM will stop current running tasks and let them report errors to Gitea; it's a kind of graceful shutdown.

Let's discuss different types of shutdown.

Stop taking new tasks and wait for the current task to finish. The most graceful, but it may require a long wait, maybe hours.
Stop current running tasks and let them report errors to Gitea. Kind of graceful, but it just requires seconds to wait.
Stop immediately, and Gitea doesn't know what happened and wait for a zombie task.

And I think most users don't really want kind 1 to wait hours, so they will run docker stop without a timeout, and it will lead to kind 3.

So my idea is to avoid kind 3, make kind 2 easy to reach for most users, and make kind 1 optional for advanced users. If SIGUSR1 is not convenient enough, runners could also provide more ways to reach kind 1, like accepting HTTP requests, executing a subcommand, or reading a unix socket. But not SIGTERM, leave it to kind 2.

I agree `SIGUSR1` would be a complex operation, but it's OK for users to not use it. `SIGTERM` will stop current running tasks and let them report errors to Gitea; it's a kind of graceful shutdown. Let's discuss different types of shutdown. 1. Stop taking new tasks and wait for the current task to finish. The most graceful, but it may require a long wait, maybe hours. 2. Stop current running tasks and let them report errors to Gitea. Kind of graceful, but it just requires seconds to wait. 3. Stop immediately, and Gitea doesn't know what happened and wait for a zombie task. And I think most users don't really want kind 1 to wait hours, so they will run `docker stop` without a timeout, and it will lead to kind 3. So my idea is to avoid kind 3, make kind 2 easy to reach for most users, and make kind 1 optional for advanced users. If `SIGUSR1` is not convenient enough, runners could also provide more ways to reach kind 1, like accepting HTTP requests, executing a subcommand, or reading a unix socket. But not `SIGTERM`, leave it to kind 2.

rowan-allspice commented

2024-05-22 20:41:43 +00:00

Would a new runner config option of "ShutdownTimeout" that defaults to 5 seconds work? Proposed behavior:

SIGTERM is sent, triggering polling to stop immediately
The Poller waits up to ShutdownTimeout for running jobs to finish gracefully
The Poller then kills any still running jobs, and reports their status to Gitea
The process then exits

The default of 5 seconds is chosen to be less than the default of a 10 second timeout for docker stop. This provides a default that splits the difference between the shutdown model of #1 & #2 that you provided, while avoiding #3. This also allows for either model #1 or #2 to be implemented by setting a ShutdownTimeout to the same value as Timeout or 0, respectively. It also enables more flexible configurations, such as a ShutdownTimeout less than Timeout in order to bound deployment time, while avoiding killing running jobs.

Would a new runner config option of "ShutdownTimeout" that defaults to 5 seconds work? Proposed behavior: * `SIGTERM` is sent, triggering polling to stop immediately * The `Poller` waits up to `ShutdownTimeout` for running jobs to finish gracefully * The `Poller` then kills any still running jobs, and reports their status to Gitea * The process then exits The default of 5 seconds is chosen to be less than the default of a 10 second timeout for `docker stop`. This provides a default that splits the difference between the shutdown model of #1 & #2 that you provided, while avoiding #3. This also allows for either model #1 or #2 to be implemented by setting a `ShutdownTimeout` to the same value as `Timeout` or `0`, respectively. It also enables more flexible configurations, such as a `ShutdownTimeout` less than `Timeout` in order to bound deployment time, while avoiding killing running jobs.

wolfogre commented

2024-05-23 02:03:02 +00:00

👍 Yes, I believe "ShutdownTimeout" is a good idea. However, The default value could be set to zero, which means not to wait and instead stop running jobs and report errors immediately. A timeout of 5 seconds would be insignificant for most use cases. And it would be easier to be compatible with the old config file which does have the new option: just treat it as zero.

rowan-allspice added 1 commit 2024-05-23 16:21:09 +00:00

feat: add Runner.ShutdownTimeout config option

checks / check and test (pull_request) Successful in 55s

Details

0d79479484

This controls the amount of time the runner will wait for running jobs
to finish before cancelling them. Defaults to 0s in order to maintain
backwards compatibility with previous behavior.

rowan-allspice commented

2024-05-23 16:23:41 +00:00

Updated to use the ShutdownTimeout logic described above with a default of 0s.

Updated to use the `ShutdownTimeout` logic described above with a default of 0s.

wolfogre approved these changes 2024-05-27 04:23:14 +00:00

lunny approved these changes 2024-05-27 07:38:47 +00:00

lunny merged commit d1d3cad4b0 into main

2024-05-27 07:38:56 +00:00

lunny referenced this issue from a commit

2024-05-27 07:38:58 +00:00

feat: allow graceful shutdowns (#546)

appleboy deleted branch graceful-shutdown

2024-06-22 02:45:51 +00:00

Sign in to join this conversation.

No reviewers

No Milestone

No Assignees

3 Participants

Notifications

Due Date

The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: gitea/act_runner#546