Why does act_runner daemon not exit on error? #460

Closed
opened 2024-01-09 20:56:35 +00:00 by davidfrickert · 7 comments
Contributor

Why does the daemon stay running when this happens:

time="2024-01-08T15:00:03Z" level=info msg="Starting runner daemon"
time="2024-01-08T15:00:03Z" level=error msg="fail to invoke Declare" error="unavailable: dial tcp: lookup gitea-http.tools.svc.cluster.local on 10.43.0.10:53: no such host"
Error: unavailable: dial tcp: lookup gitea-http.tools.svc.cluster.local on 10.43.0.10:53: no such host
2024-01-08 15:00:03,524 WARN exited: act_runner (exit status 1; not expected)
2024-01-08 15:00:04,526 INFO spawned: 'act_runner' with pid 345
time="2024-01-08T15:00:04Z" level=info msg="Starting runner daemon"
time="2024-01-08T15:00:04Z" level=error msg="fail to invoke Declare" error="unavailable: dial tcp: lookup gitea-http.tools.svc.cluster.local on 10.43.0.10:53: no such host"
Error: unavailable: dial tcp: lookup gitea-http.tools.svc.cluster.local on 10.43.0.10:53: no such host
2024-01-08 15:00:04,597 WARN exited: act_runner (exit status 1; not expected)
2024-01-08 15:00:06,601 INFO spawned: 'act_runner' with pid 355
time="2024-01-08T15:00:06Z" level=info msg="Starting runner daemon"
time="2024-01-08T15:00:06Z" level=error msg="fail to invoke Declare" error="unavailable: dial tcp: lookup gitea-http.tools.svc.cluster.local on 10.43.0.10:53: no such host"
Error: unavailable: dial tcp: lookup gitea-http.tools.svc.cluster.local on 10.43.0.10:53: no such host
2024-01-08 15:00:06,673 WARN exited: act_runner (exit status 1; not expected)
2024-01-08 15:00:09,709 INFO spawned: 'act_runner' with pid 365
time="2024-01-08T15:00:09Z" level=info msg="Starting runner daemon"
time="2024-01-08T15:00:10Z" level=error msg="fail to invoke Declare" error="unavailable: dial tcp: lookup gitea-http.tools.svc.cluster.local on 10.43.0.10:53: no such host"
Error: unavailable: dial tcp: lookup gitea-http.tools.svc.cluster.local on 10.43.0.10:53: no such host
2024-01-08 15:00:10,221 WARN exited: act_runner (exit status 1; not expected)
2024-01-08 15:00:11,223 INFO gave up: act_runner entered FATAL state, too many start retries too quickly

Should it not exit and force a container reboot?
In the current state it keeps happening that act_runner fails but the process is still running so the container doesn't restart automatically.

Why does the daemon stay running when this happens: ``` time="2024-01-08T15:00:03Z" level=info msg="Starting runner daemon" time="2024-01-08T15:00:03Z" level=error msg="fail to invoke Declare" error="unavailable: dial tcp: lookup gitea-http.tools.svc.cluster.local on 10.43.0.10:53: no such host" Error: unavailable: dial tcp: lookup gitea-http.tools.svc.cluster.local on 10.43.0.10:53: no such host 2024-01-08 15:00:03,524 WARN exited: act_runner (exit status 1; not expected) 2024-01-08 15:00:04,526 INFO spawned: 'act_runner' with pid 345 time="2024-01-08T15:00:04Z" level=info msg="Starting runner daemon" time="2024-01-08T15:00:04Z" level=error msg="fail to invoke Declare" error="unavailable: dial tcp: lookup gitea-http.tools.svc.cluster.local on 10.43.0.10:53: no such host" Error: unavailable: dial tcp: lookup gitea-http.tools.svc.cluster.local on 10.43.0.10:53: no such host 2024-01-08 15:00:04,597 WARN exited: act_runner (exit status 1; not expected) 2024-01-08 15:00:06,601 INFO spawned: 'act_runner' with pid 355 time="2024-01-08T15:00:06Z" level=info msg="Starting runner daemon" time="2024-01-08T15:00:06Z" level=error msg="fail to invoke Declare" error="unavailable: dial tcp: lookup gitea-http.tools.svc.cluster.local on 10.43.0.10:53: no such host" Error: unavailable: dial tcp: lookup gitea-http.tools.svc.cluster.local on 10.43.0.10:53: no such host 2024-01-08 15:00:06,673 WARN exited: act_runner (exit status 1; not expected) 2024-01-08 15:00:09,709 INFO spawned: 'act_runner' with pid 365 time="2024-01-08T15:00:09Z" level=info msg="Starting runner daemon" time="2024-01-08T15:00:10Z" level=error msg="fail to invoke Declare" error="unavailable: dial tcp: lookup gitea-http.tools.svc.cluster.local on 10.43.0.10:53: no such host" Error: unavailable: dial tcp: lookup gitea-http.tools.svc.cluster.local on 10.43.0.10:53: no such host 2024-01-08 15:00:10,221 WARN exited: act_runner (exit status 1; not expected) 2024-01-08 15:00:11,223 INFO gave up: act_runner entered FATAL state, too many start retries too quickly ``` Should it not exit and force a container reboot? In the current state it keeps happening that act_runner fails but the process is still running so the container doesn't restart automatically.
Author
Contributor

If anyone has a suggestion on a possible kubernetes health check that can be introduced here to bypass this issue that would also be appreciated.

If anyone has a suggestion on a possible kubernetes health check that can be introduced here to bypass this issue that would also be appreciated.
Author
Contributor

After reviewing the Dockerfile, it seems that the issue is caused due to the configuration of supervisord.
Even if act_runner enters a failed state, supervisord does not exit as dockerd is still running okay.
I submitted a PR with a working solution, essentially added a section on supervisord.conf to exit if it detects that any of the supervisored processes exited.

After reviewing the Dockerfile, it seems that the issue is caused due to the configuration of `supervisord`. Even if `act_runner` enters a failed state, `supervisord` does not exit as `dockerd` is still running okay. I submitted a PR with a working solution, essentially added a section on `supervisord.conf` to exit if it detects that any of the supervisored processes exited.
Author
Contributor

I have also published a build of this PR here if anyone is interested:
https://hub.docker.com/r/davidfrickert/act_runner/tags

I have also published a build of this PR here if anyone is interested: https://hub.docker.com/r/davidfrickert/act_runner/tags
Owner

Because it maybe a tempory error? Sometimes, network maybe unstable. I don't think act_runner should exit.

Because it maybe a tempory error? Sometimes, network maybe unstable. I don't think act_runner should exit.
Author
Contributor

Because it maybe a tempory error? Sometimes, network maybe unstable. I don't think act_runner should exit.

Sure, it definitely is a temporary error, but the container does not recover from it.
As you can see from the logs, act_runner enters a fatal state (i.e. act_runner process is not running) but the container stays running, doing nothing. This basically means that if I want the runner to resume processing jobs, I have to manually restart the container. This is obviously not good, containers should self-heal as much as possible.

I'm not super familiar with supervisord but from what I saw, a simple fix for this issue is to exit supervisord if any process exits, which will trigger a restart. This is better than having a zombie container, but a better option might exist!

> Because it maybe a tempory error? Sometimes, network maybe unstable. I don't think act_runner should exit. Sure, it definitely is a temporary error, but the container does not recover from it. As you can see from the logs, act_runner enters a fatal state (i.e. act_runner process is not running) but the container stays running, doing nothing. This basically means that if I want the runner to resume processing jobs, I have to manually restart the container. This is obviously not good, containers should self-heal as much as possible. I'm not super familiar with supervisord but from what I saw, a simple fix for this issue is to exit supervisord if any process exits, which will trigger a restart. This is better than having a zombie container, but a better option might exist!

Hi,
Just confirming that this is a real issue I'm hitting too

Because it maybe a tempory error? Sometimes, network maybe unstable. I don't think act_runner should exit.

Even if the error is temporary, the act_runner enters a FATAL state that means it no longer tries to configure, thus it does not recover from temporary errors if they go on for long enough,

I think there's a few issues to fix here.

  • The docker image for act_runner should terminate on FATAL error. Any error that ends in a infinite loop/zombie state should be avoided.
  • The conditions for FATAL errors to occur should be configurable (so you can tell it to retry indefinitely if that's desired).
  • A liveness probe endpoint is needed for act_runner if its to be deployed in production k8s.

I'll go check out the PR now and comment, but unfortunately I have no authority here. Thank you for the docker build!

Hi, Just confirming that this is a real issue I'm hitting too > Because it maybe a tempory error? Sometimes, network maybe unstable. I don't think act_runner should exit. Even if the error is temporary, the act_runner enters a FATAL state that means it no longer tries to configure, thus it does not recover from temporary errors if they go on for long enough, I think there's a few issues to fix here. - The docker image for act_runner should terminate on FATAL error. Any error that ends in a infinite loop/zombie state should be avoided. - The conditions for FATAL errors to occur should be configurable (so you can tell it to retry indefinitely if that's desired). - A liveness probe endpoint is needed for act_runner if its to be deployed in production k8s. I'll go check out the PR now and comment, but unfortunately I have no authority here. Thank you for the docker build!
Author
Contributor

Thanks for merging the fix @lunny !

Thanks for merging the fix @lunny !
Sign in to join this conversation.
No Milestone
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: gitea/act_runner#460
No description provided.