[Breaking] HA-support via Deployment #437

Merged
pat-s merged 75 commits from deployment into main 2023-07-17 19:09:46 +00:00
Member

Changes

A big shoutout to @luhahn for all his work in #205 which served as the base for this PR.

Documentation

  • After thinking for some time about it, I still prefer the distinct option (as started in #350), i.e. having a standalone "HA" doc under docs/ha-setup.md to not have a very long README (which is already quite long).
    Most of the information below should go into it with more details and explanations behind all of the individual components.

Chart deps

- Adds meilisearch as a chart dependency for a HA-ready issue indexer. Only works with >= Gitea 1.20
- Adds redis as a chart dependency for a HA-ready session and queue store.

  • Adds redis-cluster as a chart dependency for a HA-ready session and queue store (alternative to redis). Only works with >= Gitea 1.19.2.
  • Removes memcached instead of redis-cluster
  • Add postgresql-ha as default DB dep in favor of postgres

Adds smart HA chart logic

The goal is to set smart config values that result in a HA-ready Gitea deployment if replicaCount > 1.

  • If replicaCount > 1,
    • gitea.config.session.PROVIDER is automatically set to redis-cluster
    • gitea.config.indexer.REPO_INDEXER_ENABLED is automatically set to false unless the value is elasticsearch or meilisearch
    • redis-cluster is used for [queue] and [cache] and [session]mode or not

Configuration of external instances of meilisearch and minio are documented in a new markdown doc.

Deployment vs Statefulset

Given all the discussions about this lately (#428), I think we could use both.
In the end, we do not have the requirement for a sequential pod scale up/scale down as it would happen in statefulsets.
On the other side, we do not have actual stateless pods as we are attaching a RWX to the deployment.
Yet I think because we do not have a leader-election requirement, spawning the pods as a deployment makes "Rolling Updates" easier and also signals users that there is no "leader election" logic and each pod can just be "destroyed" at anytime without causing interruption.

Hence I think we should be able to switch from a statefulset to a deployment, even in the single-replica case.

This change also brought up a templating/linting issue: the definition of .Values.gitea.config.server.SSH_LISTEN_PORT in ssh-svc.yaml just "luckily" worked so far due to naming-related lint processing. Due to the change from "statefulset" to "deployment", the processing queue changed and caused a failure complaining about config.server.SSH_LISTEN_PORT not being defined yet.
The only way I could see to fix this was to "properly" define the value in values.yaml instead of conditionally definining it in helpers.tpl. Maybe there's a better way?

Chart PVC Creation

I've adapted the automated PVC creation from another chart to be able to provide the storageClassName as I couldn't get dynamic provisioning for EFS going with the current implementation.
In addition the naming and approach within the Gitea chart for PV creation is a bit unusual and aligning it might be beneficial.

A semi-unrelated change which will result in a breaking change for existing users but this PR includes a lot of breaking changes already, so including another one might not make it much worse...

  • New persistence.mount: whether to mount an existing PVC (via persistence.existingClaim
  • New persistence.create: whether to create a new PVC

Testing

As this PR does a lot of things, we need proper testing.
The helm chart can be installed from the Git branch via helm-git as follows:

helm repo add gitea-charts git+https://gitea.com/gitea/helm-chart@/?ref=deployment
helm install gitea --version 0.0.0

It is highly recommended to test the chart in a dedicated namespace.

I've tested this myself with both redis and redis-cluster and it seemed to work fine.
I just did some basic operations though and we should do more niche testing before merging.

Examplary values.yml for testing (only needs a valid RWX storage class):

values.yaml
image:
  tag: "dev"
  PullPolicy: "Always"
  rootless: true

replicaCount: 2

persistence:
  enabled: true
  accessModes:
    - ReadWriteMany
  storageClass: FIXME

redis-cluster:
  enabled: false
  global:
    redis:
      password: gitea

gitea:
  config:
    indexer:
      ISSUE_INDEXER_ENABLED: true
      REPO_INDEXER_ENABLED: false

Preferred setup

The preferred HA setup with respect to performance and stability might currently be as follows:

  • Repos: RWX (e.g. EFS or Azurefiles NFS)
  • Issue indexer: Meilisearch (HA)
  • Session and cache: Redis Cluster (HA)
  • Attachments/Avatars: Minio (HA)

This will result in a ~ 10-pod HA setup overall.
All pods have very low resource requests.

fix #98

# Changes A big shoutout to @luhahn for all his work in #205 which served as the base for this PR. ## Documentation - [x] After thinking for some time about it, I still prefer the distinct option (as started in #350), i.e. having a standalone "HA" doc under `docs/ha-setup.md` to not have a very long README (which is already quite long). Most of the information below should go into it with more details and explanations behind all of the individual components. ## Chart deps ~~- Adds `meilisearch` as a chart dependency for a HA-ready issue indexer. Only works with >= Gitea 1.20~~ ~~- Adds `redis` as a chart dependency for a HA-ready session and queue store.~~ - Adds `redis-cluster` as a chart dependency for a HA-ready session and queue store (alternative to `redis`). Only works with >= Gitea 1.19.2. - Removes `memcached` instead of `redis-cluster` - Add `postgresql-ha` as default DB dep in favor of `postgres` ## Adds smart HA chart logic The goal is to set smart config values that result in a HA-ready Gitea deployment if `replicaCount` > 1. - If `replicaCount` > 1, - `gitea.config.session.PROVIDER` is automatically set to `redis-cluster` - `gitea.config.indexer.REPO_INDEXER_ENABLED` is automatically set to `false` unless the value is `elasticsearch` or `meilisearch` - `redis-cluster` is used for `[queue]` and `[cache]` and `[session]`mode or not Configuration of external instances of `meilisearch` and `minio` are documented in a new markdown doc. ## Deployment vs Statefulset Given all the discussions about this lately (#428), I think we could use both. In the end, we do not have the requirement for a sequential pod scale up/scale down as it would happen in statefulsets. On the other side, we do not have actual stateless pods as we are attaching a RWX to the deployment. Yet I think because we do not have a leader-election requirement, spawning the pods as a deployment makes "Rolling Updates" easier and also signals users that there is no "leader election" logic and each pod can just be "destroyed" at anytime without causing interruption. Hence I think we should be able to switch from a statefulset to a deployment, even in the single-replica case. This change also brought up a templating/linting issue: the definition of `.Values.gitea.config.server.SSH_LISTEN_PORT` in `ssh-svc.yaml` just "luckily" worked so far due to naming-related lint processing. Due to the change from "statefulset" to "deployment", the processing queue changed and caused a failure complaining about `config.server.SSH_LISTEN_PORT` not being defined yet. The only way I could see to fix this was to "properly" define the value in `values.yaml` instead of conditionally definining it in `helpers.tpl`. Maybe there's a better way? ## Chart PVC Creation I've adapted the automated PVC creation from another chart to be able to provide the `storageClassName` as I couldn't get dynamic provisioning for EFS going with the current implementation. In addition the naming and approach within the Gitea chart for PV creation is a bit unusual and aligning it might be beneficial. A semi-unrelated change which will result in a breaking change for existing users but this PR includes a lot of breaking changes already, so including another one might not make it much worse... - New `persistence.mount`: whether to mount an existing PVC (via `persistence.existingClaim` - New `persistence.create`: whether to create a new PVC ## Testing As this PR does a lot of things, we need proper testing. The helm chart can be installed from the Git branch via `helm-git` as follows: ``` helm repo add gitea-charts git+https://gitea.com/gitea/helm-chart@/?ref=deployment helm install gitea --version 0.0.0 ``` It is **highly recommended** to test the chart in a dedicated namespace. I've tested this myself with both `redis` and `redis-cluster` and it seemed to work fine. I just did some basic operations though and we should do more niche testing before merging. Examplary `values.yml` for testing (only needs a valid RWX storage class): <details> <summary>values.yaml</summary> ```yml image: tag: "dev" PullPolicy: "Always" rootless: true replicaCount: 2 persistence: enabled: true accessModes: - ReadWriteMany storageClass: FIXME redis-cluster: enabled: false global: redis: password: gitea gitea: config: indexer: ISSUE_INDEXER_ENABLED: true REPO_INDEXER_ENABLED: false ``` </details> ## Preferred setup The preferred HA setup with respect to performance and stability might currently be as follows: - Repos: RWX (e.g. EFS or Azurefiles NFS) - Issue indexer: Meilisearch (HA) - Session and cache: Redis Cluster (HA) - Attachments/Avatars: Minio (HA) This will result in a ~ 10-pod HA setup overall. All pods have very low resource requests. fix #98
pat-s added the
kind
breaking
status
needs-reviews
labels 2023-04-14 09:24:44 +00:00
pat-s added 27 commits 2023-04-14 09:24:45 +00:00
add deployment
Some checks failed
continuous-integration/drone/push Build is failing
715ab4531c
pvc
Some checks failed
continuous-integration/drone/push Build is failing
62bec269ed
readd deployment:
Some checks failed
continuous-integration/drone/push Build is failing
4e9a19ff8b
add strategy block
Some checks failed
continuous-integration/drone/push Build is failing
61afcdf9d9
pvc
Some checks failed
continuous-integration/drone/push Build is failing
0cf80a9a2c
volumeClaim in deployment
Some checks failed
continuous-integration/drone/push Build is failing
ac83772720
add redis clsuter
Some checks failed
continuous-integration/drone/push Build is failing
314bf48755
add redis support
Some checks failed
continuous-integration/drone/push Build is failing
62917fc77e
add redis
Some checks failed
continuous-integration/drone/push Build is failing
fd02f8c7b1
use redis 6
Some checks failed
continuous-integration/drone/push Build is failing
69bf773407
use redis instead of redis-cluster
Some checks failed
continuous-integration/drone/push Build is failing
a8e71dd442
account for bleve repo indexers for multiple replicas
Some checks failed
continuous-integration/drone/push Build is failing
339c3f725f
add meilisearch
Some checks failed
continuous-integration/drone/push Build is failing
6f687ce6dc
docs
Some checks failed
continuous-integration/drone/push Build is failing
2c69a18c3d
meilisearch dns
Some checks failed
continuous-integration/drone/push Build is failing
17e0251c50
dynamic ISSUE_INDEXER_TYPE
Some checks failed
continuous-integration/drone/push Build is failing
efc8fe23d0
add minio
Some checks failed
continuous-integration/drone/push Build is failing
cad2d85cc7
meilisearch pvc rwx
Some checks failed
continuous-integration/drone/push Build is failing
23ca79ab6a
add redis-cluster support
Some checks failed
continuous-integration/drone/push Build is failing
9c8fab280c
use explicit get commands
Some checks failed
continuous-integration/drone/push Build is failing
1c7ea75151
remove redis.port and redis.host
Some checks failed
continuous-integration/drone/push Build is failing
6898dea6ea
docs
Some checks failed
continuous-integration/drone/push Build is failing
4213486988
Merge branch 'main' into deployment
Some checks failed
continuous-integration/drone/push Build is failing
2c00cc331b
solve lint issues
Some checks failed
continuous-integration/drone/push Build is failing
1b48c14963
fix unittests
All checks were successful
continuous-integration/drone/push Build is passing
continuous-integration/drone/pr Build is passing
2bf762f04e
pat-s added 1 commit 2023-04-15 09:30:10 +00:00
pat-s added 1 commit 2023-04-15 09:31:30 +00:00
Merge branch 'main' into deployment
All checks were successful
check-and-test / check-and-test (pull_request) Successful in 47s
723291f4cd
pat-s force-pushed deployment from cd47456198 to 29cdb678c1 2023-04-15 10:19:18 +00:00 Compare
pat-s force-pushed deployment from 29cdb678c1 to b816ff2abc 2023-04-15 11:29:09 +00:00 Compare
pat-s force-pushed deployment from b816ff2abc to dab8a53fb9 2023-04-15 11:39:51 +00:00 Compare
Contributor

your sample values seems wrong. indexer settings below minio persistence?

your sample values seems wrong. indexer settings below minio persistence?
Author
Member

I don't understand what you mean. Can you be more specific or comment directly in the diff?

Ah got it now! Indeed, a c/p mistake.

~~I don't understand what you mean. Can you be more specific or comment directly in the diff?~~ Ah got it now! Indeed, a c/p mistake.
pat-s force-pushed deployment from dab8a53fb9 to 55486c8459 2023-04-15 17:56:38 +00:00 Compare
pat-s force-pushed deployment from 55486c8459 to b83e5030a4 2023-04-15 18:01:36 +00:00 Compare
pat-s force-pushed deployment from b83e5030a4 to 558e204840 2023-04-15 18:08:15 +00:00 Compare
pat-s force-pushed deployment from 558e204840 to 5a5c3ad720 2023-04-15 18:10:26 +00:00 Compare
pat-s added 1 commit 2023-04-15 20:52:34 +00:00
docs
All checks were successful
check-and-test / check-and-test (pull_request) Successful in 43s
ab7eb4bd1f
Member

We could drop the memcached dependency and use redis by default. I mean, this PR adds both redis and redis-cluster as dependency. For non-HA redis dependency can be used where memcached is used. What's your opinion, @pat-s?

We _could_ drop the memcached dependency and use redis by default. I mean, this PR adds both redis and redis-cluster as dependency. For non-HA redis dependency can be used where memcached is used. What's your opinion, @pat-s?
Member

We could drop the memcached dependency and use redis by default. I mean, this PR adds both redis and redis-cluster as dependency. For non-HA redis dependency can be used where memcached is used. What's your opinion, @pat-s?

Dropping memcached is a good idea, redis can do everything memcached is currently doing. And we would have to manage less dependencies. (Especially since this introduces redis + redis-ha as dependency)

> We _could_ drop the memcached dependency and use redis by default. I mean, this PR adds both redis and redis-cluster as dependency. For non-HA redis dependency can be used where memcached is used. What's your opinion, @pat-s? Dropping memcached is a good idea, redis can do everything memcached is currently doing. And we would have to manage less dependencies. (Especially since this introduces redis + redis-ha as dependency)
Author
Member

We could drop the memcached dependency and use redis by default. I mean, this PR adds both redis and redis-cluster as dependency. For non-HA redis dependency can be used where memcached is used. What's your opinion, @pat-s?

Yeah I guess so. Also having less decisions to take makes things easier for users.
The only "downside" is that the default deployment would then have 5-6 pods by default instead of 2 (gitea + memcached).

Especially since this introduces redis + redis-ha as dependency)

I am not 100% sure but atm I think both are HA-ready - the second option, i.e. redis-cluster, is just a different way of how the redis components/workers talk to each other and are set up in the first place. Just because you labeled the second one as redis-ha and by this implying that the other one would not be HA-ready 🙂 (but please correct me if I'm wrong!)

We should probably use the "normal" redis as the default as redis-cluster was not functional until lately and just got fixed in Gitea HEAD. I did a backport but this means one needs at least 1.19.2 - and to use 1.18 or older, one would need to use redis and not redis-cluster.

> We could drop the memcached dependency and use redis by default. I mean, this PR adds both redis and redis-cluster as dependency. For non-HA redis dependency can be used where memcached is used. What's your opinion, @pat-s? Yeah I guess so. Also having less decisions to take makes things easier for users. The only "downside" is that the default deployment would then have 5-6 pods by default instead of 2 (gitea + memcached). > Especially since this introduces redis + redis-ha as dependency) I am not 100% sure but atm I think both are HA-ready - the second option, i.e. redis-cluster, is just a different way of how the redis components/workers talk to each other and are set up in the first place. Just because you labeled the second one as `redis-ha` and by this implying that the other one would not be HA-ready 🙂 (but please correct me if I'm wrong!) We should probably use the "normal" `redis` as the default as `redis-cluster` was not functional until lately and just got fixed in Gitea HEAD. I did a backport but this means one needs at least 1.19.2 - and to use 1.18 or older, one would need to use `redis` and not `redis-cluster`.
pat-s added 1 commit 2023-04-18 20:10:45 +00:00
remove memcached
All checks were successful
check-and-test / check-and-test (pull_request) Successful in 42s
b460f99fe5
Member

I am not 100% sure but atm I think both are HA-ready - the second option, i.e. redis-cluster, is just a different way of how the redis components/workers talk to each other and are set up in the first place. Just because you labeled the second one as redis-ha and by this implying that the other one would not be HA-ready 🙂 (but please correct me if I'm wrong!)

Oops. I meant redis-cluster instead redis-ha. Both redis variants can have multiple replicas. But when it comes to truly high availability, I experienced a downtime with the non-cluster redis when updating the deployment and replacing the pods. With the cluster setup I did not experienced that. That's why I drew the thin line between redis and redis-cluster in my previous comment.

For non-HA Gitea the common redis would work fine. Both dependencies have their use cases.

The only "downside" is that the default deployment would then have 5-6 pods by default instead of 2 (gitea + memcached).

Non-HA would have 3: 1xGitea, 1xPostgres, 1xRedis
HA would have: Gitea, Postgres, Redis(-cluster), MinIO. All replicated by at least 2. MinIO must have at least 3 or 4 to be stable. Not sure about the recommended replicas for the others.

Is that what you mean by 5-6 Pods?

>I am not 100% sure but atm I think both are HA-ready - the second option, i.e. redis-cluster, is just a different way of how the redis components/workers talk to each other and are set up in the first place. Just because you labeled the second one as `redis-ha` and by this implying that the other one would not be HA-ready 🙂 (but please correct me if I'm wrong!) Oops. I meant redis-cluster instead redis-ha. Both redis variants can have multiple replicas. But when it comes to truly high availability, I experienced a downtime with the non-cluster redis when updating the deployment and replacing the pods. With the cluster setup I did not experienced that. That's why I drew the thin line between redis and redis-cluster in my previous comment. For non-HA Gitea the common redis would work fine. Both dependencies have their use cases. >The only "downside" is that the default deployment would then have 5-6 pods by default instead of 2 (gitea + memcached). Non-HA would have 3: 1xGitea, 1xPostgres, 1xRedis HA would have: Gitea, Postgres, Redis(-cluster), MinIO. All replicated by at least 2. MinIO must have at least 3 or 4 to be stable. Not sure about the recommended replicas for the others. Is that what you mean by 5-6 Pods?
Author
Member

I experienced a downtime with the non-cluster redis when updating the deployment and replacing the pods. With the cluster setup I did not experienced that.

Ah that's interesting! I will double-check that as I had in my mind that I had no disconnects with both of them. But if this would be the case, we could also think of only keeping redis-cluster and backport the Gitea-core support to 1.18 and just keep the dependency list "small" then?

EDIT: just tested again and did not face any disconnects when using redis with 1 master and 2 replicas while destroying/deleting one of the two Gite replicas. Was your setup different in some way?

Is that what you mean by 5-6 Pods?

I left minio out here as it's optional with a RWX keeping the storage (probably more performant but not mandatory). It's certainly recommended but that's why I left it as 5-6 (Gitea, + redis). In addition I forgot to count a PG-HA deployment as I always work with an external Postgres.
So I'd see the following pod counts:

  • All: Gitea HA (2) + PG HA (2) + Redis Cluster (4) + Minio (4) + Meilisearch (2) = 14

  • Minimum: Gitea HA (2) + PG HA (2) + Redis Cluster (4) + Meilisearch (2) = 10

> I experienced a downtime with the non-cluster redis when updating the deployment and replacing the pods. With the cluster setup I did not experienced that. Ah that's interesting! I will double-check that as I had in my mind that I had no disconnects with both of them. But if this would be the case, we could also think of only keeping `redis-cluster` and backport the Gitea-core support to 1.18 and just keep the dependency list "small" then? EDIT: just tested again and did not face any disconnects when using `redis` with 1 master and 2 replicas while destroying/deleting one of the two Gite replicas. Was your setup different in some way? > Is that what you mean by 5-6 Pods? I left minio out here as it's optional with a RWX keeping the storage (probably more performant but not mandatory). It's certainly recommended but that's why I left it as 5-6 (Gitea, + redis). In addition I forgot to count a PG-HA deployment as I always work with an external Postgres. So I'd see the following pod counts: - All: Gitea HA (2) + PG HA (2) + Redis Cluster (4) + Minio (4) + Meilisearch (2) = 14 - Minimum: Gitea HA (2) + PG HA (2) + Redis Cluster (4) + Meilisearch (2) = 10
Member

EDIT: just tested again and did not face any disconnects when using redis with 1 master and 2 replicas while destroying/deleting one of the two Gite replicas. Was your setup different in some way?

I had disconnects to Redis when actually upgrading Redis, not replacing Gitea pods. IIRC, this was not the case when using redis-cluster. Although, I could not test Gitea itself due to missing cluster support back then. So I tested it otherwise. But maybe I did something wrong.

So I'd see the following pod counts:

  • All: Gitea HA (2) + PG HA (2) + Redis Cluster (4) + Minio (4) + Meilisearch (2) = 14

  • Minimum: Gitea HA (2) + PG HA (2) + Redis Cluster (4) + Meilisearch (2) = 10

It would be great if we keep the simplicity, even if Gitea is HA capable. Consider a new user, ready to get Gitea up and running in a Kubernetes cluster, trying to get their code self-hosted. What you need for that:

  • Gitea (1x)
  • Database (1x)
  • Redis (1x)

Everything else is optional, IMO. Especially repository code search.

Therefore, we should keep both redis dependencies, IMO. Or is it possible to "misuse" redis-cluster as a single-instance? In that case we could drop the redis dependency.

Additionally I don't think we should add MinIO as a dependency. If I see it right, attachments and avatars would work within the RWX storage that is shared across the Deployment replicas. If someone wants to use MinIO for that, they should deploy it outside of this Helm Chart. MinIO itself is a massive setup - for a truly HA setup, you have to run at least 4 replicas in distributed mode (absolute minimal requirement for leader election). The non-default production values set for that would be ~130 lines (no blank lines). This includes different buckets for avatars and attachments each. That's probably nothing we want to have in a Gitea Helm Chart that should be much lightweight than those of unnamed alternatives 😉. Currently the proposed setup does not look like its simpler.

> EDIT: just tested again and did not face any disconnects when using `redis` with 1 master and 2 replicas while destroying/deleting one of the two Gite replicas. Was your setup different in some way? I had disconnects to Redis when actually upgrading Redis, not replacing Gitea pods. IIRC, this was not the case when using redis-cluster. Although, I could not test Gitea itself due to missing cluster support back then. So I tested it otherwise. But maybe I did something wrong. > So I'd see the following pod counts: > > - All: Gitea HA (2) + PG HA (2) + Redis Cluster (4) + Minio (4) + Meilisearch (2) = 14 > > - Minimum: Gitea HA (2) + PG HA (2) + Redis Cluster (4) + Meilisearch (2) = 10 It would be great if we keep the simplicity, even if Gitea is HA capable. Consider a new user, ready to get Gitea up and running in a Kubernetes cluster, trying to get their code self-hosted. What you need for that: - Gitea (1x) - Database (1x) - Redis (1x) Everything else is optional, IMO. Especially repository code search. Therefore, we should keep both redis dependencies, IMO. Or is it possible to "misuse" `redis-cluster` as a single-instance? In that case we could drop the `redis` dependency. Additionally I don't think we should add MinIO as a dependency. If I see it right, attachments and avatars would work within the RWX storage that is shared across the Deployment replicas. If someone wants to use MinIO for that, they should deploy it outside of this Helm Chart. MinIO itself is _a massive_ setup - for a truly HA setup, you have to run at least 4 replicas in `distributed` mode (absolute minimal requirement for leader election). The non-default production values set for that would be ~130 lines (no blank lines). This includes different buckets for avatars and attachments each. That's probably nothing we want to have in a Gitea Helm Chart that _should_ be much lightweight than those of unnamed alternatives 😉. Currently the proposed setup does not look like its simpler.
Author
Member

I had disconnects to Redis when actually upgrading Redis, not replacing Gitea pods. IIRC, this was not the case when using redis-cluster.

Ah! Did you have redis running in HA with multiple replicas? I just tried it and the problems seems to be the redis-master-0 pod. One can destroy the replicas without issues but once the master pod goes down, Gitea has an issue.

There is experimental support for multiple masters in the chart (https://github.com/bitnami/charts/blob/main/bitnami/redis/README.md#multiple-masters-experimental).

Overall I guess it's just easier to go with redis-cluster only then and remove redis -> less choices and work for both HA and non-HA.

Everything else is optional, IMO. Especially repository code search.

Definitely agree. The question is only: do we want provide a skeleton in the chart to "easily" add these additional components or do we "just" provide documentation on how this can be done?

Or is it possible to "misuse" redis-cluster as a single-instance? In that case we could drop the redis dependency.

I guess we could just go with redis-cluster even for single-instance. Yes it requires some additional pods but users can then also easily go HA if they want to. And don't need to switch between redis and redis-cluster.

Additionally I don't think we should add MinIO as a dependency. If I see it right, attachments and avatars would work within the RWX storage that is shared across the Deployment replicas.

I see your point. I am fine with not adding it and only documenting it as an alternative to the RWX storage.

for a truly HA setup, you have to run at least 4 replicas in distributed mode (absolute minimal requirement for leader election).

That's true, yes.

The non-default production values set for that would be ~130 lines (no blank lines). This includes different buckets for avatars and attachments each.

Not sure if you always need distinct buckets for each and in my test install right now I don't have 130 config lines but again, I am fine with leaving minio out and just documenting it.

That's probably nothing we want to have in a Gitea Helm Chart that should be much lightweight than those of unnamed alternatives 😉.

I think this is also a question how you look at it: "lightweight" in terms of defaults and/or resources but rich in terms of expansion options - or lightweight in terms of general capabilities and chart dependencies.

Overall I think we might make our life easier with not adding too many dependencies into the chart but just documenting which services one needs to provision in addition to fully go HA. And only adding redis-cluster as a required dependency which would also directly work for HA then. Sounds like we all agree on this? If so, I would strip the PR down to only redis-cluster and just documenting all other optional HA deps.

(for testing purposes, it was still good/nice to be able to bootstrap a HA setup with all deps in this PR.)

> I had disconnects to Redis when actually upgrading Redis, not replacing Gitea pods. IIRC, this was not the case when using redis-cluster. Ah! Did you have `redis` running in HA with multiple replicas? I just tried it and the problems seems to be the `redis-master-0` pod. One can destroy the replicas without issues but once the master pod goes down, Gitea has an issue. There is experimental support for multiple masters in the chart (https://github.com/bitnami/charts/blob/main/bitnami/redis/README.md#multiple-masters-experimental). Overall I guess it's just easier to go with `redis-cluster` only then and remove `redis` -> less choices and work for both HA and non-HA. > Everything else is optional, IMO. Especially repository code search. Definitely agree. The question is only: do we want provide a skeleton in the chart to "easily" add these additional components or do we "just" provide documentation on how this can be done? > Or is it possible to "misuse" redis-cluster as a single-instance? In that case we could drop the redis dependency. I guess we could just go with `redis-cluster` even for single-instance. Yes it requires some additional pods but users can then also easily go HA if they want to. And don't need to switch between `redis` and `redis-cluster`. > Additionally I don't think we should add MinIO as a dependency. If I see it right, attachments and avatars would work within the RWX storage that is shared across the Deployment replicas. I see your point. I am fine with not adding it and only documenting it as an alternative to the RWX storage. > for a truly HA setup, you have to run at least 4 replicas in distributed mode (absolute minimal requirement for leader election). That's true, yes. > The non-default production values set for that would be ~130 lines (no blank lines). This includes different buckets for avatars and attachments each. Not sure if you always need distinct buckets for each and in my test install right now I don't have 130 config lines but again, I am fine with leaving minio out and just documenting it. > That's probably nothing we want to have in a Gitea Helm Chart that should be much lightweight than those of unnamed alternatives 😉. I think this is also a question how you look at it: "lightweight" in terms of defaults and/or resources but rich in terms of expansion options - or lightweight in terms of general capabilities and chart dependencies. Overall I think we might make our life easier with not adding too many dependencies into the chart but just documenting which services one needs to provision in addition to fully go HA. And only adding `redis-cluster` as a required dependency which would also directly work for HA then. Sounds like we all agree on this? If so, I would strip the PR down to only `redis-cluster` and just documenting all other optional HA deps. (for testing purposes, it was still good/nice to be able to bootstrap a HA setup with all deps in this PR.)
Member

Ah! Did you have redis running in HA with multiple replicas? I just tried it and the problems seems to be the redis-master-0 pod. One can destroy the replicas without issues but once the master pod goes down, Gitea has an issue.

I tested both master/replica and sentinel approach. In both cases the k8s service did not properly switch from the downscaled leader to a replica or sentinel, causing the disconnect.

I guess we could just go with redis-cluster even for single-instance. Yes it requires some additional pods but users can then also easily go HA if they want to. And don't need to switch between redis and redis-cluster.

Sounds great. Let's go this way. 👍

Additionally I don't think we should add MinIO as a dependency....

I see your point. I am fine with not adding it and only documenting it as an alternative to the RWX storage.

+1 for documenting MinIO as an alternative.

Overall I think we might make our life easier with not adding too many dependencies into the chart but just documenting which services one needs to provision in addition to fully go HA. And only adding redis-cluster as a required dependency which would also directly work for HA then. Sounds like we all agree on this? If so, I would strip the PR down to only redis-cluster and just documenting all other optional HA deps.

A yes from my side. Stripping it down to redis-cluster with a simple templating skeleton to easily toggle HA and scale the necessary components properly will be a huge improvement for the Chart.

The question is only: do we want provide a skeleton in the chart to "easily" add these additional components or do we "just" provide documentation on how this can be done?

Such large-scale skeleton would require the other dependencies, but would be neat in a long-run. Maybe we should start with documenting them first. As soon as we have more stability in the chart (e.g. tests, automated e2e deploys, values integrity validation, renovate automation), it will become much easier to add new dependencies and handle such complexity.

(for testing purposes, it was still good/nice to be able to bootstrap a HA setup with all deps in this PR.)

💯. I appreciate the time you already spent on this PR. It's awesome to see Gitea being a solid competitor to the well-known big players.
In other Charts I've seen example values for various use cases. Those in pair with documenting the options of MinIO and Meilisearch may help users to find the right setup for their needs. But such example values are nothing that needs to be done in a first step.

>Ah! Did you have `redis` running in HA with multiple replicas? I just tried it and the problems seems to be the `redis-master-0` pod. One can destroy the replicas without issues but once the master pod goes down, Gitea has an issue. I tested both master/replica and sentinel approach. In both cases the k8s service did not properly switch from the downscaled leader to a replica or sentinel, causing the disconnect. >I guess we could just go with `redis-cluster` even for single-instance. Yes it requires some additional pods but users can then also easily go HA if they want to. And don't need to switch between `redis` and `redis-cluster`. Sounds great. Let's go this way. 👍 >> Additionally I don't think we should add MinIO as a dependency.... > >I see your point. I am fine with not adding it and only documenting it as an alternative to the RWX storage. +1 for documenting MinIO as an alternative. >Overall I think we might make our life easier with not adding too many dependencies into the chart but just documenting which services one needs to provision in addition to fully go HA. And only adding `redis-cluster` as a required dependency which would also directly work for HA then. Sounds like we all agree on this? If so, I would strip the PR down to only `redis-cluster` and just documenting all other optional HA deps. A `yes` from my side. Stripping it down to `redis-cluster` with a simple templating skeleton to easily toggle HA and scale the necessary components properly will be a _huge_ improvement for the Chart. >The question is only: do we want provide a skeleton in the chart to "easily" add these additional components or do we "just" provide documentation on how this can be done? Such large-scale skeleton would require the other dependencies, but would be neat in a long-run. Maybe we should start with documenting them first. As soon as we have more stability in the chart (e.g. tests, automated e2e deploys, values integrity validation, renovate automation), it will become **much** easier to add new dependencies and handle such complexity. >(for testing purposes, it was still good/nice to be able to bootstrap a HA setup with all deps in this PR.) 💯. I appreciate the time you already spent on this PR. It's awesome to see Gitea being a solid competitor to the well-known big players. In other Charts I've seen example values for various use cases. Those in pair with documenting the options of MinIO and Meilisearch may help users to find the right setup for their needs. But such example values are nothing that needs to be done in a first step.
pat-s force-pushed deployment from 7530dce879 to f5de5ceb53 2023-04-22 07:27:53 +00:00 Compare
pat-s added 2 commits 2023-04-22 07:56:14 +00:00
remove meilisearch chart dep
All checks were successful
check-and-test / check-and-test (pull_request) Successful in 33s
a01473ff79
pat-s added 2 commits 2023-04-22 08:05:29 +00:00
docs
Some checks failed
check-and-test / check-and-test (pull_request) Failing after 31s
5eb64a9726
pat-s added 1 commit 2023-04-22 08:10:12 +00:00
update redis-cluster
All checks were successful
check-and-test / check-and-test (pull_request) Successful in 30s
6bace91af7
pat-s force-pushed deployment from cc7ad8cfda to c6f7f003cf 2023-04-22 09:09:48 +00:00 Compare
pat-s force-pushed deployment from c6f7f003cf to 64639934e9 2023-04-22 09:17:42 +00:00 Compare
pat-s force-pushed deployment from 64639934e9 to f640b5242f 2023-04-22 09:23:10 +00:00 Compare
pat-s force-pushed deployment from f640b5242f to a4c8ef39df 2023-04-22 09:28:46 +00:00 Compare
pat-s force-pushed deployment from a4c8ef39df to 441a85db36 2023-04-22 09:39:49 +00:00 Compare
pat-s force-pushed deployment from 441a85db36 to aeb5feb0cc 2023-04-22 17:28:33 +00:00 Compare
pat-s force-pushed deployment from aeb5feb0cc to 3dd793b337 2023-04-22 17:42:40 +00:00 Compare
pat-s force-pushed deployment from 3dd793b337 to 2e4a5092e4 2023-04-22 17:45:40 +00:00 Compare
pat-s force-pushed deployment from 2e4a5092e4 to d3254d4a62 2023-04-22 18:31:18 +00:00 Compare
pat-s force-pushed deployment from d3254d4a62 to 61144467f1 2023-04-22 18:34:54 +00:00 Compare
pat-s force-pushed deployment from 61144467f1 to d5a1f91860 2023-04-22 18:39:59 +00:00 Compare
pat-s force-pushed deployment from d5a1f91860 to ee196d2c1a 2023-04-22 18:42:30 +00:00 Compare
pat-s force-pushed deployment from ee196d2c1a to 0fa4c34def 2023-04-22 18:46:24 +00:00 Compare
pat-s force-pushed deployment from 0fa4c34def to a5c15ee421 2023-04-22 19:48:17 +00:00 Compare
pat-s force-pushed deployment from a5c15ee421 to 8727684f00 2023-04-22 19:50:34 +00:00 Compare
pat-s force-pushed deployment from 8727684f00 to a5f4437e8e 2023-04-22 19:52:47 +00:00 Compare
pat-s force-pushed deployment from a5f4437e8e to 364936df4e 2023-04-22 20:45:36 +00:00 Compare
pat-s added 1 commit 2023-04-22 21:08:20 +00:00
downgrade redis-cluster
Some checks failed
check-and-test / check-and-test (pull_request) Failing after 30s
9af0281c94
pat-s added 2 commits 2023-05-01 19:57:32 +00:00
remove REPO_INDEXER_ENABLED hardcode in favor of assertion
Some checks failed
check-and-test / check-and-test (pull_request) Failing after 38s
6451ddf766
pat-s added 1 commit 2023-05-01 21:28:58 +00:00
enable redis-cluster
Some checks failed
check-and-test / check-and-test (pull_request) Failing after 36s
3f1775416a
pat-s added 1 commit 2023-05-02 13:22:10 +00:00
redis-cluster 8.4.4
Some checks failed
check-and-test / check-and-test (pull_request) Failing after 34s
41d60f111d
pat-s added 1 commit 2023-05-02 13:32:05 +00:00
update assertion
Some checks failed
check-and-test / check-and-test (pull_request) Failing after 34s
133e3625ca
pat-s added 2 commits 2023-05-02 13:45:24 +00:00
Merge branch 'main' into deployment
Some checks failed
check-and-test / check-and-test (pull_request) Failing after 34s
69a83c6aa6
pat-s reviewed 2023-05-02 13:48:03 +00:00
@ -3,2 +3,4 @@
Expand the name of the chart.
*/}}
{{- /* multiple replicas assertions */ -}}
Author
Member

@justusbunsi I've found a neat (but hacky) solution to apply some assertions. Unfortunately helm does not provide a native way to check for certain incompatibilities of settings - or at least I couldn't find one.

I've tested these and they work as expected. I think in the longterm these checks can save quite some users from some faulty HA-deployments as it specficially looks for settings which are incompatible in a HA setup.

What do you think?

@justusbunsi I've found a neat (but hacky) solution to apply some assertions. Unfortunately helm does not provide a native way to check for certain incompatibilities of settings - or at least I couldn't find one. I've tested these and they work as expected. I think in the longterm these checks can save quite some users from some faulty HA-deployments as it specficially looks for settings which are incompatible in a HA setup. What do you think?
pat-s reviewed 2023-05-02 13:51:45 +00:00
values.yaml Outdated
@ -351,0 +358,4 @@
# RUN_MODE: dev
server:
SSH_PORT: 22 # rootful image
SSH_LISTEN_PORT: 2222 # rootless image
Author
Member

@justusbunsi I've exposed these settings in values.yml as these are their actual defaults. These are applied conditionally in helpers.tpl (https://gitea.com/gitea/helm-chart/src/branch/main/templates/_helpers.tpl#L262) without a clear exposure to the user.

With these settings being defined, we could also remove the condition in helpers.tpl?

@justusbunsi I've exposed these settings in `values.yml` as these are their actual defaults. These are applied conditionally in `helpers.tpl` (https://gitea.com/gitea/helm-chart/src/branch/main/templates/_helpers.tpl#L262) without a clear exposure to the user. With these settings being defined, we could also remove the condition in `helpers.tpl`?
pat-s added 1 commit 2023-05-02 13:55:34 +00:00
values
All checks were successful
check-and-test / check-and-test (pull_request) Successful in 37s
ca7d588f12
pat-s added 1 commit 2023-05-02 14:09:53 +00:00
update ha-doc
All checks were successful
check-and-test / check-and-test (pull_request) Successful in 37s
d5dfb1ea61
Author
Member

@justusbunsi @luhahn I wonder if we also should swap postgres for postgres-ha, similar as we are switching to redis-cluster instead of redis.

This would make a functional HA deployment easier right from the start for new users as they would not need to migrate from a single-instance PG deployment to a HA-ready PG one (and only move storage).

The downside is that it would be a hard break for all current users as they would be forced to migrate their DB. We could also have both side by side but then we are again entering "maintenance hell" to some degree.

This PR is breaking in so many ways (cache dependency, PVC creation, statefulset -> deployment) that introducing another breaking change would arguably not be too disruptive.

What are your thoughts?


Overall I am pretty much done (aside of the PG discussion above) and everything works "fine" so far in all of my tests. The biggest challenge would most likely be to provide migratin instructions to users and also test this ourselves.

@justusbunsi @luhahn I wonder if we also should swap `postgres` for `postgres-ha`, similar as we are switching to `redis-cluster` instead of `redis`. This would make a functional HA deployment easier right from the start for new users as they would not need to migrate from a single-instance PG deployment to a HA-ready PG one (and only move storage). The downside is that it would be a hard break for all current users as they would be forced to migrate their DB. We could also have both side by side but then we are again entering "maintenance hell" to some degree. This PR is breaking in so many ways (cache dependency, PVC creation, statefulset -> deployment) that introducing another breaking change would arguably not be too disruptive. What are your thoughts? --- Overall I am pretty much done (aside of the PG discussion above) and everything works "fine" so far in all of my tests. The biggest challenge would most likely be to provide migratin instructions to users and also test this ourselves.
pat-s added 1 commit 2023-05-13 14:39:31 +00:00
pat-s force-pushed deployment from 57af71c7c6 to b67f90f559 2023-05-13 14:44:11 +00:00 Compare
pat-s force-pushed deployment from b67f90f559 to c1072b3b5c 2023-05-13 14:46:35 +00:00 Compare
pat-s force-pushed deployment from c1072b3b5c to 7e162e0865 2023-05-13 14:52:48 +00:00 Compare
pat-s force-pushed deployment from 7e162e0865 to e19c90f8dd 2023-05-13 14:54:31 +00:00 Compare
pat-s force-pushed deployment from 507a821700 to b526fee2be 2023-05-13 15:05:20 +00:00 Compare
pat-s force-pushed deployment from b526fee2be to 7c34edfbe5 2023-05-13 15:08:19 +00:00 Compare
pat-s force-pushed deployment from 7c34edfbe5 to b14f5c363a 2023-05-13 15:22:11 +00:00 Compare
pat-s added 1 commit 2023-05-13 16:22:21 +00:00
pat-s force-pushed deployment from 8adaee9e44 to a9f3c9afb0 2023-05-13 16:51:12 +00:00 Compare
pat-s force-pushed deployment from a9f3c9afb0 to ae1f980f44 2023-05-13 16:53:47 +00:00 Compare
pat-s force-pushed deployment from ae1f980f44 to 59e4cda7d5 2023-05-13 17:33:28 +00:00 Compare
pat-s force-pushed deployment from 59e4cda7d5 to 547974ac16 2023-05-13 17:36:12 +00:00 Compare
pat-s force-pushed deployment from 547974ac16 to 600d47b44d 2023-05-13 17:39:23 +00:00 Compare
pat-s added 3 commits 2023-05-13 19:51:28 +00:00
add postgres-ha dep
All checks were successful
check-and-test / check-and-test (pull_request) Successful in 35s
9eca0177f7
Author
Member

@justusbunsi

I made a brave move and switched our production instance to this branch (ofc after having tested in dev before) and I can report that everything seems to work well so far.

I am running with the following setup now

  • AWS EKS 1.26
  • 3 replicas
  • AWS RDS Postgres 14.x
  • AWS Elasticache Redis 7.x
  • AWS EFS RWX
  • Issue indexer: 'db'
  • Repo indexer: disabled

The transition was not that complicated, I only had backup and move /data to the new RWX PVC.

This PR also adds topologySpreadConstraints which helps to distribute the replicas across nodes.

Site operations are slower now due to EFS compared to the former RWO but that was expected. And is nothing Gitea can do anything about.

I have added a note about the required PV transition. User will need to specify existingClaim as otherwise the chart will (probably) create a new PVC due to the new resource.

Last, I'd like to point out again that I added a few assertions in config.yaml which apply some checks during startup which should prevent erroneous configurations. AFAICS there is no other, more elegant way to do this in charts at the moment.

Currently there are no blockers for merging this. Meilisearch will be supported with > 1.20 but 'db' for the ISSUE_INDEXER works.

The only open question is whether we should switch from postgres to postgres-HA as the former might cause issues when going HA. And given that we have already opted for redis-cluster by default, I think we should also switch to postgres-HA.
For backward-comp I think we need to keep the normal postgres chart as a dependency for a bit so that users do not have to make a forced DB transition.
My suggestion would be to have postgres-ha as the default and remove the deprecated postgres dependency (which is off by default) at some point in the future.

@justusbunsi I made a brave move and switched our production instance to this branch (ofc after having tested in dev before) and I can report that everything seems to work well so far. I am running with the following setup now - AWS EKS 1.26 - 3 replicas - AWS RDS Postgres 14.x - AWS Elasticache Redis 7.x - AWS EFS RWX - Issue indexer: `'db'` - Repo indexer: disabled The transition was not that complicated, I only had backup and move `/data` to the new RWX PVC. This PR also adds `topologySpreadConstraints` which helps to distribute the replicas across nodes. Site operations are slower now due to EFS compared to the former RWO but that was expected. And is nothing Gitea can do anything about. I have added a note about the required PV transition. User will need to specify `existingClaim` as otherwise the chart will (probably) create a new PVC due to the new resource. Last, I'd like to point out again that I added a few assertions in `config.yaml` which apply some checks during startup which should prevent erroneous configurations. AFAICS there is no other, more elegant way to do this in charts at the moment. Currently there are no blockers for merging this. Meilisearch will be supported with > 1.20 but 'db' for the ISSUE_INDEXER works. The only open question is whether we should switch from postgres to postgres-HA as the former might cause issues when going HA. And given that we have already opted for redis-cluster by default, I think we should also switch to postgres-HA. For backward-comp I think we need to keep the normal `postgres` chart as a dependency for a bit so that users do not have to make a forced DB transition. My suggestion would be to have `postgres-ha` as the default and remove the deprecated `postgres` dependency (which is off by default) at some point in the future.
pat-s added 1 commit 2023-05-13 19:58:33 +00:00
minor
All checks were successful
check-and-test / check-and-test (pull_request) Successful in 35s
688d06ad0b
pat-s changed title from WIP: [Breaking] HA-support via `Deployment` to [Breaking] HA-support via `Deployment` 2023-05-13 19:58:42 +00:00
pat-s added 1 commit 2023-05-13 21:28:29 +00:00
add podDisruptionBudget
All checks were successful
check-and-test / check-and-test (pull_request) Successful in 35s
5dd5f009af
Owner

Ref: https://github.com/go-gitea/gitea/issues/13791

There are still some problems when running a HA mode.

Ref: https://github.com/go-gitea/gitea/issues/13791 There are still some problems when running a HA mode.
Author
Member

Good to know but without checking how it goes in a running instance, we won't know if they are "real" issues or just "would be good to have/clean up" 🙂

The good news is that the first day(s) have been without issues in my org - haven't checked on the CRON things yet though.

Good to know but without checking how it goes in a running instance, we won't know if they are "real" issues or just "would be good to have/clean up" 🙂️ The good news is that the first day(s) have been without issues in my org - haven't checked on the CRON things yet though.
pat-s added 2 commits 2023-06-17 20:06:00 +00:00
fail if garbage collector is enabled
All checks were successful
check-and-test / check-and-test (pull_request) Successful in 39s
0516725030
pat-s added 1 commit 2023-06-17 20:12:54 +00:00
Merge branch 'main' into deployment
Some checks failed
check-and-test / check-and-test (pull_request) Failing after 28s
b7e99778e3
pat-s added 1 commit 2023-06-17 20:16:02 +00:00
statefulset -> deployment
All checks were successful
check-and-test / check-and-test (pull_request) Successful in 34s
87b69a54ba
pat-s added 1 commit 2023-07-12 07:22:17 +00:00
Merge branch 'main' into deployment
Some checks failed
check-and-test / check-and-test (pull_request) Failing after 34s
69d73063ca
pat-s added 1 commit 2023-07-12 07:36:45 +00:00
fix GIT_GC_REPOS assertion
Some checks failed
check-and-test / check-and-test (pull_request) Failing after 34s
aa1cce0761
pat-s force-pushed deployment from de41932107 to 1f63c02706 2023-07-12 08:13:54 +00:00 Compare
pat-s force-pushed deployment from 1f63c02706 to 9d1c3f2aaf 2023-07-12 08:33:27 +00:00 Compare
pat-s force-pushed deployment from 9d1c3f2aaf to 47980bd845 2023-07-12 08:46:05 +00:00 Compare
pat-s force-pushed deployment from 01d97cc557 to fdc2adda14 2023-07-12 16:27:55 +00:00 Compare
pat-s force-pushed deployment from 9c1c3db7ca to ada1cbf082 2023-07-12 17:24:20 +00:00 Compare
pat-s force-pushed deployment from 1450afa3cc to 9ca6b0f8c0 2023-07-14 20:46:15 +00:00 Compare
pat-s added 4 commits 2023-07-17 06:15:03 +00:00
update README
All checks were successful
check-and-test / check-and-test (pull_request) Successful in 37s
75cfbcf7ae
Author
Member

Merging after confirmation from within Discord.

Merging after confirmation from within Discord.
techknowlogick approved these changes 2023-07-17 06:24:25 +00:00
lunny approved these changes 2023-07-17 13:07:31 +00:00
pat-s added 1 commit 2023-07-17 19:03:34 +00:00
readme
All checks were successful
check-and-test / check-and-test (pull_request) Successful in 35s
d54d5d6331
pat-s force-pushed deployment from 70e6a9273a to 22cffe5fa8 2023-07-17 19:07:48 +00:00 Compare
pat-s merged commit 8e27bb9bae into main 2023-07-17 19:09:46 +00:00
pat-s deleted branch deployment 2023-07-17 19:09:47 +00:00
Sign in to join this conversation.
No description provided.