Skip to content

Gatewayapi Namespaced Mode#4690

Open
radixo wants to merge 2 commits intotigera:masterfrom
radixo:gatewayapi-deployment-enterprise
Open

Gatewayapi Namespaced Mode#4690
radixo wants to merge 2 commits intotigera:masterfrom
radixo:gatewayapi-deployment-enterprise

Conversation

@radixo
Copy link
Copy Markdown
Contributor

@radixo radixo commented Apr 14, 2026

Description

Add a gatewayDeploymentMode field to the GatewayAPI CR that controls where Envoy Gateway deploys managed proxy workloads, plus the operator-side plumbing to support both modes.

  • ControllerNamespace (default, existing behaviour): Envoy Gateway controller and all Gateway proxies live in tigera-gateway.
  • GatewayNamespace (new): controller lives in calico-system (to concentrate Calico components in one namespace); proxies are deployed into each Gateway's own namespace so namespace admins can own their Gateways' NetworkPolicies.
  • The field is immutable once set (CEL self == oldSelf).
  • Embed the Envoy Gateway Helm chart (gateway-helm.tgz) and render at runtime via the Helm SDK — matches the Istio chart pattern, drops a 3.3MB pre-rendered YAML from git. Render results are cached per deployment mode; the release namespace is derived from the mode so every downstream resource (Deployment, Service, Role/RoleBinding, MutatingWebhookConfiguration, certgen Job, EnvoyProxy) lands in the right namespace automatically.
  • In GatewayNamespace + Enterprise mode, per-namespace resources are provisioned for each namespace containing an operator-managed Gateway: waf-http-filter ServiceAccount, waf-http-filter-gateway-resources RoleBinding (scoped to that namespace's Gateway-API resources), tigera-operator-secrets RoleBinding, and a tigera-pull-secret copy. Cluster-scoped permissions (licensekeys, tokenreviews) go through a single shared waf-http-filter-gateway-namespaces ClusterRoleBinding with one subject per Gateway namespace, avoiding N cluster-scoped objects.
  • Split the existing waf-http-filter ClusterRole into waf-http-filter-cluster-scoped and waf-http-filter-gateway-resources. Deprecated combined ClusterRole/ClusterRoleBinding are cleaned up on every reconcile for upgrade safety.
  • Reserved namespaces (calico-system, tigera-operator) are excluded from tigera-operator-secrets RoleBinding + pull-secret creation/deletion — the core Installation controller owns those there.
  • Cleanup invariants: when a namespace no longer contains any operator-managed Gateway the operator removes its per-namespace resources on the next reconcile; delete order is Secret → SA → RoleBinding → tigera-operator-secrets RoleBinding (that RoleBinding is what grants us delete on Secrets, so reversing it yields a 403 and aborts the reconcile); shared ClusterRoleBinding subjects are recomputed every reconcile and the CRB is removed entirely when no Gateway namespaces remain.
  • Render three Calico v3 NetworkPolicies under the calico-system tier to keep gateway components functional under a default-deny posture: calico-system.default-deny in tigera-gateway (ControllerNamespace mode only; calico-system already has one via core Installation), calico-system.envoy-gateway allow policy for the controller + certgen Job, and calico-system.envoy-proxy allow policy for Envoy Proxy workloads (ControllerNamespace mode only, since in GatewayNamespace mode proxies live in user namespaces and are out of scope for our default-deny). Policy rendering is gated on the calico-system Tier existing (via WaitToAddTierWatch + Get(Tier)), matching apiserver, manager, logstorage/*, and monitor controllers.
  • Tests: UT coverage for both deployment modes, per-namespace resource lifecycle, reserved-namespace guards, cleanup ordering, and the policy-gated-off path. FV coverage for both modes (GatewayClass lifecycle, custom EnvoyProxy watch, l7-log-collector env wiring, EnvoyGateway ConfigMap behaviour), plus two GatewayNamespace-mode-specific FVs for controller-namespace routing and per-namespace resource provisioning/cleanup on Gateway create/delete.

Security

GatewayNamespace mode copies tigera-pull-secret into every user namespace that owns a Gateway — permissive RBAC on those namespaces can expose the pull secret. Operator-owned shared resources in calico-system / tigera-operator are not touched by the gateway feature, protecting against accidental deletion.

Upgrade / compatibility

gatewayDeploymentMode is immutable; switching modes requires deleting and recreating the GatewayAPI resource. GatewayNamespace mode has not shipped in any release yet, so there is no in-place migration path from old state. Deprecated combined waf-http-filter ClusterRole/ClusterRoleBinding are cleaned up on every reconcile.

Release Note

GatewayAPI now supports a `gatewayDeploymentMode` field that controls where
Envoy Gateway deploys managed proxy workloads:

- `ControllerNamespace` (default) keeps both the controller and proxies in
  `tigera-gateway`, matching the previous behaviour.
- `GatewayNamespace` deploys the Envoy Gateway controller into
  `calico-system` and each Gateway's proxy workloads into that Gateway's
  own namespace, letting namespace admins own their Gateways' network
  policies.

The field is immutable once set on a GatewayAPI resource.

For PR author

  • Tests for change.
  • If changing pkg/apis/, run make gen-files
  • If changing versions, run make gen-versions

For PR reviewers

A note for code reviewers - all pull requests must have the following:

  • Milestone set according to targeted release.
  • Appropriate labels:
    • kind/bug if this is a bugfix.
    • kind/enhancement if this is a a new feature.
    • enterprise if this PR applies to Calico Enterprise only.

Copy link
Copy Markdown
Member

@electricjesus electricjesus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work on the runtime Helm rendering migration — dropping 53K lines of pre-rendered YAML is a huge win. The GatewayNamespace mode looks solid with good test coverage. A few observations below.

Comment thread pkg/render/gatewayapi/gateway_api.go Outdated
Comment thread pkg/render/gatewayapi/gateway_api.go Outdated
Comment thread pkg/render/gatewayapi/gateway_api.go Outdated
CurrentGatewayClasses: set.New[string](),
}

if gatewayAPI.Spec.GatewayDeploymentMode == nil {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: The CRD already has +kubebuilder:default=ControllerNamespace, so any persisted GatewayAPI resource will have this field populated by the API server. This runtime defaulting only matters for in-memory objects that were never persisted (tests?). Not a problem, just noting the redundancy — if the CRD default is the source of truth, a comment here explaining why you also default in code would help future readers.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is used by the tests

@radixo
Copy link
Copy Markdown
Contributor Author

radixo commented Apr 15, 2026

Good work on the runtime Helm rendering migration — dropping 53K lines of pre-rendered YAML is a huge win. The GatewayNamespace mode looks solid with good test coverage. A few observations below.

All good catches man, all sorted

Comment thread pkg/render/gatewayapi/gateway_api.go Outdated
Copy link
Copy Markdown
Member

@electricjesus electricjesus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

// Gateway resources using operator-managed GatewayClasses. These namespaces need
// per-namespace Enterprise resources (SA, CRB, pull secrets).
if *gatewayAPI.Spec.GatewayDeploymentMode == operatorv1.GatewayDeploymentModeGatewayNamespace &&
variant.IsEnterprise() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does the Variant matter here at all?

radixo and others added 2 commits April 23, 2026 22:09
- Swap the checked-in gateway_api_resources.yaml for the embedded gateway-helm.tgz rendered via the helm SDK at startup; K8SGatewayAPICRDs/GatewayAPICRDs now take a runtime.Scheme and return an error (istio_controller updated for the new signature)
- Deploy two envoy-gateway controllers: legacy in tigera-gateway (user-declared classes via Spec.GatewayClasses) and a new one in calico-system with deploy.type=GatewayNamespace; auto-provision the tigera-gateway-class-ns GatewayClass bound to the new controller
- Group the tigera-gateway install behind legacyObjects/legacyTeardownObjects so the eventual deprecation is a single delete
- HasLegacyGateways classifier in the controller: build a className -> controllerName map seeded from Spec.GatewayClasses + existing GatewayClass resources, classify every live Gateway; when no Gateway targets the tigera-gateway controller, the install is torn down; during the teardown-then-redeploy race the legacy render is deferred to avoid a "Namespace is terminating, skipping creation" log flood
- Legacy teardown queues only the Namespace + cluster-scoped objects + the Deployment (for status.RemoveDeployments); in-namespace RBAC/Secrets ride the cascade to avoid the tigera-operator-secrets RoleBinding race
- Move the shared waf-http-filter ClusterRoles out of the legacy bundle so the calico-system-side proxies keep their cluster-scoped perms after tigera-gateway is retired
- Per-namespace Enterprise resources (SA, RoleBindings, pull secret, shared CRB subject) for namespaces hosting a namespaced-class Gateway; reserved namespaces skip shared resource create/delete; Secret goes before RoleBinding on cleanup to avoid 403
- Gate v3 NetworkPolicies on the calico-system Tier; render calico-system.envoy-gateway allow for the controller and certgen
- Update unit tests and Makefile/docs accordingly

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Cover the calico-system envoy-gateway controller lifecycle, per-namespace resource provisioning and cleanup, custom EnvoyProxy and EnvoyGateway ConfigMap watches, owning-gateway env vars in l7-log-collector, and the legacy-class teardown path
- Teardown sequencing for tigera-gateway cascading

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@radixo radixo force-pushed the gatewayapi-deployment-enterprise branch from 04d49c6 to 8c6f64c Compare April 23, 2026 21:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants