Skip to content

feat(webhook): proxy-aware error handling for forward proxies#899

Merged
alexluong merged 22 commits into
mainfrom
feat/webhook-envoy-proxy
May 13, 2026
Merged

feat(webhook): proxy-aware error handling for forward proxies#899
alexluong merged 22 commits into
mainfrom
feat/webhook-envoy-proxy

Conversation

@alexluong
Copy link
Copy Markdown
Collaborator

@alexluong alexluong commented May 12, 2026

Summary

Webhook deliveries through a forward proxy (DESTINATIONS_WEBHOOK_PROXY_URL) now distinguish proxy infrastructure failures from destination failures, and translate proxy-reported destination failures into the right destination-attributed error code. No new config; Envoy-specific signals are auto-detected from response headers when present. Implementation is scoped to destwebhook/; destregistry/ exposes only a generic WrapTransport hook.

Full feature doc: docs/content/features/webhook-proxy.mdoc.

Why

Two equally-load-bearing problems today:

  • Misattributed retry budget. A proxy outage records as a destination failure, burning the destination's retry budget and DLQ slot for a problem it can't fix.
  • Wrong error surface to customers. Upstream errors that the proxy reports (DNS, connection refused, timeout) reach the delivery record as proxy-flavored generic errors instead of the underlying destination problem, and server: envoy / x-envoy-* headers leak into the response stored against the destination.

Behavior

  • Proxy auth failure or dial-to-proxy failure → nack. The queue redelivers; no customer-visible attempt is recorded.
  • Any other proxy-reported failure (CONNECT 5xx, Envoy-synthesized 5xx on plain-HTTP) → destination attempt with a refined code (UFconnection_refused, DCdns_error, UTtimeout, etc.; unrecognized → network_error).
  • Plain-HTTP success pathx-envoy-* and server: envoy headers are stripped before storage. Envoy-synthesized 5xx bodies are dropped.
  • HTTPS path — byte-transparent. TLS is end-to-end after CONNECT, so the wrapper never inspects or modifies the response. Destinations behind their own Envoy keep their real status, body, and headers.

Nuances

  • Spoof-resistance on the plain-HTTP path requires the operator's Envoy to emit x-envoy-response-flags with OVERWRITE_IF_EXISTS_OR_ADD (must live on RouteConfiguration, Envoy rejects it on HCM). Without this, a destination can fake the flag header and force its real body to be dropped.
  • Plain-HTTP destinations behind their own Envoy lose some observability headers (we strip them). Attribution stays correct because the forward Envoy overwrites the flag. HTTPS destinations are unaffected.
  • Generic (non-Envoy) proxies get the nack-on-auth-failure / nack-on-dial-failure paths. The Envoy header logic is a no-op when the headers aren't present.

Pub/Sub redelivery runway

Outpost's default Pub/Sub provisioning gives roughly 5 minutes of redelivery before DLQ (see the docs page for the exact knobs). A proxy outage under that window is transparent to destinations; longer ones land in the DLQ and need manual replay.

Test plan

  • Wrapper unit tests against fake proxy: CONNECT 407 → nack; CONNECT 5xx → destination record; synthesized response with each Envoy flag → correct mapped code; real upstream pass-through; HTTPS response with envoy headers → pass-through untouched
  • ExecuteHTTPRequest returns Delivery: nil only on the infra-error sentinel
  • Pin test on Go's proxyconnect error wording (guards stdlib regression)
  • MapEnvoyResponseFlag table test
  • Manual smoke test through real Envoy in QA: HTTPS dest, HTTP dest, proxy auth failure, proxy unreachable, destination DNS failure (verify recorded as destination), destination 500 (verify pass-through), destination behind Envoy returning 503 (verify customer sees the 503, not a synthesized error)

🤖 Generated with Claude Code

alexluong and others added 2 commits May 12, 2026 18:58
Adds a Features page covering the existing DESTINATIONS_WEBHOOK_PROXY_URL
configuration and introduces DESTINATIONS_WEBHOOK_PROXY_TYPE=envoy as an
opt-in for Envoy-aware error handling: infra-error detection via
x-envoy-response-flags and the proxyconnect error path, transparent
redelivery via nack, and x-envoy-*/server header sanitization. Includes
required Envoy configuration, a minimal reference envoy.yaml, and queue
retry policy guidance for sufficient redelivery runway during proxy
outages.

Default behavior for existing users is unchanged.

Co-Authored-By: Claude <noreply@anthropic.com>
Adds an explicit error-attribution table. Nack is reserved for cases
where the proxy itself is the proximate cause (auth failure, proxy
unreachable). Destination failures that Envoy merely reports (upstream
DNS / connect / timeout) are recorded as normal failed delivery attempts
with destination-attributed codes; the wrapper sanitizes the response
data so Envoy is not exposed, but the customer still sees a destination
failure.

This avoids the infinite-nack failure mode for genuinely broken
destinations.

Co-Authored-By: Claude <noreply@anthropic.com>
@alexbouchardd
Copy link
Copy Markdown
Contributor

Is DESTINATIONS_WEBHOOK_PROXY_TYPE=envoy necessary? Can't we infer that envoy is used by the response headers in case of an error?

"DLQ runway" replaced with an explicit statement of what the number
means: the maximum time a single nacked message is retried before
landing in the dead-letter queue.

Co-Authored-By: Claude <noreply@anthropic.com>
@alexluong
Copy link
Copy Markdown
Collaborator Author

Is DESTINATIONS_WEBHOOK_PROXY_TYPE=envoy necessary? Can't we infer that envoy is used by the response headers in case of an error?

hmm good idea, yeah we can do that I think

Removes DESTINATIONS_WEBHOOK_PROXY_TYPE. Envoy-specific signals
(x-envoy-response-flags, x-envoy-* / server: envoy headers) are detected
automatically from response headers when present; nothing to configure.

The proxy-infra nack path (proxyconnect 407, dial to proxy fails) is
generic and applies to any forward proxy, since the discriminator is
"the proxy is broken or misconfigured," not "the proxy is Envoy."

Co-Authored-By: Claude <noreply@anthropic.com>
@alexluong alexluong changed the title feat(webhook): Envoy forward proxy support with transparent infra-error handling feat(webhook): proxy-aware error handling for forward proxies May 12, 2026
alexluong and others added 15 commits May 12, 2026 19:11
Reorganizes the error handling section so generic-proxy behavior and
Envoy-specific signals are in separate tables. Adding support for other
proxy implementations (Squid, HAProxy, nginx) in the future would
extend the Envoy support pattern.

Co-Authored-By: Claude <noreply@anthropic.com>
Two clarifications:

1. Response sanitization is best-effort and currently complete only for
   Envoy. For arbitrary proxies, Outpost can rewrite error messages but
   cannot reliably strip proxy-identifying content from response bodies
   or headers.

2. Production proxy auth + TLS recommendation now depends on topology:
   not required if Outpost and the proxy share a private network;
   strongly recommended if the proxy is reachable over the internet to
   prevent open-relay abuse.

Co-Authored-By: Claude <noreply@anthropic.com>
Trim the production-deployment guidance to a single sentence on the
VPC-vs-internet trade-off, instead of a subsection with specific
configuration suggestions.

Co-Authored-By: Claude <noreply@anthropic.com>
Defines ErrProxyInfra (signals nack-on-infra-error) and
ErrProxyDestination (signals record-as-destination-error with mapped
code) along with MapEnvoyResponseFlag for translating Envoy response
flags into existing destination error codes (connection_refused,
timeout, dns_error, network_unreachable).

No consumers yet; subsequent commits add the transport wrapper and
integration into the webhook delivery path.

Co-Authored-By: Claude <noreply@anthropic.com>
Wraps the proxy transport when a proxy URL is configured. Detects two
classes of proxy-originated failures:

  - Proxy infrastructure (407 / 401 / 403 on CONNECT, dial-to-proxy
    failure): wrapped as ErrProxyInfra. ExecuteHTTPRequest returns
    Delivery: nil so the message handler nacks via the existing
    nil-attempt path (registry.go:179-195), preserving the destination's
    retry budget across proxy outages.

  - Destination errors that the proxy reports (5xx on CONNECT, etc.):
    wrapped as ErrProxyDestination with a classification code mapped
    from the response. Falls through to the standard failed-attempt path
    with the explicit code instead of substring-matching the underlying
    error.

CONNECT-time detection uses http.Transport.OnProxyConnectResponse which
exposes the full response (status + headers) before Go discards it,
sidestepping the fragility of parsing internal error strings. The
wrapper still detects dial-to-proxy failures via the "proxyconnect"
prefix that Go's net.OpError keeps for that case.

Co-Authored-By: Claude <noreply@anthropic.com>
Extends proxyTransport with Envoy-specific signals layered on top of the
generic proxy handling:

  - Plain-HTTP forwarding path: responses carrying a non-empty/non-"-"
    x-envoy-response-flags header are treated as Envoy-synthesized
    failures, converted to ErrProxyDestination with the flag mapped to a
    destination error code, and their bodies dropped so they don't reach
    the delivery record.

  - CONNECT path: onProxyConnectResponse reads the same flag header on
    non-200 responses to refine the generic "connection_refused"
    classification (e.g. flag DC -> dns_error, UT -> timeout).

  - Successful responses are stripped of x-envoy-* headers and a
    "server: envoy" header before being returned, so the proxy is not
    fingerprinted in the destination's recorded response.

Co-Authored-By: Claude <noreply@anthropic.com>
Adds a snapshot test asserting that net/http still emits "proxyconnect"
in dial-to-proxy failure errors. The proxyTransport detector relies on
this internal stdlib convention; if a Go upgrade changes the wording,
this test fails in CI rather than letting the detector silently stop
matching and causing proxy outages to be misclassified as destination
errors.

Co-Authored-By: Claude <noreply@anthropic.com>
Two small adjustments after the code landed:

  - Drop the implementation-specific "Outpost sees" column from the
    generic table; describe observed behavior instead. Proxy auth-style
    failures cover 407, 401, and 403, not just 407.

  - Frame the Envoy table around the response-flag signal directly,
    rather than listing every flag-to-code combination (those are
    documented by the MapEnvoyResponseFlag mapping).

Co-Authored-By: Claude <noreply@anthropic.com>
Envoy rejects response_headers_to_add at the HttpConnectionManager
level (confirmed on Envoy 1.31). The field belongs on
RouteConfiguration. Updates both the standalone snippet and the
reference envoy.yaml.

Co-Authored-By: Claude <noreply@anthropic.com>
Destinations commonly sit behind their own Envoy at the edge, which
emits server: envoy and x-envoy-response-flags on the response. For
HTTPS destinations, that response travels through an established CONNECT
tunnel — the forward proxy is byte-blind to TLS-encrypted bytes and
cannot have synthesized or modified anything. Previously the wrapper
treated those headers as if our forward proxy had set them, which:

  - converted a real destination 503 from a downed app into a misleading
    "connection_refused" with no body
  - stripped the destination's own envoy observability headers

After this change, HTTPS responses (post-successful-CONNECT) are
returned unchanged. Proxy-originated HTTPS failures all happen at
CONNECT time and remain handled in onProxyConnectResponse.

The plain-HTTP forwarding path still applies envoy detection and header
stripping because the forward proxy is in the byte path on the return.
The residual case where this over-strips — a plain-HTTP destination
that is itself behind Envoy — is documented as a limitation;
attribution remains correct because x-envoy-response-flags is
overwritten by the forward Envoy via OVERWRITE_IF_EXISTS_OR_ADD.

Top-of-file doc comment on proxyTransport now describes the two
surfaces (CONNECT-time vs plain-HTTP-forwarding) so the dual-path
design is discoverable.

Co-Authored-By: Claude <noreply@anthropic.com>
- httphelper: skip ClassifyNetworkError when ErrProxyDestination sentinel matches
- proxytransport: drop strings.ToLower; http.Header keys are canonicalized
- proxytransport: drop impossible nil-guard on req/req.URL in RoundTrip
- proxytransport: note unhandled Envoy flags fall through to network_error
- proxytransport: cross-reference pin test from proxyconnect detector
- proxytransport_test: drop unreachable empty-flag row from map test

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
MakeHTTPClient never touched BaseProvider state — it was a free helper
pretending to be a method. Move it to httpclient.go and update the three
HTTP-based providers (desthookdeck, destwebhook, destwebhookstandard) to
call destregistry.NewHTTPClient directly. BaseProvider keeps only the
metadata/validation concerns it actually owns.

No behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move proxytransport.go, the Envoy flag mapping, and all sentinels
(ErrProxyInfra, ErrProxyDestination, IsProxyInfraError) into the
destwebhook package — the only place they're actually used. destregistry
keeps a generic WrapTransport hook on HTTPClientConfig; destwebhook
exports a WrapTransport function and the two webhook providers
(destwebhook, destwebhookstandard) plug it in when constructing their
HTTP client.

destregistry/ no longer has any proxy-specific code. The wrapper, the
Envoy flag table, the CONNECT-response callback, and all four test files
move with the sentinels.

No behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the "set 5s/600s/10 attempts for ~30min runway" recommendation
with the actual behavior of the default provisioning in
internal/mqinfra/gcppubsub.go (10s/120s/6 attempts ~ 5min) and note that
operators expecting longer proxy outages should tune
MinRetryBackoff/MaxRetryBackoff/RetryLimit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@alexluong alexluong marked this pull request as ready for review May 12, 2026 18:24
Copy link
Copy Markdown
Contributor

@alexbouchardd alexbouchardd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would get @mkherlakian review here as well since he knows envoy's quirks well

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it common practice in Go to put tests in the same directory as the code it's testing?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mkherlakian yes, especially for unit test it's sort of the only option. Putting it in a different directory would be an anti-pattern.

Also keep in mind Golang enforce an implicit package boundary by directory, so there's that.

alexluong and others added 3 commits May 13, 2026 19:57
DC (Downstream Connection Termination) is the wrong flag — it means the
client closed the connection, not DNS resolution. The actual flag emitted
by Envoy's dynamic_forward_proxy when a host fails to resolve is DF (DNS
Failure). Verified against a local Envoy reference proxy
(build/dev/envoy/envoy.yaml): nonexistent.invalid → DF → dns_error.

Also adds a local-dev Envoy service (build/dev/compose.yml + envoy.yaml)
matching the documented config — `%RESPONSE_FLAGS%` on RouteConfiguration
with OVERWRITE_IF_EXISTS_OR_ADD — so the QA suite at qa/suites/outpost/
webhook-proxy.md can be run against a real proxy locally.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address review feedback that the recorded error code is too coarse and
gives operators nothing to diagnose with.

Mapping: align with ClassifyNetworkError vocabulary so customers see the
same codes whether or not a proxy is in path.
  - UC, UR, LR → connection_reset (split out of connection_refused)
  - UPE, DPE → protocol_error (new bucket; previously network_error)
  - UMSDR → timeout
  - network_error stays as the catch-all; operator-visible signal that
    a new flag has shown up and the table should be expanded.

Diagnostics: capture raw x-envoy-response-flags and the new
x-envoy-response-code-details (stage{reason}) on ErrProxyDestination
and attach them to the publish-attempt error payload as proxy_flag /
proxy_details. Logged operator-side; never written to the customer-
visible attempt response_data — same model EG uses.

Ref Envoy config emits both headers with OVERWRITE_IF_EXISTS_OR_ADD so
destinations can't spoof them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace Flag/Details fields with a free-form Diagnostics map[string]string.
The previous shape baked Envoy-specific terminology into a struct that
classifies errors from any forward proxy — even though only Envoy
populates the fields today.

Generic map keeps the contract honest: whoever populates a key owns its
naming (envoy_flag, envoy_details for Envoy; future Squid/HAProxy paths
would add their own). Error() sorts keys for deterministic log output.
httphelper splats the map into the publish-attempt Data payload.

Behavior unchanged — same log strings, same customer-invisible surface.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@alexluong alexluong merged commit 121bd95 into main May 13, 2026
2 checks passed
@alexluong alexluong deleted the feat/webhook-envoy-proxy branch May 13, 2026 17:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants