Summary
The AWS ALB default idle timeout is 60 seconds. The Mattermost server sends WebSocket ping frames every 60 seconds. This creates a race condition where the ALB may drop a WebSocket connection at exactly the moment a ping is due, causing the client to see a ~10s TCP timeout on the next write.
Evidence
Lab load test with default 60s ALB timeout showed P99 latency of 8.9s and 61% timeout error rate, preventing the coordinator from scaling past ~600 users. The same workload through nginx (proxy_read_timeout 90s) ran to 1,500 users without issue. Mattermost WebSocket client source confirms the server-side ping interval is exactly 60s.
Recommendation
Add a configuration note to the AWS deployment / load balancer documentation recommending ALB idle timeout be set to 300 seconds (5 minutes):
aws elbv2 modify-load-balancer-attributes \
--load-balancer-arn <alb-arn> \
--attributes Key=idle_timeout.timeout_seconds,Value=300
Or via AWS Console: EC2 → Load Balancers → select ALB → Attributes → Edit → Idle timeout → 300.
Additional Investigation Note
Pull CloudWatch metrics during next high-load event (e.g. 5am EST spike):
ActiveConnectionCount
ConsumedLCUs
RejectedConnectionCount
TargetResponseTime
If RejectedConnectionCount > 0 or ConsumedLCUs is near limits, an NLB may be a better long-term fit for this WebSocket-heavy workload.
Workstream
Implementation & Onboarding — validated during Midmarket HA Postgres load testing.
Summary
The AWS ALB default idle timeout is 60 seconds. The Mattermost server sends WebSocket ping frames every 60 seconds. This creates a race condition where the ALB may drop a WebSocket connection at exactly the moment a ping is due, causing the client to see a ~10s TCP timeout on the next write.
Evidence
Lab load test with default 60s ALB timeout showed P99 latency of 8.9s and 61% timeout error rate, preventing the coordinator from scaling past ~600 users. The same workload through nginx (
proxy_read_timeout 90s) ran to 1,500 users without issue. Mattermost WebSocket client source confirms the server-side ping interval is exactly 60s.Recommendation
Add a configuration note to the AWS deployment / load balancer documentation recommending ALB idle timeout be set to 300 seconds (5 minutes):
Or via AWS Console: EC2 → Load Balancers → select ALB → Attributes → Edit → Idle timeout → 300.
Additional Investigation Note
Pull CloudWatch metrics during next high-load event (e.g. 5am EST spike):
ActiveConnectionCountConsumedLCUsRejectedConnectionCountTargetResponseTimeIf
RejectedConnectionCount > 0orConsumedLCUsis near limits, an NLB may be a better long-term fit for this WebSocket-heavy workload.Workstream
Implementation & Onboarding — validated during Midmarket HA Postgres load testing.