DUNE hangs on shutdown when `Transports.HTTP` handler threads are blocked in `recv()`

Hi
After having run DUNE for several days, I tried to exit it with Ctrl-C. The process took forever to exit, something is hanging. GDB shows that it is Transports::HTTP. With excellent help from claude, I was able to diagnose and reproduce (on both armv7 and x86_64) the issue by sending an incomplete HTTP request to DUNE (` ( printf 'GET / HTTP/1.0\r\n'; sleep 999999 ) | nc 127.0.0.1 8080`). Then, when trying to quit, DUNE will hang until the HTTP request is canceled (quit nc). 

This was tested with our version of DUNE, but I could not find any upstrem fixes that addressed this issue. 

<details>
  <summary>Below is the full report from claude, with some alternative fixes. Feel free to use or discard, as you see fit 😄  </summary>

```
(gdb) thread apply all bt

Thread 4 (Thread 0xb5a473e0 (LWP 17825) "Transports.HTTP"):
#0  0xb66af654 in ?? () from /lib/arm-linux-gnueabihf/libc.so.6
#1  0xb6746cc8 in recv () from /lib/arm-linux-gnueabihf/libc.so.6
#2  0xb69b19ea in __interceptor_recv (fd=<optimized out>, buf=0xb5a46300, len=1, flags=<optimized out>)
    at ../../../../src/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:6763
#3  0x01e0fc60 in DUNE::Network::TCPSocket::doRead(unsigned char*, unsigned int) ()
#4  0x01977cf0 in Transports::HTTP::RequestHandler::handleRequest(DUNE::Network::TCPSocket*) ()
#5  0x01981988 in Transports::HTTP::Handler::run() ()
#6  0x00ad1d2e in dune_concurrency_thread_entry_point ()
#7  0xb66ede62 in ?? () from /lib/arm-linux-gnueabihf/libc.so.6

Thread 3 (Thread 0xb5aff3e0 (LWP 17821) "Transports.HTTP"):
#0  0xb66af654 in ?? () from /lib/arm-linux-gnueabihf/libc.so.6
#1  0xb6746cc8 in recv () from /lib/arm-linux-gnueabihf/libc.so.6
#2  0xb69b19ea in __interceptor_recv (fd=<optimized out>, buf=0xb5afe300, len=1, flags=<optimized out>)
    at ../../../../src/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:6763
#3  0x01e0fc60 in DUNE::Network::TCPSocket::doRead(unsigned char*, unsigned int) ()
#4  0x01977cf0 in Transports::HTTP::RequestHandler::handleRequest(DUNE::Network::TCPSocket*) ()
#5  0x01981988 in Transports::HTTP::Handler::run() ()
#6  0x00ad1d2e in dune_concurrency_thread_entry_point ()
#7  0xb66ede62 in ?? () from /lib/arm-linux-gnueabihf/libc.so.6

Thread 2 (Thread 0xb5cf73e0 (LWP 17816) "Transports.HTTP"):
#0  0xb66af654 in ?? () from /lib/arm-linux-gnueabihf/libc.so.6
#1  0xb66eb502 in ?? () from /lib/arm-linux-gnueabihf/libc.so.6
#2  0xb66eb5e0 in ?? () from /lib/arm-linux-gnueabihf/libc.so.6
#3  0xb66ef40c in ?? () from /lib/arm-linux-gnueabihf/libc.so.6

Thread 1 (Thread 0xb668f4e0 (LWP 17810) "dune"):
#0  0xb66af654 in ?? () from /lib/arm-linux-gnueabihf/libc.so.6
#1  0xb66eb502 in ?? () from /lib/arm-linux-gnueabihf/libc.so.6
#2  0xb66eb5e0 in ?? () from /lib/arm-linux-gnueabihf/libc.so.6
#3  0xb66ef40c in ?? () from /lib/arm-linux-gnueabihf/libc.so.6
```

All DUNE tasks have shut down except the main thread and the HTTP handler pool. Three of the four remaining threads are sitting in `recv()` inside `DUNE::Network::TCPSocket::doRead`, called from `Transports::HTTP::RequestHandler::handleRequest`

Threads 1 and 2 have unresolved symbols but the top frame (`0xb66af654`) matches the
top frame of threads 3 and 4 — the main thread is almost certainly parked in the
destructor chain waiting on `stopAndJoin()` for the HTTP pool, which is waiting on
the threads that are stuck in `recv()`.

## Summary

When DUNE is shut down (SIGINT / Ctrl-C), the process can hang indefinitely if any
`Transports.HTTP` handler thread is currently blocked in a read on an accepted client
socket. The `Handler::run()` loop only checks `isStopping()` between requests, and the
per-request read loop in `RequestHandler::handleRequest()` uses a blocking
`sock->read(..., 1)` with no timeout and no external wake-up mechanism. If a client
keeps a TCP connection open but does not send the bytes needed to finish the request
line / headers (`\r\n\r\n`), the handler thread cannot make progress and never checks
the stop flag; `Server::~Server()` then blocks in `stopAndJoin()` waiting for a thread
that can no longer exit.

In practice the probability of hitting this grows with uptime and with the number of
HTTP clients that have ever connected (Neptus instances, browsers left open, scripts,
health probes, etc.). After several days of continuous operation it becomes a reliable
shutdown hang in our setup.

## Environment

- Platform: Linux, `armv7l` (ARM hard-float, `arm-linux-gnueabihf`) — embedded AutoNaut USV.
- DUNE branch: a downstream fork rebased on `lsts/master`; the affected source files
  (`src/Transports/HTTP/Server.cpp`, `src/Transports/HTTP/RequestHandler.cpp`) are
  **byte-identical to `lsts/master`** at the time of reporting (verified with
  `git diff lsts/master -- src/Transports/HTTP/...` → empty).
- A sweep of *every* branch on `lsts` (643 remote refs) was performed:
  - `git log --all --oneline -- ':(top)src/Transports/HTTP/Server.cpp'
    ':(top)src/Transports/HTTP/RequestHandler.cpp'` returns only copyright-year
    updates and one unrelated 2014 commit (`8b96faaa0` — "Removed err() in
    unharmful exception"). **No branch contains a functional change to these files
    related to shutdown, hanging, timeouts, or blocking reads.**
  - `git log --all --oneline --grep='closeSocket'` finds four commits (all by
    `mariacosta`, 2025-11-10) that add the `closeSocket()` method. These appear in
    ~27 feature branches including `lsts/master`, but a sweep of every ref with
    `git cat-file -p <ref>:src/Transports/HTTP/Server.cpp | grep closeSocket`
    shows that no branch *calls* `closeSocket()` from the HTTP server — the
    primitive is orphaned.
- Task configured with `Threads = 5` (default) and a long uptime (several days) with at
  least one HTTP client connected during that period.

## Root cause

### 1. `Handler::run()` only checks `isStopping()` between requests

`src/Transports/HTTP/Server.cpp` (current `lsts/master`):

```cpp
void
run(void)
{
  while (!isStopping())                       // checked only here
  {
    if (!m_queue.waitForItems(1.0))
      continue;

    if (m_queue.closed())
      break;

    TCPSocket* sock = m_queue.pop();
    if (!sock)
      continue;

    try
    {
      m_handler.handleRequest(sock);          // <-- can block indefinitely
    }
    catch (...)
    { }

    delete sock;
  }
}
```

Once `handleRequest()` is called, the thread will only observe `isStopping()` if/when
that call returns — there is no periodic check, no cancellation, and no way for the
thread to notice that shutdown has been requested.

### 2. `RequestHandler::handleRequest()` uses a blocking byte-by-byte `recv()`

`src/Transports/HTTP/RequestHandler.cpp` (current `lsts/master`, the header-read loop
around line 250):

```cpp
unsigned idx = 0;
unsigned didx = 0;
bool eor = false;
while (!eor && (idx < (c_max_request_size - 1)))
{
  int rv = sock->read(bfr + idx, 1);          // blocking recv, no timeout

  if (rv <= 0)
    throw ConnectionClosed();

  // ... look for \r\n\r\n to mark end of request ...
}
```

`TCPSocket::doRead` ultimately calls `recv(2)` with no timeout and no poll, so the
thread will sleep in the kernel until either a byte arrives, the peer closes, or the
socket is forcibly torn down by another thread. Any client that has opened the
connection but not (yet) sent all of the request-line and headers will keep this read
blocked indefinitely. Long-lived connections from interactive tools (Neptus, browsers)
or half-open connections left behind by client crashes / network outages are enough to
produce this.

### 3. `Server::~Server()` joins those threads unconditionally

```cpp
Server::~Server(void)
{
  m_queue.close();

  for (unsigned i = 0; i < m_pool.size(); ++i)
  {
    try
    {
      m_pool[i]->stopAndJoin();               // waits forever if handler is in recv()
    }
    catch (...)
    { }

    delete m_pool[i];
  }

  // ...
}
```

`stopAndJoin()` only sets the stop flag and `pthread_join()`s the thread. Since the
thread is wedged in `recv()` with no external wake path, the join never returns and
the entire DUNE process hangs. The stop flag never gets a chance to be observed.

### 4. `TCPSocket` has no exposed "abort a blocked read" operation that the server uses

Until recently there was no public method on `DUNE::Network::TCPSocket` to forcibly
close an accepted client socket from a thread other than the one reading it.
`closeSocket()` was added by commit **`12dc93f99f404c550d5b25eb420ec8c496d9df90`**
(*"DUNE/Network/TCPSocket: Added closeSocket() method"*, 2025-11-10):

```cpp
void TCPSocket::closeSocket()
{
#if defined(DUNE_OS_POSIX)
  shutdown(m_handle, SHUT_RDWR);
  close(m_handle);
#elif defined(DUNE_OS_WINDOWS)
  closesocket(m_handle);
#endif
}
```

However, `closeSocket()` has **no callers** on `lsts/master` or on any other branch
in the `lsts` remote (verified by scanning every remote ref with
`git cat-file -p <ref>:src/Transports/HTTP/Server.cpp`). The primitive that would
unblock the wedged `recv()` is present but is not yet wired into the HTTP server
shutdown path on any upstream branch.

## Why the hang gets worse with uptime

Every accepted connection produces a `TCPSocket*` that ends up on the handler queue;
each connection where the peer does not complete its request-line+headers puts one
handler thread into a permanent `recv()` sleep. The server has a fixed thread pool
(configurable via `Threads`, default 5), so after enough such connections accumulate
the pool is exhausted and even new, well-behaved clients do not get serviced. At
shutdown, every wedged thread is one guaranteed indefinite wait in `stopAndJoin()`.

## Reproduction

1. Configure `Transports.HTTP` normally and start DUNE.
2. From another machine, open a TCP connection to the HTTP port and send *less than*
   a complete request header (do not send the final `\r\n\r\n`). For example:

   ```sh
   ( printf 'GET / HTTP/1.0\r\n'; sleep 999999 ) | nc <dune-ip> 8080
   ```

   Any client that holds a connection open without finishing the request works —
   a paused browser tab, a crashed Neptus, or a network partition between client and
   DUNE all produce the same condition.
3. Wait for DUNE to accept the connection (it will, and a handler thread will enter
   `handleRequest` → blocking `sock->read(..., 1)`).
4. Send SIGINT to DUNE (`Ctrl-C` or `kill -INT`).
5. Observe that DUNE does not exit. `gdb -p <pid>` shows the handler thread(s) in
   `recv()` inside `TCPSocket::doRead`, and the main thread in `stopAndJoin()`.

## Proposed fixes

Three independent options, in order of preference. All of them are standalone; the
choice is a trade-off between invasiveness and how well the request-handling code is
preserved.

### Option A (recommended): shut down in-flight sockets in `Server::~Server()`

Track every accepted socket in the `Server` (e.g. a `std::set<TCPSocket*>` guarded by
a `Mutex`). `poll()` inserts into the set on `accept`; `Handler::run()` removes from
the set right before `delete sock`. In `Server::~Server()`, after closing the queue,
iterate the set under the mutex and call `TCPSocket::closeSocket()` on every entry
**before** calling `stopAndJoin()` on the handler pool. The forced `shutdown(SHUT_RDWR)`
makes the blocked `recv()` return 0, which causes `RequestHandler::handleRequest()`
to throw `ConnectionClosed`, which lets `Handler::run()` observe `isStopping()` and
exit cleanly.

Pros:
- Smallest change to correct behavior.
- Uses the `closeSocket()` primitive already in upstream (`12dc93f99`), no new API.
- No change to request parsing, no risk to HTTP correctness.
- Works for every client-side cause (slow, silent, crashed, partitioned).

Cons:
- Adds a small amount of bookkeeping (set + mutex) in `Server`.
- Race between `Handler::run()` deleting a socket and the destructor iterating the
  set needs a straightforward lock + "remove before delete" rule.

Sketch:

```cpp
class Server
{
  // ...existing members...
  Concurrency::Mutex            m_live_mtx;
  std::set<TCPSocket*>          m_live_socks;

  void
  registerSocket(TCPSocket* s)
  {
    ScopedMutex l(m_live_mtx);
    m_live_socks.insert(s);
  }

  void
  unregisterSocket(TCPSocket* s)
  {
    ScopedMutex l(m_live_mtx);
    m_live_socks.erase(s);
  }
};

// In Server::poll(...):
TCPSocket* nc = m_sock.accept();
registerSocket(nc);
m_queue.push(nc);

// In Handler::run(), right before delete sock:
m_server.unregisterSocket(sock);
delete sock;

// In Server::~Server(), before the stopAndJoin() loop:
{
  ScopedMutex l(m_live_mtx);
  for (TCPSocket* s : m_live_socks)
    s->closeSocket();        // unblocks any recv() on that socket
}
// ...then stopAndJoin as before.
```

(Owner ship note: the `Handler` can keep deleting the `TCPSocket*` as today; the
destructor only needs to unblock the read, not free the memory.)

### Option B: add a receive timeout to the request-read loop

In `RequestHandler::handleRequest()`, either call `Poll::poll(sock, T)` with a short
timeout (say 500 ms) before each `sock->read(bfr+idx, 1)`, or set `SO_RCVTIMEO` on the
socket at entry. When the poll times out / `recv` returns `EAGAIN`, check a
stop-requested flag (either `isStopping()` via a back-pointer to the `Thread`, or a
new flag on the `Server` / `RequestHandler`). If stopping, close the socket and return.

Pros:
- Self-contained within `handleRequest`; no socket-lifetime bookkeeping in `Server`.
- No dependency on `closeSocket()` from `12dc93f99`.

Cons:
- Wakes up periodically while idle connections are held open (minor CPU cost, grows
  linearly with the number of stalled connections).
- Changes the hot path of request parsing, higher chance of regressing a corner case.
- Also requires also covering reads inside `handleGET` / `handlePOST` / `handlePUT` body
  paths, or at least any that can block on the peer, otherwise the same class of
  hang is possible mid-request.

### Option C: drain-and-bypass in `Server::~Server()`

Instead of joining the handler threads, mark the server as shutting down, drop the
queue, optionally detach the threads, and let the process exit without waiting for
them. This is fast but leaks threads and their accepted sockets; only viable if the
`Server` is definitely the last thing to go before the whole process exits.

Pros:
- Trivially small change.

Cons:
- Leaks resources and is hostile to unit tests / re-entrant initialisation.
- Does not address the "thread is wedged while DUNE is still supposed to be running"
  case (e.g. a stuck handler reduces the usable pool from 5 to 4 until restart).
- Generally not what you want in a long-lived daemon.

## Recommendation

Adopt **Option A** on upstream. It is the smallest correctness fix, it leverages the
`closeSocket()` primitive that is already in `lsts/master` (commit `12dc93f99`) without
needing to change request parsing, and it generalises across every client-side cause of
the stall (slow client, silent client, crashed client, network partition). Option B can
be added later as defence in depth if desired, but is not necessary to resolve the
shutdown hang.

## References

- `src/Transports/HTTP/Server.cpp` — `Handler::run()` (per-request loop) and
  `Server::~Server()` (unconditional `stopAndJoin`).
- `src/Transports/HTTP/RequestHandler.cpp` — `handleRequest()` blocking
  byte-by-byte header read.
- `src/DUNE/Network/TCPSocket.cpp` / `.hpp` — `closeSocket()` added by commit
  `12dc93f99f404c550d5b25eb420ec8c496d9df90`, currently unused.

</details>


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DUNE hangs on shutdown when `Transports.HTTP` handler threads are blocked in `recv()` #289

Summary

Environment

Root cause

1. `Handler::run()` only checks `isStopping()` between requests

2. `RequestHandler::handleRequest()` uses a blocking byte-by-byte `recv()`

3. `Server::~Server()` joins those threads unconditionally

4. `TCPSocket` has no exposed "abort a blocked read" operation that the server uses

Why the hang gets worse with uptime

Reproduction

Proposed fixes

Option A (recommended): shut down in-flight sockets in `Server::~Server()`

Option B: add a receive timeout to the request-read loop

Option C: drain-and-bypass in `Server::~Server()`

Recommendation

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

DUNE hangs on shutdown when Transports.HTTP handler threads are blocked in recv() #289

Description

Summary

Environment

Root cause

1. Handler::run() only checks isStopping() between requests

2. RequestHandler::handleRequest() uses a blocking byte-by-byte recv()

3. Server::~Server() joins those threads unconditionally

4. TCPSocket has no exposed "abort a blocked read" operation that the server uses

Why the hang gets worse with uptime

Reproduction

Proposed fixes

Option A (recommended): shut down in-flight sockets in Server::~Server()

Option B: add a receive timeout to the request-read loop

Option C: drain-and-bypass in Server::~Server()

Recommendation

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

DUNE hangs on shutdown when `Transports.HTTP` handler threads are blocked in `recv()` #289

1. `Handler::run()` only checks `isStopping()` between requests

2. `RequestHandler::handleRequest()` uses a blocking byte-by-byte `recv()`

3. `Server::~Server()` joins those threads unconditionally

4. `TCPSocket` has no exposed "abort a blocked read" operation that the server uses

Option A (recommended): shut down in-flight sockets in `Server::~Server()`

Option C: drain-and-bypass in `Server::~Server()`