This was tested with our version of DUNE, but I could not find any upstrem fixes that addressed this issue.
Below is the full report from claude, with some alternative fixes. Feel free to use or discard, as you see fit 😄
(gdb) thread apply all bt
Thread 4 (Thread 0xb5a473e0 (LWP 17825) "Transports.HTTP"):
#0 0xb66af654 in ?? () from /lib/arm-linux-gnueabihf/libc.so.6
#1 0xb6746cc8 in recv () from /lib/arm-linux-gnueabihf/libc.so.6
#2 0xb69b19ea in __interceptor_recv (fd=<optimized out>, buf=0xb5a46300, len=1, flags=<optimized out>)
at ../../../../src/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:6763
#3 0x01e0fc60 in DUNE::Network::TCPSocket::doRead(unsigned char*, unsigned int) ()
#4 0x01977cf0 in Transports::HTTP::RequestHandler::handleRequest(DUNE::Network::TCPSocket*) ()
#5 0x01981988 in Transports::HTTP::Handler::run() ()
#6 0x00ad1d2e in dune_concurrency_thread_entry_point ()
#7 0xb66ede62 in ?? () from /lib/arm-linux-gnueabihf/libc.so.6
Thread 3 (Thread 0xb5aff3e0 (LWP 17821) "Transports.HTTP"):
#0 0xb66af654 in ?? () from /lib/arm-linux-gnueabihf/libc.so.6
#1 0xb6746cc8 in recv () from /lib/arm-linux-gnueabihf/libc.so.6
#2 0xb69b19ea in __interceptor_recv (fd=<optimized out>, buf=0xb5afe300, len=1, flags=<optimized out>)
at ../../../../src/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:6763
#3 0x01e0fc60 in DUNE::Network::TCPSocket::doRead(unsigned char*, unsigned int) ()
#4 0x01977cf0 in Transports::HTTP::RequestHandler::handleRequest(DUNE::Network::TCPSocket*) ()
#5 0x01981988 in Transports::HTTP::Handler::run() ()
#6 0x00ad1d2e in dune_concurrency_thread_entry_point ()
#7 0xb66ede62 in ?? () from /lib/arm-linux-gnueabihf/libc.so.6
Thread 2 (Thread 0xb5cf73e0 (LWP 17816) "Transports.HTTP"):
#0 0xb66af654 in ?? () from /lib/arm-linux-gnueabihf/libc.so.6
#1 0xb66eb502 in ?? () from /lib/arm-linux-gnueabihf/libc.so.6
#2 0xb66eb5e0 in ?? () from /lib/arm-linux-gnueabihf/libc.so.6
#3 0xb66ef40c in ?? () from /lib/arm-linux-gnueabihf/libc.so.6
Thread 1 (Thread 0xb668f4e0 (LWP 17810) "dune"):
#0 0xb66af654 in ?? () from /lib/arm-linux-gnueabihf/libc.so.6
#1 0xb66eb502 in ?? () from /lib/arm-linux-gnueabihf/libc.so.6
#2 0xb66eb5e0 in ?? () from /lib/arm-linux-gnueabihf/libc.so.6
#3 0xb66ef40c in ?? () from /lib/arm-linux-gnueabihf/libc.so.6
All DUNE tasks have shut down except the main thread and the HTTP handler pool. Three of the four remaining threads are sitting in recv() inside DUNE::Network::TCPSocket::doRead, called from Transports::HTTP::RequestHandler::handleRequest
Threads 1 and 2 have unresolved symbols but the top frame (0xb66af654) matches the
top frame of threads 3 and 4 — the main thread is almost certainly parked in the
destructor chain waiting on stopAndJoin() for the HTTP pool, which is waiting on
the threads that are stuck in recv().
Summary
When DUNE is shut down (SIGINT / Ctrl-C), the process can hang indefinitely if any
Transports.HTTP handler thread is currently blocked in a read on an accepted client
socket. The Handler::run() loop only checks isStopping() between requests, and the
per-request read loop in RequestHandler::handleRequest() uses a blocking
sock->read(..., 1) with no timeout and no external wake-up mechanism. If a client
keeps a TCP connection open but does not send the bytes needed to finish the request
line / headers (\r\n\r\n), the handler thread cannot make progress and never checks
the stop flag; Server::~Server() then blocks in stopAndJoin() waiting for a thread
that can no longer exit.
In practice the probability of hitting this grows with uptime and with the number of
HTTP clients that have ever connected (Neptus instances, browsers left open, scripts,
health probes, etc.). After several days of continuous operation it becomes a reliable
shutdown hang in our setup.
Environment
- Platform: Linux,
armv7l (ARM hard-float, arm-linux-gnueabihf) — embedded AutoNaut USV.
- DUNE branch: a downstream fork rebased on
lsts/master; the affected source files
(src/Transports/HTTP/Server.cpp, src/Transports/HTTP/RequestHandler.cpp) are
byte-identical to lsts/master at the time of reporting (verified with
git diff lsts/master -- src/Transports/HTTP/... → empty).
- A sweep of every branch on
lsts (643 remote refs) was performed:
git log --all --oneline -- ':(top)src/Transports/HTTP/Server.cpp' ':(top)src/Transports/HTTP/RequestHandler.cpp' returns only copyright-year
updates and one unrelated 2014 commit (8b96faaa0 — "Removed err() in
unharmful exception"). No branch contains a functional change to these files
related to shutdown, hanging, timeouts, or blocking reads.
git log --all --oneline --grep='closeSocket' finds four commits (all by
mariacosta, 2025-11-10) that add the closeSocket() method. These appear in
~27 feature branches including lsts/master, but a sweep of every ref with
git cat-file -p <ref>:src/Transports/HTTP/Server.cpp | grep closeSocket
shows that no branch calls closeSocket() from the HTTP server — the
primitive is orphaned.
- Task configured with
Threads = 5 (default) and a long uptime (several days) with at
least one HTTP client connected during that period.
Root cause
1. Handler::run() only checks isStopping() between requests
src/Transports/HTTP/Server.cpp (current lsts/master):
void
run(void)
{
while (!isStopping()) // checked only here
{
if (!m_queue.waitForItems(1.0))
continue;
if (m_queue.closed())
break;
TCPSocket* sock = m_queue.pop();
if (!sock)
continue;
try
{
m_handler.handleRequest(sock); // <-- can block indefinitely
}
catch (...)
{ }
delete sock;
}
}
Once handleRequest() is called, the thread will only observe isStopping() if/when
that call returns — there is no periodic check, no cancellation, and no way for the
thread to notice that shutdown has been requested.
2. RequestHandler::handleRequest() uses a blocking byte-by-byte recv()
src/Transports/HTTP/RequestHandler.cpp (current lsts/master, the header-read loop
around line 250):
unsigned idx = 0;
unsigned didx = 0;
bool eor = false;
while (!eor && (idx < (c_max_request_size - 1)))
{
int rv = sock->read(bfr + idx, 1); // blocking recv, no timeout
if (rv <= 0)
throw ConnectionClosed();
// ... look for \r\n\r\n to mark end of request ...
}
TCPSocket::doRead ultimately calls recv(2) with no timeout and no poll, so the
thread will sleep in the kernel until either a byte arrives, the peer closes, or the
socket is forcibly torn down by another thread. Any client that has opened the
connection but not (yet) sent all of the request-line and headers will keep this read
blocked indefinitely. Long-lived connections from interactive tools (Neptus, browsers)
or half-open connections left behind by client crashes / network outages are enough to
produce this.
3. Server::~Server() joins those threads unconditionally
Server::~Server(void)
{
m_queue.close();
for (unsigned i = 0; i < m_pool.size(); ++i)
{
try
{
m_pool[i]->stopAndJoin(); // waits forever if handler is in recv()
}
catch (...)
{ }
delete m_pool[i];
}
// ...
}
stopAndJoin() only sets the stop flag and pthread_join()s the thread. Since the
thread is wedged in recv() with no external wake path, the join never returns and
the entire DUNE process hangs. The stop flag never gets a chance to be observed.
4. TCPSocket has no exposed "abort a blocked read" operation that the server uses
Until recently there was no public method on DUNE::Network::TCPSocket to forcibly
close an accepted client socket from a thread other than the one reading it.
closeSocket() was added by commit 12dc93f99f404c550d5b25eb420ec8c496d9df90
("DUNE/Network/TCPSocket: Added closeSocket() method", 2025-11-10):
void TCPSocket::closeSocket()
{
#if defined(DUNE_OS_POSIX)
shutdown(m_handle, SHUT_RDWR);
close(m_handle);
#elif defined(DUNE_OS_WINDOWS)
closesocket(m_handle);
#endif
}
However, closeSocket() has no callers on lsts/master or on any other branch
in the lsts remote (verified by scanning every remote ref with
git cat-file -p <ref>:src/Transports/HTTP/Server.cpp). The primitive that would
unblock the wedged recv() is present but is not yet wired into the HTTP server
shutdown path on any upstream branch.
Why the hang gets worse with uptime
Every accepted connection produces a TCPSocket* that ends up on the handler queue;
each connection where the peer does not complete its request-line+headers puts one
handler thread into a permanent recv() sleep. The server has a fixed thread pool
(configurable via Threads, default 5), so after enough such connections accumulate
the pool is exhausted and even new, well-behaved clients do not get serviced. At
shutdown, every wedged thread is one guaranteed indefinite wait in stopAndJoin().
Reproduction
-
Configure Transports.HTTP normally and start DUNE.
-
From another machine, open a TCP connection to the HTTP port and send less than
a complete request header (do not send the final \r\n\r\n). For example:
( printf 'GET / HTTP/1.0\r\n'; sleep 999999 ) | nc <dune-ip> 8080
Any client that holds a connection open without finishing the request works —
a paused browser tab, a crashed Neptus, or a network partition between client and
DUNE all produce the same condition.
-
Wait for DUNE to accept the connection (it will, and a handler thread will enter
handleRequest → blocking sock->read(..., 1)).
-
Send SIGINT to DUNE (Ctrl-C or kill -INT).
-
Observe that DUNE does not exit. gdb -p <pid> shows the handler thread(s) in
recv() inside TCPSocket::doRead, and the main thread in stopAndJoin().
Proposed fixes
Three independent options, in order of preference. All of them are standalone; the
choice is a trade-off between invasiveness and how well the request-handling code is
preserved.
Option A (recommended): shut down in-flight sockets in Server::~Server()
Track every accepted socket in the Server (e.g. a std::set<TCPSocket*> guarded by
a Mutex). poll() inserts into the set on accept; Handler::run() removes from
the set right before delete sock. In Server::~Server(), after closing the queue,
iterate the set under the mutex and call TCPSocket::closeSocket() on every entry
before calling stopAndJoin() on the handler pool. The forced shutdown(SHUT_RDWR)
makes the blocked recv() return 0, which causes RequestHandler::handleRequest()
to throw ConnectionClosed, which lets Handler::run() observe isStopping() and
exit cleanly.
Pros:
- Smallest change to correct behavior.
- Uses the
closeSocket() primitive already in upstream (12dc93f99), no new API.
- No change to request parsing, no risk to HTTP correctness.
- Works for every client-side cause (slow, silent, crashed, partitioned).
Cons:
- Adds a small amount of bookkeeping (set + mutex) in
Server.
- Race between
Handler::run() deleting a socket and the destructor iterating the
set needs a straightforward lock + "remove before delete" rule.
Sketch:
class Server
{
// ...existing members...
Concurrency::Mutex m_live_mtx;
std::set<TCPSocket*> m_live_socks;
void
registerSocket(TCPSocket* s)
{
ScopedMutex l(m_live_mtx);
m_live_socks.insert(s);
}
void
unregisterSocket(TCPSocket* s)
{
ScopedMutex l(m_live_mtx);
m_live_socks.erase(s);
}
};
// In Server::poll(...):
TCPSocket* nc = m_sock.accept();
registerSocket(nc);
m_queue.push(nc);
// In Handler::run(), right before delete sock:
m_server.unregisterSocket(sock);
delete sock;
// In Server::~Server(), before the stopAndJoin() loop:
{
ScopedMutex l(m_live_mtx);
for (TCPSocket* s : m_live_socks)
s->closeSocket(); // unblocks any recv() on that socket
}
// ...then stopAndJoin as before.
(Owner ship note: the Handler can keep deleting the TCPSocket* as today; the
destructor only needs to unblock the read, not free the memory.)
Option B: add a receive timeout to the request-read loop
In RequestHandler::handleRequest(), either call Poll::poll(sock, T) with a short
timeout (say 500 ms) before each sock->read(bfr+idx, 1), or set SO_RCVTIMEO on the
socket at entry. When the poll times out / recv returns EAGAIN, check a
stop-requested flag (either isStopping() via a back-pointer to the Thread, or a
new flag on the Server / RequestHandler). If stopping, close the socket and return.
Pros:
- Self-contained within
handleRequest; no socket-lifetime bookkeeping in Server.
- No dependency on
closeSocket() from 12dc93f99.
Cons:
- Wakes up periodically while idle connections are held open (minor CPU cost, grows
linearly with the number of stalled connections).
- Changes the hot path of request parsing, higher chance of regressing a corner case.
- Also requires also covering reads inside
handleGET / handlePOST / handlePUT body
paths, or at least any that can block on the peer, otherwise the same class of
hang is possible mid-request.
Option C: drain-and-bypass in Server::~Server()
Instead of joining the handler threads, mark the server as shutting down, drop the
queue, optionally detach the threads, and let the process exit without waiting for
them. This is fast but leaks threads and their accepted sockets; only viable if the
Server is definitely the last thing to go before the whole process exits.
Pros:
Cons:
- Leaks resources and is hostile to unit tests / re-entrant initialisation.
- Does not address the "thread is wedged while DUNE is still supposed to be running"
case (e.g. a stuck handler reduces the usable pool from 5 to 4 until restart).
- Generally not what you want in a long-lived daemon.
Recommendation
Adopt Option A on upstream. It is the smallest correctness fix, it leverages the
closeSocket() primitive that is already in lsts/master (commit 12dc93f99) without
needing to change request parsing, and it generalises across every client-side cause of
the stall (slow client, silent client, crashed client, network partition). Option B can
be added later as defence in depth if desired, but is not necessary to resolve the
shutdown hang.
References
src/Transports/HTTP/Server.cpp — Handler::run() (per-request loop) and
Server::~Server() (unconditional stopAndJoin).
src/Transports/HTTP/RequestHandler.cpp — handleRequest() blocking
byte-by-byte header read.
src/DUNE/Network/TCPSocket.cpp / .hpp — closeSocket() added by commit
12dc93f99f404c550d5b25eb420ec8c496d9df90, currently unused.
Hi
After having run DUNE for several days, I tried to exit it with Ctrl-C. The process took forever to exit, something is hanging. GDB shows that it is Transports::HTTP. With excellent help from claude, I was able to diagnose and reproduce (on both armv7 and x86_64) the issue by sending an incomplete HTTP request to DUNE (
( printf 'GET / HTTP/1.0\r\n'; sleep 999999 ) | nc 127.0.0.1 8080). Then, when trying to quit, DUNE will hang until the HTTP request is canceled (quit nc).This was tested with our version of DUNE, but I could not find any upstrem fixes that addressed this issue.
Below is the full report from claude, with some alternative fixes. Feel free to use or discard, as you see fit 😄
All DUNE tasks have shut down except the main thread and the HTTP handler pool. Three of the four remaining threads are sitting in
recv()insideDUNE::Network::TCPSocket::doRead, called fromTransports::HTTP::RequestHandler::handleRequestThreads 1 and 2 have unresolved symbols but the top frame (
0xb66af654) matches thetop frame of threads 3 and 4 — the main thread is almost certainly parked in the
destructor chain waiting on
stopAndJoin()for the HTTP pool, which is waiting onthe threads that are stuck in
recv().Summary
When DUNE is shut down (SIGINT / Ctrl-C), the process can hang indefinitely if any
Transports.HTTPhandler thread is currently blocked in a read on an accepted clientsocket. The
Handler::run()loop only checksisStopping()between requests, and theper-request read loop in
RequestHandler::handleRequest()uses a blockingsock->read(..., 1)with no timeout and no external wake-up mechanism. If a clientkeeps a TCP connection open but does not send the bytes needed to finish the request
line / headers (
\r\n\r\n), the handler thread cannot make progress and never checksthe stop flag;
Server::~Server()then blocks instopAndJoin()waiting for a threadthat can no longer exit.
In practice the probability of hitting this grows with uptime and with the number of
HTTP clients that have ever connected (Neptus instances, browsers left open, scripts,
health probes, etc.). After several days of continuous operation it becomes a reliable
shutdown hang in our setup.
Environment
armv7l(ARM hard-float,arm-linux-gnueabihf) — embedded AutoNaut USV.lsts/master; the affected source files(
src/Transports/HTTP/Server.cpp,src/Transports/HTTP/RequestHandler.cpp) arebyte-identical to
lsts/masterat the time of reporting (verified withgit diff lsts/master -- src/Transports/HTTP/...→ empty).lsts(643 remote refs) was performed:git log --all --oneline -- ':(top)src/Transports/HTTP/Server.cpp' ':(top)src/Transports/HTTP/RequestHandler.cpp'returns only copyright-yearupdates and one unrelated 2014 commit (
8b96faaa0— "Removed err() inunharmful exception"). No branch contains a functional change to these files
related to shutdown, hanging, timeouts, or blocking reads.
git log --all --oneline --grep='closeSocket'finds four commits (all bymariacosta, 2025-11-10) that add thecloseSocket()method. These appear in~27 feature branches including
lsts/master, but a sweep of every ref withgit cat-file -p <ref>:src/Transports/HTTP/Server.cpp | grep closeSocketshows that no branch calls
closeSocket()from the HTTP server — theprimitive is orphaned.
Threads = 5(default) and a long uptime (several days) with atleast one HTTP client connected during that period.
Root cause
1.
Handler::run()only checksisStopping()between requestssrc/Transports/HTTP/Server.cpp(currentlsts/master):Once
handleRequest()is called, the thread will only observeisStopping()if/whenthat call returns — there is no periodic check, no cancellation, and no way for the
thread to notice that shutdown has been requested.
2.
RequestHandler::handleRequest()uses a blocking byte-by-byterecv()src/Transports/HTTP/RequestHandler.cpp(currentlsts/master, the header-read looparound line 250):
TCPSocket::doReadultimately callsrecv(2)with no timeout and no poll, so thethread will sleep in the kernel until either a byte arrives, the peer closes, or the
socket is forcibly torn down by another thread. Any client that has opened the
connection but not (yet) sent all of the request-line and headers will keep this read
blocked indefinitely. Long-lived connections from interactive tools (Neptus, browsers)
or half-open connections left behind by client crashes / network outages are enough to
produce this.
3.
Server::~Server()joins those threads unconditionallystopAndJoin()only sets the stop flag andpthread_join()s the thread. Since thethread is wedged in
recv()with no external wake path, the join never returns andthe entire DUNE process hangs. The stop flag never gets a chance to be observed.
4.
TCPSockethas no exposed "abort a blocked read" operation that the server usesUntil recently there was no public method on
DUNE::Network::TCPSocketto forciblyclose an accepted client socket from a thread other than the one reading it.
closeSocket()was added by commit12dc93f99f404c550d5b25eb420ec8c496d9df90("DUNE/Network/TCPSocket: Added closeSocket() method", 2025-11-10):
However,
closeSocket()has no callers onlsts/masteror on any other branchin the
lstsremote (verified by scanning every remote ref withgit cat-file -p <ref>:src/Transports/HTTP/Server.cpp). The primitive that wouldunblock the wedged
recv()is present but is not yet wired into the HTTP servershutdown path on any upstream branch.
Why the hang gets worse with uptime
Every accepted connection produces a
TCPSocket*that ends up on the handler queue;each connection where the peer does not complete its request-line+headers puts one
handler thread into a permanent
recv()sleep. The server has a fixed thread pool(configurable via
Threads, default 5), so after enough such connections accumulatethe pool is exhausted and even new, well-behaved clients do not get serviced. At
shutdown, every wedged thread is one guaranteed indefinite wait in
stopAndJoin().Reproduction
Configure
Transports.HTTPnormally and start DUNE.From another machine, open a TCP connection to the HTTP port and send less than
a complete request header (do not send the final
\r\n\r\n). For example:Any client that holds a connection open without finishing the request works —
a paused browser tab, a crashed Neptus, or a network partition between client and
DUNE all produce the same condition.
Wait for DUNE to accept the connection (it will, and a handler thread will enter
handleRequest→ blockingsock->read(..., 1)).Send SIGINT to DUNE (
Ctrl-Corkill -INT).Observe that DUNE does not exit.
gdb -p <pid>shows the handler thread(s) inrecv()insideTCPSocket::doRead, and the main thread instopAndJoin().Proposed fixes
Three independent options, in order of preference. All of them are standalone; the
choice is a trade-off between invasiveness and how well the request-handling code is
preserved.
Option A (recommended): shut down in-flight sockets in
Server::~Server()Track every accepted socket in the
Server(e.g. astd::set<TCPSocket*>guarded bya
Mutex).poll()inserts into the set onaccept;Handler::run()removes fromthe set right before
delete sock. InServer::~Server(), after closing the queue,iterate the set under the mutex and call
TCPSocket::closeSocket()on every entrybefore calling
stopAndJoin()on the handler pool. The forcedshutdown(SHUT_RDWR)makes the blocked
recv()return 0, which causesRequestHandler::handleRequest()to throw
ConnectionClosed, which letsHandler::run()observeisStopping()andexit cleanly.
Pros:
closeSocket()primitive already in upstream (12dc93f99), no new API.Cons:
Server.Handler::run()deleting a socket and the destructor iterating theset needs a straightforward lock + "remove before delete" rule.
Sketch:
(Owner ship note: the
Handlercan keep deleting theTCPSocket*as today; thedestructor only needs to unblock the read, not free the memory.)
Option B: add a receive timeout to the request-read loop
In
RequestHandler::handleRequest(), either callPoll::poll(sock, T)with a shorttimeout (say 500 ms) before each
sock->read(bfr+idx, 1), or setSO_RCVTIMEOon thesocket at entry. When the poll times out /
recvreturnsEAGAIN, check astop-requested flag (either
isStopping()via a back-pointer to theThread, or anew flag on the
Server/RequestHandler). If stopping, close the socket and return.Pros:
handleRequest; no socket-lifetime bookkeeping inServer.closeSocket()from12dc93f99.Cons:
linearly with the number of stalled connections).
handleGET/handlePOST/handlePUTbodypaths, or at least any that can block on the peer, otherwise the same class of
hang is possible mid-request.
Option C: drain-and-bypass in
Server::~Server()Instead of joining the handler threads, mark the server as shutting down, drop the
queue, optionally detach the threads, and let the process exit without waiting for
them. This is fast but leaks threads and their accepted sockets; only viable if the
Serveris definitely the last thing to go before the whole process exits.Pros:
Cons:
case (e.g. a stuck handler reduces the usable pool from 5 to 4 until restart).
Recommendation
Adopt Option A on upstream. It is the smallest correctness fix, it leverages the
closeSocket()primitive that is already inlsts/master(commit12dc93f99) withoutneeding to change request parsing, and it generalises across every client-side cause of
the stall (slow client, silent client, crashed client, network partition). Option B can
be added later as defence in depth if desired, but is not necessary to resolve the
shutdown hang.
References
src/Transports/HTTP/Server.cpp—Handler::run()(per-request loop) andServer::~Server()(unconditionalstopAndJoin).src/Transports/HTTP/RequestHandler.cpp—handleRequest()blockingbyte-by-byte header read.
src/DUNE/Network/TCPSocket.cpp/.hpp—closeSocket()added by commit12dc93f99f404c550d5b25eb420ec8c496d9df90, currently unused.