GatewayProtectedCacheProxy#query(org.apache.ignite.cache.query.SqlFieldsQuery) encountered a thread deadlock issue when querying data.

Hello guys, I’m using Apache Ignite 2.16.0/2.17.0 in a production environment with a 15 server-nodes cluster.

A deadlock occurred when one of the nodes（Replace with ip1） was executing `org.apache.ignite.internal.processors.cache.GatewayProtectedCacheProxy#query(org.apache.ignite.cache.query.SqlFieldsQuery)`.

Thread stack is as follows:
```
"xxx" Id=317 TIMED_WAITING on java.util.concurrent.CountDownLatch$Sync@9342695
    at java.base@21.0.8/jdk.internal.misc.Unsafe.park(Native Method)
    -  waiting on java.util.concurrent.CountDownLatch$Sync@9342695
    at java.base@21.0.8/java.util.concurrent.locks.LockSupport.parkNanos(Unknown Source)
    at java.base@21.0.8/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(Unknown Source)
    at java.base@21.0.8/java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(Unknown Source)
    at java.base@21.0.8/java.util.concurrent.CountDownLatch.await(Unknown Source)
    at org.apache.ignite.internal.util.IgniteUtils.await(IgniteUtils.java:8228)
    at org.apache.ignite.internal.processors.query.h2.twostep.ReduceQueryRun.tryMapToSources(ReduceQueryRun.java:218)
    at org.apache.ignite.internal.processors.query.h2.twostep.GridReduceQueryExecutor.awaitAllReplies(GridReduceQueryExecutor.java:1065)
    at org.apache.ignite.internal.processors.query.h2.twostep.GridReduceQueryExecutor.query(GridReduceQueryExecutor.java:448)
    at org.apache.ignite.internal.processors.query.h2.IgniteH2Indexing$5.iterator(IgniteH2Indexing.java:1447)
    at org.apache.ignite.internal.processors.cache.QueryCursorImpl.iter(QueryCursorImpl.java:102)
    at org.apache.ignite.internal.processors.query.h2.RegisteredQueryCursor.iter(RegisteredQueryCursor.java:91)
    at org.apache.ignite.internal.processors.cache.QueryCursorImpl.iterator(QueryCursorImpl.java:92)
```

By checking the logs, it was found that one of the nodes in the cluster restarted while the query was being executed.
reboot   system boot  5.10.0-136.12.0. Mon Mar  4 19:51 - 15:10 (3+19:19)

At this time, checking the latest topology baseline, it was found that the node where the thread was stuck was only the one with my own IP：
```
globalState=DiscoveryDataClusterState [state=ACTIVE, lastStateChangeTime=xxx, baselineTopology=BaselineTopology [id=0, branchingHash=-708844738, branchingType='New BaselineTopology', baselineNodes=[ip1:port1]]
```

My ignite configuration is as follows:
```
IgniteConfiguration igniteCfg = new IgniteConfiguration();
TcpDiscoveryVmIpFinder ipFinder = new TcpDiscoveryVmIpFinder();
ipFinder.setAddresses(addressList:[15 nodes ip]).setShared(false);
TcpDiscoverySpi spi = new TcpDiscoverySpi();
spi.setIpFinder(ipFinder);
DataRegionConfiguration dataRegionConfiguration = new DataRegionConfiguration();
dataRegionConfiguration.setPersistenceEnabled(false);
igniteCfg.setDiscoverySpi(spi).setDataStorageConfiguration(dataRegionConfiguration);
CacheConfiguration cacheCfg = new CacheConfiguration<>(cacheName);
cacheCfg.setCacheMode(CacheMode.PARTITIONED)
.setBackups(0)
.setIndexedTypes(Integer.class, AlarmRecord.class)
.setSqlFunctionClasses(ExtIgniteFunctions.class)
.setRebalanceDelay(-1)
.setOnheapCacheEnabled(false)
.setSqlOnheapCacheEnabled(false)
.setQueryParallelism(2)
.setRebalanceMode(CacheRebalanceMode.NONE)
.setAffinity(affFunc);
```

Finally, I would appreciate guidance on:
Recommended production configuration
Any known limitations or best practices to ensure cluster stability and avoid full outages
How should I configure it to ensure that queries already executed during the restart of some nodes in the cluster do not get stuck as described above?
Thank you for your guidance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GatewayProtectedCacheProxy#query(org.apache.ignite.cache.query.SqlFieldsQuery) encountered a thread deadlock issue when querying data. #12623

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

GatewayProtectedCacheProxy#query(org.apache.ignite.cache.query.SqlFieldsQuery) encountered a thread deadlock issue when querying data. #12623

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions