Hello guys, I’m using Apache Ignite 2.16.0/2.17.0 in a production environment with a 15 server-nodes cluster.
A deadlock occurred when one of the nodes(Replace with ip1) was executing org.apache.ignite.internal.processors.cache.GatewayProtectedCacheProxy#query(org.apache.ignite.cache.query.SqlFieldsQuery).
Thread stack is as follows:
"xxx" Id=317 TIMED_WAITING on java.util.concurrent.CountDownLatch$Sync@9342695
at java.base@21.0.8/jdk.internal.misc.Unsafe.park(Native Method)
- waiting on java.util.concurrent.CountDownLatch$Sync@9342695
at java.base@21.0.8/java.util.concurrent.locks.LockSupport.parkNanos(Unknown Source)
at java.base@21.0.8/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(Unknown Source)
at java.base@21.0.8/java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(Unknown Source)
at java.base@21.0.8/java.util.concurrent.CountDownLatch.await(Unknown Source)
at org.apache.ignite.internal.util.IgniteUtils.await(IgniteUtils.java:8228)
at org.apache.ignite.internal.processors.query.h2.twostep.ReduceQueryRun.tryMapToSources(ReduceQueryRun.java:218)
at org.apache.ignite.internal.processors.query.h2.twostep.GridReduceQueryExecutor.awaitAllReplies(GridReduceQueryExecutor.java:1065)
at org.apache.ignite.internal.processors.query.h2.twostep.GridReduceQueryExecutor.query(GridReduceQueryExecutor.java:448)
at org.apache.ignite.internal.processors.query.h2.IgniteH2Indexing$5.iterator(IgniteH2Indexing.java:1447)
at org.apache.ignite.internal.processors.cache.QueryCursorImpl.iter(QueryCursorImpl.java:102)
at org.apache.ignite.internal.processors.query.h2.RegisteredQueryCursor.iter(RegisteredQueryCursor.java:91)
at org.apache.ignite.internal.processors.cache.QueryCursorImpl.iterator(QueryCursorImpl.java:92)
By checking the logs, it was found that one of the nodes in the cluster restarted while the query was being executed.
reboot system boot 5.10.0-136.12.0. Mon Mar 4 19:51 - 15:10 (3+19:19)
At this time, checking the latest topology baseline, it was found that the node where the thread was stuck was only the one with my own IP:
globalState=DiscoveryDataClusterState [state=ACTIVE, lastStateChangeTime=xxx, baselineTopology=BaselineTopology [id=0, branchingHash=-708844738, branchingType='New BaselineTopology', baselineNodes=[ip1:port1]]
My ignite configuration is as follows:
IgniteConfiguration igniteCfg = new IgniteConfiguration();
TcpDiscoveryVmIpFinder ipFinder = new TcpDiscoveryVmIpFinder();
ipFinder.setAddresses(addressList:[15 nodes ip]).setShared(false);
TcpDiscoverySpi spi = new TcpDiscoverySpi();
spi.setIpFinder(ipFinder);
DataRegionConfiguration dataRegionConfiguration = new DataRegionConfiguration();
dataRegionConfiguration.setPersistenceEnabled(false);
igniteCfg.setDiscoverySpi(spi).setDataStorageConfiguration(dataRegionConfiguration);
CacheConfiguration cacheCfg = new CacheConfiguration<>(cacheName);
cacheCfg.setCacheMode(CacheMode.PARTITIONED)
.setBackups(0)
.setIndexedTypes(Integer.class, AlarmRecord.class)
.setSqlFunctionClasses(ExtIgniteFunctions.class)
.setRebalanceDelay(-1)
.setOnheapCacheEnabled(false)
.setSqlOnheapCacheEnabled(false)
.setQueryParallelism(2)
.setRebalanceMode(CacheRebalanceMode.NONE)
.setAffinity(affFunc);
Finally, I would appreciate guidance on:
Recommended production configuration
Any known limitations or best practices to ensure cluster stability and avoid full outages
How should I configure it to ensure that queries already executed during the restart of some nodes in the cluster do not get stuck as described above?
Thank you for your guidance.
Hello guys, I’m using Apache Ignite 2.16.0/2.17.0 in a production environment with a 15 server-nodes cluster.
A deadlock occurred when one of the nodes(Replace with ip1) was executing
org.apache.ignite.internal.processors.cache.GatewayProtectedCacheProxy#query(org.apache.ignite.cache.query.SqlFieldsQuery).Thread stack is as follows:
By checking the logs, it was found that one of the nodes in the cluster restarted while the query was being executed.
reboot system boot 5.10.0-136.12.0. Mon Mar 4 19:51 - 15:10 (3+19:19)
At this time, checking the latest topology baseline, it was found that the node where the thread was stuck was only the one with my own IP:
My ignite configuration is as follows:
Finally, I would appreciate guidance on:
Recommended production configuration
Any known limitations or best practices to ensure cluster stability and avoid full outages
How should I configure it to ensure that queries already executed during the restart of some nodes in the cluster do not get stuck as described above?
Thank you for your guidance.