fix: json scan performance on local files by ariel-miculas · Pull Request #21478 · apache/datafusion

ariel-miculas · 2026-04-08T16:53:02Z

Which issue does this PR close?

Closes Regression in json performance for local files #21450

Rationale for this change

The into_stream() implementation of GetResult (from arrow-rs-objectstore) fetches every 8KiB chunk using a spawn_blocking() task, resulting in a lot of scheduling overhead.

Fix this by reading the data directly from the async context, using a buffer size of 8KiBs. This avoids any context switch.

What changes are included in this PR?

Are these changes tested?

Validated that the initial reported overhead is now much smaller:
Comparing json-test-on-main and test-json-improvement
--------------------
Benchmark clickbench_2.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query     ┃ json-test-on-main ┃ test-json-improvement ┃       Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 0  │        2421.62 ms │            2521.19 ms │    no change │
│ QQuery 1  │        2584.29 ms │            2729.98 ms │ 1.06x slower │
│ QQuery 2  │        2662.11 ms │            2782.29 ms │    no change │
│ QQuery 3  │              FAIL │                  FAIL │ incomparable │
│ QQuery 4  │        2764.78 ms │            2896.46 ms │    no change │
│ QQuery 5  │        2676.46 ms │            2758.01 ms │    no change │
│ QQuery 6  │              FAIL │                  FAIL │ incomparable │
│ QQuery 7  │        2684.50 ms │            2752.37 ms │    no change │
│ QQuery 8  │        2781.21 ms │            2827.46 ms │    no change │
│ QQuery 9  │        3039.17 ms │            3165.29 ms │    no change │
│ QQuery 10 │        2791.32 ms │            2843.44 ms │    no change │
│ QQuery 11 │        2839.05 ms │            3011.84 ms │ 1.06x slower │
│ QQuery 12 │        2691.51 ms │            2839.97 ms │ 1.06x slower │
│ QQuery 13 │        2768.57 ms │            2860.68 ms │    no change │
│ QQuery 14 │        2712.50 ms │            2856.80 ms │ 1.05x slower │
│ QQuery 15 │        2807.64 ms │            2888.94 ms │    no change │
│ QQuery 16 │        2774.87 ms │            2875.44 ms │    no change │
│ QQuery 17 │        2797.28 ms │            2850.17 ms │    no change │
│ QQuery 18 │        3017.75 ms │            3111.64 ms │    no change │
│ QQuery 19 │        2801.30 ms │            2927.25 ms │    no change │
│ QQuery 20 │        2743.43 ms │            2862.10 ms │    no change │
│ QQuery 21 │        2811.41 ms │            2906.42 ms │    no change │
│ QQuery 22 │        2953.66 ms │            3038.23 ms │    no change │
│ QQuery 23 │              FAIL │                  FAIL │ incomparable │
│ QQuery 24 │        2862.27 ms │            2940.31 ms │    no change │
│ QQuery 25 │        2763.40 ms │            2848.55 ms │    no change │
│ QQuery 26 │        2840.39 ms │            2950.47 ms │    no change │
│ QQuery 27 │        2886.70 ms │            2921.28 ms │    no change │
│ QQuery 28 │        3145.39 ms │            3221.27 ms │    no change │
│ QQuery 29 │        2821.87 ms │            2869.85 ms │    no change │
│ QQuery 30 │        2953.55 ms │            2990.15 ms │    no change │
│ QQuery 31 │        2997.81 ms │            3049.28 ms │    no change │
│ QQuery 32 │        2969.14 ms │            3126.79 ms │ 1.05x slower │
│ QQuery 33 │        2764.80 ms │            2866.63 ms │    no change │
│ QQuery 34 │        2828.77 ms │            2848.54 ms │    no change │
│ QQuery 35 │        2812.55 ms │            2793.79 ms │    no change │
│ QQuery 36 │              FAIL │                  FAIL │ incomparable │
│ QQuery 37 │              FAIL │                  FAIL │ incomparable │
│ QQuery 38 │              FAIL │                  FAIL │ incomparable │
│ QQuery 39 │              FAIL │                  FAIL │ incomparable │
│ QQuery 40 │              FAIL │                  FAIL │ incomparable │
│ QQuery 41 │              FAIL │                  FAIL │ incomparable │
│ QQuery 42 │              FAIL │                  FAIL │ incomparable │
└───────────┴───────────────────┴───────────────────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                    ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (json-test-on-main)       │ 92771.07ms │
│ Total Time (test-json-improvement)   │ 95732.89ms │
│ Average Time (json-test-on-main)     │  2811.24ms │
│ Average Time (test-json-improvement) │  2901.00ms │
│ Queries Faster                       │          0 │
│ Queries Slower                       │          5 │
│ Queries with No Change               │         28 │
│ Queries with Failure                 │         10 │
└──────────────────────────────────────┴────────────┘

and with SIMULATE_LATENCY:

--------------------
Benchmark clickbench_2.json
--------------------
┏━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query     ┃ json-test-on-main ┃ test-json-improvement ┃        Change ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0  │        2795.68 ms │            2687.68 ms │     no change │
│ QQuery 1  │        2880.50 ms │            2768.30 ms │     no change │
│ QQuery 2  │        2960.75 ms │            2826.89 ms │     no change │
│ QQuery 3  │              FAIL │                  FAIL │  incomparable │
│ QQuery 4  │        3140.38 ms │            2963.15 ms │ +1.06x faster │
│ QQuery 5  │        2926.66 ms │            2830.43 ms │     no change │
│ QQuery 6  │              FAIL │                  FAIL │  incomparable │
│ QQuery 7  │        3026.29 ms │            2858.30 ms │ +1.06x faster │
│ QQuery 8  │        4302.35 ms │            2954.96 ms │ +1.46x faster │
│ QQuery 9  │        4439.83 ms │            3200.43 ms │ +1.39x faster │
│ QQuery 10 │        3028.32 ms │            2969.32 ms │     no change │
│ QQuery 11 │        3147.81 ms │            3040.74 ms │     no change │
│ QQuery 12 │        4169.45 ms │            2886.59 ms │ +1.44x faster │
│ QQuery 13 │        3839.01 ms │            2997.80 ms │ +1.28x faster │
│ QQuery 14 │        4086.30 ms │            2907.42 ms │ +1.41x faster │
│ QQuery 15 │        4308.07 ms │            3025.22 ms │ +1.42x faster │
│ QQuery 16 │        3084.89 ms │            2984.34 ms │     no change │
│ QQuery 17 │        4287.89 ms │            2984.27 ms │ +1.44x faster │
│ QQuery 18 │        3542.80 ms │            3144.98 ms │ +1.13x faster │
│ QQuery 19 │        4388.70 ms │            3014.37 ms │ +1.46x faster │
│ QQuery 20 │        3149.54 ms │            2986.73 ms │ +1.05x faster │
│ QQuery 21 │        3250.81 ms │            2906.60 ms │ +1.12x faster │
│ QQuery 22 │        3265.98 ms │            3122.25 ms │     no change │
│ QQuery 23 │              FAIL │                  FAIL │  incomparable │
│ QQuery 24 │        3066.52 ms │            2997.55 ms │     no change │
│ QQuery 25 │        4289.31 ms │            2884.22 ms │ +1.49x faster │
│ QQuery 26 │        4223.03 ms │            2933.16 ms │ +1.44x faster │
│ QQuery 27 │        3156.86 ms │            3001.17 ms │     no change │
│ QQuery 28 │        4831.42 ms │            3318.89 ms │ +1.46x faster │
│ QQuery 29 │        3252.45 ms │            4375.90 ms │  1.35x slower │
│ QQuery 30 │        4460.06 ms │            3153.77 ms │ +1.41x faster │
│ QQuery 31 │        4235.85 ms │            3171.58 ms │ +1.34x faster │
│ QQuery 32 │        3435.14 ms │            3202.64 ms │ +1.07x faster │
│ QQuery 33 │        3147.21 ms │            3031.54 ms │     no change │
│ QQuery 34 │        4378.41 ms │            3008.79 ms │ +1.46x faster │
│ QQuery 35 │        4224.36 ms │            2897.53 ms │ +1.46x faster │
│ QQuery 36 │              FAIL │                  FAIL │  incomparable │
│ QQuery 37 │              FAIL │                  FAIL │  incomparable │
│ QQuery 38 │              FAIL │                  FAIL │  incomparable │
│ QQuery 39 │              FAIL │                  FAIL │  incomparable │
│ QQuery 40 │              FAIL │                  FAIL │  incomparable │
│ QQuery 41 │              FAIL │                  FAIL │  incomparable │
│ QQuery 42 │              FAIL │                  FAIL │  incomparable │
└───────────┴───────────────────┴───────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Benchmark Summary                    ┃             ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ Total Time (json-test-on-main)       │ 120722.63ms │
│ Total Time (test-json-improvement)   │ 100037.48ms │
│ Average Time (json-test-on-main)     │   3658.26ms │
│ Average Time (test-json-improvement) │   3031.44ms │
│ Queries Faster                       │          21 │
│ Queries Slower                       │           1 │
│ Queries with No Change               │          11 │
│ Queries with Failure                 │          10 │
└──────────────────────────────────────┴─────────────┘

For the tests I've used a c7a.16xlarge ec2 instance, with a trimmed down version of hits.json to 51G (original has 217 GiB), with a warm cache (by running cat hits_50.json > /dev/null)

Are there any user-facing changes?

No

The into_stream() implementation of GetResult (from arrow-rs-objectstore) fetches every 8KiB chunk using a spawn_blocking() task, resulting in a lot of scheduling overhead. Fix this by reading the data directly from the async context, using a buffer size of 8KiBs. This avoids any context switch.

ariel-miculas · 2026-04-08T16:53:42Z

@Weijun-H please take a look

github-actions bot added the datasource Changes to the datasource crate label Apr 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: json scan performance on local files#21478

fix: json scan performance on local files#21478
ariel-miculas wants to merge 1 commit intoapache:mainfrom
ariel-miculas:fix-local-json-performance

ariel-miculas commented Apr 8, 2026

Uh oh!

ariel-miculas commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ariel-miculas commented Apr 8, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

ariel-miculas commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant