Skip to content

The traceback of test_bd_serving #32

@AIxyz

Description

@AIxyz

I use dInfer a8b4a06 and run test_bd_serving.py, the traceback as follows: @zheng-da

# dllm-dinfer:v-260106 (de602d2bc8f9) (26.4GB)
docker run -it --gpus='"device=4"' --entrypoint=/bin/bash -v /bigdata/shared/models/huggingface/LLaDA2.0-mini--572899f-C8:/model de602d2bc8f9

sed -i 's#/mnt/infra/dulun.dl/models/dllm-mini/block-diffusion-sft-2k-v2-full-bd/LLaDA2-mini-preview-ep4-v0#/model#g' /code/dInfer/tests/test_bd_serving.py # I replace the model_path
sed -i 's#import pytest##g' /code/dInfer/tests/test_bd_serving.py # pytest is not used
sed -i 's#  model = init_sglang_dist()#  #g' /code/dInfer/tests/test_bd_serving.py # global model has inited in line 97, so line 196 shoule remove

date && python3 /code/dInfer/tests/test_bd_serving.py && date
Image
INFO 01-08 01:23:35 [__init__.py:216] Automatically detected platform cuda.
WARNING:sglang.srt.layers.moe.utils:MOE_RUNNER_BACKEND is not initialized, the backend will be automatically selected
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib/python3.12/multiprocessing/spawn.py", line 122, in spawn_main
    exitcode = _main(fd, parent_sentinel)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/multiprocessing/spawn.py", line 131, in _main
    prepare(preparation_data)
  File "/usr/lib/python3.12/multiprocessing/spawn.py", line 246, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "/usr/lib/python3.12/multiprocessing/spawn.py", line 297, in _fixup_main_from_path
    main_content = runpy.run_path(main_path,
                   ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen runpy>", line 287, in run_path
  File "<frozen runpy>", line 98, in _run_module_code
  File "<frozen runpy>", line 88, in _run_code
  File "/code/dInfer/tests/test_bd_serving.py", line 97, in <module>
    model = init_sglang_dist()
            ^^^^^^^^^^^^^^^^^^
  File "/code/dInfer/tests/test_bd_serving.py", line 69, in init_sglang_dist
    distributed.init_distributed_environment(1, 0, 'env://', 0, 'nccl')
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/distributed/parallel_state.py", line 1408, in init_distributed_environment
    torch.distributed.init_process_group(
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/c10d_logger.py", line 95, in wrapper
    func_return = func(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/distributed_c10d.py", line 1757, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
                              ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/rendezvous.py", line 278, in _env_rendezvous_handler
    store = _create_c10d_store(
            ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/rendezvous.py", line 198, in _create_c10d_store
    return TCPStore(
           ^^^^^^^^^
torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. port: 40399, useIpv6: false, code: -98, name: EADDRINUSE, message: address already in use

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions