Skip to content

Duplicate text profiles from ddprofiler #76

@snowgy

Description

@snowgy

The text profiles produced by the ddprofiler contain duplicate column profiles, making the dindex_builder take an extra long time and disk space to create the full-text search index.

Reproduce this issue:

  1. Download chicago open data. https://uchicago.box.com/s/ecmb69h874qwedj19ebncvu0qvd4n97h
  2. Follow the quick start guide to index the data
  3. Check the output_profiles_json/text

For example, in 0.csv, you can find the month_name in x2vd-qke7.csv is indexed twice.

"1507119095","demo","/Users/yuegong/Desktop/chicago_open_data_all_tbls/","x2vd-qke7.csv","month_name","JUNE MAY OCTOBER AUGUST JULY SEPTEMBER NOVEMBER APRIL"
"1507119095","demo","/Users/yuegong/Desktop/chicago_open_data_all_tbls/","x2vd-qke7.csv","month_name","JUNE MAY OCTOBER AUGUST JULY SEPTEMBER NOVEMBER APRIL"

Since dindex_builder reads the text profile to build the full-text-search index, duplicates here will lead to extra indexing time and space.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions