The text profiles produced by the ddprofiler contain duplicate column profiles, making the dindex_builder take an extra long time and disk space to create the full-text search index.
Reproduce this issue:
- Download chicago open data. https://uchicago.box.com/s/ecmb69h874qwedj19ebncvu0qvd4n97h
- Follow the quick start guide to index the data
- Check the
output_profiles_json/text
For example, in 0.csv, you can find the month_name in x2vd-qke7.csv is indexed twice.
"1507119095","demo","/Users/yuegong/Desktop/chicago_open_data_all_tbls/","x2vd-qke7.csv","month_name","JUNE MAY OCTOBER AUGUST JULY SEPTEMBER NOVEMBER APRIL"
"1507119095","demo","/Users/yuegong/Desktop/chicago_open_data_all_tbls/","x2vd-qke7.csv","month_name","JUNE MAY OCTOBER AUGUST JULY SEPTEMBER NOVEMBER APRIL"
Since dindex_builder reads the text profile to build the full-text-search index, duplicates here will lead to extra indexing time and space.
The text profiles produced by the ddprofiler contain duplicate column profiles, making the
dindex_buildertake an extra long time and disk space to create the full-text search index.Reproduce this issue:
output_profiles_json/textFor example, in
0.csv, you can find themonth_nameinx2vd-qke7.csvis indexed twice."1507119095","demo","/Users/yuegong/Desktop/chicago_open_data_all_tbls/","x2vd-qke7.csv","month_name","JUNE MAY OCTOBER AUGUST JULY SEPTEMBER NOVEMBER APRIL"
"1507119095","demo","/Users/yuegong/Desktop/chicago_open_data_all_tbls/","x2vd-qke7.csv","month_name","JUNE MAY OCTOBER AUGUST JULY SEPTEMBER NOVEMBER APRIL"
Since
dindex_builderreads the text profile to build the full-text-search index, duplicates here will lead to extra indexing time and space.