Data pre-processing is one of the most time consuming and relevant steps in a data analysis process (e.g., classification task). A given data pre-processing operator (e.g., transformation) can have positive, negative or zero impact on the final result of the analysis. Expert users have the required knowledge to find the right pre-processing operators. However, when it comes to non-experts, they are overwhelmed by the amount of pre-processing operators and it is challenging for them to find operators that would positively impact their analysis (e.g., increase the predictive accuracy of a classifier). Existing solutions either assume that users have expert knowledge, or they recommend pre-processing operators that are only ``syntactically" applicable to a dataset, without taking into account their impact on the final analysis. PRESISTANT provides assistance to non-expert users by recommending data pre-processing operators that are ranked according to their impact on the final analysis. It uses Random Forests to learn the impact of pre-processing operators on the performance (e.g., predictive accuracy) of 5 different classification algorithms, such as J48, Naive Bayes, PART, Logistic Regression, and Nearest Neighbor. PRESISTANT can effectively help non-experts in order to achieve improved results in their analytical tasks.
The full article can be found here!
A list of datasets used for the experiments (detailed information about the datasets can be found in the following link):
abalone_1, abalone_2, abalone_3, acute-inflammations_1, acute-inflammations_2, ada_agnostic_1, ada_prior_1, analcatdata_apnea1_2, analcatdata_apnea2_2, analcatdata_apnea3_2, analcatdata_asbestos_1, analcatdata_authorship_1, analcatdata_authorship_2, analcatdata_bankruptcy_1, analcatdata_birthday_2, analcatdata_bondrate_1, analcatdata_bondrate_2, analcatdata_boxing1_1, analcatdata_boxing2_1, analcatdata_broadway_1, analcatdata_broadway_2, analcatdata_broadwaymult_1, analcatdata_broadwaymult_2, analcatdata_challenger_1, analcatdata_challenger_2, analcatdata_chlamydia_2, analcatdata_creditscore_1, analcatdata_cyyoung8092_1, analcatdata_cyyoung9302_1, analcatdata_dmft_1, analcatdata_dmft_2, analcatdata_draft_1, analcatdata_draft_2, analcatdata_election2000_2, analcatdata_germangss_2, analcatdata_gsssexsurvey_2, analcatdata_gviolence_2, analcatdata_halloffame_1, analcatdata_halloffame_2, analcatdata_homerun_1, analcatdata_impeach_1, analcatdata_lawsuit_1, analcatdata_marketing_1, analcatdata_marketing_2, analcatdata_michiganacc_2, analcatdata_neavote_2, analcatdata_negotiation_2, analcatdata_olympic2000_2, analcatdata_reviewer_1, analcatdata_reviewer_2, analcatdata_runshoes_2, analcatdata_seropositive_2, analcatdata_supreme_2, analcatdata_uktrainacc_2, analcatdata_vehicle_2, analcatdata_vineyard_2, analcatdata_whale_1, analcatdata_wildcat_2, anneal_1, anneal_114, anneal_2, anneal_241, anneal_3, anneal_30, anneal_31, anneal_32, anneal_51, appendicitis_1, ar1_1, ar3_1, ar4_1, ar5_1, ar6_1, arrhythmia_1, arrhythmia_2, arsenic-female-bladder_2, arsenic-female-lung_2, arsenic-male-bladder_2, arsenic-male-lung_2, artificial-characters_1, audiology_1, audiology_2, auto_price_2, auto93_2, autoHorse_2, autoMpg_2, autoPrice_2, autos_1, autos_2, autoUniv-au1-1000_1, autoUniv-au4-2500_1, autoUniv-au6-1000_1, autoUniv-au6-400_1, autoUniv-au6-750_1, autoUniv-au7-1100_1, autoUniv-au7-500_1, autoUniv-au7-700_1, BachChoralHarmony_1, backache_1, badges2_1, balance-scale_1, balance-scale_2, balloon_2, banana_1, bank8FM_2, bank-marketing_2, banknote-authentication_1, baseball_1, baskball_2, biomed_1, blogger_1, blood-transfusion-service-center_1, bodyfat_2, bolts_2, boston_2, boston_corrected_2, braziltourism_1, braziltourism_2, breast-cancer_1, breast-cancer-dropped-missing-attributes-values_1, breast-tissue_1, breast-tissue_2, breastTumor_2, breast-w_1, bridges_1, bridges_2, bridges_3, bridges_4, bridges_5, cal_housing_1, car_1, car_2, cardiotocography_1, cardiotocography_2, cars_1, cars_2, CastMetal1_1, chatfield_4_2, chess_1, cholesterol_2, chscase_adopt_2, chscase_census2_2, chscase_census3_2, chscase_census4_2, chscase_census5_2, chscase_census6_2, chscase_funds_2, chscase_geyser1_2, chscase_health_2, chscase_vine1_2, chscase_vine2_2, chscase_whale_2, cjs_2, cleveland_2, climate-model-simulation-crashes_1, cloud_2, cm1_req_1, cmc_1, cmc_2, colic_1, colic_2, colleges_aaup_2, colleges_usnews_2, collins_1, collins_2, confidence_2, contact-lenses_1, CostaMadre1_2, cpu_2, cpu_act_3, cpu_small_3, credit-a_1, credit-g_1, cylinder-bands_1, cylinder-bands_2, data_1, datatrieve_1, dbworld-subjects_1, dbworld-subjects-stemmed_1, delta_ailerons_1, delta_elevators_3, dermatology_1, dermatology_2, desharnais_1, desharnais_2, diabetes_1, diabetes_numeric_2, diggle_table_a1_2, diggle_table_a2_2, disclosure_x_bias_2, disclosure_x_noise_2, disclosure_x_tampered_2, disclosure_z_2, dresses-sales_1, dresses-sales_2, echoMonths_2, ecoli_1, ecoli_2, elusage_2, energy-efficiency_1, Engine1_1, eucalyptus_1, eucalyptus_2, fertility_1, fishcatch_2, fl2000_2, flags_1, flags_2, fri_c0_100_10_2, fri_c0_100_25_2, fri_c0_100_5_2, fri_c0_100_50_2, fri_c0_1000_10_2, fri_c0_1000_25_2, fri_c0_1000_5_2, fri_c0_1000_50_2, fri_c0_250_10_2, fri_c0_250_25_2, fri_c0_250_5_2, fri_c0_250_50_2, fri_c0_500_10_2, fri_c0_500_25_2, fri_c0_500_5_2, fri_c0_500_50_2, fri_c1_100_10_2, fri_c1_100_25_2, fri_c1_100_5_2, fri_c1_100_50_2, fri_c1_1000_10_2, fri_c1_1000_25_2, fri_c1_1000_5_2, fri_c1_1000_50_2, fri_c1_250_10_2, fri_c1_250_25_2, fri_c1_250_5_2, fri_c1_250_50_2, fri_c1_500_10_2, fri_c1_500_25_2, fri_c1_500_5_2, fri_c1_500_50_2, fri_c2_100_10_2, fri_c2_100_25_2, fri_c2_100_5_2, fri_c2_100_50_2, fri_c2_1000_10_2, fri_c2_1000_25_2, fri_c2_1000_5_2, fri_c2_1000_50_2, fri_c2_250_10_2, fri_c2_250_25_2, fri_c2_250_5_2, fri_c2_250_50_2, fri_c2_500_10_2, fri_c2_500_25_2, fri_c2_500_5_2, fri_c2_500_50_2, fri_c3_100_10_2, fri_c3_100_25_2, fri_c3_100_5_2, fri_c3_100_50_2, fri_c3_1000_10_2, fri_c3_1000_25_2, fri_c3_1000_5_2, fri_c3_1000_50_2, fri_c3_250_10_2, fri_c3_250_25_2, fri_c3_250_5_2, fri_c3_250_50_2, fri_c3_500_10_2, fri_c3_500_25_2, fri_c3_500_5_2, fri_c3_500_50_2, fri_c4_100_10_2, fri_c4_100_100_2, fri_c4_100_25_2, fri_c4_100_50_2, fri_c4_1000_10_2, fri_c4_1000_100_2, fri_c4_1000_25_2, fri_c4_1000_50_2, fri_c4_250_10_2, fri_c4_250_100_2, fri_c4_250_25_2, fri_c4_250_50_2, fri_c4_500_10_2, fri_c4_500_100_2, fri_c4_500_25_2, fri_c4_500_50_2, fruitfly_2, glass_1, glass_2, grub-damage_1, grub-damage_2, haberman_1, hayes-roth_1, hayes-roth_2, heart-c_1, heart-c_2, heart-h_1, heart-h_2, heart-h_3, heart-long-beach_1, heart-statlog_1, heart-switzerland_1, hepatitis_1, hill-valley_1, hip_2, houses_2, housing_1, humandevel_2, hungarian_2, hutsof99_child_witness_2, hutsof99_logis_2, hypothyroid_1, hypothyroid_2, ilpd_1, ionosphere_1, iris_1, iris_3, IRIS_4, iris_5, iris-example_1, iris-example_2, irish_1, jEdit_4.0_4.2_1, jEdit_4.2_4.3_1, jm1_1, kc1_1, kc1-binary_1, kc1-top5_1, kc2_1, kc3_1, kdd_el_nino-small_2, kdd_synthetic_control_1, kidney_2, kin8nm_2, KnuggetChase3_1, kropt_1, kr-vs-k_1, kr-vs-kp_1, KungChi3_1, labor_1, leaf_1, LED-display-domain-7digit_1, letter_1, letter_2, letter-challenge-unlabeled_1, lowbwt_2, lsvt_1, lung-cancer_1, lymph_1, lymph_2, machine_cpu_2, mammography_1, mbagrade_2, mc1_1, mc2_1, MeanWhile1_1, MegaWatt1_1, meta_2, meta_all_1, meta_batchincremental_1, meta_ensembles_1, meta_instanceincremental_1, mfeat-morphological_1, mfeat-morphological_2, mfeat-pixel_1, mfeat-pixel_2, mfeat-zernike_2, MindCave2_1, molecular-biology_promoters_1, molecular-biology_promoters_2, monks-problems-1_1, monks-problems-2_1, monks-problems-3_1, mozilla4_1, mu284_2, mushroom_1, mushroom_2, mw1_1, newton_hema_2, no2_2, nursery_2, nursery_3, one-hundred-plants-margin_1, one-hundred-plants-shape_1, one-hundred-plants-texture_1, optdigits_1, optdigits_2, ozone-level-8hr_1, page-blocks_1, page-blocks_2, parkinsons_1, pasture_1, pasture_2, pbc_3, pbcseq_2, pc1_1, pc1_req_1, pc2_1, pc3_1, pc4_1, pendigits_1, pendigits_2, pharynx_2, PhishingWebsites_1, phoneme_1, PieChart1_1, PieChart2_1, PieChart3_1, PieChart4_1, PizzaCutter1_1, PizzaCutter3_1, planning-relax_1, plasma_retinol_2, pm10_2, pollen_2, pollution_2, postoperative-patient-data_1, postoperative-patient-data_2, primary-tumor_1, primary-tumor_2, prnn_cushings_1, prnn_fglass_1, prnn_fglass_2, prnn_synth_1, prnn_viruses_1, puma8NH_2, pwLinear_2, pyrim_2, qsar-biodeg_1, quake_3, quake_4, qualitative-bankruptcy_1, rabe_131_2, rabe_148_2, rabe_166_2, rabe_176_2, rabe_265_2, rabe_266_2, rabe_97_2, ringnorm_1, rmftsa_ctoarrivals_2, rmftsa_ladata_2, rmftsa_sleepdata_2, robot-failures-lp1_1, robot-failures-lp2_1, robot-failures-lp3_1, robot-failures-lp4_1, robot-failures-lp5_1, sa-heart_1, schizo_1, schlvote_2, seeds_1, segment_1, segment_2, seismic-bumps_1, semeion_1, sensory_2, servo_1, shuttle-landing-control_1, sick_1, sleep_3, sleuth_case1102_2, sleuth_case1201_2, sleuth_case1202_2, sleuth_case2002_2, sleuth_ex1221_2, sleuth_ex1605_2, sleuth_ex1714_2, sleuth_ex2015_2, sleuth_ex2016_2, socmob_2, solar-flare_1, solar-flare_2, sonar_1, soybean_1, soybean_2, space_ga_2, spambase_1, SPECT_1, SPECTF_2, spectrometer_2, splice_1, splice_2, sponge_1, sponge_2, squash-stored_1, squash-stored_2, squash-unstored_1, squash-unstored_2, steel-plates-fault_1, stock_2, strikes_2, synthetic_control_1, tae_1, tae_2, teachingAssistant_1, tecator_2, thoracic-surgery_1, thyroid_sick_1, thyroid-allbp_1, thyroid-allhyper_1, thyroid-allhypo_1, thyroid-allrep_1, thyroid-ann_1, thyroid-dis_1, tic-tac-toe_1, trains_1, transplant_2, triazines_2, user-knowledge_1, usp05_1, usp05-ft_1, vehicle_1, vehicle_2, vertebra-column_1, vertebra-column_2, veteran_2, vineyard_2, vinnie_2, visualizing_environmental_2, visualizing_ethanol_2, visualizing_galaxy_2, visualizing_hamster_2, visualizing_livestock_2, visualizing_slope_2, visualizing_soil_2, volcanoes-a1_1, volcanoes-a2_1, volcanoes-a3_1, volcanoes-a4_1, volcanoes-b1_1, volcanoes-b2_1, volcanoes-b3_1, volcanoes-b4_1, volcanoes-b5_1, volcanoes-b6_1, volcanoes-c1_1, volcanoes-d1_1, volcanoes-d2_1, volcanoes-d3_1, volcanoes-d4_1, volcanoes-e1_1, volcanoes-e2_1, volcanoes-e3_1, volcanoes-e4_1, volcanoes-e5_1, vote_1, vowel_2, vowel_3, wall-robot-navigation_1, wall-robot-navigation_2, wall-robot-navigation_3, water-treatment_2, wdbc_1, white-clover_1, white-clover_2, wholesale-customers_1, wilt_1, wind_2, wind_correlations_2, wine_2, wine-quality-white_1, wisconsin_2, witmer_census_1980_2, yeast_1, zoo_1, zoo_2.
This work is supported by the European Commission through the Erasmus Mundus Joint Doctorate "Information Technologies for Business Intelligence - Doctoral College" (IT4BI-DC).
Last update: 2018/01/08