PRESISTANT: Learning based assistant for data pre-processing

Data pre-processing is one of the most time consuming and relevant steps in a data analysis process (e.g., classification task). A given data pre-processing operator (e.g., transformation) can have positive, negative or zero impact on the final result of the analysis. Expert users have the required knowledge to find the right pre-processing operators. However, when it comes to non-experts, they are overwhelmed by the amount of pre-processing operators and it is challenging for them to find operators that would positively impact their analysis (e.g., increase the predictive accuracy of a classifier). Existing solutions either assume that users have expert knowledge, or they recommend pre-processing operators that are only ``syntactically" applicable to a dataset, without taking into account their impact on the final analysis. PRESISTANT provides assistance to non-expert users by recommending data pre-processing operators that are ranked according to their impact on the final analysis. It uses Random Forests to learn the impact of pre-processing operators on the performance (e.g., predictive accuracy) of 5 different classification algorithms, such as J48, Naive Bayes, PART, Logistic Regression, and Nearest Neighbor. PRESISTANT can effectively help non-experts in order to achieve improved results in their analytical tasks.

The full article can be found here!


Results

This section shows the results obtained after evaluating (as described in the paper submitted to KBS) the recommendations provided by PRESISTANT

Datasets

A list of datasets used for the experiments (detailed information about the datasets can be found in the following link):

abalone_1, abalone_2, abalone_3, acute-inflammations_1, acute-inflammations_2, ada_agnostic_1, ada_prior_1, analcatdata_apnea1_2, analcatdata_apnea2_2, analcatdata_apnea3_2, analcatdata_asbestos_1, analcatdata_authorship_1, analcatdata_authorship_2, analcatdata_bankruptcy_1, analcatdata_birthday_2, analcatdata_bondrate_1, analcatdata_bondrate_2, analcatdata_boxing1_1, analcatdata_boxing2_1, analcatdata_broadway_1, analcatdata_broadway_2, analcatdata_broadwaymult_1, analcatdata_broadwaymult_2, analcatdata_challenger_1, analcatdata_challenger_2, analcatdata_chlamydia_2, analcatdata_creditscore_1, analcatdata_cyyoung8092_1, analcatdata_cyyoung9302_1, analcatdata_dmft_1, analcatdata_dmft_2, analcatdata_draft_1, analcatdata_draft_2, analcatdata_election2000_2, analcatdata_germangss_2, analcatdata_gsssexsurvey_2, analcatdata_gviolence_2, analcatdata_halloffame_1, analcatdata_halloffame_2, analcatdata_homerun_1, analcatdata_impeach_1, analcatdata_lawsuit_1, analcatdata_marketing_1, analcatdata_marketing_2, analcatdata_michiganacc_2, analcatdata_neavote_2, analcatdata_negotiation_2, analcatdata_olympic2000_2, analcatdata_reviewer_1, analcatdata_reviewer_2, analcatdata_runshoes_2, analcatdata_seropositive_2, analcatdata_supreme_2, analcatdata_uktrainacc_2, analcatdata_vehicle_2, analcatdata_vineyard_2, analcatdata_whale_1, analcatdata_wildcat_2, anneal_1, anneal_114, anneal_2, anneal_241, anneal_3, anneal_30, anneal_31, anneal_32, anneal_51, appendicitis_1, ar1_1, ar3_1, ar4_1, ar5_1, ar6_1, arrhythmia_1, arrhythmia_2, arsenic-female-bladder_2, arsenic-female-lung_2, arsenic-male-bladder_2, arsenic-male-lung_2, artificial-characters_1, audiology_1, audiology_2, auto_price_2, auto93_2, autoHorse_2, autoMpg_2, autoPrice_2, autos_1, autos_2, autoUniv-au1-1000_1, autoUniv-au4-2500_1, autoUniv-au6-1000_1, autoUniv-au6-400_1, autoUniv-au6-750_1, autoUniv-au7-1100_1, autoUniv-au7-500_1, autoUniv-au7-700_1, BachChoralHarmony_1, backache_1, badges2_1, balance-scale_1, balance-scale_2, balloon_2, banana_1, bank8FM_2, bank-marketing_2, banknote-authentication_1, baseball_1, baskball_2, biomed_1, blogger_1, blood-transfusion-service-center_1, bodyfat_2, bolts_2, boston_2, boston_corrected_2, braziltourism_1, braziltourism_2, breast-cancer_1, breast-cancer-dropped-missing-attributes-values_1, breast-tissue_1, breast-tissue_2, breastTumor_2, breast-w_1, bridges_1, bridges_2, bridges_3, bridges_4, bridges_5, cal_housing_1, car_1, car_2, cardiotocography_1, cardiotocography_2, cars_1, cars_2, CastMetal1_1, chatfield_4_2, chess_1, cholesterol_2, chscase_adopt_2, chscase_census2_2, chscase_census3_2, chscase_census4_2, chscase_census5_2, chscase_census6_2, chscase_funds_2, chscase_geyser1_2, chscase_health_2, chscase_vine1_2, chscase_vine2_2, chscase_whale_2, cjs_2, cleveland_2, climate-model-simulation-crashes_1, cloud_2, cm1_req_1, cmc_1, cmc_2, colic_1, colic_2, colleges_aaup_2, colleges_usnews_2, collins_1, collins_2, confidence_2, contact-lenses_1, CostaMadre1_2, cpu_2, cpu_act_3, cpu_small_3, credit-a_1, credit-g_1, cylinder-bands_1, cylinder-bands_2, data_1, datatrieve_1, dbworld-subjects_1, dbworld-subjects-stemmed_1, delta_ailerons_1, delta_elevators_3, dermatology_1, dermatology_2, desharnais_1, desharnais_2, diabetes_1, diabetes_numeric_2, diggle_table_a1_2, diggle_table_a2_2, disclosure_x_bias_2, disclosure_x_noise_2, disclosure_x_tampered_2, disclosure_z_2, dresses-sales_1, dresses-sales_2, echoMonths_2, ecoli_1, ecoli_2, elusage_2, energy-efficiency_1, Engine1_1, eucalyptus_1, eucalyptus_2, fertility_1, fishcatch_2, fl2000_2, flags_1, flags_2, fri_c0_100_10_2, fri_c0_100_25_2, fri_c0_100_5_2, fri_c0_100_50_2, fri_c0_1000_10_2, fri_c0_1000_25_2, fri_c0_1000_5_2, fri_c0_1000_50_2, fri_c0_250_10_2, fri_c0_250_25_2, fri_c0_250_5_2, fri_c0_250_50_2, fri_c0_500_10_2, fri_c0_500_25_2, fri_c0_500_5_2, fri_c0_500_50_2, fri_c1_100_10_2, fri_c1_100_25_2, fri_c1_100_5_2, fri_c1_100_50_2, fri_c1_1000_10_2, fri_c1_1000_25_2, fri_c1_1000_5_2, fri_c1_1000_50_2, fri_c1_250_10_2, fri_c1_250_25_2, fri_c1_250_5_2, fri_c1_250_50_2, fri_c1_500_10_2, fri_c1_500_25_2, fri_c1_500_5_2, fri_c1_500_50_2, fri_c2_100_10_2, fri_c2_100_25_2, fri_c2_100_5_2, fri_c2_100_50_2, fri_c2_1000_10_2, fri_c2_1000_25_2, fri_c2_1000_5_2, fri_c2_1000_50_2, fri_c2_250_10_2, fri_c2_250_25_2, fri_c2_250_5_2, fri_c2_250_50_2, fri_c2_500_10_2, fri_c2_500_25_2, fri_c2_500_5_2, fri_c2_500_50_2, fri_c3_100_10_2, fri_c3_100_25_2, fri_c3_100_5_2, fri_c3_100_50_2, fri_c3_1000_10_2, fri_c3_1000_25_2, fri_c3_1000_5_2, fri_c3_1000_50_2, fri_c3_250_10_2, fri_c3_250_25_2, fri_c3_250_5_2, fri_c3_250_50_2, fri_c3_500_10_2, fri_c3_500_25_2, fri_c3_500_5_2, fri_c3_500_50_2, fri_c4_100_10_2, fri_c4_100_100_2, fri_c4_100_25_2, fri_c4_100_50_2, fri_c4_1000_10_2, fri_c4_1000_100_2, fri_c4_1000_25_2, fri_c4_1000_50_2, fri_c4_250_10_2, fri_c4_250_100_2, fri_c4_250_25_2, fri_c4_250_50_2, fri_c4_500_10_2, fri_c4_500_100_2, fri_c4_500_25_2, fri_c4_500_50_2, fruitfly_2, glass_1, glass_2, grub-damage_1, grub-damage_2, haberman_1, hayes-roth_1, hayes-roth_2, heart-c_1, heart-c_2, heart-h_1, heart-h_2, heart-h_3, heart-long-beach_1, heart-statlog_1, heart-switzerland_1, hepatitis_1, hill-valley_1, hip_2, houses_2, housing_1, humandevel_2, hungarian_2, hutsof99_child_witness_2, hutsof99_logis_2, hypothyroid_1, hypothyroid_2, ilpd_1, ionosphere_1, iris_1, iris_3, IRIS_4, iris_5, iris-example_1, iris-example_2, irish_1, jEdit_4.0_4.2_1, jEdit_4.2_4.3_1, jm1_1, kc1_1, kc1-binary_1, kc1-top5_1, kc2_1, kc3_1, kdd_el_nino-small_2, kdd_synthetic_control_1, kidney_2, kin8nm_2, KnuggetChase3_1, kropt_1, kr-vs-k_1, kr-vs-kp_1, KungChi3_1, labor_1, leaf_1, LED-display-domain-7digit_1, letter_1, letter_2, letter-challenge-unlabeled_1, lowbwt_2, lsvt_1, lung-cancer_1, lymph_1, lymph_2, machine_cpu_2, mammography_1, mbagrade_2, mc1_1, mc2_1, MeanWhile1_1, MegaWatt1_1, meta_2, meta_all_1, meta_batchincremental_1, meta_ensembles_1, meta_instanceincremental_1, mfeat-morphological_1, mfeat-morphological_2, mfeat-pixel_1, mfeat-pixel_2, mfeat-zernike_2, MindCave2_1, molecular-biology_promoters_1, molecular-biology_promoters_2, monks-problems-1_1, monks-problems-2_1, monks-problems-3_1, mozilla4_1, mu284_2, mushroom_1, mushroom_2, mw1_1, newton_hema_2, no2_2, nursery_2, nursery_3, one-hundred-plants-margin_1, one-hundred-plants-shape_1, one-hundred-plants-texture_1, optdigits_1, optdigits_2, ozone-level-8hr_1, page-blocks_1, page-blocks_2, parkinsons_1, pasture_1, pasture_2, pbc_3, pbcseq_2, pc1_1, pc1_req_1, pc2_1, pc3_1, pc4_1, pendigits_1, pendigits_2, pharynx_2, PhishingWebsites_1, phoneme_1, PieChart1_1, PieChart2_1, PieChart3_1, PieChart4_1, PizzaCutter1_1, PizzaCutter3_1, planning-relax_1, plasma_retinol_2, pm10_2, pollen_2, pollution_2, postoperative-patient-data_1, postoperative-patient-data_2, primary-tumor_1, primary-tumor_2, prnn_cushings_1, prnn_fglass_1, prnn_fglass_2, prnn_synth_1, prnn_viruses_1, puma8NH_2, pwLinear_2, pyrim_2, qsar-biodeg_1, quake_3, quake_4, qualitative-bankruptcy_1, rabe_131_2, rabe_148_2, rabe_166_2, rabe_176_2, rabe_265_2, rabe_266_2, rabe_97_2, ringnorm_1, rmftsa_ctoarrivals_2, rmftsa_ladata_2, rmftsa_sleepdata_2, robot-failures-lp1_1, robot-failures-lp2_1, robot-failures-lp3_1, robot-failures-lp4_1, robot-failures-lp5_1, sa-heart_1, schizo_1, schlvote_2, seeds_1, segment_1, segment_2, seismic-bumps_1, semeion_1, sensory_2, servo_1, shuttle-landing-control_1, sick_1, sleep_3, sleuth_case1102_2, sleuth_case1201_2, sleuth_case1202_2, sleuth_case2002_2, sleuth_ex1221_2, sleuth_ex1605_2, sleuth_ex1714_2, sleuth_ex2015_2, sleuth_ex2016_2, socmob_2, solar-flare_1, solar-flare_2, sonar_1, soybean_1, soybean_2, space_ga_2, spambase_1, SPECT_1, SPECTF_2, spectrometer_2, splice_1, splice_2, sponge_1, sponge_2, squash-stored_1, squash-stored_2, squash-unstored_1, squash-unstored_2, steel-plates-fault_1, stock_2, strikes_2, synthetic_control_1, tae_1, tae_2, teachingAssistant_1, tecator_2, thoracic-surgery_1, thyroid_sick_1, thyroid-allbp_1, thyroid-allhyper_1, thyroid-allhypo_1, thyroid-allrep_1, thyroid-ann_1, thyroid-dis_1, tic-tac-toe_1, trains_1, transplant_2, triazines_2, user-knowledge_1, usp05_1, usp05-ft_1, vehicle_1, vehicle_2, vertebra-column_1, vertebra-column_2, veteran_2, vineyard_2, vinnie_2, visualizing_environmental_2, visualizing_ethanol_2, visualizing_galaxy_2, visualizing_hamster_2, visualizing_livestock_2, visualizing_slope_2, visualizing_soil_2, volcanoes-a1_1, volcanoes-a2_1, volcanoes-a3_1, volcanoes-a4_1, volcanoes-b1_1, volcanoes-b2_1, volcanoes-b3_1, volcanoes-b4_1, volcanoes-b5_1, volcanoes-b6_1, volcanoes-c1_1, volcanoes-d1_1, volcanoes-d2_1, volcanoes-d3_1, volcanoes-d4_1, volcanoes-e1_1, volcanoes-e2_1, volcanoes-e3_1, volcanoes-e4_1, volcanoes-e5_1, vote_1, vowel_2, vowel_3, wall-robot-navigation_1, wall-robot-navigation_2, wall-robot-navigation_3, water-treatment_2, wdbc_1, white-clover_1, white-clover_2, wholesale-customers_1, wilt_1, wind_2, wind_correlations_2, wine_2, wine-quality-white_1, wisconsin_2, witmer_census_1980_2, yeast_1, zoo_1, zoo_2.


People


Publications


Acknowledgements

This work is supported by the European Commission through the Erasmus Mundus Joint Doctorate "Information Technologies for Business Intelligence - Doctoral College" (IT4BI-DC).


Last update: 2018/01/08