Tips and tricks on writing experiments

Posted on October 13th, 2019 in Research by Sergi Nadal

Presenting sound and convincing experimental results is a must in any paper. Besides providing practical evidence that your assumptions are correct, writing experiments is a great way to find bugs and make you rethink an algorithm that made sense in the blackboard. The objective of this blog post is to provide some tips and tricks on the process of writing experiments from my own personal experience. I will precisely focus on my main line of research which is devoted to query rewriting algorithms for data integration systems, however I believe many of the ideas apply to other areas in data management and computer science in general. I generally write Java code, hence I might discuss some specific tools that are only available for the JVM, however I assume there exist equivalent ones for other languages. Naturally, any feedback is welcome so don't hesitate to drop an email if you have other interesting recommendations.

Ask yourself: do I need to run experiments?

Writing, running, analyzing and preparing experiments is an extremely tedious task, thus the first question you should ask yourself is whether your paper does really need experiments. There is a long lasting discussion whether all CS paper should provide experimental results or not. In a very interesting article, Jeffrey Ullman states the following:

We should not accept experiments as a substitute for a more careful and general analysis, unless there really is no way to parameterize the input space suitably. And we should not accept experiments on contrived, specialized data under almost any circumstances. As authors, we should stop thinking of experiments as a substitute for analysis and deep understanding of why our algorithms work better than others that have been proposed...

I tend to agree with this opinion, and believe as a rule of thumb one should not write experiments only to see the "trend" their algorithms follow to categorize them into certain computational classes based on observations. Hence, I believe a much better approach is to carefully analyze your algorithms and try to find reductions to well-known problems that fall in P or NP, assuming that P and NP are different. For the brave ones, you don't even need to stop here as there is even a whole research line on fine-grained reductions (see last year's ICDT keynote).

If a deep analysis of your approach does not suffice and experimental results such as execution time matter, then it makes sense to run experiments. An interesting example of that situation is Alon Halevy's paper where the bucket algorithm for answering queries using views is presented. Although the problem of answering queries using views is well-known to be NP-complete, the experiments of the paper show how their algorithm gracefully deals with up to 100 sources in few seconds. This is a notorious, and probably sufficient, result in the context of the WWW in 1996.

Generation of synthetic (yet realistic) workloads

In most cases, trying to cover the complete search space is unfeasible. You should decide what are the parameters that define your search space, and then generate workloads based on this parameters. It is very convenient that you have methods to generate single workloads based on all input parameters that will define your search space.

There are additionally a plethora of solutions out there that might be suitable for your case (e.g., a library implementing the Erdős–Rényi model for random graph generation, or a library to generate realistic time series data), and you can just forget about manually creating test instances of the search space. Sometimes it is just a matter of further converting such structures to yours, in some other cases you might need to end up developing your own generators (as I had to).

Profile your code

Even though you might be dealing with an NP-complete problem, it does not mean that you should settle for "bad" numbers once parameters start growing. In many cases, the problem might be in other parts of your code (or even worst, someone else's) that are superfluous to the approach. A useful kind of tool I found to detect this kind of problems in my code are code profilers. In the case of Java (I use VisualVM which nicely integrates with IntelliJ IDEA), a profiler keeps track of the memory usage (among other parameters) and time devoted at each method of your code that is being run. You just need to run the experiment with values large enough in the parameters and let the profiler show you where actually time and memory are being spent.

In my case, I was able to detect that more than 90% of the time was devoted to run SPARQL queries over Apache Jena's storage engine TDB which contained small amounts of metadata. After detecting this, I easily solved the issue by defining in-memory data structures that contained all necessary information for my algorithm. Then, the beginning of your algorithm might look something like this:

//contains all the triples from the global graph that a wrapper covers
private static Map<String,Set<Triple>> allTriplesPerWrapper = Maps.newHashMap();
//contains all IDs that a wrapper is covering in the query
private static Map<Wrapper,Set<String>> coveredIDsPerWrapperInQuery = Maps.newHashMap();
// Set of all queried ID features
private static Set<String> queriedIDs = Sets.newHashSet();
//contains the relation attribute - (sameAs) -> feature
public static Map<String,String> featuresPerAttribute = Maps.newHashMap();
//given a feature and a wrapper, it returns the corresponding attribute
private static Map<Tuple2<Wrapper,String>,String> attributePerFeatureAndWrapper = Maps.newHashMap();
//contains the set of features per concept in the query
private static Map<String,Set<String>> featuresPerConceptInQuery = Maps.newHashMap();

Systematically running experiments

Running experiments is extremely time consuming, and during the process many things can go wrong. The main goal here is to avoid that stumbling upon any unexpected issue causes you to start over the whole process. My approach here is to write scripts that are in charge of systematically running single experiments under a certain combinations of parameters. This way, if anything goes wrong I can "easily" resume at the point I want. A typical script would look like as follows.

for UPPER_BOUND_FEATURES_IN_G in 5 10 20
do
  for N_EDGES_IN_QUERY in 2 4 6 8 10 12
  do
    for N_WRAPPERS in 2 4 8 16 32 64 128
    do
      for N_EDGES_COVERED_BY_WRAPPERS in 2 4 6 8 10 12
      do 		
        if [ $N_EDGES_COVERED_BY_WRAPPERS -le $N_EDGES_IN_QUERY ]; then
          for COVERED_FEATURES_QUERY in 0.2 0.4 0.6 0.8 1
          do
            for COVERED_FEATURES_WRAPPER in 0.2 0.4 0.6 0.8 1
            do
              timeout 10m java -jar build/libs/MetadataStorage-0.0.1-SNAPSHOT.jar 50 $UPPER_BOUND_FEATURES_IN_G $N_EDGES_IN_QUERY $N_WRAPPERS $N_EDGES_COVERED_BY_WRAPPERS $COVERED_FEATURES_QUERY $COVERED_FEATURES_WRAPPER
              sysctl -w vm.drop_caches=3 > /dev/null #free memory
              sync && echo 3 | sudo tee /proc/sys/vm/drop_caches > /dev/null #free memory
              timeout 10m java -jar build/libs/MetadataStorage-0.0.1-SNAPSHOT.jar 50 $UPPER_BOUND_FEATURES_IN_G $N_EDGES_IN_QUERY $N_WRAPPERS $N_EDGES_COVERED_BY_WRAPPERS $COVERED_FEATURES_QUERY $COVERED_FEATURES_WRAPPER
              sysctl -w vm.drop_caches=3 > /dev/null #free memory
              sync && echo 3 | sudo tee /proc/sys/vm/drop_caches > /dev/null #free memory
              timeout 10m java -jar build/libs/MetadataStorage-0.0.1-SNAPSHOT.jar 50 $UPPER_BOUND_FEATURES_IN_G $N_EDGES_IN_QUERY $N_WRAPPERS $N_EDGES_COVERED_BY_WRAPPERS $COVERED_FEATURES_QUERY $COVERED_FEATURES_WRAPPER
              sysctl -w vm.drop_caches=3 > /dev/null #free memory
              sync && echo 3 | sudo tee /proc/sys/vm/drop_caches > /dev/null #free memory
            done
          done
        fi
      done
    done
  done
done

Note I execute three times to account for variability (the number of executions you will need highly vary among different problems). Also, between executions I make sure to force the system to free all memory and caches that the previous execution might have left. Finally, the command I use to run my script for days is the following:

nohup script.sh &

Analyzing results

Once you have successfully executed all experiments it is time to take a look at the results and draw conclusions from them. However, it might be the case that you have hundreds or even thousands of lines with experimental data, where to start from? My recommendation here is to use your favorite BI tool to graphically explore the data. This kind of tools allow to easily create tables and charts with dynamic filters, so that you can vary the different parameters and see which results are the most meaningful for the paper. I generally use MicroStrategy Desktop, its free version is enough to load a CSV file and define the dimensions and metrics of analysis, with few clicks you can get visualizations like the following.

Source: https://apandre.files.wordpress.com/2013/10/e3area.jpg

Finally, after deciding what data you will present in the paper it is time to get the charts ready. You want your charts to look professional, so please do not crop Excel charts (or even the ones generated in MicroStrategy). There are tools out there to generate plots that look way better on paper, if you are familiar with Python you can use matplotlib, otherwise you might use gnuplot.