Running time is measured as the log (base 2) of the ratio of the given algorithm on a dataset to the minimum time on that dataset over all algorithm algorithms, so that if, on a given dataset, the fastest algorithm completed training / testing in a total of 2 minutes, then an algorithm completing training on the same dataset in eight minutes would receive a score of three. The final value is averaged over the 20 "slowest" datasets, to make sure the time isn't dominated by factors other than the actual learning.

Here, we use techniques from submodular optimization to select a sequence of algorithms that will capture value as we've defined it as quickly as possible. Put another way, we want to do as few experiments as possible and still get maximum performance out of our modeling process. The way to do this is to combine selecting of high-performance methods with ones that have relatively uncorrelated performance. Luckily, submodular optimization gives strong guarantees of optimality on a sequence generated by a fairly simple algorithm.

Here we
replicate earlier
results that show that some metrics disagree with most other
metrics at different rates. We say a metric *m* has an
"Error Case" if, given two algorithms evaluated on the same test
data, *m* suggests that (w.l.o.g.) the former algorithm is
better, and only a *minority* of size *k* or less of
the other metrics agree with *m*. You can select a value
for *k* field above. Obviously, the more often this
happens, the less reliable a metric in a concrete sense.

Note that this is based on a total of comparisons, on which all metrics agree in only (%) of those comparisons. Thus, it is not difficult to make a spurious comparison of algorithms if one is only using a single metric.

Here we generate a random paper abstract for your favorite
learning algorithm, backed by hard experimental data.
Obviously, this is tongue-in-cheek, but it illustrates a
valuable point: *Any* machine learning algorithm can
be made to look better than *any other*, no matter
their relative strengths, by choosing a dataset and metric
that has a favorable outcome. Even if one uses only
well-known metrics and datasets, this is, unfortunately,
still possible.

To further illustrate this, here is a heat map of the performances of all algorithms against all datasets used in this test, aggregated over all metrics. If algorithm testing were "dataset neutral", one would see vertical stripes of constant color (indicating that the performance of an algorithm on all datasets relative to other algorithms was always the same). We can see that this is not the case, especially with outlier datasets like "covertype", "occupancy", "mushroom", and "splice". Note that the numbers on this plot can be compared with the overall performance numbers.

Moreover, algorithm comparison isn't *metric neutral*
either. Here's the same chart comparing algorithms on
different metrics and aggregating across all datasets. Here
the expected vertical lines are more prominent, but there
are still metrics that appear to give spurious results,
especially the "Brier Score".

The upshot of all this is that one must evaluate an
algorithm on many different datasets and compare on many
different metrics if one wants to get a general idea of an
algorithm's performance. Here, we take bootstrap samples of
the available datasets and metrics and do a pairwise
comparison of two random algorithms on the bootstrapped
subsets. We do this 50000 times for each square in the
grid, then measure the percentage of the time the comparison
on these subsets of dataset and metric would lead us to
a *different* conclusion than using all metrics and
datasets.

It seems to take 10-15 datasets and at least three different metrics to get this probability down to even 10%, which is obviously still quite high.

We measure performance here using 10-fold cross validation repeated five times on different datasets. The datasets are taken mainly from the UCI, LIBSVM, and MULAN dataset repositories. For the multilabel datasets in the MULAN repository, we choose out a couple of labels from the dataset and use those to create a couple of binary problems. We give a full list of all datasets used at the bottom of this page.

We evaluate each algorithm on each dataset using ten different metrics:

We do this to approximate the notion of different loss functions for the same data. That is, depending on the goals of a user, one classifier's performance might look better than another's, depending on the number and distribution of the mistakes. Check out this paper for examples of how much this can effect the notion of which classifier in a given pair is better.

The plotted values for each classifier are the total amount of value captured by that classifier. Specifically, for a given metric on a given dataset, the amount of value captured is the percentage of the range of performances on that dataset against that metric. The minimum of the calculated range is the bottom quartile of all performances, and the maximum is the maximum performance. This is mostly in keeping with the analysis done in Caruana's benchmarking paper. The final value is an average over all datasets and metrics.

Here is a full list of the datasets used for the benchmark: