% -----------------------------------------------
% Template for ISMIR 2009
% -----------------------------------------------

\documentclass{article}
\usepackage{ismir09-edited-with-authors,amsmath,cite}
\usepackage[dvips]{graphicx}
\usepackage{booktabs}
\usepackage{url}
\usepackage{ifthen}
\usepackage{verbatim}
\newcommand{\paren}[1]{\left ( #1 \right )}
\newcommand{\brac}[1]{\left [ #1 \right]}
\newcommand{\half}{\frac{1}{2}}
\newcommand{\s}[1]{
  \begin{equation*}
    \begin{split}
      #1
    \end{split}
  \end{equation*}
}
\newcommand{\sn}[2]{
  \begin{equation} \label{#1}
    \begin{split}
      #2
    \end{split}
  \end{equation}
}
\newcommand{\f}[2]{\frac{#1}{#2}}
\newcommand{\pf}[2]{\paren{\frac{#1}{#2}}}
\newcommand{\giv}{\, \lvert \,}
\newcommand{\bhat}{\widehat{\beta}}
\newcommand{\ehat}{\widehat{\epsilon}}
\newcommand{\abs}[1]{\left\vert#1\right\vert}
\newcommand{\floor}[1]{\lfloor #1\rfloor}
\newcommand{\ceiling}[1]{\left\lceil#1\right\rceil}
\newcommand{\set}[1]{\left\{#1\right\}}

\newcommand{\techReport}{1} %Flip bit to include additional material for technical report
\newcommand{\report}[1]
{
\ifthenelse{\equal{\techReport}{1}}{#1}{}
}
\newcommand{\paper}[1]
{
\ifthenelse{\equal{\techReport}{0}}{#1}{}
}

\title{Using Regression to Combine Data Sources for Semantic Music Discovery}
  
 \oneauthor  { } { \\ 
 { }}  % see the ismir09-edited-with-authors.sty file for the author info

\begin{document} 

\sloppy
\maketitle
\begin{abstract}
In the process of automatically annotating songs with descriptive labels, multiple types of input information can be used. These include keyword appearances in web documents, acoustic features of the song's audio content, and similarity with other tagged songs. Given these individual data sources, we explore the question of how to aggregate them. We find that fixed-combination approaches like sum and max perform well but that trained linear regression models work better. Retrieval performance improves with more data sources. On the other hand, for large numbers of training songs, Bayesian hierarchical models that aim to share information across individual tag regressions offer no advantage.
\end{abstract}

\section{Introduction}\label{sec:introduction}

We are interested in developing a \emph{semantic music discovery engine} in which users enter text queries and receive a ranked list of relevant songs. This task requires a \emph{semantic music index}, i.e., a mapping between songs and associated tags. A \emph{tag}, such as ``afro-cuban roots," ``heavy metal," or ``steel-string guitar," is a short text token which describes some meaningful aspect of the music (e.g., genre, instrumentation, emotion, geographical origins). In this paper, our goal will be to compute a real-valued score $\widehat{y}_{st}$ that expresses how strongly tag $t$ applies to song $s$.

There are a number of ways to collect semantic annotations of music. \cite{turnbull08ismir} compare five such approaches: surveys, social tagging, games, web documents, and audio content. Each of these data sources offers a different perspective, and each has its own strengths and weaknesses (e.g., scalability, popularity bias, accuracy), so we may wish to collect information from several of them. The question then becomes how to combine that information into a single score for use in our semantic index.

In Section \ref{Data Sources}, we describe three sources of music information that we have collected: text mining web documents, content-based audio analysis, and collaborative filtering. Section \ref{Combining Methods} describes various approaches for combining these sources, including simple fixed rules, as well as a trained \emph{regression} model in which combination weights depend on the quality and sparsity of the input data. We explore both ordinary linear and logistic regression, as well as Bayesian hierarchical models that aim to share information across tags. Section \ref{Experimental Setup} describes our experimental setup, which includes a ground-truth corpus of 10,870 songs for two vocabularies (71 Genre tags and 151 Acoustic tags) collected from Pandora's Music Genome
Project.\footnote{See \url{http://www.pandora.com/mgp.shtml}} Section \ref{Conclusions} concludes.

\section{Music Information Sources}\label{Data Sources}

We collect semantic-annotation information from three sources: web documents (WD), content-based audio analysis (CB), and collaborative filtering (CF). For each song $s$ and tag $t$, we use these sources to generate scores---denoted $x_{st}^\text{WD}$, $x_{st}^{\text{CB}}$, and $x_{st}^\text{CF}$, respectively---indicating how well $t$ describes $s$.

\subsection{Web Documents}\label{Web Documents}

Tags that appropriately describe a song will tend to appear in association with the song's name in natural-language text documents. We exploit this fact by downloading from the web pages that describe the song and counting how often the proposed tag appears within them.

Given a song $s$, we generate a database $D_s$ of documents by querying Google for ``song name" ``artist name" in lower-case (e.g., ``enjoy the silence'' ``depeche mode''). We download all hits in the top 10
%We download the top 10 hits\footnote{In some cases, fewer than 10 pages were returned. The average was 9.8 with a standard deviation of 1.7.}
and clean the HTML files into raw text. This was done for a total of 9,359 songs. Then, for each tag $t$, we compute
\s{
x_{st}^\text{WD} = \sum_{d \in D_s} \frac{n_{td}}{N_{td}},
}
where $n_{td}$ is a measure that roughly expresses how many times $t$ actually appeared in document $d$, and $N_{td}$ is the number of times $t$ \emph{could have} appeared in $d$. $N_{td}$ is just $\abs{d} / \abs{t}$, the number of words in $d$ divided by the number of words in $t$. $n_{td}$ is a bit more complicated. For long tags, such as ``call and answer vocal harmony (antiphony)," positional searches for the entire phrase would not work well. On the other hand, searching for the appearance of any of the words in $t$ would yield too many hits. We compromise by computing $n_{td}$ as the minimum number of hits for any word, taken over all words in $t$. In the case when the words in $t$ appear in $d$ only in the correct order, $n_{td}$ will in fact be equal to the number of occurrences of the full phrase $t$.

\subsection{Content-Based Audio Analysis}\label{Content-Based Analysis}

A second potential source of semantic information about a song is the audio content itself. For this purpose we use the supervised multiclass labeling (SML) model recently proposed by \cite{turnbull08}. 

The audio track of a song is represented as a bag of feature vectors $\mathcal{X} = \{\mathbf{x}_{1}, \ldots, \mathbf{x}_{T}\}$, where each $\mathbf{x}_{i}$ is a feature vector that represents a short-time segment of audio, and
$T$ depends on the length of the song. We use the expectation maximization (EM) algorithm to learn a
\emph{song-specific} Gaussian mixture model (GMM) distribution over each $\mathcal{X}$. Then, for each tag in our vocabulary, we learn a \emph{tag-specific} GMM using the Mixture Hierarchies EM algorithm \cite{vasconcelos01}. This algorithm combines the set of song-specific GMMs for all the songs that have been associated with the tag. Given a novel song $s$, we compute the likelihood that its bag of feature vectors $\mathcal{X}_s$ would have been generated by each of the tag GMMs. Normalizing these likelihoods using the technique described in \cite{turnbull08} yields our set of scores $x_{st}^\text{CB}$, which can be interpreted as the parameters of a multinomial distribution over the vocabulary of tags.

\report{We use the popular Mel frequency cepstral coefficients (MFCCs) as our audio feature representation since it was incorporated into all of the top performing autotagging systems in the 2008 MIREX tag classification task \cite{downie, turnbull08, mandel08, eck07}. MFCCs are loosely associated with the musical notion of timbre (``color") of the music because they are a low-dimensional representation of the frequency spectrum of a a short-time audio sample. For each monaural song in the data set, sampled at 22,050 Hz, we compute the first 13 MFCCs for each half-overlapping short-time ($\sim$23 msec) window from 6 five-second clips spaced at uniform intervals over the length of the song. Over the time series of audio segments, we calculate the first and second instantaneous derivatives (referred to as \emph{deltas}) for each MFCC. This results in about 5,000 39-dimensional MFCC+delta feature vectors per 30 seconds of audio content. We summarize an entire song by modeling the distribution of its MFCC+delta features with a 4-component GMM.  We model each tag with an 8-component GMM.}


\subsection{Collaborative Filtering}

One additional source of semantic information is user playlists: If two songs appear together in a large number of listener collections, one possible reason is that the songs share certain attributes (say, ``punk influences") that the listeners enjoy. This suggests the idea of \emph{tag propagation}: Find songs that tend to co-occur in playlists, and transfer tags from one of them to the other. A more robust approach is to find the collection of $k$ songs ($k=32$ here) that have the strongest co-occurence score with a given song $s$. For each tag $t$, we take the association $x_{st}^\text{CF}$ of $s$ with $t$ to be the fraction of those 32 songs to which $t$ applies. We set this number to 0 if the fraction is below a threshold of 0.3. The reasons for these choices, as well as further details on the entire data-collection process and choice of tag sets, appear in \cite{kim2009}.

Our data consist of 400,000 user music libraries from last.fm, where a \emph{library} is taken to be the set of items that a user listens to at least 1\% of the time. It turns out that data at the song level is too sparse to generate meaningful co-occurence statistics, so we instead work at the artist level. We say that a tag applies to an artist if the tag applies to any of that artist's songs. At the end of the propagation process, we transfer an artist's score for a tag to each of its songs. We find the 32 closest artists using the following similarity score. Between artists $i$ and $j$, we take
\s{
\text{sim}(i,j) = \frac{p(i,j)}{\sqrt{ p(i)  p(j) }},
}
where $p(i,j)$ is the fraction of all artist co-occurrences represented by artists $i$ and $j$, and $p(i)$ is the fraction of all co-occurrences containing artist $i$.


\section{Combining Methods}\label{Combining Methods}

Given the data sources described in Section \ref{Data Sources}, how can we aggregate them? This general question has been well studied and is known variously as \emph{combining expert judgments} (e.g., \cite{morris1977cej, jacobs1995mce}), \emph{multi-sensor data fusion} (e.g., \cite{datafusion2007}), \emph{information fusion} (e.g., \cite{ross2003ifb}), or \emph{combining classifiers} (e.g., \cite{kittler1998cct, bayesianCombo2007}). Rather than reviewing the entire body of literature on the subject, we focus on two of the most basic approaches: Fixed-combination rules and trained combiners, specifically regression.

% Return to kittler1998cct, which says in the abstract: ``We also show that the two theoretical frameworks can be used for devising fusion strategies when the individual experts use features some of which are shared and the remaining ones distinct. We show that in both cases (distinct and shared representations), the expert fusion involves the computation of a linear or nonlinear function of thea posteriori class probabilities estimated by the individual experts. Classifier combination can therefore be viewed as a multistage classification process whereby thea posteriori class probabilities generated by the individual classifiers are considered as features for a second stage classification scheme. Most importantly, when the linear or nonlinear combination functions are obtained by training, the distinctions between the two scenarios fade away, and one can view classifier fusion in a unified way."

\subsection{Fixed Combiners}\label{Fixed Combiners}

\emph{Fixed combining rules} take the output score $\widehat{y}_{st}$ to be a simple function of the input scores: e.g., max, min, median, sum, or product \cite[sec. 3]{duin2002cct}. Usually the input scores $x^i_{st}$, with $i \in \set{\text{WD}, \text{CB}, \text{CF}}$, are calibrated so that they correspond to confidences or probabilities $p_{st}^i$ that $t$ applies to $s$ given the source. This can be done, for instance, by standardizing the input scores to have mean 0 and variance 1 and then taking
\s{
p_{st}^i = \frac{1}{1+ \exp\paren{ - \alpha  x_{st}^i } }
}
for some $\alpha$ \cite[sec. 4.1]{duin2002cct}. We use $\alpha=1$ in this paper.

A disadvantage of this technique, however, is that each source is treated on equal footing, when in fact, one of our sources may be far more trustworthy or better informed \cite[sec. 1]{duin2002cct}. One method that overcomes this limitation is Bayesian Model Averaging (BMA) (e.g., \cite{hoeting1999bma}), which assumes that one of the data sources is the ``correct" source and takes the final probability to be a weighted combination of the input probabilities:
\s{
p_{st}^{\text{all sources}} = \sum_{i \in \set{\text{WD}, \text{CB}, \text{CF}}} p_{st}^i p_i,
}
where $p_i$ is the probability that source $i$ is correct. As \cite[sec. 1]{ghahramani2003bcc} point out, this assumption is often unrealistic, as the truth about whether a tag applies to a song needn't be captured by exactly one of our data sources. Still, the idea of taking our final score $\widehat{y}_{st}$ to be a weighted combination of the input scores---
\sn{reg-eqn}{
\widehat{y}_{st} = \sum_{i \in \set{\text{WD}, \text{CB}, \text{CF}}} \beta_{t}^i x_{st}^i
}
for some weights $\beta_{t}^i$---does seem like a natural way to account for the differential predictive value of different inputs. The question is how to determine the weights.

\subsection{Trained Combiners}

If we have training data for a subset of songs,\footnote{Another possible scenario is that, rather than having ground-truth labels for a subset of our songs, we have data that applies to all of our songs but is weakly labeled, i.e., not every song that applies for a given tag is labeled as such. If our input data sources are less sparse, we can use them to ``fill in zeros" in the ground truth while preserving the labels that the ground truth had already.} the obvious answer is to use supervised learning. This is the \emph{trained combiners} approach advocated in \cite{duin2002cct}. Indeed, \eqref{reg-eqn} has the form of a linear-regression model, and we can determine the weights of the sources just by treating them as input features and computing their regression coefficients. %Sources that are more predictive of the ground truth $y_{st}$ will be more useful features and will be given larger weights.

We try both linear and logistic regression, predicting the ground truth $y_{st} \in \set{0,1}$ by the individual scores $x_{st}^i$, as well as an intercept and possibly other features of interest (see Section \ref{Setup: Regression Models}). We take our predicted values $\widehat{y}_{st}$ to be real-valued so that we can more finely rank-order songs than with 0/1 labels. Regression is a convenient combination approach because it potentially allows us to use a number of standard statistical tools: p-values for the significance of regression coefficients, prediction intervals for our output scores, model selection based on residual sum of squares, and many more advanced techniques.

\subsection{Hierarchical Regression Models}

One such technique is borrowing of information across tags. Each tag has its own regression model, but we might suspect that these models share significant structure: For instance, if collaborative filtering tends to be a highly predictive source, we would expect its coefficient to be consistently large. And the linear combination of sources that best predicts the tag ``traditional country" is probably similar to the one that best predicts ``contemporary country."

One way to capture this intuition is with a Bayesian hierarchical linear model (e.g., \cite{lindley1972bel}). We'll illustrate this concept in the case of a single regression coefficient $\beta_t$ for a single data source $x_{st}$ without an intercept, but similar equations apply in the multivariate setting. Independent regression across the $T$ tags assumes
\sn{level1}{
y_{st} = \beta_t x_{st} + \epsilon_{st}, \ \ \epsilon_{st} \stackrel{\text{i.i.d.}}{\sim} \mathcal{N}(0, \sigma_t^2), \ \ t = 1, \ldots, T
}
for some variances $\sigma_t^2$, with no relationship among the $\beta_t$ values. We call this the \emph{Independent Linear} model. The \emph{Independent Logistic} model is the same, except that $y_{st}$ is replaced by the log-odds $\ln \paren{ \frac{p_{st}}{1-p_{st}}}$, where $p_{st}$ is the probability that $y_{st} = 1$.

In a hierarchical model, we assume in addition to \eqref{level1} that the $\beta_t$'s share a common structure:
\sn{level2}{
\beta_t = \overline{\beta} + v_t, \ \ \ v_t \stackrel{\text{i.i.d.}}{\sim} \mathcal{N}(0, \sigma^2), \ \ \ t = 1, \ldots, T.
}
For instance, if we had three tags with independent regression coefficients of 0.1, 0.2, and 0.3, it might be reasonable to suppose that $\overline{\beta} \approx 0.2$ with $\sigma \approx 0.1$. We can further assume a prior over $\overline{\beta}$ and perform Bayesian inference to estimate the parameters. The multivariate version of this model we call \emph{Hierarchical Linear}, and the corresponding version in which $y_{st}$ is replaced  by the log-odds that $y_{st} = 1$ we call \emph{Hierarchical Logistic}.

We might also assume that $v_t$ in \eqref{level2}, rather than being normally distributed, is drawn from a mixture of normal distributions. For instance, perhaps the web-document source is much better at predicting genre labels than acoustic ones, so that its $\beta_t$ values for genre tags cluster around 0.2, say, while its $\beta_t$ values for acoustic tags cluster around 0.05. In that case, $\beta_t$ could be modeled by taking $\overline{\beta} = 0.05$, with $v_t$ having peaks at 0 and 0.15. We call this model \emph{Mixture Linear$_k$}, where $k$ is the number of centers.\footnote{See Chapters 3 and 5 of \cite{rossi2003bsa} for details on each of these three hierarchical models in a more general setting.}

\subsection{Regression Models}\label{Regression Models}\label{Setup: Regression Models}

Equation \eqref{reg-eqn} suggests the basic regression model to use, although in practice we include an intercept, which we find always to be highly statistically significant. We can also regress on just one or two of the main sources at a time.

A nice aspect of using regression is that we can include extra features in our model (assuming we expect they'll contribute useful information rather than meaningless noise that will lead us to overfit). In particular, we include \emph{scrobble counts} from last.fm as a measure of the popularity of the artist who wrote the given song. If we suspected that more popular songs had more nonzero $y_{st}$ values in our ground-truth, we would expect this popularity term to have a high positive regression coefficient. Including the term could be seen as a way of controlling for popularity bias if we omit the popularity feature when we predict $\widehat{y}_{st}$ for novel songs. We can also include terms for the interaction of data sources with popularity. A positive interaction coefficient would indicate that the data source gives a more confident prediction that a tag applies to a song when the song's artist is popular.

\section{Experimental Setup}\label{Experimental Setup}

\subsection{Data Set}\label{Data Set}

Our data set consists of 10,870 songs representing 19 top-level genres (e.g., rock, classical, electronic) and 180 subgenres (e.g., grunge, romantic period opera, trance). We have approximately 60 songs per subgenre. Each song is associated with one or more genres and one or more subgenres. For each song, we also attempt to collect between 2 and 10 acoustic tags from Pandora's Music Genome Project vocabulary. This vocabulary consists of over 1,000 unique tags like ``dominant bass riff," ``gravelly male vocalist," and ``acoustic sonority." These acoustic tags can be thought to be \emph{objective} in that two trained experts can annotate a song using the same tags with high probability \cite{pandoraNotes}.

\subsection{Cross-Validation Setup}\label{Cross-Validation Setup}

We evaluate the retrieval performance of our combined scores using five-fold cross-validation on the Pandora data set. Ordinarily, this would involve training our regression  model on 4/5 of the data and testing on the remaining 1/5. However, we need to be careful here, because our content-based data source also trains on the Pandora data set. The danger is that the content-based system may overfit the training data, and because our regression model would be using the same training data, the model might overweight the content-based source. \cite[sec. 5]{duin2002cct} notes this problem and suggests that it be addressed by dividing the training set into two parts, which we do as follows.

We divide the songs into five partitions, each with roughly 2,000 songs. We apply an \emph{artist filter} to the partitions, with all of the songs by an artist appearing in a single fold, to avoid overfitting our model to the particular artists that appear in our training set. On three of the partitions we train the content-based system, using it to then obtain predictions for the songs in the remaining two. We use one of those partitions (roughly 2,000 songs) to train our regression model, which then makes its predictions on the final partition. We then cycle this process five times. The reason for the uneven split between the two training sets is that the content-based system needs to learn many more parameters than our regression model, which typically has at most five coefficients.

\subsection{Tag Pruning}\label{Tag Pruning}

Some tags are labeled with too few songs to be useful for training when we divide the songs into five partitions, so we prune them. In particular, the content-based training considers only tags that have at least 20 positive instances in the ground truth over each possible set of three partitions on which to train. In addition, our regression model requires that each single partition have at least one positive ground-truth song (since it would be trivial to train a model when the $y_{st}$'s are all 0) and at least one positive song in each of the three main data sources. After pruning we are left with 71 Genre tags and 151 Acoustic tags.

\subsection{Implementation Details}\label{Implementation Details}

Regression works best when the features are roughly normally distributed, so we transform some of the input scores for this purpose. For popularity counts, which range anywhere from 1 to over 15 million, we apply a log transformation. For the web-document source, which is based on count data, we apply a square-root transformation \cite[p. 84]{faraway2002pra}. We then standardize each data source by subtracting the mean and dividing by the standard deviation for a given tag. The $x_{st}^i$'s referred to in Section \ref{Fixed Combiners} are these standardized values.

For a small number of tags, $\beta_{t}^i$ was estimated as negative for one or two of the input data sources. Because we believe that our main three data sources, while potentially unhelpful, should not be anti-predictive of the ground truth, we eliminate negative coefficients by setting them to 0 when they occur. (Making this adjustment results in a small but statistically significant improvement in mean average precision and area under the ROC curve for both Genre and Acoustic tags.) % Note: There are proper ways to do nonnegative regression, but I didn't get a chance to try them. Google "nonnegative least-squares" for sources.
We do allow popularity to have a negative coefficient, and we remove this restriction entirely when considering models with interaction terms.


\subsection{Regression Types}\label{Regression Types}

We implement Independent Linear and Independent Logistic regression using the basic \texttt{lm} and \texttt{glm}  functions of the \texttt{R} language. For the hierarchical regressions, we use the \texttt{bayesm} package \cite{rossi2005bbi}, specifically the \texttt{rhierLinearModel}, \texttt{rhierBinLogit}, and \texttt{rhierLinearMixture} functions for Hierarchical Linear, Hierarchical Logistic, and Mixture Linear$_k$, respectively, with all optional parameters set to their default values. These methods use Markov chain Monte Carlo to sample the entire posterior distribution for the $\beta_t^i$'s given the data, but we simply take our $\beta_t^i$ estimate to be the average of these draws. Performance is good with as few as a few hundred samples, but we find that area under the ROC curve does not level off completely until 5,000 to 10,000 draws. For the results in this paper, we sample 15,000 draws, which takes on the order of 30 minutes with roughly 100 tags and 2,000 songs. A parameter sweep of the number $k$ of means in the Gaussian-mixture prior showed no appreciable differences over the range 2 to 50, so we use $k=2$ as the default.

% todo: I could try looking at the \overline{beta} hyperparameter and the mixture parameters....

\begin{table*}[htdp!]
\centering
\small
\caption{\small Area under the ROC curve, mean average precision, R-precision, and 10-precision for various settings described further in the text. Rows are ordered by average AUC for Genre tags. Means and standard errors are taken over the tags, applied to the averages of five-fold cross-validation. (To compute standard errors with respect to each individual CV fold, divide the reported standard errors by a further $\sqrt{5}$.) The data-source abbreviations are web documents (WD), collaborative filtering (CF), content-based analysis (CB), popularity (P), all three main sources in the model (All3), and interactions with each of the three main sources (I). }
% todo: Tukey test for all-pairs? or R's pairwise.t.test ? What I did isn't valid....!
\begin{tabular}{lcccc|cccc}
\toprule
\multicolumn{9}{c}{Regression Model}\\
\midrule
 &
\multicolumn{4}{c}{71 Genre Tags} &
\multicolumn{4}{c}{151 Acoustic Tags}\\
\midrule
 &
AUC &
MAP &
R-Prec &
10-Prec &
AUC &
MAP &
R-Prec &
10-Prec\\
\midrule
Random &
0.502$\pm$0.003 &
0.09$\pm$0.01 &
0.08$\pm$0.01 &
0.08$\pm$0.02 &
0.508$\pm$0.003 &
0.032$\pm$0.003 &
0.030$\pm$0.003 &
0.03$\pm$0.00\\
WD &
0.666$\pm$0.010 &
0.25$\pm$0.02 &
0.29$\pm$0.02 &
0.47$\pm$0.03 &
0.616$\pm$0.006 &
0.135$\pm$0.007 &
0.181$\pm$0.008 &
0.29$\pm$0.02\\
CF &
0.732$\pm$0.010 &
0.45$\pm$0.02 &
0.45$\pm$0.02 &
0.72$\pm$0.04 &
0.641$\pm$0.008 &
0.154$\pm$0.010 &
0.213$\pm$0.011 &
0.25$\pm$0.02\\
CB &
0.781$\pm$0.014 &
0.23$\pm$0.02 &
0.25$\pm$0.02 &
0.38$\pm$0.03 &
0.836$\pm$0.008 &
0.141$\pm$0.007 &
0.161$\pm$0.008 &
0.19$\pm$0.01\\
WD\&CF &
0.789$\pm$0.010 &
0.50$\pm$0.02 &
0.50$\pm$0.02 &
{\bf 0.74$\pm$0.04} &
0.724$\pm$0.007 &
0.231$\pm$0.010 &
0.280$\pm$0.011 &
0.40$\pm$0.02\\
CB\&WD &
0.819$\pm$0.010 &
0.32$\pm$0.02 &
0.34$\pm$0.02 &
0.53$\pm$0.03 &
0.870$\pm$0.006 &
0.220$\pm$0.009 &
0.246$\pm$0.009 &
0.36$\pm$0.02\\
CB\&CF &
0.853$\pm$0.009 &
0.49$\pm$0.02 &
0.48$\pm$0.02 &
0.73$\pm$0.04 &
0.861$\pm$0.007 &
0.213$\pm$0.010 &
0.244$\pm$0.010 &
0.29$\pm$0.01\\
All3\&P\&I &
0.856$\pm$0.007 &
{\bf 0.52$\pm$0.02} &
0.50$\pm$0.02 &
{\bf 0.74$\pm$0.04} &
0.860$\pm$0.006 &
0.262$\pm$0.010 &
0.288$\pm$0.010 &
0.40$\pm$0.02\\
All3 &
0.871$\pm$0.007 &
{\bf 0.52$\pm$0.02} &
0.50$\pm$0.02 &
{\bf 0.74$\pm$0.04} &
{\bf 0.888$\pm$0.006} &
0.276$\pm$0.010 &
0.298$\pm$0.010 &
{\bf 0.42$\pm$0.02}\\
All3\&P &
{\bf 0.876$\pm$0.007 } &
{\bf 0.52$\pm$0.02} &
{\bf 0.51$\pm$0.02} &
{\bf 0.74$\pm$0.04} &
0.887$\pm$0.006 &
{\bf 0.277$\pm$0.010} &
{\bf 0.299$\pm$0.010} &
{\bf 0.42$\pm$0.02}\\
\midrule
\multicolumn{9}{c}{Combination Method}\\
\midrule
 &
\multicolumn{4}{c}{71 Genre Tags} &
\multicolumn{4}{c}{151 Acoustic Tags}\\
\midrule
 &
AUC &
MAP &
R-Prec &
10-Prec &
AUC &
MAP &
R-Prec &
10-Prec\\
\midrule
Min &
0.658$\pm$0.015 &
0.27$\pm$0.02 &
0.27$\pm$0.02 &
0.60$\pm$0.04 &
0.654$\pm$0.009 &
0.121$\pm$0.006 &
0.161$\pm$0.008 &
0.26$\pm$0.01\\
Product &
0.826$\pm$0.009 &
0.42$\pm$0.03 &
0.41$\pm$0.02 &
0.67$\pm$0.04 &
0.814$\pm$0.006 &
0.197$\pm$0.008 &
0.232$\pm$0.009 &
0.32$\pm$0.01\\
Median &
0.826$\pm$0.009 &
0.43$\pm$0.02 &
0.43$\pm$0.02 &
0.68$\pm$0.04 &
0.820$\pm$0.006 &
0.219$\pm$0.009 &
0.261$\pm$0.009 &
0.35$\pm$0.02\\
Sum &
0.851$\pm$0.007 &
0.44$\pm$0.03 &
0.44$\pm$0.02 &
0.69$\pm$0.04 &
0.847$\pm$0.006 &
0.220$\pm$0.009 &
0.252$\pm$0.009 &
0.34$\pm$0.01\\
Max &
0.856$\pm$0.007 &
0.46$\pm$0.02 &
0.48$\pm$0.02 &
0.59$\pm$0.03 &
0.859$\pm$0.006 &
0.239$\pm$0.009 &
0.274$\pm$0.009 &
0.34$\pm$0.01\\
Ind Log &
0.866$\pm$0.006 &
0.51$\pm$0.03 &
0.50$\pm$0.02 &
0.72$\pm$0.04 &
0.875$\pm$0.005 &
0.266$\pm$0.010 &
0.293$\pm$0.010 &
0.40$\pm$0.02\\
Hier Log &
0.872$\pm$0.006 &
0.51$\pm$0.03 &
0.50$\pm$0.02 &
0.73$\pm$0.04 &
0.883$\pm$0.006 &
0.272$\pm$0.010 &
0.296$\pm$0.010 &
0.40$\pm$0.02\\
Hier Mix &
{\bf 0.876$\pm$0.007} &
{\bf 0.52$\pm$0.02} &
{\bf 0.51$\pm$0.02} &
{\bf 0.74$\pm$0.04} &
{\bf 0.887$\pm$0.006} &
{\bf 0.277$\pm$0.010} &
{\bf 0.299$\pm$0.010} &
{\bf 0.42$\pm$0.02}\\
Hier Lin &
{\bf 0.876$\pm$0.007} &
{\bf 0.52$\pm$0.02} &
{\bf 0.51$\pm$0.02} &
{\bf 0.74$\pm$0.04} &
{\bf 0.887$\pm$0.006} &
{\bf 0.277$\pm$0.010} &
{\bf 0.299$\pm$0.010} &
{\bf 0.42$\pm$0.02}\\
Ind Lin &
{\bf 0.876$\pm$0.007} &
{\bf 0.52$\pm$0.02} &
{\bf 0.51$\pm$0.02} &
{\bf 0.74$\pm$0.04} &
{\bf 0.887$\pm$0.006} &
{\bf 0.277$\pm$0.010} &
{\bf 0.299$\pm$0.010} &
{\bf 0.42$\pm$0.02}\\
\bottomrule
\end{tabular}
\vspace{-3mm}
\label{table}
\end{table*}


% From an email conversation, 6 July 2009:
% Doug: Rich brings up an interesting point about the last three rows. It
%is VERY
%strange that they are all identical give the variance of the
%evaluation
%metrics. Is this perhaps a cut-and-paste bug? Let's discuss.

% Brian: When I ran the original results, I remember noticing how similarly
%those methods performed. However, the scores are not identical in the
%less significant decimal places. For instance, here are the complete
%AUC results for the Pandora Genre:

%Hierarchical Mixture
%AUC = 0.8763721
%Hierarchical Linear
%AUC = 0.8763743
%Independent Linear
%AUC = 0.8763744

%The same is true for the other measures and for the Acoustic results.
%It just goes to show how much the prior has been overwhelmed by our
%large amounts of data.


\section{Results and Discussion}

We assess performance using the four standard information-retrieval metrics listed in Table \ref{table} (see \cite[sec. 8.4]{manning08} for explanation of each). We have also made available\footnote{See \url{http://www.sccs.swarthmore.edu/users/09/btomasi1/combiner/}} a list of the top 5 predicted songs for each tag for purposes of qualitative evaluation.

\subsection{Regression Models}\label{models}

The top half of Table \ref{table} reports the performance of the Independent Linear model on subsets of the data sources, as well as models that include popularity information. The Random method is a regression model in which all sources have coefficients of 0, so that the final ranking of songs is the same as the (randomized) order in which they were initially seen. Each source alone clearly performs better than random, and each addition of a new source results in a statistically significant improvement in AUC.\footnote{This is usually apparent from inspection of standard errors, but we verify it by checking that p-values are less than 0.05 for paired $t$-tests on the per-tag AUC values. In fact, the only pairs between which this fails to hold are (1) CB and WD\&CF for Genre tags, (2) All3 and All3\&P for Acoustic tags, and (3) CB\&CF and All3\&P\&I for both tag types. If we apply a conservative Bonferroni correction for the $\frac{10 \cdot 9}{2}$ pairs of tests, a few more pairs become not significant, including the transition from CB\&CF to All3 for Genre tags.} This is consistent with the fact that the data sources are relatively uncorrelated, having correlation coefficients typically less than 0.3 and often less than 0.1, depending on the tag.

According to the AUC measure, CB is the individually most predictive source, while according to precision, CF is. We suspect this reflects the fact that CB's input representation is dense, providing nonzero scores for 91.2\% of songs for each tag, while CF's input contains mostly zeros, with scores for only an average across tags of 2.4\% of songs. (WD falls in the middle, with nonzero scores for an across-tag average of 13.7\% of songs.) 
% CB: 9909 / 10870 = 91.2%; CF: 262 / 10870 = 2.4%; WD: 1489 / 10870 = 13.7%
When CF has a nonzero value, it really means something, so that CF's top results are very precise. Toward the later end of the ranked results list, however, CF is essentially random, while CB still provides useful information.

It is interesting to observe that CB's advantage over CF in terms of AUC is larger in the case of acoustic tags than genre tags, perhaps because acoustic tags are inherently more predictable by audio content alone.

Popularity data was not especially helpful. While its addition to the three main sources did result in a statistically significant AUC improvement for Genre tags (p-value 0.007), it did not for Acoustic tags (p-value 0.4), and the magnitude of difference was relatively small. In some sense, this is a welcome result, since it suggests that the Pandora labels are not biased very much by whether an artist is well-known. The interaction model contained too many features and tended to overfit, which is unsurprising given the modest usefulness of the main popularity term.
%The interaction model contained too many features and overfit the training data, causing poor generalization. Upon including interactions of popularity with each main data source, some of the coefficients for the main sources themselves became negative, and rarely were the coefficients statistically significant. In view of the modest usefulness of the main popularity term, this behavior is not surprising.


\subsection{Coefficient Magnitudes}\label{Coefficient Magnitudes}

Our default regression model was Independent Linear with the three main data sources, popularity, and an intercept. Averaging the $\beta_{t}^i$'s over all of the tags $t$ gives the following prediction equation for Genre tags (the one for Acoustic tags is similar):
\s{
\widehat{y}_{st} = 0.08 + 0.02 x_{st}^\text{WD} + 0.02 x_{st}^\text{CB} + 0.09 x_{st}^\text{CF} + 0.02 x_{s}^\text{pop}.
}
Because the $x_{st}^i$'s represent the transformed and standardized input values (see Section \ref{Implementation Details}), the standard error for each $\beta_{st}^i$ is roughly the same for a given tag,\footnote{This is only ``roughly" because of small inter-feature correlations. % Here's why: The SE of the beta vector is (X'X)^{-1} \sigma^2. If inter-feature correlations are 0, then the off-diagonals of X'X are 0. The diagonals are all n-1. If there is correlation, then the off-diagonals mess up the symmetry of the sources when the matrix is inverted.
} so that the $t$-statistic of each coefficient is roughly proportional to the coefficient's magnitude. % The fact that $\beta_{st}^\text{CF}$ is more statistically different from 0 than $\beta_{st}^\text{WD}$ and $\beta_{st}^\text{CB}$ can also be seen directly from the p-values of the coefficients; the median over the 71 Genre tags was 0.003 for each of WD and CB but $4 \cdot 10^{-57}$ for CF. % Why are the p-values more different than the average coeffs? I think it's because, while std err is constant for a given tag, it isn't across tags. So the betas aren't comparable across tags.
% Question: Okay, but then why don't the z-scores work exactly. pnorm(- median zscore) is not quite the same as median p-val.... Because p-values based on change in sum of squares, not coefficient itself?
It's worth noting, though, that statistical significance of a coefficient as different from zero is not identical with usefulness as a data source. Indeed, we saw in Section \ref{models} that CB was individually more predictive than CF, at least as measured by AUC, while CB's coefficient is 0.02 instead of 0.09. The reason may again be that CB provides a denser input representation than CF; CF can afford to have a large $\beta_{st}^\text{CF}$ because in the rare cases when its values are nonzero, they're strongly informative.

\begin{comment}
\report{The average equation for Acoustic tags is
\s{
\widehat{y}_{st} = 0.02 + 0.03 x_{st}^\text{WD} + 0.02 x_{st}^\text{CB} + 0.02 x_{st}^\text{CF} - 0.0006 x_{s}^\text{pop}.
}
The intercept is smaller than before, reflecting the lower average number of positive labels in the ground truth. CF's dominance is also lower, perhaps because acoustic tags propagate less well than genre tags. (It seems more plausible to say that two songs are in a user's playlist because they're both ``rock" than because they both have ``a repetitive song structure," for instance.)}
\end{comment}

\subsection{Regression Types}\label{Regression Types}

The bottom half of Table \ref{table} shows various combination techniques. The regression approaches use the model All3\&P, while the fixed-combination approaches use just the three main sources. All trained regression models outperform all fixed-combining methods.\footnote{Paired $t$-tests on the AUC values for individual tags give p-values less than 0.05 for all pairs except between (1) Product and Median and (2) Sum and Max for Genre tags, and (3) all three of Hier Mix, Hier Lin, and Ind Lin for both tag types. For Genre tags, five more pairs fail to reject the null hypothesis if we apply a Bonferroni correction on the $\frac{10 \cdot 9}{2}$ pairs of tests, including Sum vs. Independent Logistic (p-value 0.01).} This result contrasts with the finding by \cite{ross2003ifb} that the simple sum rule outperformed supervised linear-discriminant analysis (similar to logistic regression) and decision trees. Still, Sum and especially Max do not fare badly and would not be unreasonable choices for a simple combining system. That Max is close to Independent Logistic regression is perhaps unsurprising, because the fixed-combining methods apply the same sigmoid transformation to the input data that logistic regression uses. %The similar performance suggests that most of the signal captured by logistic regression comes from a single, salient data source with a peak value, rather than an even combination of all three. This is no doubt partly due to the sparsity of the WD and especially CF input sources.

While Hierarchical Logistic regression did slightly outperform Independent Logistic, the hierarchical and mixture models showed no apparent effect for linear regression. We suspect this is because the number of observations (songs) is so large (over 2,100 on average) that the Bayesian prior terms in those models wash out. To confirm this, we tried artificially restricting ourselves to 250 songs, and in that case, the hierarchical methods did slightly outperform their independent counterparts.

\section{Conclusions}\label{Conclusions}

We have shown that combining different sources of song-tag annotation information improves retrieval performance. Fixed-combining methods like Sum and Max do a fine job for simple systems, but retrieval improves when we use a trained combining method like linear or logistic regression. In settings where large numbers of songs are available, basic Independent Linear regression on each tag separately gives results just as good as more sophisticated hierarchical models, while allowing for easier implementation, faster computation, and greater parallelizability.% Independent regressions are also more flexible in allowing for the use of different numbers of features for different tags, whereas the straightforward hierarchical approach assumes that each tag has the same number of regression coefficients.% In addition to suggesting the weights to use in combining input data sources, the different regression coefficients for each tag may be of intrinsic interest---for instance, by allowing for clustering of tags based on which data sources are most predictive.

% predict new tags???

\small
\bibliography{ismir09}
\end{document}
% END DOCUMENT