by Brian Tomasik, Phyo Thiha, and Douglas Turnbull
SIGIR 2009

## Abstract

Associating labels with online products can be a labor-intensive task. We study the extent to which a standard 'bag of visual words' image classifier can be used to tag products with useful information, such as whether a sneaker has laces or velcro straps. Using Scale Invariant Feature Transform (SIFT) image descriptors at random keypoints, a hierarchical visual vocabulary, and a variant of nearest-neighbor classification, we achieve accuracies between 66% and 98% on 2- and 3-class classification tasks using several dozen training examples. We also increase accuracy by combining information from multiple views of the same product.

## Manuscripts

• Two-page SIGIR version (pdf)
• 8-page technical report (pdf)

## Data

The images in our dataset were downloaded from Amazon.com in 2008 and so are copyrighted. You can get our collection of ~3,500 images as a zip file here, but it should only be used for research purposes. I hope that providing these images is consistent with fair use, but if you think this poses a copyright problem, let me know.

Here are the meanings of the prefixes on the image-file names:

bk == "back"
btl == "bottom pointing towards left"
btr == "bottom pointing towards right"
fr == "front"
la == "left angled"
ll == "pointing left, viewed from the side"
nn == "other" (sometimes worn by people, sometimes two shoes in the picture, etc.)
ra == "right angled"
rl == "pointing right, viewed from the side"
tpl == "from top, pointing left"
tpr == "from top, pointing right"


Feel free to contact me (Brian Tomasik) with questions.

## Errata

Page 4 of the 8-page technical report linked above presents a formula for combining standard errors in quadrature. I think the correct formula would instead be

(1/sqrt(5)) * sqrt(Σi SEi2).

That's because we want the average SE, which is sqrt(average variance), and

average estimated variance = (1/5) (Σi SEi2).

As an example to demonstrate this, imagine that SEi = 1 for all i. Then the average SE should also be 1, which requires that the normalizing constant in front be 1/sqrt(5), not 1/5.

Presumably this error made the standard errors we used in this paper too low.