by Brian Tomasik, Phyo Thiha, and Douglas Turnbull
SIGIR 2009


Associating labels with online products can be a labor-intensive task. We study the extent to which a standard 'bag of visual words' image classifier can be used to tag products with useful information, such as whether a sneaker has laces or velcro straps. Using Scale Invariant Feature Transform (SIFT) image descriptors at random keypoints, a hierarchical visual vocabulary, and a variant of nearest-neighbor classification, we achieve accuracies between 66% and 98% on 2- and 3-class classification tasks using several dozen training examples. We also increase accuracy by combining information from multiple views of the same product.



  • Two-page SIGIR version (pdf)
  • 8-page technical report (pdf)


The images in our dataset were downloaded from in 2008 and so are copyrighted. You can get our collection of ~3,500 images as a zip file here, but it should only be used for research purposes. I hope that providing these images is consistent with fair use, but if you think this poses a copyright problem, let me know.

Here are the meanings of the prefixes on the image-file names:

bk == "back"
btl == "bottom pointing towards left"
btr == "bottom pointing towards right"
fr == "front"
la == "left angled"
ll = "pointing left, viewed from the side"
nn = "other" (sometimes worn by people, sometimes two shoes in the picture, etc.)
ra == "right angled"
rl = "pointing right, viewed from the side"
tpl = "from top, pointing left"
tpr = "from top, pointing right"

Feel free to contact me (Brian Tomasik) with questions.