by Brian Tomasik, Phyo Thiha, and Douglas Turnbull
Associating labels with online products can be a labor-intensive task. We study the extent to which a standard 'bag of visual words' image classifier can be used to tag products with useful information, such as whether a sneaker has laces or velcro straps. Using Scale Invariant Feature Transform (SIFT) image descriptors at random keypoints, a hierarchical visual vocabulary, and a variant of nearest-neighbor classification, we achieve accuracies between 66% and 98% on 2- and 3-class classification tasks using several dozen training examples. We also increase accuracy by combining information from multiple views of the same product.
The images in our dataset were downloaded from Amazon.com in 2008 and so are copyrighted. You can get our collection of ~3,500 images as a zip file here, but it should only be used for research purposes. I hope that providing these images is consistent with fair use, but if you think this poses a copyright problem, let me know.
Here are the meanings of the prefixes on the image-file names:
bk == "back" btl == "bottom pointing towards left" btr == "bottom pointing towards right" fr == "front" la == "left angled" ll = "pointing left, viewed from the side" nn = "other" (sometimes worn by people, sometimes two shoes in the picture, etc.) ra == "right angled" rl = "pointing right, viewed from the side" tpl = "from top, pointing left" tpr = "from top, pointing right"
Feel free to contact me (Brian Tomasik) with questions.
Page 4 of the 8-page technical report linked above presents a formula for combining standard errors in quadrature. I think the correct formula would instead be
(1/sqrt(5)) * sqrt(Σi SEi2).
That's because we want the average SE, which is sqrt(average variance), and
average estimated variance = (1/5) (Σi SEi2).
As an example to demonstrate this, imagine that SEi = 1 for all i. Then the average SE should also be 1, which requires that the normalizing constant in front be 1/sqrt(5), not 1/5.
Presumably this error made the standard errors we used in this paper too low.