Rating System

How I Assign Value to Products

Author: Philo Baer
Date: 2008-02-29
Copyright: 2008, all rights reserved

Table of Contents

Short Description

Why do we rate? To tell others how we feel about products. How do we rate? With interesting paradoxes.

How I Rate

My rating system relies on letter grades (either A, B, C, D, or F) with an optional modifier (represented by a + or a -). Thus, there are 15 distinct categories. It is very similar to most grading systems used worldwide. There are lots of nuances about this system, but it can be approximately described as:

Grade Description
A Exceptional. Definitely worthwhile.
B Well-liked.
C Mildly liked.
D Not liked (but not necessarily disliked).
F Disliked.

The plus sign designates the product as "above average" and the minus sign as "below average" (relative to the products in the same category). As a general principle, any item with a grade of B- or above is recommended.

There are a few exceptions to the above classification:

  • The A+ rating is special, and designates a personal favorite.
  • The F+ rating designates an item as flawed (and possibly slightly disliked).
  • The F rating represents an item that is definitely disliked.
  • The F- rating represents a repulsive product.

Why I Rate

As a consumer of many goods and a subscriber of the capitalistic beliefs, I wish to catalog the products I've enjoyed and the ones have failed me. The purpose is two-fold: first, to provide myself with a written history of experiences (I want to remember the things that I truly adored and why I adored them); second, to give guidance to others and ensure that the free market forces reward quality and punish inferiority.

Why Rating is Complicated

Designing a rating system to satisfy these goals is rather difficult. There are several objectives a system should maximize, and many of these objectives are in direct conflict with each other.

The objectives that make a good review system (and compete with each other) are:

Categorical Accuracy
The chance that the reviewer mislabels a product is low.
Relative Accuracy
The difference between a product's true rating and its given rating should be minimal.
Differentiation
Products that are sufficiently different should be placed in distinct categories.
Meaningful
Products in higher categories should be preferred to products in worse categories.

At first glance, it seems odd that these qualities would be at odds with each other. However, the most important design aspect is the number of categories in the rating system. For example, the Siskel and Ebert "two thumbs up / two thumbs down / one thumb up, one thump down" method is a 3-category system; the Amazon system of 1 to 5 stars per product is a 5-category system. Moreover, some qualities require many categories, while others need fewer.

Categorical Accuracy

Categorical accuracy is concerned with the accuracy of the reviewer as he relates to the review system; no matter how great the system may be, it is essentially worthless if reviewers cannot leverage it. As a generality, systems with few categories tend to have high categorical accuracy.

For example, consider a 1-category system. In such a scheme, every product is placed in the same category. The accuracy is 100%, since it is impossible to misclassify a product.

On the other extreme, consider a 1,000,000-category system. It is impossible for a human to exactly pick the right category for every product. The categorical accuracy of such a system is close to 0%.

Relative Accuracy

While categorical accuracy is absolute (it depends on whether or not the right category was picked), relative accuracy is concerned with differences. If a product is misclassified, a good system should ensure it still sits in a category that closely reflects its true rating. In general, systems with many categories tend to have high relative accuracy.

You may wonder, "why does high relative accuracy matter? Why not just care about categorical accuracy?" The reason is very simple: no matter how good of a reviewer you are, you will (eventually) mislabel an item. No review system is perfect. If those items are placed in wildly inaccurate categories, it will degrade effectiveness of the whole system. Most people would rather have a system that correctly classified 90% of all reviews and slightly misclassified the remaining 10%, than a system that correctly classified 95% of all reviews but wildly misclassified the remaining 5%.

To illustrate this point, consider the 3-category system of "negative," "neutral," and "positive" labels. Imagine a product that is halfway between "neutral" and "positive," and slightly leans toward the positive side. It is likely that someone would misclassify it as "neutral." Even though the product is only one categorical label away from its true rating, the rating is (relatively) inaccurate -- neutral carries the connotation of no preference, when the product was actually (slightly) liked.

Examining the other side, consider a 100-category system. Imagine a great product that deserves a rating of 95. However, the reviewer only gives it a 91, due to a bad day he was having at the time. Although the product is placed four categories away, a 91 still carries the connotation of a great product. The more categories one has, the more alike they are; the more alike they are, the less danger of misclassification.

Differentiation

Differentiation simply states that different products deserve different categories. It is clear that more categories allow for more differentiation, while fewer categories limit it. Differentiation is important since review systems are meaningless unless products are distinguished.

For example, consider the ludicrous 1-category system described above. The reason why it is ludicrous (and why no one uses it) is because it conveys no useful information. Likewise, in the 3-category system of "negative," "neutral," and "positive," we also see problems. "Positive" does not tell a person whether the item is at the borderline, is good, is great, or is exceptional.

Meaningful

The counterpart to differentiation, the meaningful principle states that we should not differentiate so much that the categories become meaningless. Products have many different aspects, and attempting to create a single label to describe all of aspects of it can be futile. As such, categories should correspond to a particular feeling, and be vague enough to account for the fact that not everything can be measured along a single axis. As such, meaningfulness requires a moderate number of categories.

Looking at a 100-category system, one has to wonder what the differences between two adjacent categories are. For example, can the reviewer really distinguish a "85" from a "86"? How much of the number is signal, and how much is noise?

Moreover, consider the case of three books: Cervante's "Don Quixote," Douglas Adam's "The Hitchhiker's Guide to the Galaxy," and Kafka's "The Metamorphosis." For argument's sake, let us say that the reviewer likes "Don Quixote" more than "The Hitchhiker's Guide to the Galaxy," likes "The Hitchhiker's Guide to the Galaxy" more than "The Metamorphosis," and likes "The Metamorphosis" more than "Don Quixote," for various reasons. [1] We now have a situation where we cannot strictly order these books according to preference, since no book is strictly preferred to the others. If we had to rate them in a very fine grain system, like a 100-category scheme, the three books might be assigned 91, 92, and 93, respectively. However, the ratings lie, since the book with a 92 ("The Hitchhiker's Guide to the Galaxy") is actually preferred to the one with a 93 ("The Metamorphosis"). The best solution would be to reduce the number of categories so all three books could receive the same classification -- we usually only encounter such bizarre scenarios when nitpicking over the fine details.

Likewise, if we consider the case of too few categories, we lose meaning because we lose differentiation.

Thus, too many categories run the risk of becoming both meaningless and misleading. Too few categories run the risk of meaninglessness. We want just enough categories to give meaning to our reviews, preventing randomness and intransitivity from ruining our system.

How I Developed My System

My very first scheme used a 101-category system, with products able to obtain a numerical value between 0.0 and 10.0. This system was similar to GameSpot's old system. However, I eventually discovered I had a lack of categorical accuracy and meaning. I also noticed that I tended to give ratings that ended in either .5 or .0, indicating that I was subconsciously ignoring many categories.

After examining different systems, I finally decided on my first general principle that should guide my system: each category should represent a distinct "feeling" that is sufficiently different from the others. Deciding on categories so products would cleanly fall into just one was rather difficult. I consulted my previous reviews and tried to re-classify them. I thought about this problem for quite some time, and eventually came across a set of labels that, although not perfect, worked well:

Feeling Description
  1. Love
Like "exceptional," but limited to my personal favorites.
  1. Exceptional
Truly wonderful. The cream of the crop.
  1. Strong Like
Enjoyable and recommendable, but with the caveat that there are (probably) better things out there.
  1. Mild Like
Nothing special, but has some niceties.
  1. Neutral
Neutral is a bit of misnomer. This category encompasses items that are not liked, but at the same time, not offensive. It can also include products that might have been labeled as flawed or disliked, but had redeemable characteristics to balance these blemishes.
  1. Flawed
A category encompassing many possibilities. Item in here are usually mildly disliked. They can also be equivalent to the "neutral" category, but with noticeably more flaws.
  1. Strong Dislike
A product that makes you wish for your time and money back.
  1. Despicable
Something so bad that it has scarred you for life. Once you experience it, you can't un-experience it!

However, I felt that such a system lacked sufficient differentiation. Moreover, only having eight categories felt like I lacked some relative accuracy as well (which I often felt for products that were borderline between two categories). I eventually found a solution to the problem: an additional modifier.

Every review has a major category (listed above) and a minor category label. The minor category labels are: "Below Average," "Average," and "Above Average," representing the product's standing relative to all other products in the same category. I noticed it is extremely difficult to strictly order products that are loved, so I do not use modifiers in the "love" category. Also, I noted that "love" and "above average exceptional" are basically equivalent, and decided to merge them together (so "above average exceptional" and "love" are the same category). Thus, my final rating system is a 21-category scheme.

This solution worked remarkably well for differentiation and actually increases relative accuracy without negatively affecting accuracy (after all, one can ignore the minor category label without affecting the major category). Remarkably, the system satisfied all of my requirements, despite their conflicts, with nice compromises between all the goals. I feel that it has been pushed as far as possible while satisfying all the constraints.

Why The Online Ratings are Different

As you have noticed by now, I claim to use a system different from the one I initially described. In fact, I use the 21-category system internally for all my reviews. When I display the reviews on my web site, I use an internal process to change it to the grade system, via the following conversion:

Feeling Modifier Grade
Love N/A (same as Above Average Exceptional) A+
Exceptional None A
Below Average A-
Strong Like Above Average B+
None B
Below Average B-
Mild Like Above Average C+
None C
Below Average C-
Neutral Above Average D+
None D
Below Average D-
Flawed Above Average F+
None
Below Average
Strong Dislike Above Average F
None
Below Average
Despicable Above Average F-
None
Below Average

The reason for presenting a 15-category grade system opposed to a 21-category scheme is simplicity. When displaying my reviews online, where other people can read them, it is best to use a system that people already know. Moreover, since I hardly ever run across a product I dislike (as I am careful with what I buy), I can easily squish those categories together without clumping too many products together.

Criticism

Some people argue that there is another criteria necessary for rating systems: somewhat uniform distributions. These people argue that systems lose meaning when every category is not used. For example, many people use a 10-star system, but never rate products below 4 stars. Why not remove the lower categories? These critics would tel me to ditch my 21-category system and just use the 15-category grade system.

I strongly disagree with this philosophy for two reasons. First, a rating system should encompass all discernible feelings, even if the reviewer does not run across items that trigger those emotions. Second, it is extremely easy to convert from the 21-category system to the 15-category system, but very difficult to meaningfully convert from a 15-category system to a 21-category system.

In sum, I have more flexibility and more expressiveness with the 21-category system. I do not see why I should needlessly sacrifice it.

Conclusion

When I review a product, I first look at the 8 categories, from love to despicable. I discern which label fits it. After placing the label, I then assign a minor category label to it, if applicable. Each label, both major and minor, confers a unique feeling. When displaying the review online, the value is converted to a grade, which also carries these auras.

Although there has been much analysis about the design of a review system, it is equally important that the review system matches the reviewer. I designed my system for me, and I consult its categories and definitions when rating items. Applying theory and running experiments resulted in my system and satisfied my constraints.

I have no doubt that the system could be further improved, but I think such improvements would be minor. I feel that the minimum number of categories any system should have is 5, with the maximum number around 20. Any fewer categories and relative accuracy, meaningfulness, and differentiation suffer. Any more causes categorical accuracy and meaningfulness to diminish. With my unique 21-category design, I have pushed the system to the edge. The only way I can see an improvement is through a paradigm shift.

[1]

If you wonder how a reviewer might have such intransitive preferences, consider the following scenario:

  • The reviewer prefers reading the original, untranslated text (to preserve the flavor of the writing); his favorite languages are, in descending order, German, Spanish, and English.
  • He prefers longer books to shorter ones.
  • He prefers books that take place in more modern settings than books that take place in more medieval ones.

When ranking two books, he evaluates all three criteria and uses a majority vote. For example, when comparing "A Tale of Two Cities" to "Don Quixote," "A Tale of Two Cities" wins the time period vote, but loses the language and length votes, so "Don Quixote" wins. This decision process leads to intransitive preferences.

Although intransitive preferences are rare, they do happen, as humans are complicated. A good rating system (just like a good electoral system) accounts for this phenomenon.