Forum Settings
Forums
New
Oct 3, 2007 1:44 PM
#1

Offline
Jun 2007
82
I have written a Greasemonkey script which calculates the similarity between two user using the cosine similarity formula. Those who are familiar with vector algebra may know the concept of cosine similarity. Its a better formula for finding similarity between two vector (in this case the score vectors of two users) and also the most used one. The script can be installed from http://userscripts.org/scripts/show/12746 and the details of cosine similarity can be seen at http://en.wikipedia.org/wiki/Cosine_similarity. You'll need Firefox with Greasemonkey extension to use this script. After installing the script go to any user's profile and click on "All Shared" link to see the similarity.
Pages (3) [1] 2 3 »
Oct 3, 2007 1:58 PM
#2
Overlord

Offline
Nov 2004
5752
Seems accurate (i tried it out on about 15 people), but how is it any better than the current system?
Oct 3, 2007 2:08 PM
#3

Offline
Jun 2007
82
Well it will require some knowledge of vector algebra to understand this.
Let A and B be two vectors. If they are in same direction (that is they are totally similar) then the angle between then is zero degree and the cosine of angle (which is calculated in this script) is 1 that is 100%. Whereas if they are in opposite direction (that is they are totally dissimilar) then angle between then 180 degree and cosine is -1 that is -100%.
The more the two vector are in similar direction, the more the cosine is near 100% and the more the two vector are in opposite direction, the more the cosine is near -100%.
So the cosine give the similarity between two users.
The current system is based on differences of score and their average which I don't think is as strongly correlated to the similarity as the cosine formula.

Any math's geek out there to explain all this?
Oct 3, 2007 2:09 PM
#4

Offline
Jun 2007
82
Must sleep now. Its 2:40 am.
Oct 3, 2007 3:14 PM
#5

Offline
Mar 2005
3807
haha we were just going over fourier transform in physics about an hr ago so I'm so in the mood for this. nice

if ur not going to base the cosine similarity on the average difference in score, then values will you be using for the vector parameters?
Oct 3, 2007 8:13 PM
#6

Offline
Jun 2007
82
Normalized score vectors of the users. Of course for the shared anime only.
Oct 3, 2007 8:54 PM
#7
Overlord

Offline
Nov 2004
5752
I'm like to see a comparison of this system vs. the current system with multiple subjects and the explanation on why the 'cosine similarity' is better in all cases.
Oct 3, 2007 8:56 PM
#8

Offline
Jun 2007
82
I'll do that after I am back from college.
Oct 3, 2007 10:50 PM
#9

Offline
Mar 2005
3807
If I understand this correctly this method would better account for diverging patterns in the ways users score anime. Like if one person tends to score lower in general for anime and you compare to someone who scores higher, the current system would invariably give them low compatibility ratings. By transforming the scores of users into vectors and comparing them mathematically this new method might be able to smooth out discrepancies in differing scoring patterns. I think this might be possible given my limited knowledge of the vector math behind this but then again I do not fully understand this method yet so this will have to be confirmed.
Oct 4, 2007 12:44 AM

Offline
Jun 2007
82
So here is the explanation:

Cosine Similarity



As seen from the figure above, if the two vectors are in same direction then the angle between then is close to zero deg. and the cosine of angle is near 1 or 100%. Similar arguments for unrelated scores and opposite scores. The point here is that the more the vectors are in same direction, the more the cosine is near 100% and for opposite direction its near -100%.

How cosine similarity is applied here

Say two users have following scores for the anime they share:

Anime Title user 1's Score user 2's Score
Azumanga Daioh 10 10
Bleach 9 7
Bokura ga Ita 10 -
Card Captor Sakura 9 9
Death Note 8 7
Dragon Ball GT 6 4

* Neglect the series for which any or both users don't have rating
* Normalize the score in range -1 to 1

Now the user score can be represented as vector:
v1 = [1, 0.8, 0.8, 0.6. 0.2]
v2 = [1, 0.4, 0.8, 0.4, -0.2]

Now we just calculate the cosine between the two vectors and it gives the similarity between the two user. In this case the vectors are n-dimensional but the cosine formula still give the cosine of angle between them. So vectors in same direction have smaller angle between them and hence cosine is near 1.

Comparison with current system

The current system find the difference in scores for two users and averages them. This does not make much sense for finding similarity between users. As kei-clone pointed out, if one user tends to rate lower and one higher then the current system will show low similarity between them. The current system depends on the magnitude of scores. Whereas the cosine similarity does not depend on magnitudes, it just depends on the direction of vectors, which is a much better measure of similarity. If two score vectors are in same direction - it does not matter what the score magnitudes and hence the length of vectors are - the angle between then is small and hence cosine is near 1.

Some examples:
users Diff Cosine
Sakura_Blossoms and abhin4v 1.73 89
DishCloth and abhin4v 1.71 82
DrkSasuke and abhin4v 1.80 92
FatalFlaw and abhin4v 2.38 81
slayer83 and xinil 0.90 95
izikiel and abhin4v 1.28 89
kei-clone and abhin4v 1.62 78
blissfire and abhin4v 2.30 53
Marco and abhin4v 1.80 74
lops-sama and abhin4v 1.86 71
Oct 4, 2007 9:58 AM

Offline
Aug 2006
386
I do like the methodology, however:

Shouldn't the scores be normalized based upon the user's mean score rather than based upon each others score? At the moment your normalization seems to merely make the numbers smaller, which creates a similar vector with smaller magnitude.

To truly ignore magnitude the mean score of the user should be used, and numbers normalized based upon that.

Edit: this is based off of your little example and conversion of that to a vector.


Let me expand on this some more:

base the normalization off of three things: Highest posted score, mean score, and lowest posted score.

Then use those numbers to create a vector for two users, [-1,1] where 0 would be the mean score, 1 the highest, and -1 the lowest used score.
aisakkuOct 4, 2007 10:23 AM
#dontcare
Oct 4, 2007 11:35 AM

Offline
Jun 2007
82
Shouldn't the scores be normalized based upon the user's mean score rather than based upon each others score? At the moment your normalization seems to merely make the numbers smaller, which creates a similar vector with smaller magnitude.


I thought of that too but the thing is that, each score has a quality associated with it, like Masterpiece, Great, Very Good, Good etc. So when a user rates a series his score is associated with a quality. And I think these qualitative terms are absolute. Masterpiece means masterpiece to everyone. So the normalization is done with respect to same score for all users (5, that is average) here. Indeed Average is taken as mean (average, that is). If the noramalization is done with respect to mean score (which is different for each user), the results will be erroneous.
Oct 4, 2007 11:40 AM

Offline
Mar 2005
3807
I'm gonna have to agree with aisaaku here. the fact is that the qualities associated with the scores are definitely not absolute. so even though it says "average" for 5, quite a significant amount of users don't consider that an "average" score, rather a "low" score. that's why I think this normalization idea is good, because it scores more relatively based on each user's own standard of scoring.
Oct 4, 2007 11:46 AM

Offline
Jun 2007
82
If you say so then I'll give it a try.
Oct 4, 2007 11:52 AM
Overlord

Offline
Nov 2004
5752
Awesome explanation abhin4v. I look forward to seeing what the normalization example gives us. This looks highly likely to be implemented.
XinilOct 4, 2007 12:46 PM
Oct 4, 2007 11:55 AM

Offline
Jun 2007
82
domo ^^
Oct 4, 2007 12:51 PM

Offline
Jun 2007
82
I did the normalization with respect to score 5 and also with respect to mean score of user. Here are the results:

User | norm 5 | norm mean | diff
Xinil and abhin4v 88 27 1.19
kei-clone and abhin4v 77 12 1.67
aisakku and abhin4v 80 29 1.58
selective_yellow and abhin4v 91 36 1.04
windy and abhin4v 92 25 1.06

As you can see from the results, when the scores are normalized using the mean as basic then the results are very erroneous. So I think my method of normalization with respect to a constant (5) is better.

the fact is that the qualities associated with the scores are definitely not absolute. so even though it says "average" for 5, quite a significant amount of users don't consider that an "average" score, rather a "low" score.


I think people don't think in that way and consider "average" as an "average" score.
Oct 4, 2007 5:15 PM

Offline
Aug 2006
386
abhin4v said:

the fact is that the qualities associated with the scores are definitely not absolute. so even though it says "average" for 5, quite a significant amount of users don't consider that an "average" score, rather a "low" score.


I think people don't think in that way and consider "average" as an "average" score.


I'm pretty sure you're wrong on this one. The majority of people consider 5 a bad score. You could even poll to figure it out, but the fact that i tend to score 2 points lower than people with my mean being a 5.8 shows that people tend to score higher than "average."

As for the "erroneous" information. Erroneous based upon what? I have the most differing tastes from you compared to the others compared, and in general the numbers do reflect that.

Consider this. Logically thinking, when you base the numbers off of each other, you are in fact using the magnitude as a measurement. What your normalization system does is merely shrinks it down to a number between -1 and 1, and does not adjust for any difference in magnitude. Doing it based upon mean score completely disregards the magnitude of the score and bases things off of the magnitude of difference of the users own personal score, making it a closer estimation of "similar tastes."
That which is below the users average score is going to be considered: well below average, or a negative. That above it will be considered more positive.

Now perhaps it is needed to round to the closest whole number to do the calculations, I'm not sure, but pure logic dictates that using a system based on scores compared to a users personal average will better find similarity in scoring than one basing it off of the other person individual scores.

The numbers here are obviously going to be different from your previous calculations, are most likely going to be smaller, and not necessarily erroneous because of that.
#dontcare
Oct 4, 2007 8:44 PM

Offline
Jun 2007
82
aisakku said:

Consider this. Logically thinking, when you base the numbers off of each other, you are in fact using the magnitude as a measurement. What your normalization system does is merely shrinks it down to a number between -1 and 1, and does not adjust for any difference in magnitude. Doing it based upon mean score completely disregards the magnitude of the score and bases things off of the magnitude of difference of the users own personal score, making it a closer estimation of "similar tastes."
That which is below the users average score is going to be considered: well below average, or a negative. That above it will be considered more positive.


Your argument seems valid to me. Xinil, can you create a poll to find out what score people give to the series they think is "average"?
Oct 4, 2007 9:12 PM

Offline
Oct 2006
1569
Probably something around a 6. It's hard to pinpoint exactly though. I mean, you could find the average of all the scores given, but most people don't keep watching shows they think aren't any better than average, so that'd still be off a bit.
Oct 4, 2007 9:56 PM
Overlord

Offline
Nov 2004
5752
Well based on the examples, using the norm mean seems to make the averages a lot lower than using norm 5.

Xinil and abhin4v 88 27 1.19

27% as compared to 88%. Hmm, a bigger sample might be in order.
Oct 4, 2007 10:03 PM

Offline
Mar 2005
3807
according to the theory, shouldn't a negative percentage also be possible? I don't see any...
Oct 4, 2007 10:54 PM

Offline
Aug 2006
386
Wait, if those are percentages, then wouldn't the numbers based off of my method be percentage of difference rather than a percentage of similarity? Hence the massive difference in the numbers.

Edit: i need to see the forumla written out to see what's really going on.
aisakkuOct 4, 2007 11:01 PM
#dontcare
Oct 5, 2007 12:40 AM

Offline
Jun 2007
82
Here are the formula:

First one is for standard correlation coefficient or cosine coefficient, the one I have used.
Second is the Pearson's correlation coefficient, the one aisakku proposed. It has mean of data subtracted from the data.
abhin4vOct 5, 2007 8:22 AM
Oct 5, 2007 3:58 AM
Offline
Oct 2007
9
How does the current system work? Difference in mean?

In that case the proposed method seems a lot better but it doesn't feel quite right either. Does it not have the effect that angles depends on the number of shared animes, this is not something we want.
The current rating system is probably all but linear. However, it seems that the top scores are stationary over users while the scale from there vary. If every user has lots of ratings, one could use the number of items they rated an item over as measure, instead of the rating, i.e. if they rate the i:th shared item rating r, use

v_i = #items rated r / 2 + sum j=0 to r-1: #items rated j

Do we really want negative correlation?

Is it normal to use x_i² to denote sum_i x_i² ? We tend to use x_. to denote mean over an index and occasionally use the horrible syntax of adding a 2 for the mean square.
Oct 5, 2007 4:00 AM

Offline
Jun 2007
82
Ok, I am very confused now. Take a look at this, and decide for yourself. The scores after | are normalized about the mean.
Oct 5, 2007 4:56 AM
Offline
Oct 2007
9
Is the Sqrt(x_i² - x*) (which i presume means Sqrt(sum(x_i²) - x*) ) a typo or is that precisely what you use? It should be Sqrt sum (x_i - x*)². If you want to use a shorter notation, we use S_xy to denote the sum
Sum i: (x_i - x*)(y_i - y*)
consequently, S_xx denotes the sum of squares. It yields the neat S_xy / S_xx S_yy.

P.S. perhaps try with a smaller dataset to begin with?

The Mahalanobis metric is also interesting,
http://en.wikipedia.org/wiki/Mahalanobis_distance .
Oct 5, 2007 7:54 AM
Offline
Oct 2007
9
How does the current system work? Difference in mean?

In that case the proposed method seems a lot better but it doesn't feel quite right either. Does it not have the effect that angles depends on the number of shared animes, this is not something we want.
The current rating system is probably all but linear. However, it seems that the top scores are stationary over users while the scale from there vary. If every user has lots of ratings, one could use the number of items they rated an item over as measure, instead of the rating, i.e. if they rate the i:th shared item rating r, use

v_i = #items rated r / 2 + sum j=0 to r-1: #items rated j

Do we really want negative correlation?

Is it normal to use x_i² to denote sum_i x_i² ? We tend to use x_. to denote mean over an index and occasionally use the horrible syntax of adding a 2 for the mean square.

Sorry, obviously I meant the fraction of animes it's rates above, i.e.
v_i = ((#items rated r by U) / 2 + sum j=0 to r-1: #items rated j by U) / #items rated by U
where U is the user.
Oct 5, 2007 8:23 AM

Offline
Jun 2007
82
fuso said:
Is the Sqrt(x_i² - x*) (which i presume means Sqrt(sum(x_i²) - x*) ) a typo or is that precisely what you use? It should be Sqrt sum (x_i - x*)². If you want to use a shorter notation, we use S_xy to denote the sum
Sum i: (x_i - x*)(y_i - y*)
consequently, S_xx denotes the sum of squares. It yields the neat S_xy / S_xx S_yy.

P.S. perhaps try with a smaller dataset to begin with?

The Mahalanobis metric is also interesting,
http://en.wikipedia.org/wiki/Mahalanobis_distance .


Sorry for the mistake, I have corrected the formula.
Oct 5, 2007 9:23 AM

Offline
Jun 2007
82
fuso said:
v_i = ((#items rated r by U) / 2 + sum j=0 to r-1: #items rated j by U) / #items rated by U where U is the user.

what basis do u have for this method of normalization? is this some standard formula?
Oct 5, 2007 10:14 AM

Offline
Jun 2007
82
I did some more digging, I think the normalization about mean as suggested by aisakku is the best method. For comparison see the scores and the graphs.
.
So I am going to use Pearson's correlation coefficient formula for finding similarity in my greasemonkey script.
Oct 5, 2007 11:38 AM

Offline
Jun 2007
82
Here is how to interpret the Pearson's correlation coefficient:

Correlation | Negative | Positive
Small | −0.29 to −0.10 | 0.10 to 0.29
Medium | −0.49 to −0.30 | 0.30 to 0.49
Large | −1.00 to −0.50 | 0.50 to 1.00

-0.09 to 0.09 is "No Correlation" that is unrelated scores.
abhin4vOct 5, 2007 11:49 AM
Oct 5, 2007 11:46 AM
Offline
Oct 2007
9
what basis do u have for this method of normalization? is this some standard formula?

It is not a question of normalization.
The method you have taken is to measure similarity between users by the correlation between vectors associated with the users. Your choice of vectors is not motivated but is the most convenient and straight-forward based on the ad hoc storage and user input system structure. To handle the problems mentioned in this topic I'm merely suggesting a different choice of vectors which appears to stand a fair chance. The motivation would be that I believe the similarities better fit the underlying model with this choice.

I did some more digging, I think the normalization about mean as suggested by aisakku is the best method. For comparison see the scores and the graphs.

How do you evaluate the different methods? The normalization about the mean has the effect that it produces correlations closer to 0 and more evenly distributed about
positive and negative values, for any domain, but why do you seek this? The only way to evaluate it I believe is to try different examples and see which one fits our own fuzzy similarity measure.
What are the x axes on the graphs?
fusoOct 5, 2007 11:51 AM
Oct 5, 2007 12:00 PM

Offline
Jun 2007
82
fuso said:
what basis do u have for this method of normalization? is this some standard formula?

It is not a question of normalization.
The method you have taken is to measure similarity between users by the correlation between vectors associated with the users. Your choice of vectors is not motivated but is the most convenient and straight-forward based on the ad hoc storage and user input system structure. To handle the problems mentioned in this topic I'm merely suggesting a different choice of vectors which appears to stand a fair chance. The motivation would be that I believe the similarities better fit the underlying model with this choice.

I did some more digging, I think the normalization about mean as suggested by aisakku is the best method. For comparison see the scores and the graphs.

How do you evaluate the different methods? The normalization about the mean has the effect that it produces correlations closer to 0 and more evenly distributed about
positive and negative values, for any domain, but why do you seek this? The only way to evaluate it I believe is to try different examples and see which one fits our own fuzzy similarity measure.
What are the x axes on the graphs?


I agree with you. We need to evaluate more models for finding the best one. The x axes on graphs are just the serial no. of series (see the link to scores). I'll try your formula and compare the results.
Oct 5, 2007 5:29 PM

Offline
Aug 2006
386
fuso said:

How do you evaluate the different methods? The normalization about the mean has the effect that it produces correlations closer to 0 and more evenly distributed about
positive and negative values, for any domain, but why do you seek this? The only way to evaluate it I believe is to try different examples and see which one fits our own fuzzy similarity measure.
What are the x axes on the graphs?


The reason I said to evaluate based upon normalization about the mean is because of the initial point made that similarity should not be determined on the basis of the magnitude of the score. If you are not normalizing numbers based upon the users own mean score, then you are directly comparing the two scores together, and even if you are setting up a vector, you are comparing their scoring directly, rather than searching for a similar scoring pattern.

Edit:
Using a mean score of 5 sets up a decent correlation of how close two users score in general, but that is the same as the current scoring system, just slightly more accurate a reading. Normalizing numbers based upon the two users mean scores finds how similar the users' scoring patterns are.

Keep in mind as well: 1 and 10 are not always the low and High end
Normalization about the mean should be done so that the users lowest score given is the lowest posisble score in their range, same goes for the top. Assume: top score of 9, low score of 3, mean of 6. Numbers should be normalized into vector form such that 3 = -1, 6 = 0, and 9 = 1.
aisakkuOct 5, 2007 5:34 PM
#dontcare
Oct 6, 2007 12:16 AM
Offline
Oct 2007
9

The reason I said to evaluate based upon normalization about the mean is because of the initial point made that similarity should not be determined on the basis of the magnitude of the score. If you are not normalizing numbers based upon the users own mean score, then you are directly comparing the two scores together, and even if you are setting up a vector, you are comparing their scoring directly, rather than searching for a similar scoring pattern.

Edit:
Using a mean score of 5 sets up a decent correlation of how close two users score in general, but that is the same as the current scoring system, just slightly more accurate a reading. Normalizing numbers based upon the two users mean scores finds how similar the users' scoring patterns are.

I well understand the reason why you suggested it but does not entail that it necessarily yields good results since it's still far from perfect. I was merely being curious of how abhin4k evaluated the schemes and came to the conclusion that "normalization about mean" yields the most natural similarity measures.

Keep in mind as well: 1 and 10 are not always the low and High end
Normalization about the mean should be done so that the users lowest score given is the lowest posisble score in their range, same goes for the top. Assume: top score of 9, low score of 3, mean of 6. Numbers should be normalized into vector form such that 3 = -1, 6 = 0, and 9 = 1.

If we inspect users we find that the user's highest scor density is less variable (ranging from 9 to 10 for the best out of hundred of animes) than the low scores (ranging from 0 to 6 for the worst animes). Hence your assumption that it is evenly distributed about the mean and linear does not seem to hold.

Also note that it's not always 1 for the highest and -1 for the lowest as your example seems to indicate.
Let v = (3, 8, 10)
Ev, the mean = (3 + 8 + 10) / 3 = 7, hence
v^ = (v - Ev) / |v - Ev| = ((3, 8, 10) - 7) / sqrt((3-7)² + (8-7)² + (10-7)²)
= (-4, 1, 3) / sqrt(16 + 1 + 9) = (-4, 1, 3) / 5.10 = (-0.78, 0.20, 0.59)

Although I'm not certain this is what abhin4k uses.
If one sets the lowest to -1 and highest to 1, then it has little to do with the mean. My guess is that it doesn't account well for when there are only a few extremely low points either.
fusoOct 6, 2007 12:41 AM
Oct 6, 2007 1:09 AM

Offline
Aug 2006
386
fuso said:

I well understand the reason why you suggested it but does not entail that it necessarily yields good results since it's still far from perfect. I was merely being curious of how abhin4k evaluated the schemes and came to the conclusion that "normalization about mean" yields the most natural similarity measures.


If we inspect users we find that the user's highest scor density is less variable (ranging from 9 to 10 for the best out of hundred of animes) than the low scores (ranging from 0 to 6 for the worst animes). Hence your assumption that it is evenly distributed about the mean and linear does not seem to hold.

Also note that it's not always 1 for the highest and -1 for the lowest as your example seems to indicate.
Let v = (3, 8, 10)
Ev, the mean = (3 + 8 + 10) / 3 = 7, hence
v^ = (v - Ev) / |v - Ev| = ((3, 8, 10) - 7) / sqrt((3-7)² + (8-7)² + (10-7)²)
= (-4, 1, 3) / sqrt(16 + 1 + 9) = (-4, 1, 3) / 5.10 = (-0.78, 0.20, 0.59)


I think you're slightly misunderstanding what I said.

The score normalization will not give equal value to things above and below the mean when converting them to normalized numbers for the vectors, or whatever you plan to use them for. I never said they would nor attempted to make them even. There's no reason to do so.

If you wish, things can be measured in terms of the mode score, with the logic that the most occurring number would be a better representation of that users perception of "average."

Keep in mind: My earlier suggestions were merely that, suggestions, with no further testing on the subject. I was theory crafting, as i still am at this point. Logically thinking however, my idea gave a better representation of scoring patterns than directly comparing two users scores.

I merely added the maximum and minimum range because it only makes sense. Mostly all users will use the value of 10, but there may be some who never do score an anime 1. If they do not score it 1, the score of 1 should not be included in that part of the equation. If scores above or below a certain range are never used, they should not be included. This is for the sake of accuracy.

The numbers i gave were merely for example and were not plugged into any equation at all. I simply used them correlating to the vector math example given by abhin4v a while ago. Don't overthink them.

I never said my method would be the perfect way to do things. It however is the more logical way to find similar patterns in scoring than the others suggested.
#dontcare
Oct 6, 2007 1:18 AM

Offline
Jun 2007
82
fuso said:
Hence your assumption that it is evenly distributed about the mean and linear does not seem to hold.

I agree with you in this point. The Pearson coefficient gives best results for normal distribution, but I guess the scores do not have a normal distribution.
fuso said:
Although I'm not certain this is what abhin4k uses.

That is exactly what I am doing. Do this and then take a dot products of vectors.
aisakku said:
Using a mean score of 5 sets up a decent correlation of how close two users score in general, but that is the same as the current scoring system, just slightly more accurate a reading. Normalizing numbers based upon the two users mean scores finds how similar the users' scoring patterns are.

This is correct. As you can see in the graphs that normalization about mean brings the score curve closest and then u can match the pattern of the scores most easily.

I agree that there may be better methods of finding the correlation, but I guess Pearson's coefficient gives a pretty good measure. Its the most used formula for finding the correlation. And I think both fuso and aisakku agree that its much better than current system.

@fuso: Can you show some sample calculation based on the formula you proposed?
Oct 6, 2007 9:55 AM
Offline
Oct 2007
9
I think you're slightly misunderstanding what I said.

I'm sorry, it may have sounded a bit harsh. I was inquiring abhin4k for the reason normalization about the mean was better and your response suggested you were trying to make an objective stand for this, greater than the obvious intuition behind it. My reply to this obviusly had to be of a rather strict nature suggesting that your arguments were merely motivations and there was not much of a guarantee that it is as good as what I had interpreted that you could be claiming/suggesting it to be.

This has left me slightly confused though, could you give an equation for what it was you suggested if it wasn't the equation I listed in the previous post? I was also uncertain of whether or not you wanted to normalize about the mean or merely shift and rescale so that the lowest value recieves the normalized value -1 and the highest value 1. Unless you suggest something at least partially nonlinear, these two cannot work together.

(1) v_i' = (v_i - min v_i) / (max v_i - min v_i)

alternatively,

(2) v_i' = 2(v_i - min v_i) / (max v_i - min v_i) - 1

= (2v_i - max v_i - min v_i) / (max v_i - min v_i)

ignoring degeneracies.


The score normalization will not give equal value to things above and below the mean when converting them to normalized numbers for the vectors, or whatever you plan to use them for. I never said they would nor attempted to make them even. There's no reason to do so.

This was to point out a deficiency in normalization about mean.


Keep in mind: My earlier suggestions were merely that, suggestions, with no further testing on the subject. I was theory crafting, as i still am at this point.

I'm not blaming you for anything, you've been much helpful and left significant contributions to this thread. I'm not claiming that I have done anything of significance for this topic My comments are merely, I hope, constructive criticism. Please do tell me, PM or otherwise, if you do not agree.

Logically thinking however, my idea gave a better representation of scoring patterns than directly comparing two users scores.

Perhaps I'm being thickheaded here but how so? It's surely motivated but how can we know it's actually an improvement? I don't quite like the term logical in this and the last sentence's context.

I merely added the maximum and minimum range because it only makes sense. Mostly all users will use the value of 10, but there may be some who never do score an anime 1. If they do not score it 1, the score of 1 should not be included in that part of the equation. If scores above or below a certain range are never used, they should not be included. This is for the sake of accuracy.

I do not think this is terribly robust when comparing an user that has rated 200 items, all in [4,10], and an user with exactly the same data except he happened to rate one item 1.

The numbers i gave were merely for example and were not plugged into any equation at all. I simply used them correlating to the vector math example given by abhin4v a while ago. Don't overthink them.

They happened to be values for which a rescaled normalization about mean and the -1/1-min/max (eq. (2) ) yields the same results. This obviously made me suspicious and confused.

The numbers i gave were merely for example and were not plugged into any equation at all. I simply used them correlating to the vector math example given by abhin4v a while ago. Don't overthink them.

They happened to be values for which a rescaled normalization about mean and the -1/1-min/max (eq. (2) ) yields the same results. This obviously made me suspicious and confused.


abhin4v
@fuso: Can you show some sample calculation based on the formula you proposed?

It wont' do any good til I know what you're looking for and see similar calculations for the other schemes.

P.S. What is the quote-syntax here. I tried lsbrace quote = abhin4v rsbrace but then it obfuscated my post by replacing some lsbrace quote rsbrace by lsbrace quote= rsbrace and adding copies of certain text segments to others, such as the adding the end of one fairly large text segment to the beginning of the next.
fusoOct 6, 2007 11:16 AM
Oct 6, 2007 12:17 PM

Offline
Jun 2007
82
Wow, such a long post, how much time did it take to write? :)
As I wrote, normalization about mean may not be the best method, but it gives reasonably accurate results, is better that current system, is easy to implement and is one of the most used method. Their may be better methods, but I am not a Math/Statistics scholar (and I'd rather watch anime then researching the net :) ).
Anyway, I have updated my greasemonkey script (which started all this) to use normalization about mean rather than 5. And after seeing many user profiles, i think it is giving correct results. Here is the link to script:
http://userscripts.org/scripts/show/12746

@fuso: to see the quote syntax, click on the "quote" link below any post.
Oct 6, 2007 1:10 PM
Offline
Oct 2007
9
Wow, such a long post, how much time did it take to write? :)

It looks like a lot but it isn't. I have a tendency to write long posts however.


As I wrote, normalization about mean may not be the best method, but it gives reasonably accurate results, is better that current system, is easy to implement and is one of the most used method. Their may be better methods, but I am not a Math/Statistics scholar (and I'd rather watch anime then researching the net :) ).

What makes you say these results are more accurate than the alternatives?


I used the same syntax as the quote buttons yields.
Nov 30, 2007 8:21 PM

Offline
Nov 2007
694
I'm bumping this because I'm really interested in this method of scoring, and think that even if it's not the 100% best, most perfect way of comparing similar tastes in anime, I think we can all agree that it's superior to the current system we have.

Whether this gets implemented or not, I'll be using this system instead. I think it does a greater job at comparison... I've already noticed a few instances where it'd say I have "medium" similarity with a person on the old system, even though our scoring patterns were much different.

Anyway, that's my 2 cents on the matter :) I doubt I can help at all on the math end as I've only taken first year stats and no vector math yet.
Nov 30, 2007 8:30 PM

Offline
Jun 2007
82
CanadaAotS said:
I'm bumping this because I'm really interested in this method of scoring, and think that even if it's not the 100% best, most perfect way of comparing similar tastes in anime, I think we can all agree that it's superior to the current system we have.

Whether this gets implemented or not, I'll be using this system instead. I think it does a greater job at comparison... I've already noticed a few instances where it'd say I have "medium" similarity with a person on the old system, even though our scoring patterns were much different.

Anyway, that's my 2 cents on the matter :) I doubt I can help at all on the math end as I've only taken first year stats and no vector math yet.

I am glad you liked it. As soon as I get some time in my hand, I'll look into the algorithm and try to make it more reliable.
Nov 30, 2007 11:02 PM

Offline
Feb 2007
5481
:D

way to go everyone.

just sort of throwing in my 'good luck!'
:3

I understood parts of the math, but got bored reading it >.> (I'm just a high school student so pah)
Dec 1, 2007 12:05 AM

Offline
Nov 2007
694
Well it's funny because way back on the first page Xinil pretty much agreed to implementing this... but then the thread sorta veered off then died :P

Edit: Wow! I found someone that had 0.83 compatibility on the regular system, and 80% on the new one. Highest I've seen so far lol. I really like this :] lol.
VoltlighterDec 2, 2007 12:53 AM
Jan 5, 2008 8:27 PM

Offline
Nov 2007
694
Well still nothing. You there Xinil?

Xinil said:
Awesome explanation abhin4v. I look forward to seeing what the normalization example gives us. This looks highly likely to be implemented.
Jan 5, 2008 9:49 PM

Offline
Jun 2007
82
CanadaAotS said:
Well still nothing. You there Xinil?

Xinil said:
Awesome explanation abhin4v. I look forward to seeing what the normalization example gives us. This looks highly likely to be implemented.


Well untill Xinil finds time to implement it, you can install this greasemonkey userscript http://userscripts.org/scripts/show/12746 and see the CP similarity in shared anime pages.
Jan 5, 2008 9:55 PM
Overlord

Offline
Nov 2004
5752
I wasn't aware there was ever a consensus on any one method.
Jan 5, 2008 11:49 PM

Offline
Jun 2007
82
Xinil said:
I wasn't aware there was ever a consensus on any one method.

I suppose the consensus reached here is that, although this may not be the best method to find similarity between two users, this is much better than the current method.
Jan 6, 2008 11:32 PM

Offline
Oct 2007
3266
Why not start a poll, and notify all the users about it?
Pages (3) [1] 2 3 »

More topics from this board

» Japanese Language Board

LifelineByNature - 8 hours ago

0 by LifelineByNature »»
8 hours ago

» I think it's now time to update abuse section on mal forum guidelines to add "anime tourist" in insult category.

jacobPOL - Sep 22

4 by deg »»
Yesterday, 9:30 AM

» Remove Non-Anime Content from MAL (Music Videos, PVs, CMs, etc.)

TaviiTavii - Sep 22

2 by Shishio-kun »»
Yesterday, 12:08 AM

» Stop with the 'We value your privacy' pop-ups

_cjessop19_ - Sep 9

7 by Shishio-kun »»
Sep 22, 12:13 AM

» Request to Increase Favorite Character Limit

ENANO7211 - Sep 21

0 by ENANO7211 »»
Sep 21, 2:46 PM
It’s time to ditch the text file.
Keep track of your anime easily by creating your own list.
Sign Up Login