Forum Settings
Forums
Must be a Club Member to Reply 
#1
Mar 8, 2012 11:28 AM

Offline
Joined: Nov 2010
Posts: 87
I wanted to elaborate a bit about the formulas we used, hence this topic.

Before all, here is "explanation" of common conceptions. This didn't make its way onto the MALgraph itself, since it's rather self-explaining, but I added this here for the sake of completeness. (Sorry for misaligned images.)

  • By we mean set of all titles owned by given user.
  • There are also few functions such as , where ; all of them are pretty much self-explaining.
  • By we mean that user didn't rate given title.



    Score distribution
    Involves: completed, completing, dropped, on-hold; rated and unrated.

    This is the most basic graph. It renders your score distribution in form of a bar chart. Each bar's width is proportional to the following:



    ...where:



    So it is equal to how many entries have given score. Technically it is then divided by and multiplied by constant to make bars proportional to some pixel width.

    Personally I think the more titles user has watched, the more this graph should look like Gaussian curve. (Then again a lot of users unlike myself don't bother to add crap.) In any case, I have thought about including Shapiro-Wilk test. It is used to check whether some distribution is significantly different from normal distribution. In the end we decided to drop the idea, since it could result in flame and wouldn't have very meaningful result (just a boolean - I don't find booleans very exciting).



    Score to time distribution
    Involves: completed, completing, dropped, on-hold; rated and unrated.

    In my opinion this is the second most basic graph. It renders how much time you spent on watching titles that you rated with given score. What's the point of this? Perhaps our alpha title for this graph should be helpful - "How much time do you waste on crap". As a rule of thumb, . So we use this formula:



    where:



    Whether this should or shouldn't look like Gaussian curve when you force yourself to watch every is hard to answer and depends on too many factors for me to comprehend (for example, 1990s were full of long-ass 50+ ep series, which can be interpreted in multiple ways). Probably yes. With 10% significance. :P



    Favorite decades
    Involves: completed, completing, dropped, on-hold; rated only; having aired year specified only.

    So I wanted to know what decade I like the most. (To be honest, I suspected 1990s for SM, NGE and DBZ. Nope.) So we made this line chart, which uses following:



    ...where:



    In other words, we compute mean score of titles from every decade user has ever watched anything from.

    At first we were also going to make some weighting. We thought about title cummulative duration or amount of titles user watched. But after a short while, we realized that there are at least twice as much english-subbed anime available from 2000s as from 1990s (not to say 1980s). That explains why we didn't like weighting using title amount. As for show duration, there are very little shows from 2000s that have 50+ episodes (counting out never-ending singletons like One Piece)... at least comparing to 1990s and 1980s. There is also this idea where we combine both of above: few lengthy titles from 1990s vs lots of short titles from 2000s should result in draw, right? Well, as far as it theoretically works, it ridiculously violates KISS principle. So we decided to stick with just mean scores.



    Status distribution
    Involves: completed, completing, dropped, planned, on-hold; rated and unrated.

    This one is rather self-contained and there's not much to talk about. We just count how many titles have given status and present it in form of a pie chart.



    Favorite genres
    Involves: completed, completing, dropped, on-hold; rated only.

    This is my favorite. Each genre has associated an internal score; we sort them by that score and present the top and bottom 10 genres to the user. This is done using:



    ...where:



    So, we take user's mean score of all titles (), mean score of all titles from given genre () and normalize genre mean using global mean. So far we have:



    But this is equal to .
    So why do we put it this way? I did it this because I wanted to make genre scores (so far equal to ) weighted; I'll get back to it in a moment. Obviously doing something like is bad, since this will scale with respect to 0. So what should I relate to when scaling my scores? The answer comes naturally and yep, it's .

    Now, why the weighting? Imagine that user has watched 300 titles tagged with "demon" and one tagged with "angel". Obviously "demons" are more meaningful than "angels". Now, how we should interpret this? We took following facts into our consideration:

  • There are few big genres like "comedy", "romance", "sci-fi" and several small ones, like "cars", "dementia" etc.
  • If user keeps watching something, it might mean he still likes it, even if he rates it low.
  • If user has watched only a few titles from given genre, he might not be privileged to say it's his favorite genre.
  • On the contrary, if user has watched only few titles and rated them ultra-high, this could mean he likes it more than 300 titles from other genre.

    Taking into account everything above, we decided to stick with simple approach: let's use only means (global and genre-specific); then let's make sure that the less user has watched from given genre, the less "autonomous" his "opinion" (genre-specific mean score) is. That's how we ended up with the final formula. The rest was just finetuning of .

    (The difference between and is that treats unrated entries like they were rated with ).



    We also considered following:


    This had many flaws pointed out in topic discussion.


    I don't remember exactly why, but in the end it turned out to be rather crappy. Possibly for similar reasons as above.


    This one behaved almost exactly like your everyday arithmetic mean. part because of loldivisionbyzero for best genres.

    Finally, there was this:

    ...and this one:

    ...but these worked exactly opposite to what we wanted to achieve. So yeah, square root.

    As a final note, I'd like to add that even though our internal score distribution for these genres is non-linear, we still display it like it was linear (imagine best genre scored with 300, next one with 20 and then 19 - they will be displayed successively with 16pt, 15pt and 14pt font).



    Episode count chart
    Involves: completed only; rated and unrated.

    Here we come up with few length thresholds (1 episode, 2-5 episodes and so forth); we count how many titles fall into each group and then we present it to the user in form of a pie chart.



    Favorite studios
    Involves: completed, completing, dropped, on-hold; rated only; having studio specified only.

    Everything in this chart is computed exactly like in favorite genres.



    Timeline
    Involves: completed only; rated and unrated; having start and finish year specified only.

    Like most of graphs, here we create thresholds in form of year-month and count how many titles were completed in given month. There was idea to make duration-based weighing, but it would be unintuitive. We may add it some day as another series which would be rendered in the same chart. I considered removing current / active month from this graph, but I figured some users may find it useful and want to compare their performance from current month with performance from last month, so I left it intact.



    Suggestions
    Involves: completed only; rated and unrated.

    Not much to talk about. We look at user's completed titles, sort them by rating and display all related titles that aren't on user's list. This may change some day and we'll base it on user recommendations, but if that were the case, we'd probably implement another chart instead of removing existing feature.
  • Modified by rr-, Apr 3, 2012 12:07 PM
     
    #2
    Mar 31, 2012 3:31 AM

    Offline
    Joined: Nov 2010
    Posts: 87
    [edit]This is response to kFYatek's post, which I accidentally deleted 。・゚・(ノД`)・゚・。. He suggested to modify formulas for genre and producer value calculation, namely replace:

    with:

    and provided short analysis how this would work.[/edit]

    Your point is obviously correct: the difference gets amplified, not the score. This is intentional. But, de facto, your suggestion makes things kind of worse. Consider following:






    According to your suggestion:



    Now this would be interpreted like this... user loves demons and hates angels as well as humans. In fact, angels are even more hated than the humans, even though they received 7 in contrast to hundred of 1. In reality I see it like this: demons are somewhat liked since he keeps watching them, angels are rather meh (but might become point of interest), humans are mega crap.

    The point is that 'least liked' genres would cease to make sense, since they would contain 'most meh' or 'least watched' genres alongside 'most hated' ones (<-- I added "humans" in order to demonstrate that).

    The first thing that comes to my mind that could fix that is that we could use two functions, one for calculating most liked and second for most disliked genres... for example, like this:


    where...


    But this means using two separate distance functions for the same thing. Crazy.

    That's why I designed it like this. This isn't perfect, but IMO it's alright enough. Disliked genres are below 'zero' (which is global mean), liked genres are above zero and we can scale it however we want.



    The more data you have, the better results you get.
    Modified by rr-, Apr 3, 2012 12:17 PM
     
    #3
    Mar 31, 2012 11:10 AM

    Offline
    Joined: Jul 2009
    Posts: 393
    I messaged a few people (too few for it to count as a representative group) asking which version of favorite genres seemed more accurate for them. It ended up in a perfect tie. >_>

    We'll stick with the "old" formula for now. Thanks for the suggestion.
     
    #4
    Mar 31, 2012 7:29 PM

    Offline
    Joined: Apr 2009
    Posts: 146
    Well, I know that the formula I wrote wasn't perfect, either. But I had another proposal in my mind already. How about this?:



    The 10 is here to make it the same scale as the first term, so that both terms account to the final score in equal halves.

    This way, we can also account for unrated titles (a person who watches tons of demon despite not rating it, pretty much loves it, doesn't he?).

    I haven't tried it in practice, so maybe some fiddling with it (like adding weights or square-rooting something) would make it better.
     
    #5
    Apr 1, 2012 8:39 AM

    Offline
    Joined: Nov 2010
    Posts: 87
    Again, this greatly punishes groups that are small in size by boosting the others :( However, your post inspired me to think about it some more and this morning I came up with something new.

    So I asked a question: "What should we do, when there's no sufficient data?". The answer: extrapolate it! That's how I came up with this:

    1. Genres that have relatively few titles should tend to global average score.
    2. Genres that have relatively many titles should tend do their own average score.
    3. [bonus] Treat unrated entries like they were rated with global average score.

    It somewhat realizes the goals. It doesn't work extraordinarily well with lists containing vast amounts of unrated entries, but... oh well.

    Formally, it goes like this...



    ...where:





    Originally weight was going to be just:



    ... but I wanted to have some kind of "logarithmic" scale (like, the more titles, the less difference it makes), hence the square in final equation. Note that since , we gotta to reverse it, or it will have opposite result.

    After fri extensively checked it, he said it's good, so we're probably going to use it in next release.
     
    #6
    Apr 1, 2012 1:20 PM

    Offline
    Joined: Jul 2009
    Posts: 270
    It might be a little bit off topic , but am I the only person in this club who has no idea what those above are talking about ? xD
     
    #7
    Apr 1, 2012 7:44 PM

    Offline
    Joined: Apr 2009
    Posts: 146
    @loskierek: Judging from how I'm the only one besides the admins who dared to speak in this topic, I think you're in the majority, in fact.

    @chrupky: First, about my proposal:
    chrupky said:
    Again, this greatly punishes groups that are small in size by boosting the others :(

    I thought this was the idea all along the way. If someone watches tons of eg. mecha even despite rating it as crap, he must really love that genre, doesn't he?

    But anyway, thanks for getting inspired by me. We'll see how the updated formula will work out in practice. My profile doesn't seem to be updated yet, so I don't really know yet. But it certainly feels much better, judging from calculations on that demon/angel example.

    [EDIT] My profile got updated, and I don't really see too much difference... but well, the results seem more or less reasonable when I look at the tooltips. [/EDIT]

    PS Czemu ja się dopiero teraz zorientowałem, że tu sami rodacy? ;)
    Modified by Balibuli, Apr 2, 2012 4:54 AM
     
    #8
    Apr 2, 2012 11:03 AM

    Offline
    Joined: Mar 2010
    Posts: 451
    ok, the results yielded by the new formulas seem a lot closer to what I was expecting now. Great job!
     
    #9
    Apr 2, 2012 4:52 PM

    Offline
    Joined: Mar 2009
    Posts: 512
    A clarification on the syntax, is max g' a binding of g' to the largest genre in the genre calculation?

    If that is the case it's possible that there's an error in your calculation. Using the given count and genre mean in the hover text doesn't give me the same results following your calculation.
    Also don't you think it might be better to take the square root (or to the power of 2/3) of the genre ratio in your weights. As lists grow in length they'll tend towards common genres such as comedy rather than favourite genres excessively punishing smaller less common genres.

    Discard the above if my reading of max g' is incorrect.
     
    Apr 2, 2012 11:02 PM

    Offline
    Joined: Nov 2010
    Posts: 87
    Your reading of g' is correct, the whole max part gets largest size of any genre group from user's titles. If your calculations mismatch, it means you're doing it wrong. Note the fact that largest group might not even be visible (for example, it can have perfect average score).

    We already tested changing exponent to 0.5 (as well as few other ). Also your reading of power of 2 is probably wrong, since the (1-x) part makes it work like square root (only rotated and mirrored).
  • How it works now: http://www.wolframalpha.com/input/?i=plot+%281-%281-x%29%5E2%29%2C+x%3D0..1
  • Your suggestion: http://www.wolframalpha.com/input/?i=plot+%28x%5E0.5%29%2C+x%3D0..1
    The graphs above show how group size affects the weight, which affects final score (x). If weight = 1, x will be equal to mean score of user titles for given genre, if weight = 0, x will be equal to global user's average.
    Further special treatment of outliers is going to be awkward. I was mad with previous models when I kept seeing 'cars' as 'hated' even though I have watched only 'Redline' and 'Tailenders', so yeah.
  • Modified by rr-, Apr 3, 2012 8:38 AM
     
    Apr 3, 2012 5:29 AM

    Offline
    Joined: Mar 2009
    Posts: 512
    That wasn't my intended suggestion.
    My intended suggestion was http://www.wolframalpha.com/input/?i=plot+1-%281-%28x%5E%282%2F3%29%29%29%5E2%2C+x%3D0..1

    Larger lists will always tend towards containing a large number of frequently found genres such as comedy and romance which can be found in conjunction with almost all other genres. The existence of a particularly large genre is unreasonably punishing to other smaller genres.
    An example from my graph. I have 266 series in the comedy genre, even if this is not my largest genre I would need to watch ~75 of a genre for the evaluated score to be the midpoint of the global and genre means.
    Considering the Game genre in my list, it has the highest mean of all genres with 1.44 points above my global mean. To get it's evaluated score to be the midpoint I would need to watch 75 of the 83 MAL entries with that genre.


    tl;dr Common genres unreasonably skew results on large lists.


    I'm still getting a calculation mismatch though, and it's irrelevant if the largest group isn't visiible, if the largest group is even bigger the result is even more incorrect.
    If I chuck the following into a Haskell interpreter using the values given for the Game genre from my graph
    let w = 1-(1-(10/266))^2 in 6*(1-w) + 7.44*w
    I get 6.10623551359602 as the output, leaving me to consider either the tool tips to give incorrect results or your implementation to be wrong as this doesn't match your evaluated result.
     
    Apr 3, 2012 8:16 AM

    Offline
    Joined: Nov 2010
    Posts: 87
    Your point is? I'll write this again: we are already doing this, using similar weighting to the one you proposed. See for yourself that it doesn't make much difference:


    There is also this, which takes above approach to the extreme: http://www.wolframalpha.com/input/?i=plot+%281-%281-x%29%5E2%29%5E0.5%2C+x%3D0..1 (it is basically circle placed in (1,0) with radius 1):


    We're gonna test them again...

    As for the mismatch... since titles appear multiple times. That's why the value is off by 0.04. Fixed this as well.
    Modified by rr-, Apr 3, 2012 9:16 AM
     
    Apr 3, 2012 9:30 AM

    Offline
    Joined: Jul 2009
    Posts: 393
    So yeah, we tested it again on a few users, starting with 2/3 and going up to 1. It seems that the constant 8/9 yields the best results, boosting small groups a bit while not making it illogical (e.g. there were two series tagged as "Police", and (2/3) boosted it to 2nd place, while (8/9) kept it a little lower, but still higher than before, because those two series were rated with 8s).
     
    Apr 3, 2012 10:26 AM

    Offline
    Joined: Mar 2009
    Posts: 512
    Thanks for the extra testing.

    Was the mismatch an actual error then? Or a mistake on my end regarding the mean?
     
    Apr 3, 2012 10:43 AM

    Offline
    Joined: Nov 2010
    Posts: 87
    It was an error on our end, but it wasn't very serious; formula remained the same. (The way we calculated average score was wrong.)
     
    Aug 29, 2014 6:11 AM

    Offline
    Joined: Mar 2014
    Posts: 18199
    So, what exactly is "Std dev.:1.62" (I imagine it's for Score Deviation) but I can't make sense out of the 1.62 since my mean is 7.19 and MAL's deviation doesn't show nowhere near close to that in my complete list (around -0.25 with 7.3 mean, while the complete is not everything the other's are hardly significant).
     
    Sep 1, 2014 4:06 PM

    Offline
    Joined: Feb 2010
    Posts: 355
    Std dev is http://en.wikipedia.org/wiki/Standard_deviation
    Basically it shows how different (from your average) your scores are. Like if you use full scale (1-10) your std. dev. will be high (close to 2), but if you mostly use only part of it - it would be low (closer to 1).
     
    Sep 2, 2014 2:38 PM

    Offline
    Joined: Mar 2014
    Posts: 18199
    vivan said:
    Std dev is http://en.wikipedia.org/wiki/Standard_deviation
    Basically it shows how different (from your average) your scores are. Like if you use full scale (1-10) your std. dev. will be high (close to 2), but if you mostly use only part of it - it would be low (closer to 1).


    So it's only between 1 and 2? Don't see much point in that in the end but okay.
    (Or maybe I'm just looking at it wrong, I didn't study math with english so I get really confused about these things)
     
    Sep 2, 2014 3:43 PM

    Offline
    Joined: Jul 2009
    Posts: 393
    Average value and standard deviation are often both displayed in summaries of statistical data. Std dev tells you how spread the data is - if it's clustered near the average value std dev is small, and if it greatly differs from mean it gets bigger.

    If you have both average and std dev, you can tell much more about user's rating habits than with average alone. A higher std dev value tells you they probably use the whole scale, while a small one coupled with high average tells you they only use scores 5-10. The std dev bounds for MAL's rating system are 0 for all scores equal, and 6.364 for having two scores: one 1 and one 10.

    For more visual comparison, look at these:
    http://mal.oko.im/Tipsyrools/rati,anime
    http://mal.oko.im/kvinxtri/rati,anime
    http://mal.oko.im/DraconisMarch/rati,anime
     
    Sep 2, 2014 7:11 PM

    Offline
    Joined: Mar 2014
    Posts: 18199
    fri said:
    Average value and standard deviation are often both displayed in summaries of statistical data. Std dev tells you how spread the data is - if it's clustered near the average value std dev is small, and if it greatly differs from mean it gets bigger.

    If you have both average and std dev, you can tell much more about user's rating habits than with average alone. A higher std dev value tells you they probably use the whole scale, while a small one coupled with high average tells you they only use scores 5-10. The std dev bounds for MAL's rating system are 0 for all scores equal, and 6.364 for having two scores: one 1 and one 10.

    For more visual comparison, look at these:
    http://mal.oko.im/Tipsyrools/rati,anime
    http://mal.oko.im/kvinxtri/rati,anime
    http://mal.oko.im/DraconisMarch/rati,anime


    Right, thanks for clearing that up, my main problem with it is that I thought it had to do with my mean (as in, having a 1.5 std dev would be something like +-0.75 for my mean, like how MAL gives you score dev in each list) but that was just me confusing terms.
     
    Sep 3, 2014 2:17 AM

    Offline
    Joined: Feb 2010
    Posts: 355
    Paulo27 said:
    So it's only between 1 and 2? Don't see much point in that in the end but okay.
    (Or maybe I'm just looking at it wrong, I didn't study math with english so I get really confused about these things)
    Another releveant link - http://en.wikipedia.org/wiki/Normal_distribution (you can switch article to your language).
    Ratings should be close to normal distibution, and that destribution can be described by using only 2 numbers - average and sigma (std. dev.).

    Paulo27 said:
    Right, thanks for clearing that up, my main problem with it is that I thought it had to do with my mean (as in, having a 1.5 std dev would be something like +-0.75 for my mean, like how MAL gives you score dev in each list) but that was just me confusing terms.
    Well, 1.5 std dev means that about 68% of the values are in range (average-1.5 ~ average+1.5), 95% are +- 3, and 99.7% are +- 4.5. It is true only when ratings are close to notmal distibution and ratings are real values (not integers), but it still somewhere around it in our case.
     
    Dec 29, 2014 4:53 PM

    Offline
    Joined: Oct 2014
    Posts: 576
    Man i wanted to know what's the formula for the weighted score but the image has expired on your host :(

    Could you please re-upload it or something? :P
    Modified by iN3krO, Dec 29, 2014 5:00 PM
     
    Dec 31, 2014 11:56 AM

    Offline
    Joined: Jul 2009
    Posts: 393
    I don't have the image, but if you want the code, take a look at this: https://github.com/rr-/malgraph4/blob/master/src/ModelUtils/DistributionEvaluator.php
    The magic happens around line 48.
     
    Top