I'll start with a pro: It is a fact that if your win% for 3-player games is > 1/3, 4-player win % is > 1/4, and so on, your G Rating is > 1. Generally, we should expect players with high win percentages to have higher G Ratings. The description from the help file is remarkably accurate:
"G rating is a normalized rating based on how many games you expect to win on a game of a particular size." _of a particular size_ is required for the stat to maintain its meaning.
However, the calculation of the G rating mixes games of different sizes and here it goes very wrong. Winning 100% of your duels would be truly remarkable (from a percentage standpoint), yet it is equivalent to winning only 40% of your 5 player games, which many players achieve. 100% two player and 40% five player both yield a 2.0 G Rating.
I think the a priori criticisms are too easy, so I decided to see what the different the skewings among the top 10 are if you delete their two player games. In this data set, EVERY SINGLE PLAYER'S G RATING WENT UP AFTER DELETING THEIR TWO PLAYER GAMES!!!!
The skewings: +.13, +.17, +.19, +.12, +.47, +.23, +.32, +.09, +.07, +.38
The large plusses usually occurred with players who play a high proportion of duels.
It was suggested that over time the skewing should go to zero. In fact, it will not if over time a player maintains the same proportion of duels being played. Of course it is not unique to duels. I confidently conjecture that if we deleted 6-10 player games from (good) players who play a lot of those games, that their G Rating would drop.
The intent of the G Rating was well expressed by BlackDog: A low G Rating with a high ranking means that I generally play higher rated players, while a high G Rating with a high ranking means that I play lower rated players.
I think this is the true intent of the G Rating: Attach a numerical value, something like a percentage, that ignores the difficulty of the opposition and yet makes sense across different sized games.
I have a real hard time saying what this _truly_ should be, but I do have a suggestion (to be posted later), that has predictive value within our current ranking system.
I sort of agree with you, let me see if I can restate it and you tell me if I understand you objection.
The G-rating is there to give a 'normalized' rating with regards to the number of players in a game, and should give a value indicative of how much better a given person is doing compared to how many they 'should' win if all players were dumb automatons who rolled randomly. Thus winning 50% of 2-player games is treated the same as winning 1/16th of a 16 player games.
While that is true, your objection is based upon the fact that an elite player will win proportionally more games as the number of players are increased, which I believe is true. Winning 1/8th of 16 player games is easily doable, while winning 100% of your duels is impossible. The reasons for this are many, some are more obvious than others, and ultimately the reasons don't matter.
Since players are ranked in part by G-rating, it 'punishes' those players who prefer duels and inflates the rankings of those players who prefer large games.
My first instinct would be to attempt to quantify the rate of increase per player added to a game in terms of what the elite players can actually accomplish. I'm thinking along the lines of averaging the actual winning percentage per players/game of the top X G-rated players on the site. From this you could normalize the G-rating to obtain a G-prime rating, which would serve to eliminate the skewings you mentioned.
I agree with 11s. At the very least, where the skewing comes into play is for those top of the line players who can better than double the expected win ratio in games with more than two players. This means you have to win better than 66% of your matches in three player games to break the system. In cases like this, playing only 2 player games would hurt their g-ratings. But even 11's post suggests that it's a bit more pervasive than that.
So what if we topped out the G rating at 2?
Consider for simplicity's sake the kind of player that can win 75% of their two-player games. Could we say this is half-way between average and perfection? Let's give this excellent player a G rating of 1.5 (half-way between 1 and 2.)
Now let’s consider that all things being equal, the same (half way to perfection kind of guy) should win not 10%, but rather 55% of all ten player games (I.e., somewhere around 11 out of every 20 games played).
Also consider the not-so-good 25% half-way-to-total-loser player who will win 1 out of 20 ten player games (1/2 the expected). His G rating should be .5 for that pathetic effort.
But how do you accomplish this? What would the math look like? What about using 11’s idea and somehow creating a straight percentage score where 50% would be expected and 100% would be perfect?
Hugh wrote:The description from the help file is remarkably accurate:
"G rating is a normalized rating based on how many games you expect to win on a game of a particular size." _of a particular size_ is required for the stat to maintain its meaning.
I'm pretty certain I stole that from tom
IRoll11s wrote:
While that is true, your objection is based upon the fact that an elite player will win proportionally more games as the number of players are increased, which I believe is true. Winning 1/8th of 16 player games is easily doable, while winning 100% of your duels is impossible. The reasons for this are many, some are more obvious than others, and ultimately the reasons don't matter.
Since players are ranked in part by G-rating, it 'punishes' those players who prefer duels and inflates the rankings of those players who prefer large games.
I'm glad you're here to help me clarify :) I'm stating something stronger. I believe the effect you speak of is real - whatever equivalent performance across game sizes is defined to be, elite players should perform better in games with more players. I believe G Rating to be more broken than that.
G Ratings for particular game sizes compare perfectly well across that game size. I contend that, except around the 1.0 mark, the G Rating does a poor job of comparing performances across game sizes.
It is easy to construct toy examples: Here is one - suppose a player wins 3 consecutive two player games. They have played 3 games, their G Rating is 2.0. As a performance, this is good, 1/8 is the probability of doing that if it's just coin flips. A different player plays an 8 player game and wins. 1/8 is also the probability of hitting that performance randomly. Now the 2nd player plays two more 8 player games (losing both), so that the weighting is the same as the first player and due to the losses, the 2nd player has performed worse. The first player has 2.0 weighted at 3 games, the second player has 2.67 weighted at 3 games, yet the first player clearly outperformed the second player.
It does a poor job across different game sizes is the summary of what I am saying. Examples similar to the above should be constructible even if we restricted to 4-player vs 5-player.
Hugh wrote:It is easy to construct toy examples: Here is one - suppose a player wins 3 consecutive two player games. They have played 3 games, their G Rating is 2.0. As a performance, this is good, 1/8 is the probability of doing that if it's just coin flips. A different player plays an 8 player game and wins. 1/8 is also the probability of hitting that performance randomly. Now the 2nd player plays two more 8 player games (losing both), so that the weighting is the same as the first player and due to the losses, the 2nd player has performed worse. The first player has 2.0 weighted at 3 games, the second player has 2.67 weighted at 3 games, yet the first player clearly outperformed the second player.
I would say there could be an argument against "the first player clearly outperformed the second player". Player 1 defeated 6 players in the 3 games he played, while player 2 defeated 7 players in the one game he won and then played against another 14 that he lost to. There are times where winning an 8 player game is vastly more difficult than winning a 3 player game (and the same could potentially be said for the opposite, but with the 8 player game you do have "less" of a chance of winning).
I'll concede a point, but defend first. Random move making automata have a 1/8 probability of winning three consecutive two player games. Random move making automata have a 1/8 probability of winning a single 8-player game. There is a symmetry involved in an 8-player game - it isn't any harder for one player versus another (until you factor in skill starting position and those things).
The point I'll concede is that I used uniform probabilities as my point of comparison. I don't have a great answer to how best to compare across game sizes. I use uniform probabilities because I don't know where else to turn for comparison. However, no matter what the metric, a person winning 50/100 of their 5-player games for a G Rating of 2.5 has not outperformed a player who has won 1000/1000 two player games for a G Rating of 2.0.
I love this thread a lot.
Hugh wrote:
It was suggested that over time the skewing should go to zero. In fact, it will not if over time a player maintains the same proportion of duels being played. Of course it is not unique to duels. I confidently conjecture that if we deleted 6-10 player games from (good) players who play a lot of those games, that their G Rating would drop.
I did suggest something similar (not zero), but concede that the skewing will not go away (maybe this is the reason for my comparatively low g-rating - lots of spies).
I meant to say and didn't that if players play a lot of games of all sizes then the skewing is incorporated and not as important, but certainly a good player that only play games with a lot of players will have a higher g-rating than a good player who only plays dueling maps. (This is well explained by Hugh above).
Here my solution, although it is not easy to implement: Give each player a number of points on a somewhat logarithmic scale.
For two player games (50% wins is expected) so:
Win Percentage Points
<6.25% 0
6.25% .125 (12.5% of expected win percentage)
12.5% .25 (25% of expected win percentage)
25% .5 (50% of expected win percentage)
50% 1 (expected)
75% 1.5 (150% of expected win percentage)
82.5% 1.75 (175% of expected win percentage)
88.75% 1.875 (187.5% of expected win percentage)
>88.75% 2
For three player games (33% is expected) so:
Win Percentage Points
< 4.125% 0
4.125% .125
8.25% .25
16.5% .5
33% 1
49.5% 1.5
57.75% 1.75
61.875% 1.875
> 61.875% 2
For anyone who understands what is above, this should be averagable to give a meaningful stat as long as you ignore cases of "played one 10 player game and won" (set minimum number of games of player size to be ___ before used in stat).
If I am wrong with what I said above, please let me know.
I did mention a suggestion that works towards the goal of having meaning within our ranking system. It is actually a very mild variant of G Rating. It is this:
Each loss is 0/1, but each win is regarded as going 1/1 against each opponent you beat. So, 1/1 in two player wins, 2/2 in three player wins, 3/3 in four player wins etc. We'll refer to the denominator as "effective games" (instead of total games). This may seem strange, but it corresponds precisely to how our rating system is structured.
Does it satisfy the property everyone loved about G Ratings? Yes - if you win 1/4 of your 4-player games, your score is 0.5, 1/5 of your 5-player games is also 0.5. Less gives a number under 0.5, more gives a number greater than 0.5.
It maxes out at 1.0 regardless of game size.
Lastly, equal scores with different game sizes corresponds to equal rating gains. A 0.6 score achieved solely with 4-player games has the same rating gain as a 0.6 score achieved solely with 2-player games provided the same number of "effective games" were played and the opponents were of equal rating to the player. Thus, if two players had the same Global Rating, we could tell who had the higher rated opponents by looking at this score. G rating does not share this property due to the game size skewing.
Disclaimer: The way in which this system compares performances across game sizes may not reflect "the truth" about which was better, but it is compatible with our current rating system.
Alpha - your suggestion looks like G Rating, but capped at 2? Right?
Testing by looking at edge cases confirms that this is a good metric. To wit:
Dumb-monkey expected wins:
game size - wins/losses - avg. - g-rating - h-rating
2 - 1/2 - 50.0% - 1 - .5 (1/2)
4 - 1/4 - 25.0% - 1 - .5 (3/6)
8 - 1/8 - 12.5% - 1 - .5 (7/14)
God of war wins:
game size - wins/losses - avg. - g-rating - h-rating
2 - 2/2 - 100% - 2 - 1 (2/2)
4 - 4/4 - 100% - 4 - 1 (12/12)
8 - 8/8 - 100% - 8 - 1 (49/49)
Average is a poor metric that cannot be combined with games of various sizes unless someone wins all or none of their games, which is why G-rating was established. However G-rating can only be accurately combined as a comparison when someone wins exactly as many dumb-monkey games as expected (G-rating = 1) and starts to skew badly both up and down the skill scale.
Once again I'm only clarifying for you Hugh. Alpha your log scale has merit but I believe this is more precise. The only part of Hugh's post I didn't understand was this:
"...and the opponents were of equal rating to the player. Thus, if two players had the same Global Rating, we could tell who had the higher rated opponents by looking at this score."
This might be because I really have no interest in rankings other than a passing mathematical interest, so I'm not even sure how the Global and Board rankings work. I'm not sure how the H-rating would give any indication of how tough your opponents were, since there is nothing in the calculation that uses the global or board rankings.
I support Alpha's logistic, Hugh's holistic, 11's heuristics and ASM's linguistics.
Mongrel wrote: I support Alpha's logistic, Hugh's holistic, 11's heuristics and ASM's linguistics.
As soon as I'll finish my exams (my last two ever!!!! °___°) and get my internet connection back, I'll be more into this than I'm actually doing (I'm trying to follow this thread though).
Support the 'istics'!
(=
Ok, we’ve seen two solutions that even the playing field in terms of “capping†the limits of a rating. One ranges from 0-2 and the other 0-100%, both of which I like.
But they still have the disadvantage of not being very helpful when players have not played many games. E.g., monkey joins WG, monkey somehow wins all 2 or 3 games played, and monkey retires with the best possible g-rating.
Why not measure the likelihood of a certain performance in terms of deviation from the norm?
Lucky monkey goes 3/4 playing in 2-player games, putting him in the top 12.5% of all monkeys who play four 2-player games = +1.15 expressed as a standard deviation.
Same monkey then goes 3/4 playing in 3-player games, putting him in the top 1.2% for a SD of +2.26
From here it’s a matter of weighting these performances. Because monkey has played the same number of games in each category in this particular case I’m calling it the mean of these numbers for a composite SD of +1.70.
Consider that a player who goes 2/2 in a pair of 8-player games has a SD of +2.15.
It follows that these numbers could still be inflated for those who don't play many games, but I think it would be pretty hard to whip up a good rating after 10 or so games, so I would propose that a "G"-rating isn't valid (posted), until 10 games have been played.
M57 wrote:Ok, we’ve seen two solutions that even the playing field in terms of “capping†the limits of a rating. One ranges from 0-2 and the other 0-100%, both of which I like.
But they still have the disadvantage of not being very helpful when players have not played many games. E.g., monkey joins WG, monkey somehow wins all 2 or 3 games played, and monkey retires with the best possible g-rating.
Why not measure the likelihood of a certain performance in terms of deviation from the norm?
Lucky monkey goes 3/4 playing in 2-player games, putting him in the top 87.5% of all monkeys who play four 2-player games = +1.15 expressed as a standard deviation.
Same monkey then goes 3/4 playing in 3-player games, putting him in the top +98.8% for a SD of +2.26
From here it’s a matter of weighting these performances. Because monkey has played the same number of games in each category in this particular case I’m calling it the mean of these numbers for a combined SD of +1.70.
I like (and agree) with this too, but before we get much further, I'd just like to say that every ranking system will fail some "reasonable criterion". Somebody proved this. Alpha told me about it, and was why I was OK with G-rating.
Improvements to G rating? Sure. Will that rating be "wrong" in some way? Always.
The WF (gasp!) idea was to not count percentages until a certain number of games was reached. Seems to be a decent 4th solution.
In summary: Grip it, rip it, move on. We had a seemingly similar discussion about an aggressiveness stat that ended with.... well, it ended.
Mongrel wrote: every ranking system will fail some "reasonable criterion".
Fair enough. What reasonable criterion does the SD method not cover? We've already determined that the G-Rating can be gamed, which is a pretty strong strike against it.
I think we should ignore skewing at low game counts, and focus on eliminating skewing due to different game sizes at high game counts. Nobody cares about the G rating of someone who just joined the site.
M57 wrote:Fair enough.
HA! Exactly.
Not sure what example would exploit your formula over others, but it probably happens when you weight the deviations together. What the SD approach captures that the other two do not is "sustained dominance"- my guess is that the global ranking is also an attempt to do this. From Hugh's formula, one can extrapolate some crisp ancillary data about quality of opponents which I also like.
Coin flip.