WarGear / Forum - Viewing thread: G Rating pros/cons, mostly cons

Pages: 123 4 5 6 7 8 9 »»» (11 in total)

Fri 2nd Jul 2010 11:08 #21 / 211

RiskyBack
Standard Member

Rank

Colonel

Rank Posn

#104

Join Date

Nov 09

Location

Posts

1190

IRoll11s wrote: Testing by looking at edge cases confirms that this is a good metric. To wit:

Dumb-monkey expected wins:

game size - wins/losses - avg. - g-rating - h-rating
2 - 1/2 - 50.0% - 1 - .5 (1/2)
4 - 1/4 - 25.0% - 1 - .5 (3/6)
8 - 1/8 - 12.5% - 1 - .5 (7/14)

Was this a RiskySlam?

Cobra Commander + Larry - Mo * Curly = RiskyBack
Fri 2nd Jul 2010 11:51 #22 / 211

M57
Standard Member

Rank

Brigadier General

Rank Posn

#73

Join Date

Apr 10

Location

Posts

5083

I'm pretty sure that over time, both Hugh's system and the SD system will head towards a nominal point, and from there it will move very little because it represents a cumulative lifetime rating. Conversely, the artificial WG Point system can fluctuate quite a bit at any given time because it doesn't "remember" how long it took you to get to your "nominal" point. Also, at least at first, it favors players who play more games. On the other hand, H and SD, and even G ratings can get you "on the radar" quicker with reasonable accuracy regarding your ability.

Because G-ratings, H-ratings and SD-ratings are all "lifetime" types of statistical ratings, I think it would be very interesting as more and more people rack up hundreds of games to see these types of ratings expressed graphically. Something like a last 25 or 50 game moving average would be pretty cool. Of course, this wouldn't be relevant for the standard Point system, so it further points to the need for the types of statistical information we're talking about here.

I don't believe this conversation should result in no action. Not when there is such clear evidence that over time, G-ratings can be gamed. Unless someone can point out otherwise, this is not the case with the alternative systems that have been discussed. I think we're on to something here.

..but we won't be happy until there is a "barren" designer feature.

Edited Fri 2nd Jul 11:55 [history]
Fri 2nd Jul 2010 11:59 #23 / 211

asm
Standard Member

Rank

Major General

Rank Posn

#20

Join Date

Nov 09

Location

Posts

1686

BlackDog wrote:
I think we should ignore skewing at low game counts, and focus on eliminating skewing due to different game sizes at high game counts. Nobody cares about the G rating of someone who just joined the site.

QFT

IF YOU ARE SUGGESTING ASM IS A GOOD PLAYER YOU WILL STOP NOW, OR I WILL CALL HR AND I WILL PUT AN END TO IT, FOR THAT IS WHAT IT SOUNDS LIKE TO ME.
Fri 2nd Jul 2010 12:02 #24 / 211

Yertle
Premium Member

Rank

Major General

Rank Posn

#21

Join Date

Nov 09

Location

Posts

3997

M57 wrote:Not when there is such clear evidence that over time, G-ratings can be gamed.

I don't understand how the G-rating can be "gamed", especially over time.
Fri 2nd Jul 2010 12:07 #25 / 211

Vataro
Standard Member

Rank

Sergeant

Rank Posn

#437

Join Date

Nov 09

Location

Posts

574

I am in agreement with asm.

Give a man fire and he's warm for a day... but set him on fire and he's warm for the rest of his life.
Fri 2nd Jul 2010 12:32 #26 / 211

M57
Standard Member

Rank

Brigadier General

Rank Posn

#73

Join Date

Apr 10

Location

Posts

5083

Yertle wrote:
M57 wrote:Not when there is such clear evidence that over time, G-ratings can be gamed.

I don't understand how the G-rating can be "gamed", especially over time.

I thought we arrived at a consensus that better players who play a disproportionate amount of many-player games (like 5 and 6+) will push their G-rating higher.

..but we won't be happy until there is a "barren" designer feature.
Fri 2nd Jul 2010 12:53 #27 / 211

Yertle
Premium Member

Rank

Major General

Rank Posn

#21

Join Date

Nov 09

Location

Posts

3997

M57 wrote:
Yertle wrote:
M57 wrote:Not when there is such clear evidence that over time, G-ratings can be gamed.

I don't understand how the G-rating can be "gamed", especially over time.

I thought we arrived at a consensus that better players who play a disproportionate amount of many-player games (like 5 and 6+) will push their G-rating higher.

Hmmmm, better players will push their G-rating higher when they play more many-player games...that's right...right? How's that then "gamed"? That makes sense right?

I have the lowest G-rating in the Board CP top 10, but that makes sense since I have played a significant amount of 2 player games which normally I have an advantage in anyhow since most 2 player maps are somewhat based on the number of times you play the board and get the strategy. If I wanted to increase my G-rating then I should be playing more many-player games since that advantage normally decreases on boards that hold more people and games with more people.
Fri 2nd Jul 2010 13:29 #28 / 211

M57
Standard Member

Rank

Brigadier General

Rank Posn

#73

Join Date

Apr 10

Location

Posts

5083

When I say "gamed", perhaps I'm mis-using the expression. A better word might be "manipulated".

While I agree there is logic to the argument that in two player games, the better player is more "in control", the fact remains that the lower the number of players the more severely capped the potential g-rating, I would venture that the various advantages of playing in high-#-of-player games outweighs the 1v1 "control" factor.

I just thought of another reason that 2 player games can bring down a g-rating. This will not be true with all boards because some are quite fair, but I think it's safe to say that most boards favor the player who moves first. Depending on how severe the board's tilt, the more the better player has to overcome.

..but we won't be happy until there is a "barren" designer feature.

Edited Fri 2nd Jul 13:31 [history]
Fri 2nd Jul 2010 14:03 #29 / 211

M57
Standard Member

Rank

Brigadier General

Rank Posn

#73

Join Date

Apr 10

Location

Posts

5083

M57 wrote:
..the various advantages of playing in high-#-of-player games outweighs the 1v1 "control" factor.

Case in point: poloquebec has the highest G-rating of any player with more than 100 games.

Take a look at those games and draw your own conclusions.

..but we won't be happy until there is a "barren" designer feature.
Fri 2nd Jul 2010 15:38 #30 / 211

Hugh
Standard Member

Rank

Lieutenant General

Rank Posn

#13

Join Date

Nov 09

Location

Posts

869

Yay! Debate, discussion, Denny's new RiskySlam intestine destroyer! (my math threads never go this well!)

@G-Rating manipulating: No one "manipulates" their G Rating because it isn't what gets you into the top X page. However, as M57 is correctly pointing out, the data is speaking louder than we are. Broad ranges of G Ratings may have people roughly in the right spots, but game sizes are introducing more noise than is necessary.

@Standard Deviation: I don't have it in me to disagree with keeping variation statistics. In fact, Glicko is preferred by chess players over Elo due to the usage of SD statistics employed in ranking calculations. For something like G Rating, I can see why people prefer one number to follow (simplicity). For percentages, G Ratings and the like, min# of games solves this.

@M57 even more: I like the idea of using percentiles, I'd like to figure out the best way to do this. Where it gets tricky is that if someone has a > 50% win rate, their percentile pushes towards 100% as they play more games. This may not be an issue if you are trying to use the percentile data to calculate a normalized percentage (by comparing, say, 4 player percentiles to 2 player percentiles with the same amount of games). The idea fascinates me, because it might be the right one.

@Alpha: I like the idea of a logarithmic scale, also because it might be the right idea, but your data set exactly matched G Rating (as far as I could tell), so I don't know what logarithmic formula you are using.

@Mongrel: Similar to aggressiveness thread in that we haven't "solved" the hard problem. However, rating systems can improve without hitting the ideal, and I see many good ideas towards this end.
Fri 2nd Jul 2010 16:05 #31 / 211

Hugh
Standard Member

Rank

Lieutenant General

Rank Posn

#13

Join Date

Nov 09

Location

Posts

869

IRoll11s wrote:

Once again I'm only clarifying for you Hugh. Alpha your log scale has merit but I believe this is more precise. The only part of Hugh's post I didn't understand was this:

"...and the opponents were of equal rating to the player. Thus, if two players had the same Global Rating, we could tell who had the higher rated opponents by looking at this score."

Well, I know I need it!! (the alternative is the 7 page Hugh post that no one should ever read.)

I should have been way more careful with that statement, since it is really hard to clarify, and its proof utilizes highly idealized conditions that don't occur in practice. Anyway, here is an attempt: Given two players who have reached an equilibrium Global rating against a type of opponent (for example, player 1 plays only 1000 rated players, player 2 plays only 1400 rated players), if their global ratings are equal, but their H ratings are not, you can read off who played the stronger opponents (the one with the lower H rating). The H Rating advantage is that this remains true if you vary game size, but with a G Rating, game size will rear its ugly head.

I conjecture that the idealization "plays only 1000 rated players" can be replaced by "the opponents' average rating is 1000" via some sort of law of large number argument.

For more on equilibrium ratings, there is the Hugh post that did not go well:

http://www.wargear.net/forum/showthread/474

My definition/analysis of equilibrium is given in my very first post in that thread.
Fri 2nd Jul 2010 17:08 #32 / 211

M57
Standard Member

Rank

Brigadier General

Rank Posn

#73

Join Date

Apr 10

Location

Posts

5083

Hugh, I read your post and though some of the details go over my head, I have a question.

Would it be possible, given that a g-rating equalibrium exists for each category of game (e.g., 2-player, 3-player, 4, etc.) to find a different multiplicative constant for each of these situations that normalizes them to a 2-player paradigm?

For instance, let's say a 3 player g-rating gets a .95 constant applied (I'm just pulling numbers out of my @@$), and a 4 player g-rating gets a .90 constant.

Then a 1.40 3-player rating and a 1.48 4-player rating would each "normalize" to a 1.33 (2-player) g-rating.

I'm pretty sure it's not that simple because at some point a positive rating would become a negative normalized rating. But hey, maybe there's some truth to that.

If there was a "reasonable" constant, sure, it might be possible to achieve 2+ scores, but certainly it would be much less likely.

..but we won't be happy until there is a "barren" designer feature.
Fri 2nd Jul 2010 19:43 #33 / 211

Hugh
Standard Member

Rank

Lieutenant General

Rank Posn

#13

Join Date

Nov 09

Location

Posts

869

M57 wrote:
Would it be possible, given that a g-rating equalibrium exists for each category of game (e.g., 2-player, 3-player, 4, etc.) to find a different multiplicative constant for each of these situations that normalizes them to a 2-player paradigm?

About to leave - will post an example later, but if I understand you right, this is essentially what the H rating does. A modified G Rating of the style you suggest would give a number between 0 and 2 that is exactly twice the H Rating. I like the mindset of the modified G Rating, because then you don't have to base it on the current rating system - you could use, as 11s suggested, existing player data to scale the different game sizes. Then you could interpret the metric as being "true" relative to the current data set, just as H Rating is "true" relative to the current rating system.
Sat 3rd Jul 2010 01:36 #34 / 211

Hugh
Standard Member

Rank

Lieutenant General

Rank Posn

#13

Join Date

Nov 09

Location

Posts

869

I guess I didn't understand you: small experiments indicate that a single multiplicative constant translates very poorly. It works well around a certain pair of percentages and then quickly ceases to make any sense.

I prefer something like what I previously proposed, which does give a correspondence between percentages from different game categories (2 player, 3 player, etc), but does not use a single scaling factor to do so.
Sat 3rd Jul 2010 05:20 #35 / 211

Hugh
Standard Member

Rank

Lieutenant General

Rank Posn

#13

Join Date

Nov 09

Location

Posts

869

I'm going to post some simple sample calculations in the system I proposed on page 1 of the thread. Hopefully some clarification results. To summarize the system:

A percentage is generated in which each loss is counted as 0/1, and each win is 1/1 per player defeated in the win. It mimics the calculation of our rating system. The percentage should be thought of as converting multiplayer data into something comparable to two player win percentages.

Example: You win 2/4 four player games. In the wins you effectively went 3/3, so overall you went 6/8, for a score of 0.75.

The denominators are a bit strange. I refer to the 8 in that example as the number of "effective games".

We see that my system views winning 50% of your 4-player games as being like winning 75% of your two player games. How should we interpret this?

Suppose your four games were with opponents of Global Ratings equal to yours. Each player you beat gives you +20 points, each player you lose to subtracts 20 from your rating. In the 2/4 example, the wins gave you 120 points, the losses lost you 40 for a net of +80 points.

Here is the point: if you play 8 effective 2-player games at a 75% win rate, you should gain +80 rating points. In duel-land this is going 6/8. That also gains you 120, also loses 40, for a net of +80.

Does the example scale well? If I win 5/10 four player games, will that result in something as good as going 75% in duels? 5/10 converts to 15/20, so yes. In both cases (using 20 effective games), the rating gain is +200.

Does it work with mixed game sizes? New example: Suppose you go 1/2 in 5-player games and 2/3 in two player games. Effectively, you went 4/5 and 2/3, so 6/8, yielding a score of 0.75. At 0.75 with 8 effective games, we saw that the rating should change +80. The 1/2 in five player gave you +60, and the 2/3 gave you +20, so it works. It is not hard to convince yourself that it will always work.

We can also relax the requirement of same opponent rating (to same ratio), but the examples were easier this way. There _should_ be less skewing with this statistic, but I don't guarantee 0 skewing.
Sun 4th Jul 2010 22:26 #36 / 211

Alpha
Pop. 1, Est. 1981

Rank

Brigadier General

Rank Posn

#61

Join Date

Dec 09

Location

Posts

991

Hugh wrote: Alpha - your suggestion looks like G Rating, but capped at 2? Right?

Yes, but my data set was wrong, I will correct this later.

M57 wrote: It follows that these numbers could still be inflated for those who don't play many games, but I think it would be pretty hard to whip up a good rating after 10 or so games, so I would propose that a "G"-rating isn't valid (posted), until 10 games have been played.

This is the reason I suggested that for any stat we come up with, a player must play _____ games before it is used (otherwise it is just blank); I would suggest >25. G-rating works this way for individual boards. You are not assigned a g-rating until you win which is strange and possibly a bug.

Hugh wrote: Yay! Debate, discussion, Denny's new RiskySlam intestine destroyer! (my math threads never go this well!)

@Alpha: I like the idea of a logarithmic scale, also because it might be the right idea, but your data set exactly matched G Rating (as far as I could tell), so I don't know what logarithmic formula you are using.

Yes, I had worked out different calculations, but apparently I copied the wrong data, will recreate what I had and try posting again, but I have decided I like the H-rating as it is. Here are the H-ratings of some sample players:

Alpha:         1.3948
Hugh:          1.2927
ASM:      1.3229
Norseman: 1.2414
Yertle:         1.3149

*edit
Oops, here are the correct H-ratings:
Alpha:          .6052
Hugh:     .7073
ASM:       .6771
Norseman:   .7586
Yertle:      .6851
Poloquebec: .7460
BlackDog: .7773
Waldo:       .7592
*end edit

H-rating is really easy to calculate (+1), more meaningful than G-rating (+1), independent of game size (+1), plus other things, and is good enough for me.

The standard deviation stat of M57 also has merit, but I think that it will be computationally difficult to maintain. (I could certainly be wrong, but since I couldn't come up with an easy way to calculate it for myself, I decided it was too computational).

Edited Sun 4th Jul 22:54 [history]
Mon 5th Jul 2010 14:28 #37 / 211

tom
WarGear Admin

Rank

Commander In Chief

Rank Posn

#763

Join Date

Jun 09

Location

Posts

5651

Not much differentiation between the players though... what's a good rating and what's a bad one?
Mon 5th Jul 2010 14:55 #38 / 211

M57
Standard Member

Rank

Brigadier General

Rank Posn

#73

Join Date

Apr 10

Location

Posts

5083

Well, those are all top players. There ratings should be similar. What would mine look like?

..but we won't be happy until there is a "barren" designer feature.
Mon 5th Jul 2010 19:32 #39 / 211

Mongrel
Where's the armor?

Rank

Brigadier General

Rank Posn

#53

Join Date

Nov 09

Location

Posts

522

tom wrote: Not much differentiation between the players though... what's a good rating and what's a bad one?

One that maximizes gloatability.

Longest innings. Most deadly.
Mon 5th Jul 2010 20:30 #40 / 211

M57
Standard Member

Rank

Brigadier General

Rank Posn

#73

Join Date

Apr 10

Location

Posts

5083

Alpha wrote:
The standard deviation stat of M57 also has merit, but I think that it will be computationally difficult to maintain. (I could certainly be wrong, but since I couldn't come up with an easy way to calculate it for myself, I decided it was too computational).

I was thinking that myself. There's no recursive function to make things go smoothly. Every time a game is finished all the numbers for those players need to be re-crunched. But it sure would be a nice way to assess your performance.

What if you used your H-numbers, and then did a StDev using those? There wouldn't be as many numbers to crunch, only as many as there are players, and now your not doing a StDev on expected outcome, but rather the actual output of the machine? Does that make sense?

..but we won't be happy until there is a "barren" designer feature.