7-Mar-2024, 21-Dec-2022, 27-Jun-2021, 28-Dec-2020 | Like this? Dislike this? Let me know |
Note: This is a rant-under-construction. Some parts may change significantly as more thought and work are put into the project.
aws s3 ls s3://chess-bgn/archive/ curl https://chess-bgn.s3-us-west-2.amazonaws.com/ Get one file: aws s3 cp s3://chess-bgn/archive/blitz/2019/1/G.8.blitz.json.bz2 . Get a month of blitz games: aws s3 cp s3://chess-bgn/archive/blitz/2024/1 . --recursive Get a month of ALL game types: aws s3 cp s3://chess-bgn/archive/*/2016/1 . --recursive Get a year of blitz games: aws s3 cp s3://chess-bgn/archive/blitz/2016 . --recursive
This is data converted from the archive at lichess.org. Data is broken out by game type since comparison between types e.g. blitz and classical is unusual, although absolutely still possible. The Jan 2024 files contain almost 100 million games. I originally intended to convert each month but stopped due to zero download activity.
Content is broken out into many smaller, easily digestible bzip2 compressed
JSON files of approx. 60MB apiece, each of which decompress to approx 1G
which was the target decompressed size.
The compression is effective:
# Wildcard for both game type and the individual files. This # will pick up all 99m games for Jan 2024: game_data = 's3://chess-bgn/archive/*/2024/1/G.*.json.bz2' # ... or all games within just one game type: game_data = 's3://chess-bgn/archive/blitz/2024/1/G.*.json.bz2' # ... or all games in 2024: game_data = 's3://chess-bgn/archive/*/2024/*/G.*.json.bz2' # ... or just one game file, approx 200,000 to 210,000 games: game_data = 's3://chess-bgn/archive/blitz/2024/1/G.0.blitz.json.bz2' # ... or everything ever! Billions of games: game_data = 's3://chess-bgn/archive/*/*/*/G.*.json.bz2' df = spark.read.json(game_data)
# This would normally be in util.py or equiv, away from the specific analytic # but we code it directly here for convenience/clarity: def get_game_schema(): a_schema = StructType([ StructField("nag", IntegerType()), StructField("eval", ArrayType(StringType())), StructField("clk", ArrayType(StringType())) ]) m_schema = StructType([ StructField("p", StringType()), StructField("f", StringType()), StructField("t", StringType()), StructField("x", StringType()), StructField("castle", StringType()), StructField("promo", StringType()), StructField("a", ArrayType(a_schema)) ]) o_schema = StructType([ StructField("verbose", StringType()), StructField("generic", StringType()), StructField("ECO", StringType()) ]) p_schema = StructType([ StructField("handle", StructType([ StructField("domain", StringType()), StructField("value", StringType()) ])), StructField("ELO_blitz", IntegerType()), StructField("ELO_bullet", IntegerType()), StructField("ELO_rapid", IntegerType()), StructField("ELO_classic", IntegerType()) ]) g_schema = StructType([ StructField("site", StringType()), StructField("event", StringType()), StructField("type", StringType()), StructField("result", StringType()), StructField("datetime", DateType()), StructField("opening", o_schema), StructField("moves", ArrayType(m_schema), False), StructField("players", ArrayType(p_schema)) ]) return g_schema # This makes a "less than" cascade suitable for numerics, e.g.: # F.when(< 10).when(< 20).when(< 30).when(< 40) # Since order of eval is left to right, we do not need each when expression # to contain both a min and max value. vals is an array of monotonically # increasing values, not necessarily with same intervals e.g.: # makeWhenBuckets(the_col, [ -10, 0, 10, 50, 100, 200, 300, 500, 1000 ]) # def makeWhenBuckets(the_col, vals): cond = F.when(the_col < vals[0], F.lit("<"+str(vals[0]))) for n in range(1, len(vals)): s = str(vals[n-1])+"-"+str(vals[n]) cond = cond.when(the_col < vals[n], F.lit(s)) cond = cond.otherwise('other') return cond def process(game_data, output_uri): with SparkSession.builder.appName("blunders").getOrCreate() as spark: game_schema = get_game_schema() dfg = spark.read.json(game_data, game_schema) # Remember, in BGN moves are "half moves" so a 100 move game # limit is an array of 200 in BGN. End point is 201 to pick up # 200 itself (as opposed to stopping at 190): bkt_list = list ( range(10,201,10) ) # group by game type, UPPER case piece (black OR white), and bucket dfx = dfg\ .select( dfg.type, F.posexplode(dfg.moves).alias("N","move") )\ .filter(F.element_at(F.col('move.a.nag'),1) == 4)\ \ .withColumn("B", makeWhenBuckets(F.col('N'), bkt_list))\ \ .groupBy( F.col('type'), F.upper(F.col('move.p')).alias('piece'), F.col('B').alias('bucket') ).agg( F.count(F.col('N')).alias('Nblunders'))\ .sort(F.col('Nblunders').desc()) #dfx.show(truncate=False) dfx.write.mode("overwrite").json(output_uri)
Let's be honest: Portable Game Notation had a great run but the time has come
for a better way to capture chess action.
Here we will outline Better Game Notation or BGN. BGN has these
design goals:
Moves are the heart of the thing. We will start there and back up into the more pedestrian data elements. Important concepts with moves:
1. e4 e5 2. Nf3 Nf6
0 { "p":"P", "f": "e2", "t": "e4" } 1 { "p":"p", "f": "e7", "t": "e5" } 2 { "p":"N", "f": "g1", "t": "f3" } 3 { "p":"n", "f": "g8", "t": "f6" }
moves = [ { "p":"P", "f": "e2", "t": "e4" }, { "p":"p", "f": "e7", "t": "e5" }, { "p":"N", "f": "g1", "t": "f3" }, { "p":"n", "f": "g8", "t": "f6" } ]
<moves> <move><p>P</p><f>e2</f><t>e4</t></move> <move><p>p</p><f>e7</f><t>e5</t></move> <move><p>N</p><f>g1</f><t>f3</t></move> <move><p>n</p><f>g8</f><t>f6</t></move> </moves>
Castling is a king-touch move. In PGN:
n. O-O
n { "p":"K", "f": "e1", "t": "g1", "castle":"K" }
BGN explicitly captures additional information for moves. In PGN we might see the following; assume a white bishop is on D4:
13. Qh7 Nxd4+
26 { "p":"Q", "f": "h4", "t": "h7" } 27 { "p":"N", "f": "b5", "t": "d4", "x":"B", "c":2 }
Promotions are fairly straightforward. In PGN we might see
13. f8/Q Bf7 14. Qf4 ...
26 { "p":"P", "f": "f7", "t": "f8": "promote":"Q" } 27 { "p":"B", "f": "d5", "t": "f7"} 28 { "p":"Q", "f": "f8", "t": "f4"}
There is "wiggle room" in the pft to handle unusual situations like bughouse. In PGN we might see
14. N@f3 { drop a knight into the mix }
14 { "p":"N", "f": null, "t": "f3"}
The game play and data representations above are objective. But chess action has additional information that is subjective, most obviously the marking of a move as "questionable" (?) or "brilliant" (!!). All such annotations are captured in the a field:
26 { "p":"Q", "f": "h4", "t": "h7", "a": [{"nag":3} ] }
13. Qh7? Bf3?? 14. Nf3! Rc2!!
26 { "p":"Q", "f": "h4", "t": "h7", "a": [{"id":"AA2","nag":1} ] } 27 { "p":"B", "f": "e4", "t": "f3", "a": [{"id":"AA2","nag":4} ] } 28 { "p":"N", "f": "f5", "t": "f3", "a": [{"id":"AA2","nag":3} ] } 29 { "p":"R", "f": "c1", "t": "c2", "a": [{"id":"AA2","nag":8}] }
13. Qh7 (AA2 ?, AA7 -)
13 { "p":"Q", "f": "h4", "t": "h7", "a": [{"id":"AA2","nag":4},{"id":"AA7","nag":3}] }
13 { "p":"Q", "f": "h4", "t": "h7", "a": [ {"id":"AA2","nag":1}, {"id":"AA7","nag":0}, {"id":"AA7","date":"2022-03-04", "q":1, "comment":"yeah..."} ] }
14. Rd8 { [%clk "0:02:30"] [%eval #-3] } bf2 { [%clk "0:02:20"] [%eval #-2] } 17. fxe3 Kd7 ( Kd6 { [%clk "0:02:30"] [%eval #-3] the better move...?} ) 18. ...
Per the enhancedpgn spec, all commands must have at least 1 or more operands (parameters). In BGN, we model all operands as an array of any type, even if only one operand is required. This keeps the access and processing simple and free of "if type == array else ..." logic. For example, lichess.org now adds %clk and %eval commands into game output and could be modeled this way:
13 { "p":"Q", "f": "h4", "t": "h7", "a": { { "eval": ["#-3"], "clk": ["0:02:03"] }} }
13 { "p":"Q", "f": "h4", "t": "h7", "a": { {"eval": ["M", -3, 0.001], "rushFactor": [{"d1":0,"d2":-0.025,"d3":-1}]}} }
13. Qh7 (AA2 alt1 would be Nxc5, followed almost certainly by Bxc5)
13 { "p":"Q", "f": "h4", "t": "h7", "a": [ {"id":"AA2", "alt": {"name":"Knight press", "comment": "bla bla bla", "moves": [ {"p":"N", "f":"a4", "t":"c5", "x":"P"}, {"p":"B", "f":"f6", "t":"c5", "x":"N"} ] } } ] }
{ // Mandatory fields. You need these to describe a complete game: "moves" : [ (array as described above) ], "result" : "W" // Optional but almost always there and super helpful for filtering and // performing analytics. "type": "Z", "players": [ {"last":"Moschetti","first":"Paul","nickname":"Buzz"}, {"handle":{"domain":"org.lichess","value":"Gruenblaugelb"}, "ELO_blitz":1342, "ELO_classic":1654, "USchessID":"12807646"} ], "datetime" : "2022-09-04T12:30:00Z", // ISO-8601 in Z time ONLY. No local time! "event" : "Descriptive name", "site" : "Somewhere on Earth", "opening" : {"ECO":"D40", "verbose":"Sicilian Defense: Modern Variations, Anti-Qxd4 Move Order Accepted", "generic":"SICILIAN"}, "timeControl":"600+5", // If no default (id = _) that is OK. If no subinfo at all, it // just means the subinfo in the 'a' field cannot be explicitly attributed; // this is not the end of the world: "subinfo": [ {"id":"_", info: { "name": {"last":"Hoover"}, "rating":1000}} ], // "Domain defined additional attributes" ext: [ {id: "org.chessvenue.com", ts: "2022-09-04T12:30:00Z", data: { "temp": 24, "O2_pct":20.6 }} ] }
Notes on the fields:
How many games played where last name = Moschetti? // dotpath thru array yields new array of lastnames: MongoDB: db.games.find({"players.last":"Moschetti"}); // Same dotpath behavior here: SPARK: df.filter(array_contains(col('players.last'), 'Moschetti')) // The [] operator "unwinds" the players array: JQ: jq 'select(.players[] | .last == "Moschetti")' N.game.json How many games did Moschetti win playing as white? // This time, pick *only* [0] from players to get white player: MongoDB: db.games.find({"result":"W","players.0.last":"Moschetti"}); // SPARK array functions are 1-based not 0-based: SPARK: df.filter( (element_at(col('players.last'),1) == lit('Moschetti')) & (col('result') == lit('W')) ) // Nearly the same as MongoDB: JQ: jq 'select(.result == "W" and (.players[0].last == "Moschetti"))' N.game.json
As a special case for space/reference data lookup efficiency, the players array can contain a string identifier instead of the object of rich information at either or both of the 0 or 1 index:
"players": [ "B76235R", "Gruenblaugelb" ]
db.chessdata.aggregate([ // Step 1: Filter for only the dates we want, which should cut down a LOT of the material: {$match: {$and: [ {"bgn.eventDate":{$gte:new ISODate("1960-01-01")}} , {"bgn.eventDate":{$lt:new ISODate("1980-01-01")}} ] }} // Step 2: Use the $reduce function to "walk" the moves array and sniff out at what point, // if ever, the castle occurs. We only need to check up to the first 5 (or 10 or 15) moves // OR the max length of moves array, whichever is shorter: ,{$project: {X: {$reduce: { input: {$range:[0, {$min:[ {$size:"$moves"},5 ]} ]}, initialValue: [], in: {$let: { // $$this is the sequential int generated from $range in the input vars: { ee: {$arrayElemAt:["$moves","$$this"] } }, // The following translates to: "if the castle field value is true, then append to the // every growing $$value array a new array of one containing the offset where it was // found, else append a ZERO length array -- essentially a noop": in: {$concatArrays: [ "$$value", {$cond: [ {$eq:["$$ee.castle",true]} , ["$$this"] , [] ]} ]} }} }} }} // Step 3: The $reduce function can leave us with an empty -- but non-null! -- array, so // lastly filter those out: ,{$match: {$expr: {$ne:[0,{$size:"$X"}]} }} ]);
Beyond MongoDB, BGN rendered as bz2 compressed JSON can be stored on AWS S3 and accessed by SPARK and other scalable subsystems to solve very large scale data analysis problems. AWS SPARK drivers are S3 (obviously) and bz2 optimized; no need to decompress to perform analytics! And in a recent test, the 33.2G of Jan-2022 zstd compressed lichess PGN data transformed into a set of bz2 compressed JSON files totaled 27G -- almost 19% smaller and much more practical to work with.
26 { "p":"Q", "f": "h4", "t": "h7", "x":"N", "et":1234 } 27 { "p":"B", "f": "e4", "t": "f3", "et": 1256}
Like this? Dislike this? Let me know