BGN: A Better Chess Game Action Information Architecture

7-Mar-2024, 21-Dec-2022, 27-Jun-2021, 28-Dec-2020 Like this? Dislike this? Let me know

Note: This is a rant-under-construction. Some parts may change significantly as more thought and work are put into the project.

Now Available: BGN files on AWS S3! Updated 7-Mar-2024

    aws s3 ls s3://chess-bgn/archive/
    curl https://chess-bgn.s3-us-west-2.amazonaws.com/

    Get one file:
    aws s3 cp s3://chess-bgn/archive/blitz/2019/1/G.8.blitz.json.bz2 .

    Get a month of blitz games:
    aws s3 cp s3://chess-bgn/archive/blitz/2024/1 .  --recursive

    Get a month of ALL game types:
    aws s3 cp s3://chess-bgn/archive/*/2016/1 .  --recursive    

    Get a year of blitz games:
    aws s3 cp s3://chess-bgn/archive/blitz/2016 .  --recursive    
  

This is data converted from the archive at lichess.org. Data is broken out by game type since comparison between types e.g. blitz and classical is unusual, although absolutely still possible. The Jan 2024 files contain almost 100 million games. I originally intended to convert each month but stopped due to zero download activity.

Content is broken out into many smaller, easily digestible bzip2 compressed JSON files of approx. 60MB apiece, each of which decompress to approx 1G which was the target decompressed size.
The compression is effective:

You can do big analytics with AWS EMR by using this URI in your SPARK code. Note that the SPARK environment on AWS has built-in drivers for S3 which natively support reading bzip2 compressed JSON data!
Run your AWS EMR SPARK jobs in region us-west-2 to eliminate data transfer costs!
      # Wildcard for both game type and the individual files.  This
      # will pick up all 99m games for Jan 2024:
      game_data = 's3://chess-bgn/archive/*/2024/1/G.*.json.bz2'

      # ... or all games within just one game type:
      game_data = 's3://chess-bgn/archive/blitz/2024/1/G.*.json.bz2'

      # ... or all games in 2024:
      game_data = 's3://chess-bgn/archive/*/2024/*/G.*.json.bz2'

      # ... or just one game file, approx 200,000 to 210,000 games:
      game_data = 's3://chess-bgn/archive/blitz/2024/1/G.0.blitz.json.bz2'

      # ... or everything ever!  Billions of games:
      game_data = 's3://chess-bgn/archive/*/*/*/G.*.json.bz2'

      df = spark.read.json(game_data)            
  
Here is a sample analytic that answers: What pieces experience blunders at what point of the game?

# This would normally be in util.py or equiv, away from the specific analytic
# but we code it directly here for convenience/clarity:
def get_game_schema():
    a_schema = StructType([
        StructField("nag", IntegerType()),
        StructField("eval", ArrayType(StringType())),
        StructField("clk", ArrayType(StringType()))
    ])
    
    m_schema = StructType([
        StructField("p", StringType()),
        StructField("f", StringType()), 
        StructField("t", StringType()),
        StructField("x", StringType()),
        StructField("castle", StringType()),
        StructField("promo", StringType()),
        StructField("a", ArrayType(a_schema))
    ])
    
    o_schema = StructType([
        StructField("verbose", StringType()),
        StructField("generic", StringType()), 
        StructField("ECO", StringType())
    ])

    p_schema = StructType([
        StructField("handle", StructType([
            StructField("domain", StringType()),
            StructField("value", StringType())
            ])),
        StructField("ELO_blitz", IntegerType()),
        StructField("ELO_bullet", IntegerType()),
        StructField("ELO_rapid", IntegerType()),
        StructField("ELO_classic", IntegerType())
    ])    
        
    g_schema = StructType([
        StructField("site", StringType()),
        StructField("event", StringType()),        
        StructField("type", StringType()),
        StructField("result", StringType()),
        StructField("datetime", DateType()),
        StructField("opening", o_schema),

        StructField("moves", ArrayType(m_schema), False),

        StructField("players", ArrayType(p_schema))
    ])

    return g_schema

#  This makes a "less than" cascade suitable for numerics, e.g.:
#    F.when(< 10).when(< 20).when(< 30).when(< 40)
#  Since order of eval is left to right, we do not need each when expression
#  to contain both a min and max value.  vals is an array of monotonically
#  increasing values, not necessarily with same intervals e.g.:
#    makeWhenBuckets(the_col, [ -10, 0, 10, 50, 100, 200, 300, 500, 1000 ])
#
def makeWhenBuckets(the_col, vals):
    cond = F.when(the_col < vals[0], F.lit("<"+str(vals[0])))

    for n in range(1, len(vals)):
        s = str(vals[n-1])+"-"+str(vals[n])
        cond = cond.when(the_col < vals[n], F.lit(s))

    cond = cond.otherwise('other')
                    
    return cond

        
def process(game_data, output_uri):
    with SparkSession.builder.appName("blunders").getOrCreate() as spark:

        game_schema = get_game_schema()
        
        dfg = spark.read.json(game_data, game_schema)

        # Remember, in BGN moves are "half moves" so a 100 move game
        # limit is an array of 200 in BGN.  End point is 201 to pick up 
        # 200 itself (as opposed to stopping at 190):
        bkt_list = list ( range(10,201,10) )
        
        # group by game type, UPPER case piece (black OR white), and bucket
        dfx = dfg\
            .select( dfg.type, F.posexplode(dfg.moves).alias("N","move") )\
            .filter(F.element_at(F.col('move.a.nag'),1) == 4)\
            \
            .withColumn("B", makeWhenBuckets(F.col('N'), bkt_list))\
            \
            .groupBy( F.col('type'), F.upper(F.col('move.p')).alias('piece'), F.col('B').alias('bucket') ).agg( F.count(F.col('N')).alias('Nblunders'))\
            .sort(F.col('Nblunders').desc())
        
        #dfx.show(truncate=False)
        dfx.write.mode("overwrite").json(output_uri)
  

 

 

Let's be honest: Portable Game Notation had a great run but the time has come for a better way to capture chess action.
Here we will outline Better Game Notation or BGN. BGN has these design goals:

  1. Rich-shape structured data instead of CR-dependent, simple key:value pairs and the hard-to-parse movetext (SAN plus other items) for the purposes of precise data capture and storage in a variety of formats and subsystems (e.g. MongoDB or SPARK)
  2. Explicit piece from-to data completely eliminates ambiguity and (more importantly) permit analysis of moves without having to run a chess engine against the whole game to figure out what is moving. In other words, you can look at move 20 and know instantly what happened without replaying from move 1. This also easily facilitates alternate games like bughouse or variants with different sized boards such as Los Alamos Chess.
  3. Explicit piece capture and game events
  4. Flexible to permit extensible annotations for commentary (blunders, etc.) from multiple sources
  5. Ability to capture PGN V2 commands in a structured fashion, not as embedded comments
In describing the structure and values within, we will use JSON as a rendering example but it is important to understand that BGN is an information architecture, not a rendering / storage specification. BGN structures can be easily implemented in all popular languages (Java, python, etc.) and easily externalized in JSON or XML. It can also be easily read and written to MongoDB. It can also be easily read and written to relational databases that support XML or JSON representations of columns although the queryability of such a representation may be limited. It is possible (but much less easily) to convert BGN into a purely relational form especially when considering alternate lines within alternate lines.
A formal specification of types within the BGN architecture is forthcoming but assume at least scalar string, int64, double, and date, and maps (objects) and arrays thereof. In this rant we'll explore:
  1. Basic BGN design details
  2. Practical Reasons For Exploring This At All

Moves

Moves are the heart of the thing. We will start there and back up into the more pedestrian data elements. Important concepts with moves:


In PGN, a basic opening would be notated as:
1. e4 e5   2. Nf3 Nf6
  
In BGN the same basic opening would have this in the moves array. In our documentation here, we show the array offset for a little more context but it is not part of actual spec.
  
 0  { "p":"P", "f": "e2", "t": "e4" }
 1  { "p":"p", "f": "e7", "t": "e5" }
 2  { "p":"N", "f": "g1", "t": "f3" }
 3  { "p":"n", "f": "g8", "t": "f6" }
  
We call the "piece-from-to" construct a pft and the field names are made short on purpose. In addition, although pft is a strong recuring concept and could be modeled as an array of three elements, we deliberately use field names to avoid the confusion of nested arrays, e.g. array[4][2] = 'F4'. It is more workable like this: array[4]['t'] = 'F4'.
There is a lot more to the pft which we will see shortly but again, it is important to know that pft is not parsed in the same way as PGN. There is no whitespace, there is no explict numbering of the moves e.g. the "2." in 2. Nf3 Nf6. All data has real field names and a set of valid values including optional values. Here is sample JSON implementation of moves:
  
	moves = [
	  { "p":"P", "f": "e2", "t": "e4" },
	  { "p":"p", "f": "e7", "t": "e5" },
	  { "p":"N", "f": "g1", "t": "f3" },
	  { "p":"n", "f": "g8", "t": "f6" }
	]
And to prove the point, here it is in XML (although in 2024 XML is highly NOT recommended):
  
	<moves>
	  <move><p>P</p><f>e2</f><t>e4</t></move>
	  <move><p>p</p><f>e7</f><t>e5</t></move>
	  <move><p>N</p><f>g1</f><t>f3</t></move>
	  <move><p>n</p><f>g8</f><t>f6</t></move>	  
	</moves>

Castling is a king-touch move. In PGN:

n. O-O
  
In BGN:
n  { "p":"K", "f": "e1", "t": "g1", "castle":"K" }
  
King vs. queen castling is disambiguated with the value of the castle field. The rook moving from h1 to f1 is implicit. This is the only non-explicit move of a piece in BGN. But when the rook moves later (maybe!) we will see the move from f1 (where it landed during the castle) to the new landing square.

BGN explicitly captures additional information for moves. In PGN we might see the following; assume a white bishop is on D4:

13. Qh7 Nxd4+
  
In BGN this would be:
26  { "p":"Q", "f": "h4", "t": "h7" }
27  { "p":"N", "f": "b5", "t": "d4", "x":"B", "c":2 }
  
The "x" field explicitly identifies the piece captured and the "c" field identifies we have placed the opponent in check for the SECOND time. Note that we do not count final checkmate as being in check. There should be no moves after checkmate and the result will identify the winner. This does make forced checkmate versus simply conceding the game a little ambiguous. TBD: Change...?

Promotions are fairly straightforward. In PGN we might see

13. f8/Q Bf7
14. Qf4 ...
  
In BGN this would be:
26  { "p":"P", "f": "f7", "t": "f8": "promote":"Q" }
27  { "p":"B", "f": "d5", "t": "f7"}
28  { "p":"Q", "f": "f8", "t": "f4"}
  
Note that on move 26, the pawn on F8 was promoted to queen, and was subseqeuently moved as a queen on move 28.

There is "wiggle room" in the pft to handle unusual situations like bughouse. In PGN we might see

14. N@f3 { drop a knight into the mix }
  
In BGN this might be:
14  { "p":"N", "f": null, "t": "f3"}
  

Subjective Annotations

The game play and data representations above are objective. But chess action has additional information that is subjective, most obviously the marking of a move as "questionable" (?) or "brilliant" (!!). All such annotations are captured in the a field:

26  { "p":"Q", "f": "h4", "t": "h7", "a": [{"nag":3} ] }
  
The a field is an array because more than one provider can contribute subjective infomation ("subinfo") to a move. To save some space, the id field in subinfo is optional; attribution is carried in the BGN header fields. The management of these IDs is not a core requirement of BGN so we will park id management for the moment.
As an example, consider this PGN:
13. Qh7? Bf3??
14. Nf3! Rc2!!
  
Someone has subjectively questioned white's queen move and called the bishop move a blunder. It is probably the annotator name as (maybe) set up in the PGN headers. In the next exchange, apparently there is brilliance. In BGN this is represented thusly through the nag field (using standard NAG codes) for quality, e.g. 1 ("!", good move), 4 ("??", blunder):
26  { "p":"Q", "f": "h4", "t": "h7", "a": [{"id":"AA2","nag":1} ] }
27  { "p":"B", "f": "e4", "t": "f3", "a": [{"id":"AA2","nag":4} ] }
28  { "p":"N", "f": "f5", "t": "f3", "a": [{"id":"AA2","nag":3} ] }
29  { "p":"R", "f": "c1", "t": "c2", "a": [{"id":"AA2","nag":8}] }
  
This allows multiple authors to opine subjectively on moves. For example, if such a thing was legitimate in PGN, meaning "AA2 thinks it is questionable but AA7 believes it is fine:"
13. Qh7 (AA2 ?, AA7 -)
  
Then in BGN we would have:
13  { "p":"Q", "f": "h4", "t": "h7", "a": [{"id":"AA2","nag":4},{"id":"AA7","nag":3}] }
  
Annotations optionally can have dates. Annotations without dates are assumed to be relevant to the timeframe of the game event itself. This means other subjective annotators can come in later and opine. For example, suppose author AA7 later on decided that it was a blunder. We could update the move as follows:
      13  { "p":"Q", "f": "h4", "t": "h7", "a": [
      {"id":"AA2","nag":1},
      {"id":"AA7","nag":0},
      {"id":"AA7","date":"2022-03-04", "q":1, "comment":"yeah..."}
      ] }
  
The permissioning of performing such an update, much like the physical persistence itself, is out of scope for this BGN data design doc but there are at least 2 very practical, very fast ways this could implemented in either MongoDB or a JSON column RDBMS.

PGN "V2" Commands

PGN was not designed with richly structured data in mind, nor extensibility. As a result, additional information on moves called commands (detailed at https://www.enpassant.dk/chess/palview/enhancedpgn.htm) is tucked away inside the comment field, e.g.:
      14. Rd8 { [%clk "0:02:30"] [%eval #-3] } bf2 { [%clk "0:02:20"] [%eval #-2] } 
      17. fxe3 Kd7 ( Kd6 { [%clk "0:02:30"] [%eval #-3] the better move...?} )  18. ...
  
In BGN, these commands are captured as fields in subinfo. Some commands have an inherently objective definition particularly %clk but others like %eval are "similar" but not exactly the same in all situations. Because the a field can carry 2 or more subinfos with attribution, it makes sense to place commands there.

Per the enhancedpgn spec, all commands must have at least 1 or more operands (parameters). In BGN, we model all operands as an array of any type, even if only one operand is required. This keeps the access and processing simple and free of "if type == array else ..." logic. For example, lichess.org now adds %clk and %eval commands into game output and could be modeled this way:

      13  { "p":"Q", "f": "h4", "t": "h7", "a": {
            { "eval": ["#-3"], "clk": ["0:02:03"] }}
          }
  
The information architecture within a subinfo is completely at the discretion of the "owner" of the subinfo. Because BGN is inherently neutral to the data types, the values of keys in the operands are not restricted to simple scalar strings. An emerging popular treatment might be to carry a single struct as operand 1, creating a name:value set instead of an ordinally restricted arglist:
      13  { "p":"Q", "f": "h4", "t": "h7", "a": {
            {"eval": ["M", -3, 0.001], "rushFactor": [{"d1":0,"d2":-0.025,"d3":-1}]}}
          }
  

Alternate Lines

The a field can carry complete alternate move structures. Suppose author AA3 thought that this would be better:
13. Qh7 (AA2 alt1 would be Nxc5, followed almost certainly by Bxc5)
  
In BGN, this is explicitly described using the same set of structures as in the mainline game:
13  { "p":"Q", "f": "h4", "t": "h7",
      "a": [ {"id":"AA2",
              "alt": {"name":"Knight press",
                      "comment": "bla bla bla",
                      "moves": [
                        {"p":"N", "f":"a4", "t":"c5", "x":"P"},
                        {"p":"B", "f":"f6", "t":"c5", "x":"N"}
                      ]
              }
        } ]
}
  
The hidden gem here is moves array in the alt structure is the same as the mainline -- which means that alternate lines themselves can have alternate lines within! Any feature that is added to the move info architecture is automatically available recursively in the alternate lines.

Overall Game Structure

As promised earlier, we will back up to the overall game structure that holds moves. An example serves us well:
{
  // Mandatory fields.  You need these to describe a complete game:
  "moves" : [ (array as described above) ],
  "result" : "W"
  
  // Optional but almost always there and super helpful for filtering and
  // performing analytics.
  "type": "Z",
  "players": [
    {"last":"Moschetti","first":"Paul","nickname":"Buzz"},
    {"handle":{"domain":"org.lichess","value":"Gruenblaugelb"},
      "ELO_blitz":1342, "ELO_classic":1654, "USchessID":"12807646"}
  ],
  "datetime" : "2022-09-04T12:30:00Z",  // ISO-8601 in Z time ONLY.  No local time!
  "event" : "Descriptive name",
  "site" :  "Somewhere on Earth",
  "opening" : {"ECO":"D40",
               "verbose":"Sicilian Defense: Modern Variations, Anti-Qxd4 Move Order Accepted",
               "generic":"SICILIAN"},

  "timeControl":"600+5",


  // If no default (id = _) that is OK. If no subinfo at all, it
  // just means the subinfo in the 'a' field cannot be explicitly attributed;
  // this is not the end of the world:
  "subinfo": [
    {"id":"_", info: { "name": {"last":"Hoover"}, "rating":1000}}
  ],

  // "Domain defined additional attributes"
  ext: [
    {id: "org.chessvenue.com", ts: "2022-09-04T12:30:00Z", data: {
      "temp": 24, "O2_pct":20.6
    }}
  ]
}

Notes on the fields:

What is The Real Point?

  1. BGN is a modern, database and software-friendly data design
    BGN is both explicit and expressive without syntactical shortcuts like ?? for blunders and also is very digestible by rich shape databases like MongoDB. Consider a data set of 100,000,000 games and we wish to ask the question: How many games did a castle occur in the first 5 moves between 1960 and 1980? 10 moves? 15? In MongoDB we could query the hypothetical chessdata collection thusly:
    db.chessdata.aggregate([
    // Step 1: Filter for only the dates we want, which should cut down a LOT of the material:
       {$match: {$and: [ {"bgn.eventDate":{$gte:new ISODate("1960-01-01")}} ,
    		      {"bgn.eventDate":{$lt:new ISODate("1980-01-01")}}
    		    ] }}
    
    // Step 2:  Use the $reduce function to "walk" the moves array and sniff out at what point,
    // if ever, the castle occurs.  We only need to check up to the first 5 (or 10 or 15) moves
    // OR the max length of moves array, whichever is shorter:
        ,{$project: {X: {$reduce: {
    	input: {$range:[0, {$min:[ {$size:"$moves"},5 ]} ]},
    	initialValue: [],
    	in: {$let: {
    
              // $$this is the sequential int generated from $range in the input
              vars: { ee: {$arrayElemAt:["$moves","$$this"] } },
    
              // The following translates to:  "if the castle field value is true, then append to the
              // every growing $$value array a new array of one containing the offset where it was
              // found, else append a ZERO length array -- essentially a noop":
              in: {$concatArrays: [ "$$value",
    				{$cond: [ {$eq:["$$ee.castle",true]} , ["$$this"] , [] ]}
    			      ]}
    	}}
         }}
       }}		
    
    // Step 3:  The $reduce function can leave us with an empty -- but non-null! -- array, so
    // lastly filter those out:
    ,{$match: {$expr: {$ne:[0,{$size:"$X"}]} }}
    ]);
    
  2. BGN stored in MongoDB or AWS S3 or Hadoop utilizing SPARK makes terabyte sized analytics a possibility
    In MongoDB with appropriate indexing, such a query might only take seconds or a minute as opposed to, say, hours struggling with running python programs using chess.pgn over and over. The PGN archive at https://database.lichess.org is adding a nearly 20GB file of zstd compressed PGN representing approx. 100 million new games per month.

    Beyond MongoDB, BGN rendered as bz2 compressed JSON can be stored on AWS S3 and accessed by SPARK and other scalable subsystems to solve very large scale data analysis problems. AWS SPARK drivers are S3 (obviously) and bz2 optimized; no need to decompress to perform analytics! And in a recent test, the 33.2G of Jan-2022 zstd compressed lichess PGN data transformed into a set of bz2 compressed JSON files totaled 27G -- almost 19% smaller and much more practical to work with.

  3. BGN -- especially the moves data design -- is highly extensible
    One of the biggest drawbacks to PGN is the difficulty -- never mind lack of standardization -- in adding fields, simple or complex, to each move. It is trivial in BGN to do so because it is a modern structured data design requiring no special parsing; the choice of implementation (by default, JSON) already has high performance parsing in many different languages. For example, we can add a field et for elapsed time in seconds from start of game.
        26  { "p":"Q", "f": "h4", "t": "h7", "x":"N", "et":1234 }
        27  { "p":"B", "f": "e4", "t": "f3", "et": 1256}
    
    It now becomes easy to measure the pace of the game by comparing move[n].et to move[n+1].et. This can even be bucketed e.g. % moves executed in 1-10 second, % executed in 10 to 120 seconds, % more than 120 seconds, etc.
  4. BGN is easily externalizable into highly digestable JSON
    After spending days struggling with 22GB PGN files, BGN externalized as CR-delimited JSON offers some interesting advantages:
    1. Finding anything with grep is as fast as, well, grep and will yield the complete game; no other lines (rows) necessary
    2. Splitting a CR-delimited file is easy as using split, again because a game is on one line/row.
    3. You can use jq -- the de facto standard for command line hacking of JSON -- to filter, transform, and otherwise hack the JSON. Or any other JSON hacking tool you like.

Like this? Dislike this? Let me know


Site copyright © 2013-2024 Buzz Moschetti. All rights reserved