BGN: A Better Chess Game Action Information Architecture

2-Jul-2025, 1-Mar-2025, 7-Mar-2024, 21-Dec-2022, ..., 28-Dec-2020

Note: This is a rant-under-construction. Some parts may change significantly as more thought and work are put into the project.

TL;DR

BGN is a data design for chess notation that is easily implemented in JSON but could certainly be implemented in BSON or even a custom binary representation. The design goals of BGN are:

A real data architecture instead of PGN's set of tags and a parser-unfriendly moves syntax.
Capture explicit from-to move action, eliminating the need for a chess engine to compute the position n moves into the game. A trivial amount of code can take BGN from-to moves and create a FEN. It is even easier to create a PGN to take advantage of existing tooling if desired, especially in the visualization/interactive space.
The ability to add new fields, scalar or complex, in the future without having to change the parsing or breaking interactions with existing data.
Incorporate PGN v2 commands (currently "buried" in comments) into a well-structured set of data that can be analyzed as easily as the date of the game.
Able to be represented isomorphically in three different scale domains and perform equally well in all of them:
- Less than 10,000 games (e.g. in a file, parsed with jq), similar to what one might do with a single PGN file.
- Index-optimized querying millions games (e.g. in mongodb (natively!) or postgres exploiting jsonb and the rich set of operators there). PGN does not directly translate into such an environment although transformations are certainly possible. Such queries are useful only across PGN tags and a subset of needs on the moves themselves; full analytics requires each fetched game to be run through a chess engine because the detail from-to data simply is not available in PGN.
- Many billions of games (e.g. 1000s of games presented to a SPARK cluster). The chess engine challenge outlined above becomes a showstopper at this scale; for example, you simply cannot run `import chess ; engine = chess.engine.SimpleEngine.popen_uci(rargs.engine)` on billions of games.
The isomorphism means that only one data architecture has to be learned -- and as it turns out, JSON (esp. pbzip2 compressed JSON) as an implementation works great for all three.

Initial reactions to BGN after posting on various channels including hackernews.com were derogatory ("you don't know what you're talking about") but what was interesting is that almost all the feedback boiled down to there was an existing format/approach -- often binary -- and/or service that was better/faster/more flexible than BGN. The design goals BGN are different; the point is to create a "meritage", a curated blend that scores great-to-excellent on a number of different dimensions without trying to be the best at any one or two. In particular, BGN allows you to work with chess data directly in a way that scales from trivial to enormous.

Now Available: BGN encoding of lichess.com archives on AWS S3! Updated with Jun-2025 files

    aws s3 ls s3://chess-bgn/archive/
    curl https://chess-bgn.s3-us-west-2.amazonaws.com/

    Get one file:
    aws s3 cp s3://chess-bgn/archive/blitz/2025/6/G.8.blitz.json.bz2 .

    Get a month of blitz games:
    aws s3 cp s3://chess-bgn/archive/blitz/2025/6 .  --recursive

    Get a month of ALL game types:
    aws s3 cp s3://chess-bgn/archive/*/2025/6 .  --recursive    

    Get a year of blitz games:
    aws s3 cp s3://chess-bgn/archive/blitz/2024 .  --recursive

This is data converted from the archive at lichess.org. Data is broken out by game type since comparison between types e.g. blitz and classical is unusual, although absolutely still possible. The Jan 2024 files contain almost 100 million games.

Content is broken out into many smaller, easily digestible bzip2 compressed JSON files of approx. 60MB apiece, each of which decompress to approx 1G which was the target decompressed size.
The compression is effective:

The original Jan 2019 zstd compressed file is 10G but the 360 60MB bzip2 files (72 files each for blitz,bullet,rapid,classic,other) total only 4.2G.
For the Jan 2023 files (released 4-Feb-2023), the original zstd file is 33.5 GB for 103,178,407 games but the 1185 bzip2 files (237 files each for blitz,bullet,rapid,classic,other) total only 25.4G.
For the Jun 2025 files (released 1-Jul-2025), the original zstd file is 29.7 GB for 91,189,176 games but the 1060 bzip2 files (212 files each for blitz,bullet,rapid,classic,other) total only 23G.

You can do big analytics with AWS EMR by using this URI in your SPARK code. Note that the SPARK environment on AWS has built-in drivers for S3 which natively support reading bzip2 compressed JSON data!
Run your AWS EMR SPARK jobs in region us-west-2 to eliminate data transfer costs!

      # Wildcard for both game type and the individual files.  This
      # will pick up all 99m games for Jan 2024:
      game_data = 's3://chess-bgn/archive/*/2024/1/G.*.json.bz2'

      # ... or all games within just one game type:
      game_data = 's3://chess-bgn/archive/blitz/2024/1/G.*.json.bz2'

      # ... or all games in 2024:
      game_data = 's3://chess-bgn/archive/*/2024/*/G.*.json.bz2'

      # ... or just one game file, approx 200,000 to 210,000 games:
      game_data = 's3://chess-bgn/archive/blitz/2024/1/G.0.blitz.json.bz2'

      # ... or everything ever!  Billions of games:
      game_data = 's3://chess-bgn/archive/*/*/*/G.*.json.bz2'

      df = spark.read.json(game_data)

Here is a sample analytic that answers: What pieces experience blunders at what point of the game?


# This would normally be in util.py or equiv, away from the specific analytic
# but we code it directly here for convenience/clarity:
def get_game_schema():
    a_schema = StructType([
        StructField("nag", IntegerType()),
        StructField("eval", ArrayType(StringType())),
        StructField("clk", ArrayType(StringType()))
    ])
    
    m_schema = StructType([
        StructField("p", StringType()),
        StructField("f", StringType()), 
        StructField("t", StringType()),
        StructField("x", StringType()),
        StructField("castle", StringType()),
        StructField("promo", StringType()),
        StructField("a", ArrayType(a_schema))
    ])
    
    o_schema = StructType([
        StructField("verbose", StringType()),
        StructField("generic", StringType()), 
        StructField("ECO", StringType())
    ])

    p_schema = StructType([
        StructField("handle", StructType([
            StructField("domain", StringType()),
            StructField("value", StringType())
            ])),
        StructField("ELO_blitz", IntegerType()),
        StructField("ELO_bullet", IntegerType()),
        StructField("ELO_rapid", IntegerType()),
        StructField("ELO_classic", IntegerType())
    ])    
        
    g_schema = StructType([
        StructField("site", StringType()),
        StructField("event", StringType()),        
        StructField("type", StringType()),
        StructField("result", StringType()),
        StructField("datetime", DateType()),
        StructField("opening", o_schema),

        StructField("moves", ArrayType(m_schema), False),

        StructField("players", ArrayType(p_schema))
    ])

    return g_schema

#  This makes a "less than" cascade suitable for numerics, e.g.:
#    F.when(< 10).when(< 20).when(< 30).when(< 40)
#  Since order of eval is left to right, we do not need each when expression
#  to contain both a min and max value.  vals is an array of monotonically
#  increasing values, not necessarily with same intervals e.g.:
#    makeWhenBuckets(the_col, [ -10, 0, 10, 50, 100, 200, 300, 500, 1000 ])
#
def makeWhenBuckets(the_col, vals):
    cond = F.when(the_col < vals[0], F.lit("<"+str(vals[0])))

    for n in range(1, len(vals)):
        s = str(vals[n-1])+"-"+str(vals[n])
        cond = cond.when(the_col < vals[n], F.lit(s))

    cond = cond.otherwise('other')
                    
    return cond

        
def process(game_data, output_uri):
    with SparkSession.builder.appName("blunders").getOrCreate() as spark:

        game_schema = get_game_schema()
        
        dfg = spark.read.json(game_data, game_schema)

        # Remember, in BGN moves are "half moves" so a 100 move game
        # limit is an array of 200 in BGN.  End point is 201 to pick up 
        # 200 itself (as opposed to stopping at 190):
        bkt_list = list ( range(10,201,10) )
        
        # group by game type, UPPER case piece (black OR white), and bucket
        dfx = dfg\
            .select( dfg.type, F.posexplode(dfg.moves).alias("N","move") )\
            .filter(F.element_at(F.col('move.a.nag'),1) == 4)\
            \
            .withColumn("B", makeWhenBuckets(F.col('N'), bkt_list))\
            \
            .groupBy( F.col('type'), F.upper(F.col('move.p')).alias('piece'), F.col('B').alias('bucket') ).agg( F.count(F.col('N')).alias('Nblunders'))\
            .sort(F.col('Nblunders').desc())
        
        #dfx.show(truncate=False)
        dfx.write.mode("overwrite").json(output_uri)

Let's be honest: Portable Game Notation had a great run but the time has come for a better way to capture chess action.
Here we will outline Better Game Notation or BGN. BGN has these design goals:

Rich-shape structured data instead of CR-dependent, simple key:value pairs and the hard-to-parse movetext (SAN plus other items) for the purposes of precise data capture and storage in a variety of formats and subsystems (e.g. MongoDB or SPARK)
Explicit piece from-to data completely eliminates ambiguity and (more importantly) permit analysis of moves without having to run a chess engine against the whole game to figure out what is moving. In other words, you can look at move 20 and know instantly what happened without replaying from move 1. This also easily facilitates alternate games like bughouse or variants with different sized boards such as Los Alamos Chess.
Explicit piece capture and game events
Flexible to permit extensible annotations for commentary (blunders, etc.) from multiple sources
Ability to capture PGN V2 commands in a structured fashion, not as embedded comments

In describing the structure and values within, we will use JSON as a rendering example but it is important to understand that BGN is an information architecture, not a rendering / storage specification. BGN structures can be easily implemented in all popular languages (Java, python, etc.) and easily externalized in JSON or XML. It can also be easily read and written to MongoDB. It can also be easily read and written to relational databases that support XML or JSON representations of columns although the queryability of such a representation may be limited. It is possible (but much less easily) to convert BGN into a purely relational form especially when considering alternate lines within alternate lines.
A formal specification of types within the BGN architecture is forthcoming but assume at least scalar string, int64, double, and date, and maps (objects) and arrays thereof. In this rant we'll explore:

Basic BGN design details
Practical Reasons For Exploring This At All

Moves

Moves are the heart of the thing. We will start there and back up into the more pedestrian data elements. Important concepts with moves:

Moves are an integer-indexed, zero-based array of structures.
A move is only one side; it takes two BGN moves to "equal" one number notated move in PGN.
White always starts; therefore, all even moves (0,2,4,6) are white. All odd moves (1,3,5,7) are black.
TBD: tricks to permit "black starts."
A BGN move is designed to capture and thus facilitate direct queryability for analysis as opposed to the most compact notation.
Each move captures from-to squares and the all pieces involved, plus additional optional commentary. This allows direct assessment of the data without having to run a "chess engine" to figure out what pieces moved or captured what.
There is no termination/result marker e.g. 1-0 because that is not a move. The game result is captured in the BGN header fields.

In PGN, a basic opening would be notated as:

1. e4 e5   2. Nf3 Nf6

In BGN the same basic opening would have this in the moves array. In our documentation here, we show the array offset for a little more context but it is not part of actual spec.

  
 0  { "p":"P", "f": "e2", "t": "e4" }
 1  { "p":"p", "f": "e7", "t": "e5" }
 2  { "p":"N", "f": "g1", "t": "f3" }
 3  { "p":"n", "f": "g8", "t": "f6" }

The piece being moved is always explicity identified.
To make identification even easier -- especially for querying -- white pieces are uppercase and black pieces are lowercase, even though only uppercase letters can appear in even moves and lowercase in odd. This is a deviation from the PGN standard (section 8.2.3.1)
f and t carry on the PGN standard of (section 8.2.3.2) of using lowercase letters.

We call the "piece-from-to" construct a pft and the field names are made short on purpose. In addition, although pft is a strong recuring concept and could be modeled as an array of three elements, we deliberately use field names to avoid the confusion of nested arrays, e.g. array[4][2] = 'F4'. It is more workable like this: array[4]['t'] = 'F4'.
There is a lot more to the pft which we will see shortly but again, it is important to know that pft is not parsed in the same way as PGN. There is no whitespace, there is no explict numbering of the moves e.g. the "2." in 2. Nf3 Nf6. All data has real field names and a set of valid values including optional values. Here is sample JSON implementation of moves:

  
	moves = [
	  { "p":"P", "f": "e2", "t": "e4" },
	  { "p":"p", "f": "e7", "t": "e5" },
	  { "p":"N", "f": "g1", "t": "f3" },
	  { "p":"n", "f": "g8", "t": "f6" }
	]

And to prove the point, here it is in XML (although in 2024 XML is highly NOT recommended):

  
	<moves>
	  <move><p>P</p><f>e2</f><t>e4</t></move>
	  <move><p>p</p><f>e7</f><t>e5</t></move>
	  <move><p>N</p><f>g1</f><t>f3</t></move>
	  <move><p>n</p><f>g8</f><t>f6</t></move>	  
	</moves>

Castling is a king-touch move. In PGN:

n. O-O

In BGN:

n  { "p":"K", "f": "e1", "t": "g1", "castle":"K" }

King vs. queen castling is disambiguated with the value of the castle field. The rook moving from h1 to f1 is implicit. This is the only non-explicit move of a piece in BGN. But when the rook moves later (maybe!) we will see the move from f1 (where it landed during the castle) to the new landing square.

BGN explicitly captures additional information for moves. In PGN we might see the following; assume a white bishop is on D4:

13. Qh7 Nxd4+

In BGN this would be:

26  { "p":"Q", "f": "h4", "t": "h7" }
27  { "p":"N", "f": "b5", "t": "d4", "x":"B", "c":2 }

The "x" field explicitly identifies the piece captured and the "c" field identifies we have placed the opponent in check for the SECOND time. Note that we do not count final checkmate as being in check. There should be no moves after checkmate and the result will identify the winner. This does make forced checkmate versus simply conceding the game a little ambiguous. TBD: Change...?

Promotions are fairly straightforward. In PGN we might see

13. f8/Q Bf7
14. Qf4 ...

In BGN this would be:

26  { "p":"P", "f": "f7", "t": "f8": "promote":"Q" }
27  { "p":"B", "f": "d5", "t": "f7"}
28  { "p":"Q", "f": "f8", "t": "f4"}

Note that on move 26, the pawn on F8 was promoted to queen, and was subseqeuently moved as a queen on move 28.

There is "wiggle room" in the pft to handle unusual situations like bughouse. In PGN we might see

14. N@f3 { drop a knight into the mix }

In BGN this might be:

14  { "p":"N", "f": null, "t": "f3"}

Subjective Annotations

The game play and data representations above are objective. But chess action has additional information that is subjective, most obviously the marking of a move as "questionable" (?) or "brilliant" (!!). All such annotations are captured in the a field:

26  { "p":"Q", "f": "h4", "t": "h7", "a": [{"nag":3} ] }

The a field is an array because more than one provider can contribute subjective infomation ("subinfo") to a move. To save some space, the id field in subinfo is optional; attribution is carried in the BGN header fields. The management of these IDs is not a core requirement of BGN so we will park id management for the moment.
As an example, consider this PGN:

13. Qh7? Bf3??
14. Nf3! Rc2!!

Someone has subjectively questioned white's queen move and called the bishop move a blunder. It is probably the annotator name as (maybe) set up in the PGN headers. In the next exchange, apparently there is brilliance. In BGN this is represented thusly through the nag field (using standard NAG codes) for quality, e.g. 1 ("!", good move), 4 ("??", blunder):

26  { "p":"Q", "f": "h4", "t": "h7", "a": [{"id":"AA2","nag":1} ] }
27  { "p":"B", "f": "e4", "t": "f3", "a": [{"id":"AA2","nag":4} ] }
28  { "p":"N", "f": "f5", "t": "f3", "a": [{"id":"AA2","nag":3} ] }
29  { "p":"R", "f": "c1", "t": "c2", "a": [{"id":"AA2","nag":8}] }

This allows multiple authors to opine subjectively on moves. For example, if such a thing was legitimate in PGN, meaning "AA2 thinks it is questionable but AA7 believes it is fine:"

13. Qh7 (AA2 ?, AA7 -)

Then in BGN we would have:

13  { "p":"Q", "f": "h4", "t": "h7", "a": [{"id":"AA2","nag":4},{"id":"AA7","nag":3}] }

Annotations optionally can have dates. Annotations without dates are assumed to be relevant to the timeframe of the game event itself. This means other subjective annotators can come in later and opine. For example, suppose author AA7 later on decided that it was a blunder. We could update the move as follows:

      13  { "p":"Q", "f": "h4", "t": "h7", "a": [
      {"id":"AA2","nag":1},
      {"id":"AA7","nag":0},
      {"id":"AA7","date":"2022-03-04", "q":1, "comment":"yeah..."}
      ] }

The permissioning of performing such an update, much like the physical persistence itself, is out of scope for this BGN data design doc but there are at least 2 very practical, very fast ways this could implemented in either MongoDB or a JSON column RDBMS.

PGN "V2" Commands

PGN was not designed with richly structured data in mind, nor extensibility. As a result, additional information on moves called commands (detailed at https://www.enpassant.dk/chess/palview/enhancedpgn.htm) is tucked away inside the comment field, e.g.:

      14. Rd8 { [%clk "0:02:30"] [%eval #-3] } bf2 { [%clk "0:02:20"] [%eval #-2] } 
      17. fxe3 Kd7 ( Kd6 { [%clk "0:02:30"] [%eval #-3] the better move...?} )  18. ...

In BGN, these commands are captured as fields in subinfo. Some commands have an inherently objective definition particularly %clk but others like %eval are "similar" but not exactly the same in all situations. Because the a field can carry 2 or more subinfos with attribution, it makes sense to place commands there.

Per the enhancedpgn spec, all commands must have at least 1 or more operands (parameters). In BGN, we model all operands as an array of any type, even if only one operand is required. This keeps the access and processing simple and free of "if type == array else ..." logic. For example, lichess.org now adds %clk and %eval commands into game output and could be modeled this way:

      13  { "p":"Q", "f": "h4", "t": "h7", "a": {
            { "eval": ["#-3"], "clk": ["0:02:03"] }}
          }

The information architecture within a subinfo is completely at the discretion of the "owner" of the subinfo. Because BGN is inherently neutral to the data types, the values of keys in the operands are not restricted to simple scalar strings. An emerging popular treatment might be to carry a single struct as operand 1, creating a name:value set instead of an ordinally restricted arglist:

      13  { "p":"Q", "f": "h4", "t": "h7", "a": {
            {"eval": ["M", -3, 0.001], "rushFactor": [{"d1":0,"d2":-0.025,"d3":-1}]}}
          }

Alternate Lines

The a field can carry complete alternate move structures. Suppose author AA3 thought that this would be better:

13. Qh7 (AA2 alt1 would be Nxc5, followed almost certainly by Bxc5)

In BGN, this is explicitly described using the same set of structures as in the mainline game:

13  { "p":"Q", "f": "h4", "t": "h7",
      "a": [ {"id":"AA2",
              "alt": {"name":"Knight press",
                      "comment": "bla bla bla",
                      "moves": [
                        {"p":"N", "f":"a4", "t":"c5", "x":"P"},
                        {"p":"B", "f":"f6", "t":"c5", "x":"N"}
                      ]
              }
        } ]
}

The hidden gem here is moves array in the alt structure is the same as the mainline -- which means that alternate lines themselves can have alternate lines within! Any feature that is added to the move info architecture is automatically available recursively in the alternate lines.

Overall Game Structure

As promised earlier, we will back up to the overall game structure that holds moves. An example serves us well:

{
  // Mandatory fields.  You need these to describe a complete game:
  "moves" : [ (array as described above) ],
  "result" : "W"
  
  // Optional but almost always there and super helpful for filtering and
  // performing analytics.
  "type": "Z",
  "players": [
    {"last":"Moschetti","first":"Paul","nickname":"Buzz"},
    {"handle":{"domain":"org.lichess","value":"Gruenblaugelb"},
      "ELO_blitz":1342, "ELO_classic":1654, "USchessID":"12807646"}
  ],
  "datetime" : "2022-09-04T12:30:00Z",  // ISO-8601 in Z time ONLY.  No local time!
  "event" : "Descriptive name",
  "site" :  "Somewhere on Earth",
  "opening" : {"ECO":"D40",
               "verbose":"Sicilian Defense: Modern Variations, Anti-Qxd4 Move Order Accepted",
               "generic":"SICILIAN"},

  "timeControl":"600+5",


  // If no default (id = _) that is OK. If no subinfo at all, it
  // just means the subinfo in the 'a' field cannot be explicitly attributed;
  // this is not the end of the world:
  "subinfo": [
    {"id":"_", info: { "name": {"last":"Hoover"}, "rating":1000}}
  ],

  // "Domain defined additional attributes"
  ext: [
    {id: "org.chessvenue.com", ts: "2022-09-04T12:30:00Z", data: {
      "temp": 24, "O2_pct":20.6
    }}
  ]
}

Notes on the fields:

With the exception of the data in ext.data, all top level fields in a BGN document have specific meanings, structure, and types.
result is a single uppercase char, one of the following:
1. W: White wins
2. B: Black wins
3. S: Stalemate
4. D: Draw (non-stalemate)
type is a single uppercase char, one of the following:
1. T: Bullet
2. Z: Blitz
3. R: Rapid
4. C: Classical
5. O: Other
Other types may emerge over time but as of 2023, T+Z+R+C are 99+% of the well-known types.
players is an array. players[0] is always white, players[1] is always black. This helps with both querying and indexing the information by having a single key field instead of two. Some examples in popular query contexts:
```
How many games played where last name = Moschetti?

      // dotpath thru array yields new array of lastnames:
      MongoDB:  db.games.find({"players.last":"Moschetti"});

      // Same dotpath behavior here:
      SPARK:    df.filter(array_contains(col('players.last'), 'Moschetti'))

      // The [] operator "unwinds" the players array: 
      JQ:       jq 'select(.players[] | .last == "Moschetti")' N.game.json
      

How many games did Moschetti win playing as white?

      // This time, pick *only* [0] from players to get white player:
      MongoDB:  db.games.find({"result":"W","players.0.last":"Moschetti"});

      // SPARK array functions are 1-based not 0-based:
      SPARK:    df.filter( (element_at(col('players.last'),1) == lit('Moschetti')) & (col('result') == lit('W')) )

      // Nearly the same as MongoDB:
      JQ:       jq 'select(.result == "W" and (.players[0].last == "Moschetti"))' N.game.json
      
  
```
A variety of optional fields are available to be populated such as email address, DOB, USChess ID, etc. It should be clear that the structured data approach allows for precise lookup (e.g. eliminating formatting ambiguity of "last, first" or "first last" name handling) as well as extensibility e.g. fields for honorific ("Dr.","Mr.",etc.), title ("GM", etc.)
ELO ratings are considered to be as-of the time of the game. It is not a requirement that ELO be present if players data is present, nor is it required that all possible type ratings be present, or even the rating for the specific game being played. It is reasonable that if the information is available, then for a particular type of game (rapid, blitz, etc) then at least that rating would also be included.
As a special case for space/reference data lookup efficiency, the players array can contain a string identifier instead of the object of rich information at either or both of the 0 or 1 index:
```
  "players": [ "B76235R", "Gruenblaugelb" ]
```
In this case it is assumed that the software accessing the data will join it with other data. Remember, in the end, information about the player is secondary to the game action itself.
event and site are carry-overs from PGN.
opening is an object. It can have 1 to 3 subfields:
1. ECO: Encyclopedia of Chess Openings code
2. verbose: A "full" English description of the opening
3. generic: Simplified description with uppercase and underscores; makes it much easier to do basic grouping.
BGN itself performs no checks on the subfield-to-subfield integrity / appropriateness.
termination follows the spirit of 9.8.1: Tag: Termination but compactified as follows:
1. A: Abandoned; but game still has a result
2. T: Time forfeit
3. R: Rules infraction
Because the majority of games end normally, we do not specific a code for normal end-of-game. Note that many games do end as a result of time forfeit.
subinfo contains reference data for the ids appearing in the a field of a move. An id of underscore _ is special and means the default for those a subinfo without the id.
BGN provides for any number of "out-of-band" or otherwise non-standard data in the ext field. The ext field always contains id (basically, a namespace) and the data substructure, inside which any data shape can exist.

What is The Real Point?

BGN is a modern, database and software-friendly data design
BGN is both explicit and expressive without syntactical shortcuts like ?? for blunders and also is very digestible by rich shape databases like MongoDB. Consider a data set of 100,000,000 games and we wish to ask the question: How many games did a castle occur in the first 5 moves between 1960 and 1980? 10 moves? 15? In MongoDB we could query the hypothetical chessdata collection thusly:

db.chessdata.aggregate([
// Step 1: Filter for only the dates we want, which should cut down a LOT of the material:
   {$match: {$and: [ {"bgn.eventDate":{$gte:new ISODate("1960-01-01")}} ,
		      {"bgn.eventDate":{$lt:new ISODate("1980-01-01")}}
		    ] }}

// Step 2:  Use the $reduce function to "walk" the moves array and sniff out at what point,
// if ever, the castle occurs.  We only need to check up to the first 5 (or 10 or 15) moves
// OR the max length of moves array, whichever is shorter:
    ,{$project: {X: {$reduce: {
	input: {$range:[0, {$min:[ {$size:"$moves"},5 ]} ]},
	initialValue: [],
	in: {$let: {

          // $$this is the sequential int generated from $range in the input
          vars: { ee: {$arrayElemAt:["$moves","$$this"] } },

          // The following translates to:  "if the castle field value is true, then append to the
          // every growing $$value array a new array of one containing the offset where it was
          // found, else append a ZERO length array -- essentially a noop":
          in: {$concatArrays: [ "$$value",
				{$cond: [ {$eq:["$$ee.castle",true]} , ["$$this"] , [] ]}
			      ]}
	}}
     }}
   }}		

// Step 3:  The $reduce function can leave us with an empty -- but non-null! -- array, so
// lastly filter those out:
,{$match: {$expr: {$ne:[0,{$size:"$X"}]} }}
]);

BGN stored in MongoDB or AWS S3 or Hadoop utilizing SPARK makes terabyte sized analytics a possibility
In MongoDB with appropriate indexing, such a query might only take seconds or a minute as opposed to, say, hours struggling with running python programs using chess.pgn over and over. The PGN archive at https://database.lichess.org is adding a nearly 20GB file of zstd compressed PGN representing approx. 100 million new games per month.
Beyond MongoDB, BGN rendered as bz2 compressed JSON can be stored on AWS S3 and accessed by SPARK and other scalable subsystems to solve very large scale data analysis problems. AWS SPARK drivers are S3 (obviously) and bz2 optimized; no need to decompress to perform analytics! And in a recent test, the 33.2G of Jan-2022 zstd compressed lichess PGN data transformed into a set of bz2 compressed JSON files totaled 27G -- almost 19% smaller and much more practical to work with.
BGN -- especially the moves data design -- is highly extensible
One of the biggest drawbacks to PGN is the difficulty -- never mind lack of standardization -- in adding fields, simple or complex, to each move. It is trivial in BGN to do so because it is a modern structured data design requiring no special parsing; the choice of implementation (by default, JSON) already has high performance parsing in many different languages. For example, we can add a field et for elapsed time in seconds from start of game.
```
    26  { "p":"Q", "f": "h4", "t": "h7", "x":"N", "et":1234 }
    27  { "p":"B", "f": "e4", "t": "f3", "et": 1256}
```
It now becomes easy to measure the pace of the game by comparing move[n].et to move[n+1].et. This can even be bucketed e.g. % moves executed in 1-10 second, % executed in 10 to 120 seconds, % more than 120 seconds, etc.

BGN is easily externalizable into highly digestable JSON
After spending days struggling with 22GB PGN files, BGN externalized as CR-delimited JSON offers some interesting advantages:
1. Finding anything with grep is as fast as, well, grep and will yield the complete game; no other lines (rows) necessary
2. Splitting a CR-delimited file is easy as using split, again because a game is on one line/row.
3. You can use jq -- the de facto standard for command line hacking of JSON -- to filter, transform, and otherwise hack the JSON. Or any other JSON hacking tool you like.

Like this? Dislike this? Let me know