Want To Go On A (java.util.)Date?

10-Sep-2013

Date handling in Java and other languages continues to cause problems even in new systems that have a platform otherwise capable of rich, precise, and lossless operations on them. The problem is rooted in three major issues:

An instantaneous point in time is not the same thing as a calendar date.
The datetime of an event is not the same thing as the "system processing date" (which, by the way, is also not the same thing as the system clock date).
Shortcuts in externalization (to- and from-string) end up being lossy or make it difficult and/or ambiguous to use the externalized form.

The good news is these issues are very easy to address. It simply takes some discipline and erring on the side of producing too much information instead of too little.

Time vs. Calendar

To begin, it is important to understand that a point in time is absolute. It is "the same" in every time zone. In Java, point in time is carried in the java.util.Date object and is represented as the number of milliseconds since exactly midnight GMT Jan 1, 1970, a date commonly refered to in the Unix world as the epoch date. Consider a global system with two instances of the same software running in Tokyo and New York that processes 4 pieces of activity in the following order, where the absolute activity time is captured using a simple call to new Date():

# Absolute Time Which System Local Time
1 1378746511036 TK 2013-09-10 02:08.441

2 1378746511037 NY 2013-09-09 13:08.442

3 1378750111170 NY 2013-09-09 14:08.651

4 1378753711055 TK 2013-09-10 04:08.085

#	Absolute Time	Which System	Local Time
1	1378746511036	TK	2013-09-10 02:08.441
2	1378746511037	NY	2013-09-09 13:08.442
3	1378750111170	NY	2013-09-09 14:08.651
4	1378753711055	TK	2013-09-10 04:08.085

The example above is a simple demonstration of how our human-imposed context of Calendar and earth's rotation confounds an otherwise straightforward sequence of 4 events. In the absence of such context, items #2 and #3 sure look like they happen before item #1. Clearly, if we were only ever using absolute time, all sorts of filter and sorting and other operations would be easy. It would be a great thing if our minds could easily grok globally absolute time and we could simply do:


    select * from table where tradeDate > 1378750111170

and "just know" that meant 2013-09-09 14:08 in New York.

It's About Dayframe, not Timezone

But we and our systems lead lives structured around a day. More specifically, it is "activity occuring within some number of hours, 24 or less." The subtle point here is a system processing day may cross over into the next calendar day before the processing day is incremented. A day's worth of processing is not necessarily anchored to a calendar day, but by convention we almost always associate it with a date. This duration of time (herein called the "dayframe") is the important organizational element of a system, not timezones. Dayframes are high level and business-oriented; timezones are low-level and system-oriented. The challenges are:

There is only one absolute time, but there are likely several dayframes and several timezones within each dayframe.
The interpretation of the absolute time in an individual dayframe is as important as the interpreted relationship between dayframes.
(The big one) Interpretation and functions on the date in a dayframe is done in objects outside of Date -- and usually in just a simple String. Programs, SQL, and flatfiles don't typically use absolute times; they use TO_DATE(YYYY-MM-DD)

A design that makes one or more of the following mistakes will make it difficult to work with dayframe data on a practical basis:

Mistake #1: Using activity date to imply dayframe date
Don't take the shortcut. The datetime captured for an activity does not have to occur on the dayframe date.
Mistake #2: Using timezone to imply dayframe
Among other things, several different timezones can be generating activity at the same time within the same dayframe. Timezone is too low-level for organizing activity; you must design and define an appropriate set of dayframes and identifiers for same.
Mistake #3: Converting everything to GMT/UTC
In order to capture cross-timezone relationships in a way that does not combinatorially explode, many systems choose instead to convert the absolute time to GMT, essentially eliminating all the relationships. This helps in sorting and performing global event rationalization but presents an immediate practical problem for processing data within a dayframe because all date functions will need an offset applied. Furthermore (and this is the longer term, more insidious problem), it becomes difficult to recreate what happened at exactly what time in a dayframe given shifts in timezone and daylight savings time. The practical problem of dealing with an all-GMT date system in databases and code starts hard and gets harder.
Mistake #4: Overreliance + sloppiness with ISO 8601
ISO 8601 datetime strings are pretty good. No whitespace, able to capture to the millisecond, a method to capture offset from UTC+0 (GMT+0), and even a nice convenience symbol 'Z' as a substitute for +00:00. When a time and UTC offset are present, you get local and UTC time in one string.
But there are non-trivial fact-of-life issues with adopting ISO 8601:
- Sorting and filtering is essentially non-standardized when UTC offsets are included. This negative outweighs the positive of capturing local and UTC offset (recall the cross-dayframe requirement above) in one encoded form.
- On a practical use case basis, operating on a set of records is easier when the functions can work directly with a datetime and not have to apply logic to a datetime to convert it to UTC. In other words, the offset from UTC is less useful than the actual UTC.
- UTC offsets make datetime field processing different than the other popular scalar types (string, double, int, BigDecimal, etc.). For example, a String can be created, emitted to a flatfile, and read into a String in another program with little effort and complete transparency around the resources and logic to do so. Even BigDecimal has a very straightforward object-to-string-to-object pathway. A Date, however, needs to be converted to an ISO 8601 string form with logic that properly adds the UTC offset, a non-trivial operation. The other program needs additional logic to properly/reliably convert the string back to a Date. Even if you choose to duck the issue (see below) and not use UTC offsets, you must ensure your code handles the case where they inadvertantly show up and are parsed by existing logic to yield the wrong datetime!
- There is a lack of uniformity and a wealth of options in popular database engines (Oracle, MySQL, Postgres, etc.) around both column types and conversion functions and arguments (e.g. TO_DATE() vs. TO_TIMESTAMP() vs. DATE(), with or without timezone data, etc.). This variety hampers quick, clear understanding of how the data will behave in the RDBMS.
- Ducking the issue and dropping UTC offsets with no additional context is a lossy operation and dooms the consumer of the data to rely on assumptions about which dayframe was assumed in the interpretation.
- The ecosystem, especially in Java, around parsing ISO dates is complicated. Right in the JDK is javax.xml.bind.DatatypeConverter.parseDateTime() but it does not parse UTC offsets! SimpleDateFormat can be used. So can JodaTime, a very popular open source library. Each has nuances particularly around parsing datetime strings that can lead a team to settle into a locally-correct design that is inconsistent with a peer system -- which itself is locally-correct.
Mistake #5: Overreliance on platform LOCALE
A lot of datetime software is plumbed into things like default LOCALE and such. These often misunderstood or unknown environmental factors can change without notice. Environment-based functions also run the risk of being too broadly scoped in their use. A design that manages to use environment-based information correctly might find itself needing to process different dayframes of data in the same process space, requiring brittle hacks such as "switching out the environment under the program" to get the datetime functions to work properly for a new dayframe.

The Solution

A practical solution involves the following for each record that is created:

Externalize the event without timezone information in order to capture exactly what "the wall clock said" at the moment the event was generated as it would be seen by the most important actor in the activity. This could be a user pressing a purchase button or a system task generating a report.
Define and capture a business dayframe identifier e.g. a processing region enumeration.
Capture the business dayframe date.
Externalize the offset to UTC in minutes in a separate field. Alternately, for maximum clarity, simply store a whole new UTC date. Note this is not a datetime with offset Z or +00:00. The definition of the field itself is UTC Time and the datetime carries no UTC offset information.

Coming back to our 4 pieces of activity:

#	Business Dayframe	Dayframe date	Local Time	UTC Time
1	ASIA	2013-09-09	2013-09-10 02:08.441	2013-09-09 18:08.441
2	NORTHAM	2013-09-09	2013-09-09 13:08.442	2013-09-09 18:08.442
3	NORTHAM	2013-09-09	2013-09-09 14:08.653	2013-09-09 19:08.653
4	ASIA	2013-09-09	2013-09-10 04:08.085	2013-09-09 20:08.085

The value of the datetimes and how they are used principally used is now unambiguous because we now have four pieces of information (business dayframe, dayframe date, local event time, and UTC time) instead of just two (local time and UTC offset). Notes:

We have created two dayframe identifiers, ASIA and NORTHAM. These represent two contexts for processing as defined by the system. They are named in such a way as to provide a healthy clue about what they are used for but there is no specific relationship between the name and timezones that are implied by the name. This addresses major issue #2 above.
The dayframe date has no time. It does not need time. Dayframe identifier + Dayframe date defines a business-relevant set of events.
The local time is exactly what the user or the system "saw" at the time the event was created within the business dayframe. This is particularly important for systems interacting with financial markets and regulatory reporting.
Since all local times are also stored as UTC, this takes care of global sorting and filtering and we no longer need to store absolute time.
All sorting, filtering, and database functions can now be made very clear and transparent with no murky logic around time offsets.
Datetimes are now convertible between String and Date without any extra information or processing, just like other popular scalars.

If we want to get very crisp about dayframe handling, then instead of using a date and dayframe identifiers as a composite key, we would indirectly address date and other attributes through a bespoke key. A simple implementation might be D followed by an incremental integer, e.g.

# Dayframe Local Time UTC Time
1 D771 2013-09-10 02:08.441 2013-09-09 18:08.441

2 D772 2013-09-09 13:08.442 2013-09-09 18:08.442

3 D772 2013-09-09 14:08.653 2013-09-09 19:08.653

4 D771 2013-09-10 04:08.085 2013-09-09 20:08.085

#	Dayframe	Local Time	UTC Time
1	D771	2013-09-10 02:08.441	2013-09-09 18:08.441
2	D772	2013-09-09 13:08.442	2013-09-09 18:08.442
3	D772	2013-09-09 14:08.653	2013-09-09 19:08.653
4	D771	2013-09-10 04:08.085	2013-09-09 20:08.085

To find out what happened on a particular dayframe, one would first consult the dayframe resource and query to arrive at the proper dayframe key, which would then be supplied to a query against the transaction resource. This permits more flexibility in determining and capturing attributes that make a dayframe, such as the UTC time that it was activated, by which actor in the system, state, etc.

In summary, don't use shortcuts to try to capture a local time, a UTC time, and a dayframe (a.k.a. system processing date) in less than the three separate pieces of information that they are.

Like this? Dislike this? Let me know