Want To Go On A (java.util.)Date?
Date handling in Java and other languages continues to cause problems
even in new systems that have a platform otherwise capable of rich, precise,
and lossless operations on them. The problem is rooted in three major issues:
- An instantaneous point in time is not the same thing as a calendar date.
- The datetime of an event is not the same thing as the "system processing
date" (which, by the way, is also not the same thing as the system clock date).
- Shortcuts in externalization (to- and from-string) end up being lossy
or make it difficult and/or ambiguous to use the externalized form.
The good news is these issues are very easy to address. It simply
takes some discipline and erring on the side of producing too much information
instead of too little.
Time vs. Calendar
To begin, it is important to understand that a point in time is absolute.
It is "the same" in every time zone. In Java, point in time is carried in
the java.util.Date object and is represented as the number of
milliseconds since exactly midnight GMT Jan 1, 1970, a date commonly refered to in the Unix world
as the epoch date. Consider a global system with two instances of the
same software running in Tokyo and New York that processes 4 pieces of
activity in the following order, where the absolute activity time is captured
using a simple call to new Date():
# | Absolute Time | Which System | Local Time |
1 | 1378746511036 | TK | 2013-09-10 02:08.441 |
2 | 1378746511037 | NY | 2013-09-09 13:08.442 |
3 | 1378750111170 | NY | 2013-09-09 14:08.651 |
4 | 1378753711055 | TK | 2013-09-10 04:08.085 |
The example above is a simple demonstration of how our human-imposed context
of Calendar and earth's rotation confounds an otherwise straightforward
sequence of 4 events. In the absence of such context, items #2 and #3 sure look
like they happen before item #1.
Clearly, if we were only ever using absolute time, all sorts of filter and
sorting and other operations would be easy. It would be a great thing if our
minds could easily grok globally absolute time and
we could simply do:
select * from table where tradeDate > 1378750111170
and "just know" that meant 2013-09-09 14:08 in New York.
It's About Dayframe, not Timezone
But we and our systems lead lives structured around a day. More specifically,
it is "activity occuring within some number of hours, 24 or less." The
subtle point here is a system processing day may cross over into the next
calendar day before the processing day is incremented. A day's worth of
processing is not necessarily anchored to a calendar day, but by convention
we almost always associate it with a date. This duration
of time (herein called the "dayframe") is
the important organizational element of a system, not timezones.
Dayframes are high level and business-oriented; timezones are low-level
and system-oriented.
The challenges are:
- There is only one absolute time, but there are likely several dayframes
and several timezones within each dayframe.
- The interpretation of the absolute time in an
individual dayframe is as important as the interpreted relationship
between dayframes.
- (The big one) Interpretation and functions on the date in a
dayframe is done in objects outside of Date -- and usually in just a
simple String. Programs, SQL, and flatfiles don't typically use absolute
times; they use TO_DATE(YYYY-MM-DD)
A design that makes one or more of the following
mistakes will make it difficult to work with
dayframe data on a practical basis:
- Mistake #1: Using activity date to imply dayframe date
Don't take the shortcut. The datetime captured for an activity does
not have to occur on the dayframe date.
- Mistake #2: Using timezone to imply dayframe
Among other things, several different timezones can be generating activity
at the same time within the same dayframe. Timezone is too
low-level for organizing activity; you must design and define an appropriate
set of dayframes and identifiers for same.
- Mistake #3: Converting everything to GMT/UTC
In order to capture cross-timezone relationships in a way
that does not combinatorially explode, many systems choose instead to
convert the absolute time to GMT, essentially eliminating
all the relationships. This helps in sorting and performing
global event rationalization but presents an immediate practical problem for
processing data within a dayframe because all date functions
will need an offset applied. Furthermore (and this is the longer term,
more insidious problem), it becomes difficult to recreate what happened
at exactly what time in a dayframe given shifts in timezone and daylight
savings time.
The practical problem of dealing with an all-GMT date system in
databases and code starts hard and gets harder.
- Mistake #4: Overreliance + sloppiness with ISO 8601
ISO 8601 datetime strings are pretty good. No whitespace, able to capture
to the millisecond, a method to capture offset from UTC+0 (GMT+0), and
even a nice convenience symbol 'Z' as a substitute for +00:00. When
a time and UTC offset are present, you get local and UTC time in
one string.
But there are non-trivial fact-of-life issues with adopting ISO 8601:
- Sorting and filtering is essentially non-standardized when UTC offsets
are included. This negative outweighs the positive of capturing local
and UTC offset (recall the cross-dayframe requirement above) in one encoded
form.
- On a practical use case basis, operating on a set of records
is easier when the functions can work directly with a datetime and not
have to apply logic to a datetime to convert it to UTC. In other words,
the offset from UTC is less useful than the actual UTC.
- UTC offsets make datetime field processing different than the
other popular scalar types (string, double,
int, BigDecimal, etc.). For example, a String can be created, emitted
to a flatfile, and read into a String in another program with little
effort and complete transparency around the resources and logic to do so.
Even BigDecimal has a very straightforward object-to-string-to-object
pathway.
A Date, however, needs to be converted to an ISO 8601 string form with logic
that properly adds the UTC offset, a non-trivial operation.
The other program needs additional
logic to properly/reliably convert the string back to a Date. Even if
you choose to duck the issue (see below) and not use UTC offsets, you
must ensure your code handles the case where they inadvertantly show
up and are parsed by existing logic to yield the wrong datetime!
- There is a lack of uniformity and a wealth of options in popular
database engines (Oracle, MySQL, Postgres, etc.) around both column types
and conversion
functions and arguments (e.g. TO_DATE() vs. TO_TIMESTAMP() vs. DATE(), with or without
timezone data, etc.). This variety hampers quick, clear understanding
of how the data will behave in the RDBMS.
- Ducking the issue and dropping UTC offsets
with no additional context is a lossy
operation and dooms the consumer of the data to rely on assumptions about
which dayframe was assumed in the interpretation.
- The ecosystem, especially in Java, around parsing ISO dates is
complicated. Right in the JDK is
javax.xml.bind.DatatypeConverter.parseDateTime() but it does not
parse UTC offsets! SimpleDateFormat can be used. So can
JodaTime, a very popular open source library. Each has nuances
particularly around parsing datetime strings that can lead a team to
settle into a locally-correct design that is inconsistent with a peer
system -- which itself is locally-correct.
- Mistake #5: Overreliance on platform LOCALE
A lot of datetime software is plumbed into things like default
LOCALE and such. These often misunderstood or unknown environmental
factors can change without notice. Environment-based functions also
run the risk of being too broadly scoped in their use. A design that
manages to use environment-based information correctly might find itself
needing to process different dayframes of data in the same process
space, requiring brittle hacks such as "switching out the environment under
the program" to get the datetime functions to work properly for a
new dayframe.
The Solution
A practical solution involves the following for each record that
is created:
- Externalize the event without
timezone information in order to capture
exactly what "the wall clock said" at the moment the event was generated
as it would be seen by the most important actor in the activity. This
could be a user pressing a purchase button or a system task generating
a report.
- Define and capture a business dayframe identifier e.g. a processing region enumeration.
- Capture the business dayframe date.
- Externalize the offset to UTC in minutes in a separate field. Alternately,
for maximum clarity, simply store a whole new UTC date. Note this is not a
datetime with offset Z or +00:00. The definition of the field itself is
UTC Time and the datetime carries no UTC offset information.
Coming back to our 4 pieces of activity:
# | Business Dayframe | Dayframe date | Local Time | UTC Time |
1 | ASIA | 2013-09-09 | 2013-09-10 02:08.441 | 2013-09-09 18:08.441 |
2 | NORTHAM | 2013-09-09 | 2013-09-09 13:08.442 | 2013-09-09 18:08.442 |
3 | NORTHAM | 2013-09-09 | 2013-09-09 14:08.653 | 2013-09-09 19:08.653 |
4 | ASIA | 2013-09-09 | 2013-09-10 04:08.085 | 2013-09-09 20:08.085 |
The value of the datetimes and how they are used principally used is
now unambiguous
because we now have four pieces of information (business dayframe, dayframe
date, local event time, and UTC time) instead of just two
(local time and UTC offset). Notes:
- We have created two dayframe identifiers, ASIA and NORTHAM.
These represent two contexts for processing as defined by the system. They
are named in such a way as to provide a healthy clue about what they are used
for but there is no specific relationship between the name and timezones that
are implied by the name. This addresses major issue #2 above.
- The dayframe date has no time. It does not need time. Dayframe
identifier + Dayframe date defines a business-relevant set of events.
- The local time is exactly what the user or the system "saw" at the time
the event was created within the business dayframe. This is particularly
important for systems interacting with financial markets and regulatory
reporting.
- Since all local times are also stored as UTC, this takes care of global
sorting and filtering and we no longer need to store absolute time.
- All sorting, filtering, and database functions can now be made very clear
and transparent with no murky logic around time offsets.
- Datetimes are
now convertible between String and Date without any extra information
or processing, just like other popular scalars.
If we want to get very crisp about dayframe handling, then instead of using
a date and dayframe identifiers as a composite key, we would indirectly
address date and other attributes through
a bespoke key. A simple implementation might be D followed by
an incremental integer, e.g.
# | Dayframe | Local Time | UTC Time |
1 | D771 | 2013-09-10 02:08.441 | 2013-09-09 18:08.441 |
2 | D772 | 2013-09-09 13:08.442 | 2013-09-09 18:08.442 |
3 | D772 | 2013-09-09 14:08.653 | 2013-09-09 19:08.653 |
4 | D771 | 2013-09-10 04:08.085 | 2013-09-09 20:08.085 |
To find out what happened on a particular dayframe, one would first
consult the dayframe resource and query to arrive at the proper dayframe key,
which would then be supplied to a query against the transaction resource.
This permits more flexibility in determining and capturing attributes that
make a dayframe, such as the UTC time that it was activated, by which
actor in the system, state, etc.
In summary, don't use shortcuts to try to capture a local time, a UTC time,
and a dayframe (a.k.a. system processing date) in less than the three separate pieces of
information that they are.
Like this? Dislike this? Let me know
Site copyright © 2013-2024 Buzz Moschetti. All rights reserved