PQL: The Prefix Query Language
What is PQL?
PQL stands for (P)refix (Q)uery (L)anguage and is pronounced "pea-quill." It is
a specification for how a set of functions and their associated arguments can
be applied to a shape of data to yield a boolean match or no match. It is not
a database unto itself, nor the "house band" for a particular database engine.
It is simply ... software.
What makes PQL different from SQL and XPath and some other popular query languages
is that it is based on a physical data structure, not a stringified grammar
which is then parsed into some internal form.
In other words, there is no standard string equivalent to the full SQL statement:
String s = "select * from FOO where A = 'NY' and B < 12";
PQL has no intrinsic parser per-se; you build a PQL statement by
assembling Maps of Maps in a way very similar to MongoDB but with no syntax
shorcuts. The PQL equivalent to the SQL above is:
Map m1 = new HashMap();
m1.put("A", "NY");
Map m2 = new HashMap();
m2.put("eq", m1);
Map m3 = new HashMap();
m3.put("B", 12); // autobox to Integer
Map m4 = new HashMap();
m4.put("lt", m3);
List l2 = new ArrayList();
l2.add(m2);
l2.add(m4);
Map m5 = new HashMap();
m5.put("and", l2);
We show the construction above implemented in Java but the concept of Map
of Maps is largely language-neutral; most popular languages are capable
of constructing such a structure.
Each Map contains exactly one function name and one argument (i.e. a single key-
value pair), although that argument can be of type scalar, Map, or List.
The topmost map above, m5, can then be consumed by utils such
as MapFilter, SQLRewrite, MBI/MBD, and other tools. PQL is mostly about
physical structure and a very small set of basic operators. Externalized in JSON
form (a convenient way to show it), the statement above would be:
{ "and": [
{ "eq": { "A": "NY" }},
{ "lt": { "B": 12 }}
]
}
Again, to be clear, although PQL is in fact nicely stringified in JSON, the
language specification is not dependent on JSON. Rather, the specification
is about standard function names and the physical structure of their
arguments (scalar, List, or Map).
Why PQL?
The following factors motivated the design and implementation of PQL:
- Systems increasingly use caches of data and maintain this data as
rich shapes (maps of maps of lists of maps, etc.) instead of the
traditional RDBMS database-centric approach
that vends ResultSets and flat rectangles. A framework that allows seamless,
consistent, and efficient rich shape filtering across a persistor and items
in-memory is very desirable because it eliminates the nagging problem of needing
to build a "paradigm bridge":
[SQL / flat rectangles] ==> [some other means of filtering / rich shapes]
- SQL and XPath are heavily human readable/typeable oriented grammars,
including infix notation, parentheses, whitespace, etc.
This is good for experimentation at the command line. It is also good
for "smallish" queries. When predicates get large, however, the ability
to quickly absorb the logical operators and parentheses and nested
conditions is lost, and it becomes no more difficult to "think like a
computer" when examining a prefix notation statement. Perhaps more
importantly, the readability of SQL and XPath actually
confounds the programmatic construction of queries, from a GUI or backend
logic both, especially with respect to to-string conversion of non-string
types. Software does not need whitespace, syntactic sugar, or English nouns
and verbs. And going in the other direction (parsing/deconstruction of SQL or
XPath) is positively laborious. PQL, however, is about low-level structure of
the query and enforcing prefix notation for all functions and is very
easily constructed -- and it is just as easily deconstructed.
The map-of-map paradigm means it can be externalized in a number of familiar
forms including JSON or XML which means it can be rehydrated easily
into the original map-of-maps object hierarchy.
- The low-level structural approach permits modularity in query construction.
Query fragments can be pre-constructed and dynamically assembled to
perform the desired query. Common and perhaps complex query logic can
be set up in one place and leveraged elsewhere in a runtime. SQL and XPath
have a harder time with this because they are non-modular. They are designed
around a full statement, not fragments. Even if you do fragment a query,
you typically run into the issue of variable substitution and so you would
likely have an implementation that held the data pre-formatting (i.e.
before creating the final string A = 'B' and C = 'D' ). And that
implementation starts to look a lot like ... PQL!
- Most SQL is not pure-portable across database engines. In contrast,
most PQL queries can be implemented not only on all SQL based RDBMS but
other database engine types like MongoDB and Cassandra and Aerospike. PQL
is also easily converted to Ehcache Search API.
- PQL has a straightforward means to add user-defined functions, largely
because the spec itself is mostly about about functions and arguments.
- Cross-language and multi-language leverage across Java, python, ruby, etc.
is increasingly useful and being able to define rich structure (and a query
expression) in the paradigm of the "host" language is very valuable.
The PQL map-of-map structure is natively "transportable" because essentially
every language knows how to make a Map, a List, or one of the core scalar
types. High performance
utils written in Java to consume Maps of data and PQL queries (as Maps)
are immediately and natively leveragable in JRuby and JPython; just build
the hashes in the familiar ruby or python way and pass them to the Java
layer.
In general, most queries -- even complex ones -- can be expressed in PQL.
The simplicity of analyzing a PQL query combined with flexibility of
mating PQL to a backend persistor means that even persistor-specific
features and query capabilities can be exploited in the rewrite layer
while protecting the application from the details of that implementation.
For example, it is straightforward to programmatically walk a PQL AND
clause and check if two date or numeric fields are a "less-than-greater-than"
pair, implying BETWEEN. The rewrite would substitute a single BETWEEN
expression instead of the two separate "lt" and "gt" expressions.
PQL "Compliance"
Since PQL is merely a structure and a dictionary of well-known function
strings, PQL compliance means that tools consuming it must conform to these
rules:
- PQL function Maps contain a single key-value entry.
The key is the function name and the value is the argument to that function.
Functions requiring multiple scalars to perform their task can be passed a Map
and that Map (the argument Map) can contain more than a single entry.
Functions that operate on more than one unit can be passed a List.
Single-entry Maps are used rather than Pair because the Map interface is
much more widely used so familiarity and breadth of support and implementation
trumps minimal interface design here.
- A util digesting PQL must be capable of supporting the following
12 Basic Functions:
- Basic operators: lt, lte, ne, eq, gte, gt
For all of these, the argument is a Map containg:
- Inlist (optimized equals): in
SQL: select * from FOO where field1 in ('NY','NJ')
PQL: { "in": { "field1": [ "NY", "NJ" }}
- Logicals: and, or, not
The backbone of complex expressions, and and or take a List
of of subexpressions to evaluate and can be nested arbitrarily deep. not
takes a single Map as an argument.
- Existence: null
Due to the vagaries of some Map implementations permitting entries with null
values, PQL avoids ambiguous terms like "exists." Instead, the null
operator clearly checks for whether a field -- or any parent component of
the field if it is a dotpath -- is null, regardless of how the underlying
Map might be carrying the value (if at all). Use logical not
to determine if a dotpath component is not null, e.g.
PQL: { "not": { "null": "a.b" }}
- Type: type
An equals operator for target field type instead of value. To aid in the
to- and from-string serialization, the following simple strings identify the
types: STRING, INT, LONG, DATE, DOUBLE, BIGDECIMAL, MAP, LIST.
Implementations may provide convenient macros or constants for these strings.
PQL: { "type": { "birthday": "DATE" }}
PQL challenges
It always comes down to Dates and BigDecimal and round tripping. PQL by
itself has zero issues with Dates and BigDecimal, but PQL as part of a larger
system design with externalization requires something else to manage and/or
define the Date and BigDecimal types. For example, PQL queries and the
Maps they operate on are trivially externalized and parsed via JSON if they
contain only strings and integers. Most JSON parsers have switch selectable
modes to parse floating point numbers as BigDecimal instead of
double so that's only slightly less trivial. In other words, JSON
out of the box can be used as the externalization solution!
But dates are a pain
because once externalized as a String representation (say, in ISO 8601),
an arbitrary consumer does not know upon rehydration if it is a real date,
a date deliberately made to look like a String, or soemthing else. Sure,
you can sample-parse it to see if a successful conversion can be performed,
but that might not be the intention of the producer. SQL and other languages
are based around a string grammar and thus the standards for to- and from-string
conversion are baked in. That said, there is still enough "variety" in the
SQL standard such that exteralizing something like DATE('2013-08-01')
from Postgres will not work in Oracle (although TO_DATE(data, format)
does in fact work in both).
SQL: select * from FOO where createDate < date('2013-08-01')
PQL:
Map m1 = new HashMap();
{
// An acceptable built-in string-to-date parser...
Calendar c2 = javax.xml.bind.DatatypeConverter.parseDateTime("2013-08-01");
m1.put("createDate", c2.getTime());
}
Map m2 = new HashMap();
m2.put("lt", m1);
// Up to here, PQL and all tools still working fine
// But here, a trivial externalizer might produce this:
{ "lt": { "createDate": "2013-08-01" }}
// ...and now that looks like a String, not a date...
In short, it comes down to defining a standard for externalization. One possibility
is to leverage the special Map function framework (that supports ind)
and add a function called class
{ "lt": { "createDate": { "class": { "type": "java.util.Date", "str": "2013-08-01" }}}}
Again, this representation is NOT used internally in PQL. Post rehydration
of the fragment just above, the function Map would look like this:
Map.Entry me1 = extractOneME(m); // assume m is the "lt" map
me1.getKey(); // "lt"
Map m2 = me1.getValue(); // the Map arg
Map.Entry me2 = extractOneME(m2);
me2.getKey(); // "createDate"
// ah HA! This is not the "class" thing above. The rehydrator has
// recognized the special function "class" and created a real Date object
// in place of the type info and string equivalent:
Object o = me2.getValue(); // o.getClass() is Date, value is 20130801
Aggregates
Standard PQL does not have a specification for aggregate functions because
it is oriented around boolean matching of single candidate shapes, not lists.
Aggregate functions have different semantics and although it is tempting
to create functions like agg:
{ "agg": {
"filter": { # regular PQL here },
"expr": { # commands to do something BESIDES filtering }
}
these need to be set up in an engine that does something intelligent with
the output especially since agg functions typically generate new shapes
of data. The MongoDB aggregation framework and pipelining is step
in that direction.
Extending PQL
The strict but simple nature of PQL means that adding functions is
easy. We do need, however, a convention for namespacing the functions.
The obvious choice would be using a colon just like XML:
{ "eq": { "field1": "NY" }} # eq is in the default Basic namespace
{ "ns1:myFunc": arg } # myFunc is in the ns1 namespace
The filter would provide the predictable handler interface to declare
new sets of functions:
MapFilter e = new TheMapFilterImpl();
class MyFunctions implements MapFilterFunctionSet {
// register function "myFunc"
}
e.registerFunctions("ns1", new MyFunctions(args));
At the point we tackle that, we should also craft a means to identify
namespaces of variables that can be addressed. Yes, ultimately we are
rewriting LISP, but we would stop well short of that.
It's not SQL
PQL is not a large, multi-decade mature, general purpose query language.
It is a simple framework to perform modest operations on a Map of data.
Although user defined functions can be declared, don't equate the
Map-of-Map design paradigm with SQL. SQL and its variants are much richer
and more complex.
Date is always a tricky type and PQL avoids some of the trickiness by not
complicating it with its own rules. Date timezone, daylight savings time,
etc., are completely up to the producer and consumer to negotiate. PQL
does not get into the middle of the discussion; it simply carries the Date
type around. A Date is a single instant in time and intrinsically has no
timezone or daylight savings time (DST) information. In other words,
operations like (Date)d1.compareTo((Date)d2) work correctly
independently of timezone and DST.
It is only when Date is externalized or parsed from a String that
the multitimezone interpretation of that instant in time becomes an issue.
The rubber hits the road when a PQL-compliant util actually
performs String operations on the date. This is where things like default
locale and DST settings start to muddy the waters. A pragmatic
recommendation is to NOT use TZ and DST offsets in string representation
of dates and instead specifically define the scope/purpose of the target date
item i.e. businessTradeDate will carry the datetime of activity as
observed on the clock in that business day context -- which is not necessary
the clock that the "user" sees nor the UTC time. Other information can
be used later to take the point-in-time datetime and represent it in
different String-ified forms.
Does PQL "Work?"
Yes! I have crafted these compact utils that use PQL:
- MapFilter. Given a candidate data Map and a PQL map, return a
boolean match or no match. One might call MapFilter the reference
implementation for processing PQL.
- MapListFilter. Given a candidate List of Maps and a PQL map,
return a boolean array indicating which items in the list match or
not.
- SQLRewrite. An essential component for MBI/MBD, a hybridized relational/
JSON persistence layer. MBI/MBD sits in between the persistor and the
data access layer and allows a regular RDMS like Oracle or MySQL or
Postgres to efficiently capture and perform queries on rich data. In
summary, fields -- arbitrarily deep into the rich data structure --
that are indexed in MBI/MBD are copied from the shape into "helper" columns
in the table which can be used as part of a SQL expression. MBI/MBD
manages consistency between the JSON BLOB/CLOB and the helper columns
so copy, indexing, update, etc. are all invisible to the DAL.
- PQLtoMongo. Probably obvious that this would exist. It creates the
shortcuts necessary to drive a MongoDB query, e.g.
PQL: { "eq": {"fld1": "val1" }}
MDB: { "fld1": "val1"}
PQL: { "lt": {"fld1": "val1" }}
MDB: { "fld1": {"$lt": "val1"}}
Is PQL "Fast?"
Given MapFilter as a reference implementation, here are some results
collected on a 2013 MacBookPro (Intel Core i7, 2.7GHz, 4 cores,
256KB L2 per core, 6MB L3 cache, 16GB total physical memory) using
Java 1.7.0_40 with no special tuning or -X startup options.
// Data shape looks like this:
// {
// "c1": "val",
// "c2": "val2"
// }
// PQL query:
{ "and": [ { "eq": { "c1": "val1" }}, { "eq": { "c2": "val2" }} ] }
MapFilter.eval(data, query) achieves between 4.2 and 4.4 million
evals per second. Note: With a single "eq" operation (no "and"), you
get a whopping 15.2 million per second.
// Data shape looks like this:
// {
// "a": {
// "b": {
// "c": {
// "d1": "val1",
// "d2": "val2"
// }
// }
// }
// }
// PQL query:
{ "and": [ { "eq": { "a.b.c.d1": "val1" }}, { "eq": { "a.b.c.d2": "val2" }} ] }
MapFilter.eval(data, query) achieves between 2.6 and 2.7 million
evals per second.
In the context of AND (a common logical), basic operator tests can be performed
at a rate of about 7 million per second.
A big query -- say, an "and" list containing 20 basic operators going
only one level deep into a rich data structure -- runs at
about 450000 evals per second. In the reference implementation, performance
is governed largely by the number of basic operations. A performance
optimization for non-nested dotpaths keeps things fast for those queries.
Performance degrades only a 2 or 3% per level as the depth of the dotpath increases.
"Pushing a frame" for and, or, and not has little effect.
In practice, there will be very few examples of deeply nested data combined
with deeply nested logical operators.
The in operator works particularly well even with very large lists.
// Put one million items in the list:
{ "in": { "name": [ "N1", "N2", ... "N1000000" ] }}
// Even with a trivial implementation (i.e. no internal optimizations
// like putting lists greater than 1000 in a hashmap) combined with a
// "worst case" input data map like this
// {
// "name": "N1000000"
// }
// it still only takes about 50ms, about 20,000,000/sec for String.equals().
Because PQL queries are essentially "modular" Map units, large inlists
can be constructed ahead of time and managed as a group. Later, as
appropriate, they can be added as ingredients into a new query. You have
the additional flexibility of managing the raw List of String
separately and "plugging it into" the value for "name" later on.
In general, it can be said that it is fast enough.
PQL Examples
Here are some examples (in JSONified representation of the qexpr) in
the context of a traditional select statement from SQL. Only the first
two examples have Java snippets but the approach to construction is the
same for all examples:
// The equiv of hello world:
//
// fld1 = 'value1'
The Java to construct this Map-of-Maps might look like this:
Map q = new HashMap();
Map q1 = new HashMap();
q1.put("fld1", "value");
q.put("eq", q1);
and the string representation (easier to visualize it this way) would be:
{
"eq": { "fld1": "value1" }
}
// More than a single boolean op requires one or more logical functions.
// Here we see that the and function expects a list of expressions
// to eval, not a single map or scalar value. But the overall
// function-argument design remains consistent.
//
// fld1 = 'value1' and fld2 = 'value2'
{
"and": [
{ "eq": {"fld1": "value1"}},
{ "eq": {"fld2": "value2"}}
]
}
Map qexpr = new HashMap();
List q2 = new ArrayList();
{
Map q = new HashMap();
Map m7 = new HashMap();
m7.put("fld1", "val1");
q.put("eq", m7);
q2.add(q);
}
{
Map q = new HashMap();
Map m7 = new HashMap();
m7.put("fld2", "val2");
q.put("eq", m7);
q2.add(q);
}
qexpr.put("and", q2);
// "Roughly equiv" to:
// select where trade.header.sys = 'ABC'.
// The qexpr map semantics permit rich nested structure. Maps
// of maps in the document Map being queried can be "walked" using
// the familiar dot operator.
// In conventional SQL this would be some sort of join of foreign
// keys, or perhaps the data itself would be "flattened" to a column
// like : select where trade_header_sys = "ABC"
{
"eq": { "trade.header.sys": "value1" }
}
// SQL is not oriented toward dealing with non-scalars in a structure
// and that's OK; we're not trying to make this an apples-to-apples
// SQL v. PQL comparison. But suppose we have a structure like this:
// {
// "authUsers": [ "buzz", "dave" ]
// }
// where there can be a variable number of authUsers. Suppose we want
// to find any record where size(authUsers) > 1. We exploit the number
// convention of the dotpath syntax to see which records have a NULL entry
// at offset 1 (meaning the list has only 1 element), and then NOT the
// result to return those records that have more than one element. This
// is assuming well-behaved lists without embedded nulls and such.
{
"not": { "null": "authUsers.1" }
}
// fld1 = 'value1' and fld2 <= 20
{
"and": [
{ "eq": {"fld1": "value1"}},
{ "lte": {"fld2": 20}}
]
}
// fld1 in ('value1','value2', 'value3')
{
"in": {"fld1": [ "value", "value2", "value" ] }
}
// fld4 not in (10, 20, 30)
{
"not": { "in": {"fld4": [ 10, 20, 30 ] }}
}
// Typically, the AND and OR functions will be the entry in the
// topmost map because almost any nontrivial query will require them.
// Here, OR is the main construct:
//
// fld1 in ('value1','value2') or fld2 != 'bar'
{
"or": [
{ "in": {"fld1": [ "value", "value2" ] } },
{ "ne": {"fld2": "bar"}}
]
}
// Getting fancy but the qexpr syntax is very simple so although
// it would be messier than SQL to type, it is very easy for
// program logic to construct.
// We show datetime strings here but note the conversion to a date
// type. java.util.Date type is one of the recognized scalar types.
//
// fld1 in ('value1','value2')
// or fld2 != 'bar'
// or (fld3 != 0 and fld3 = TO_DATE('2013-08-27T00:00:00')
//
{
"or": [
{ "in": {"fld1": [ "value", "value2" ] } },
{ "ne": {"fld2": "bar"}},
{ "and": [
{ "ne": {"fld3": 0}},
{ "eq": {"fld3": str2date("2013-08-27T00:00:00")}}
]
}
]
}
// Careful with those parens:
// fld1 = 'A' and (fld2 = 'B' or fld3 = 'C')
{
"and": [
{ "eq": {"fld1": "A"},
{ "or": [ { "eq": {"fld2": "B"}}, { "eq": {"fld3": "C"}} ] }
]
}
// Again, be careful with those parens!
// (fld1 = 'A' and fld2 = 'B') or fld3 = 'C'
{
"or": [
{ "and": [ { "eq": {"fld1": "A"}}, { "eq": {"fld2": "B"}} ] }
{ "eq": {"fld3": "C"}}
]
}
// All basic scalar booleans can make use of the special ind function.
// If a literal value is not provided for comparison, then a Map
// containing the ind (indirect) function and a target field name in
// docuemnt can be specified instead:
//
// fld1 = fld2
{
"eq": { "fld1": {"ind": "fld2"} }
}
Like this? Dislike this? Let me know
Site copyright © 2013-2024 Buzz Moschetti. All rights reserved