Did you know ... | Search Documentation: |
Pack musicbrainz -- prolog/lucene.pl |
Right. First off, forget everything you know about Lucene's search syntax. Especially the boolean operators, which do not work in any logical way. This library is based on a data type which, as far as I can determine, represents the internal structure of a Lucene query. Basically, a query is a triple of a modifier (Lucene +, -, or <none>), a numerical boost (Lucene ^ operator), and a, for want of a better name, a 'part'. I could have called it a query 'component', but 'part' is a shorter word that means the same thing. A part is either a primitive term coupled with a field name( (:)/2 constructor), or a composite part consisting of a list of sub-queries (comp/1 constructor). So, we have:
:- type query ---> q(modifier,boost,part). :- type part ---> comp(list(query)) ; field:prim. :- type modifier ---> plus, minus, none. :- type boost == nonneg.
Note that the 'field' argument of the (:)/2 part constructor is inherently defaulty: if no field is specified, the search agent fills it in with an application specific default.
The primitives cover all those obtainable using the Lucene syntax and are as follows:
:- type prim ---> word(word) % bare, unquoted literal word ; glob(pattern) % word with wildcards * and ? ; re(pattern) % regular expression /.../ ; fuzzy(word,integer) % fuzzy word match <...>~N ; range_inc(word,word) % inclusive range [A TO B] ; range_exc(word,word) % exclusive range {A TO B} ; phrase(list(word),integer) % quoted multi word with slop .
Building queries out of these constructors is a bit of a chore, so next
we have an term language and associated evaluator which takes an expression
and produces a valid query term. This can be thought of as a set of functions
which return queries. Every function in the language produces a value of type
query
. Some of them leave the field and modifier arguments unbound. If they
are unbound at the end of the process, they take on default values.
The functions and literals are as follows:
<any atomic literal> :: query % primitive word with unbound modifier and field (@) :: atomic -> query % wildcard pattern with unbound modifier and field (\) :: atomic -> query % regular expression with unbound modifier and field (/) :: atomic, number -> query % fuzzy match with unbound modifier and field (//) :: list(atomic), number -> query % quoted phrase with unbound modifier and field (+) :: atomic, atomic -> query % inclusive range with unbound modifier and field (-) :: atomic, atomic -> query % exclusive range with unbound modifier and field (+) :: query -> query % unifies query modifier with plus (-) :: query -> query % unifies query modifier with minus (:) :: atom, query % unifies all field arguments recursively (^) :: query, number % multiplies boost factor list(query) :: query % a list of queries evaluates to a composite query % with unbound modifier.
A few notes are in order.
-F:E
as (-F):E
, but my expression language
needs to see -(F:E)
. Hence, there is little kludge in the evaluator to catch
such terms and re-group the operators.qexpr ---> @atomic; \atomic ; atomic/number ; list(atomic)//number ; atomic + atomic ; atomic - atomic ; +qexpr ; -qexpr ; atom:qexpr ; qexpr^number ; list(qexpr) . atomic :< qexpr. % any atomic is a qexpr query :< qexpr. % any query is a qexpr
Ugh. Let us take the 4 different cases in turn:
So that's the basics of it. There might still be some problems in the DCG when it comes to handling character escapes. Somewhat suprisingly, the DCG seems to parse much of the Lucene query syntax more or less correctly, except for the boolean operators, which Lucene does not handle in any sensible way and are best avoided. Also, it does not parse field names applied to componound queries or the postfix '~' operator.
See (though I warn you it will not be especially enlightening) https://lucene.apache.org/core/4_3_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description
query
or an expression of type qexpr
and produces a query as a list
of character codes. It accepts the following options:
See lucene//1 for more details.
@throws failed(G)
If an expression contains a type errors, or any contradictory
operators, G is the failing type check.