1:- module(lucene,
2 [ lucene//1
3 , lucene_codes/3
4 , op(200,fy,@)
5 ]).
130:- use_module(library(dcg/basics)). 131:- use_module(library(dcg_core)). 132:- use_module(library(dcg_codes)). 133:- use_module(library(sandbox)). 134 135:- set_prolog_flag(double_quotes, codes).
query
or an expression of type qexpr
and produces a query as a list
of character codes. It accepts the following options:
See lucene//1 for more details.
@throws failed(G)
If an expression contains a type errors, or any contradictory
operators, G is the failing type check.
153lucene_codes(E,Codes) :- lucene_codes(E,[],Codes). 154lucene_codes(E,Opts,Codes) :- 155 eval(E,Q), 156 (option(fields(F),Opts) -> check_fields(F,Q); true), 157 once(phrase(lucene(Q),Codes,[])). 158 159eval(W, q(_,1,_:word(W))) :- atomic(W). 160eval(@(G), q(_,1,_:glob(G))) :- insist(atomic(G)). 161eval(\(RE), q(_,1,_:re(RE))) :- insist(atomic(RE)). 162eval(W/S, q(_,1,_:fuzzy(W,S))) :- atomic(W), insist(number(S)). 163eval(Ws//D, q(_,1,_:phrase(Ws,D))) :- insist(maplist(atomic,Ws)), insist(number(D)). 164eval(Min-Max, q(_,1,_:range_exc(Min,Max))) :- insist(atomic(Min)), insist(atomic(Max)). 165eval(Min+Max, q(_,1,_:range_inc(Min,Max))) :- insist(atomic(Min)), insist(atomic(Max)). 166 167eval(+F:E, C) :- !, eval(+(F:E),C). % kludge to fix operator precedence 168eval(-F:E, C) :- !, eval(-(F:E),C). % kludge to fix operator precedence 169eval(F:E, C) :- eval(E,C), apply_field(F,C). 170eval(Es, q(_,1,comp(Cs2))) :- is_list(Es), maplist(eval,Es,Cs2). 171eval(E1^B, q(M,B2,Q)) :- eval(E1,q(M,B1,Q)), B2 is B1*B. 172eval(+E, q(plus,B,Q)) :- eval(E,q(M,B,Q)), insist(M=plus). 173eval(-E, q(minus,B,Q)) :- eval(E,q(M,B,Q)), insist(M=minus). 174eval(q(M,B,Q),q(M,B,Q)). 175 176apply_field(F,q(_,_,F:_)). 177apply_field(F,q(_,_,comp(Cs))) :- maplist(apply_field(F),Cs). 178 179 180check_fields(Fields,q(_,_,Part)) :- check_part(Part,Fields). 181check_part(comp(Queries),Fields) :- maplist(check_fields(Fields),Queries). 182check_part(Field:_, Fields) :- insist((var(Field);member(Field,Fields)),invalid_field(Field)). 183 184insist(G) :- insist(G,failed(G)). 185insist(G,Ex) :- call(G) -> true; throw(Ex).
196lucene(Top) --> query(Top). 197 198query(q(Mod,Boost,Part)) --> mod(Mod), part(Part), boost(Boost). 199 200mod(none) --> "". 201mod(plus) --> "+". 202mod(minus) --> "-". 203 204boost(Boost) --> {Boost=1}; "^", number(Boost). 205 206part(default(_):Prim) --> prim(Prim). 207part(Field:Prim) --> field(Field), ":", prim(Prim). 208part(comp(Clauses)) --> "(", seqmap_with_sep(" ",query,Clauses), ")". 209 210prim(word(W)) --> word(W). 211prim(glob(G)) --> bidi(G,esc(glob,C),string_codes(G,C)). 212prim(re(RE)) --> "/", bidi(RE,esc(regexp,C),string_codes(RE,C)), "/". 213prim(fuzzy(W,P)) --> word(W), "~", integer(P). 214prim(range_inc(Min,Max)) --> sqbr((word(Min), " TO ", word(Max))). 215prim(range_exc(Min,Max)) --> brace((word(Min), " TO ", word(Max))). 216prim(phrase(Words,D)) --> 217 "\"", seqmap_with_sep(" ",word,Words), "\"", 218 ( {D=0}; "~", integer(D)). 219 220word(W) --> bidi(W,esc(word,C),string_codes(W,C)). 221field(F) --> bidi(F,esc(field,C),string_codes(F,C)). 222 223% beginnings of parsing ability... 224% this orders the two goals depending on the instantiation state of Sem. 225bidi(Sem,Phrase,Unify) --> 226 {var(Sem)} -> , {}; {}, . 227 228% -- escape sequence parsers for different entities 229 230word([C|T],T) --> [0'\\,C], {member(C," /+-&|!(){}[]^\"~:\\*?")}. % auto escape these 231word([C|T],T) --> [C], {\+member(C," /+-&|!(){}[]^\"~:\\*?")}. % pass the rest 232 233glob([0'\\,C|T],T) --> [0'\\,C], {member(C,"*?\\")}. % manual escape these 234glob([C|T],T) --> [0'\\,C], {member(C," /+-&|!(){}[]^\"~:")}. % auto escape these 235glob([C|T],T) --> [C], {\+member(C," /+-&|!(){}[]^\"~:\\")}. 236 237% escape sequences preserved in string, \ must be escaped, 238% / auto escaped 239regexp([0'\\,C|T],T) --> [0'\\,C]. 240regexp([0'/|T],T) --> "\\/". 241regexp([C|T],T) --> [C], {\+member(C,"/\\")}. 242 243% no escape sequences, no funny characters allowed. 244field([C|T],T) --> [C], {\+member(C," /+-&|!(){}[]^\"~?:\\")}. 245 246sandbox:safe_primitive(lucene:lucene(_,_,_)). 247sandbox:safe_primitive(lucene:lucene_codes(_,_,_)). 248 249prologmessage(invalid_field(F)) --> 250 ['Fieldname ~w is not recognised in the current Lucene query context.'-[F]]
A DCG for generating Lucene searches
Right. First off, forget everything you know about Lucene's search syntax. Especially the boolean operators, which do not work in any logical way. This library is based on a data type which, as far as I can determine, represents the internal structure of a Lucene query. Basically, a query is a triple of a modifier (Lucene +, -, or <none>), a numerical boost (Lucene ^ operator), and a, for want of a better name, a 'part'. I could have called it a query 'component', but 'part' is a shorter word that means the same thing. A part is either a primitive term coupled with a field name( (:)/2 constructor), or a composite part consisting of a list of sub-queries (comp/1 constructor). So, we have:
Note that the 'field' argument of the (:)/2 part constructor is inherently defaulty: if no field is specified, the search agent fills it in with an application specific default.
The primitives cover all those obtainable using the Lucene syntax and are as follows:
Building queries out of these constructors is a bit of a chore, so next we have an term language and associated evaluator which takes an expression and produces a valid query term. This can be thought of as a set of functions which return queries. Every function in the language produces a value of type
query
. Some of them leave the field and modifier arguments unbound. If they are unbound at the end of the process, they take on default values. The functions and literals are as follows:A few notes are in order.
-F:E
as(-F):E
, but my expression language needs to see-(F:E)
. Hence, there is little kludge in the evaluator to catch such terms and re-group the operators.Quoting and escaping policy
Ugh. Let us take the 4 different cases in turn:
So that's the basics of it. There might still be some problems in the DCG when it comes to handling character escapes. Somewhat suprisingly, the DCG seems to parse much of the Lucene query syntax more or less correctly, except for the boolean operators, which Lucene does not handle in any sensible way and are best avoided. Also, it does not parse field names applied to componound queries or the postfix '~' operator.
See (though I warn you it will not be especially enlightening) https://lucene.apache.org/core/4_3_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description */