unicode_security.pl

View source with formatted comments or as raw
    1/*  Part of SWI-Prolog
    2
    3    Author:        Jan Wielemaker
    4    E-mail:        jan@swi-prolog.org
    5    WWW:           https://www.swi-prolog.org
    6    Copyright (c)  2026, SWI-Prolog Solutions b.v.
    7    All rights reserved.
    8
    9    Redistribution and use in source and binary forms, with or without
   10    modification, are permitted provided that the following conditions
   11    are met:
   12
   13    1. Redistributions of source code must retain the above copyright
   14       notice, this list of conditions and the following disclaimer.
   15
   16    2. Redistributions in binary form must reproduce the above copyright
   17       notice, this list of conditions and the following disclaimer in
   18       the documentation and/or other materials provided with the
   19       distribution.
   20
   21    THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
   22    "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
   23    LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
   24    FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
   25    COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
   26    INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
   27    BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
   28    LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
   29    CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
   30    LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN
   31    ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
   32    POSSIBILITY OF SUCH DAMAGE.
   33*/
   34
   35:- module(unicode_security,
   36          [ unicode_script/2,             % +Code, -Script
   37            unicode_script_extensions/2,  % +Code, -Scripts
   38            unicode_identifier_status/2,  % +Code, -Status
   39            unicode_identifier_type/2,    % +Code, -Types
   40
   41            unicode_skeleton/2,           % +Text, -Skeleton
   42            unicode_confusable/2,         % +T1, +T2
   43            unicode_confusable/3,         % +T1, +T2, +Options
   44
   45            unicode_resolved_scripts/2,   % +Text, -Scripts
   46            unicode_restriction_level/2   % +Text, -Level
   47          ]).   48:- use_foreign_library(foreign(unicode_security4pl)).   49
   50/** <module> Unicode security helpers (UTS #39, UAX #24)
   51
   52This  library  implements  helpers  from   [UTS  #39  (Unicode  Security
   53Mechanisms)](https://www.unicode.org/reports/tr39/)   and   the   script
   54properties of [UAX #24](https://www.unicode.org/reports/tr24/).   It  is
   55intended for linters, identifier validators and   any code that needs to
   56reason about confusable look-alike text  or mixed-script identifiers. It
   57does **not** alter  the  Prolog  reader;   UTS  #39  is  deliberately  a
   58library-level facility.
   59
   60The library ships its  own  UCD-derived   tables  and  is independent of
   61`library(unicode)`  (which  wraps  libutf8proc   for  normalisation  and
   62per-code-point  properties).  See  `etc/gen_uts39.pl`   in  the  package
   63directory to regenerate the tables on a Unicode-version bump.
   64
   65Predicates fall into three groups:
   66
   67  - Per-code-point lookups: unicode_script/2,
   68    unicode_script_extensions/2, unicode_identifier_status/2,
   69    unicode_identifier_type/2.
   70  - Skeleton and confusable test (UTS #39 §4): unicode_skeleton/2,
   71    unicode_confusable/2, unicode_confusable/3.
   72  - String-level identifier checks (UTS #39 §5):
   73    unicode_resolved_scripts/2, unicode_restriction_level/2.
   74*/
   75
   76%!  unicode_script(+Code:integer, -Script:atom) is semidet.
   77%
   78%   True when Script is the UAX #24 Script_Property of Code. Script is a
   79%   lower-case atom of the  long   property  value (`latin`, `cyrillic`,
   80%   `han`, `common`, `inherited`, ...). Fails   for  code points outside
   81%   the Unicode range or with no entry in Scripts.txt.
   82
   83%!  unicode_script_extensions(+Code:integer, -Scripts:list(atom)) is semidet.
   84%
   85%   Scripts is the sorted list of UAX #24 Script_Extensions of Code. For
   86%   most code points this is  a   singleton  `[Script]`.  Fails for code
   87%   points outside the Unicode range and for   code points with no entry
   88%   in either ScriptExtensions.txt or Scripts.txt.
   89
   90%!  unicode_identifier_status(+Code:integer, -Status:atom) is semidet.
   91%
   92%   Succeeds, unifying Status with `allowed`, when Code is listed in UTS
   93%   #39 IdentifierStatus.txt with status ``Allowed``.  Fails otherwise —
   94%   per UTS #39 every code  point  not   listed  there  is Restricted by
   95%   default; rather than return `restricted`   for everything else, this
   96%   predicate simply fails.
   97
   98%!  unicode_identifier_type(+Code:integer, -Types:list(atom)) is semidet.
   99%
  100%   Types is the sorted list of UTS   #39 Identifier_Type atoms for Code
  101%   (`recommended`, `inclusion`, `technical`, `obsolete`, `limited_use`,
  102%   `exclusion`,     `not_nfkc`,     `not_xid`,     `default_ignorable`,
  103%   `deprecated`, `uncommon_use`). Fails for  code   points  outside the
  104%   Unicode range or with no entry in IdentifierType.txt.
  105
  106%!  unicode_skeleton(+Text, -Skeleton:atom) is det.
  107%
  108%   Compute the UTS #39 §4 skeleton of  Text: apply NFD, substitute each
  109%   code point with its confusables.txt prototype string, then apply NFD
  110%   again. Two strings are confusable iff their skeletons compare equal.
  111
  112%!  unicode_confusable(+T1, +T2) is semidet.
  113%
  114%   True when unicode_skeleton/2 of T1 and T2 are equal.
  115
  116%!  unicode_confusable(+T1, +T2, +Options) is semidet.
  117%
  118%   As unicode_confusable/2.  Options:
  119%
  120%   * ignore_intentional(+Bool)
  121%     If `true`, skip the per-character substitution when the source
  122%     and target form a pair listed in UTS #39 intentional.txt (e.g.
  123%     Latin A versus Greek capital Alpha).  Default `false`.
  124
  125%!  unicode_resolved_scripts(+Text, -Scripts:list(atom)) is det.
  126%
  127%   Scripts is the UTS #39 §5.1 resolved augmented Script_Extensions set
  128%   of   Text:   the   intersection    of     `augscx(c)`    over    all
  129%   non-Common/non-Inherited characters, with the augmentation rules for
  130%   Han, Hiragana, Katakana, Hangul and Bopomofo applied. The empty list
  131%   signals a mixed-script string.
  132
  133%!  unicode_restriction_level(+Text, -Level:atom) is det.
  134%
  135%   Classify Text under UTS #39 §5.2 at   the most restrictive level for
  136%   which it qualifies. Level is one of:
  137%
  138%   * `ascii_only` — every code point in U+0020..U+007E and Allowed.
  139%   * `single_script` — augmented resolved-script-set non-empty and
  140%     every code point Allowed.
  141%   * `highly_restrictive` — covered by Latin plus one of ``Hanb``,
  142%     ``Jpan`` or ``Kore`` (UTS #39 §5.1 augmented profiles).
  143%   * `moderately_restrictive` — covered by Latin plus a single
  144%     non-Latin Recommended script (``Cyrl`` or ``Grek``).
  145%   * `minimally_restrictive` — every code point has Identifier_Type
  146%     in `{recommended, inclusion}`.
  147%   * `unrestricted` — otherwise.
  148%
  149%   A linter that walks  source  clauses   and  reports  atoms  with the
  150%   confusability issues above is registered  in `library(check)` itself
  151%   (predicate      `list_confusable_identifiers/0`);      see       the
  152%   `library(check)` documentation for details.