View source with raw comments or as raw
    1/*  Part of SWI-Prolog
    2
    3    Author:        Jan Wielemaker
    4    E-mail:        jan@swi-prolog.org
    5    WWW:           https://www.swi-prolog.org
    6    Copyright (c)  2026, SWI-Prolog Solutions b.v.
    7    All rights reserved.
    8
    9    Redistribution and use in source and binary forms, with or without
   10    modification, are permitted provided that the following conditions
   11    are met:
   12
   13    1. Redistributions of source code must retain the above copyright
   14       notice, this list of conditions and the following disclaimer.
   15
   16    2. Redistributions in binary form must reproduce the above copyright
   17       notice, this list of conditions and the following disclaimer in
   18       the documentation and/or other materials provided with the
   19       distribution.
   20
   21    THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
   22    "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
   23    LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
   24    FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
   25    COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
   26    INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
   27    BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
   28    LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
   29    CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
   30    LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN
   31    ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
   32    POSSIBILITY OF SUCH DAMAGE.
   33*/
   34
   35:- module(unicode_security,
   36          [ unicode_script/2,             % +Code, -Script
   37            unicode_script_extensions/2,  % +Code, -Scripts
   38            unicode_identifier_status/2,  % +Code, -Status
   39            unicode_identifier_type/2,    % +Code, -Types
   40
   41            unicode_skeleton/2,           % +Text, -Skeleton
   42            unicode_confusable/2,         % +T1, +T2
   43            unicode_confusable/3,         % +T1, +T2, +Options
   44
   45            unicode_resolved_scripts/2,   % +Text, -Scripts
   46            unicode_restriction_level/2   % +Text, -Level
   47          ]).   48:- use_foreign_library(foreign(unicode_security4pl)).

Unicode security helpers (UTS #39, UAX #24)

This library implements helpers from UTS #39 (Unicode Security Mechanisms) and the script properties of UAX #24. It is intended for linters, identifier validators and any code that needs to reason about confusable look-alike text or mixed-script identifiers. It does not alter the Prolog reader; UTS #39 is deliberately a library-level facility.

The library ships its own UCD-derived tables and is independent of library(unicode) (which wraps libutf8proc for normalisation and per-code-point properties). See etc/gen_uts39.pl in the package directory to regenerate the tables on a Unicode-version bump.

Predicates fall into three groups:

 unicode_script(+Code:integer, -Script:atom) is semidet
True when Script is the UAX #24 Script_Property of Code. Script is a lower-case atom of the long property value (latin, cyrillic, han, common, inherited, ...). Fails for code points outside the Unicode range or with no entry in Scripts.txt.
 unicode_script_extensions(+Code:integer, -Scripts:list(atom)) is semidet
Scripts is the sorted list of UAX #24 Script_Extensions of Code. For most code points this is a singleton [Script]. Fails for code points outside the Unicode range and for code points with no entry in either ScriptExtensions.txt or Scripts.txt.
 unicode_identifier_status(+Code:integer, -Status:atom) is semidet
Succeeds, unifying Status with allowed, when Code is listed in UTS #39 IdentifierStatus.txt with status Allowed. Fails otherwise — per UTS #39 every code point not listed there is Restricted by default; rather than return restricted for everything else, this predicate simply fails.
 unicode_identifier_type(+Code:integer, -Types:list(atom)) is semidet
Types is the sorted list of UTS #39 Identifier_Type atoms for Code (recommended, inclusion, technical, obsolete, limited_use, exclusion, not_nfkc, not_xid, default_ignorable, deprecated, uncommon_use). Fails for code points outside the Unicode range or with no entry in IdentifierType.txt.
 unicode_skeleton(+Text, -Skeleton:atom) is det
Compute the UTS #39 §4 skeleton of Text: apply NFD, substitute each code point with its confusables.txt prototype string, then apply NFD again. Two strings are confusable iff their skeletons compare equal.
 unicode_confusable(+T1, +T2) is semidet
True when unicode_skeleton/2 of T1 and T2 are equal.
 unicode_confusable(+T1, +T2, +Options) is semidet
As unicode_confusable/2. Options:
ignore_intentional(+Bool)
If true, skip the per-character substitution when the source and target form a pair listed in UTS #39 intentional.txt (e.g. Latin A versus Greek capital Alpha). Default false.
 unicode_resolved_scripts(+Text, -Scripts:list(atom)) is det
Scripts is the UTS #39 §5.1 resolved augmented Script_Extensions set of Text: the intersection of augscx(c) over all non-Common/non-Inherited characters, with the augmentation rules for Han, Hiragana, Katakana, Hangul and Bopomofo applied. The empty list signals a mixed-script string.
 unicode_restriction_level(+Text, -Level:atom) is det
Classify Text under UTS #39 §5.2 at the most restrictive level for which it qualifies. Level is one of: