1/* Part of SWI-Prolog 2 3 Author: Jan Wielemaker 4 E-mail: jan@swi-prolog.org 5 WWW: https://www.swi-prolog.org 6 Copyright (c) 2026, SWI-Prolog Solutions b.v. 7 All rights reserved. 8 9 Redistribution and use in source and binary forms, with or without 10 modification, are permitted provided that the following conditions 11 are met: 12 13 1. Redistributions of source code must retain the above copyright 14 notice, this list of conditions and the following disclaimer. 15 16 2. Redistributions in binary form must reproduce the above copyright 17 notice, this list of conditions and the following disclaimer in 18 the documentation and/or other materials provided with the 19 distribution. 20 21 THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 22 "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 23 LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS 24 FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE 25 COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, 26 INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, 27 BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; 28 LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 29 CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 30 LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN 31 ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE 32 POSSIBILITY OF SUCH DAMAGE. 33*/ 34 35:- module(unicode_security, 36 [ unicode_script/2, % +Code, -Script 37 unicode_script_extensions/2, % +Code, -Scripts 38 unicode_identifier_status/2, % +Code, -Status 39 unicode_identifier_type/2, % +Code, -Types 40 41 unicode_skeleton/2, % +Text, -Skeleton 42 unicode_confusable/2, % +T1, +T2 43 unicode_confusable/3, % +T1, +T2, +Options 44 45 unicode_resolved_scripts/2, % +Text, -Scripts 46 unicode_restriction_level/2 % +Text, -Level 47 ]). 48:- use_foreign_library(foreign(unicode_security4pl)). 49 50/** <module> Unicode security helpers (UTS #39, UAX #24) 51 52This library implements helpers from [UTS #39 (Unicode Security 53Mechanisms)](https://www.unicode.org/reports/tr39/) and the script 54properties of [UAX #24](https://www.unicode.org/reports/tr24/). It is 55intended for linters, identifier validators and any code that needs to 56reason about confusable look-alike text or mixed-script identifiers. It 57does **not** alter the Prolog reader; UTS #39 is deliberately a 58library-level facility. 59 60The library ships its own UCD-derived tables and is independent of 61`library(unicode)` (which wraps libutf8proc for normalisation and 62per-code-point properties). See `etc/gen_uts39.pl` in the package 63directory to regenerate the tables on a Unicode-version bump. 64 65Predicates fall into three groups: 66 67 - Per-code-point lookups: unicode_script/2, 68 unicode_script_extensions/2, unicode_identifier_status/2, 69 unicode_identifier_type/2. 70 - Skeleton and confusable test (UTS #39 §4): unicode_skeleton/2, 71 unicode_confusable/2, unicode_confusable/3. 72 - String-level identifier checks (UTS #39 §5): 73 unicode_resolved_scripts/2, unicode_restriction_level/2. 74*/ 75 76%! unicode_script(+Code:integer, -Script:atom) is semidet. 77% 78% True when Script is the UAX #24 Script_Property of Code. Script is a 79% lower-case atom of the long property value (`latin`, `cyrillic`, 80% `han`, `common`, `inherited`, ...). Fails for code points outside 81% the Unicode range or with no entry in Scripts.txt. 82 83%! unicode_script_extensions(+Code:integer, -Scripts:list(atom)) is semidet. 84% 85% Scripts is the sorted list of UAX #24 Script_Extensions of Code. For 86% most code points this is a singleton `[Script]`. Fails for code 87% points outside the Unicode range and for code points with no entry 88% in either ScriptExtensions.txt or Scripts.txt. 89 90%! unicode_identifier_status(+Code:integer, -Status:atom) is semidet. 91% 92% Succeeds, unifying Status with `allowed`, when Code is listed in UTS 93% #39 IdentifierStatus.txt with status ``Allowed``. Fails otherwise — 94% per UTS #39 every code point not listed there is Restricted by 95% default; rather than return `restricted` for everything else, this 96% predicate simply fails. 97 98%! unicode_identifier_type(+Code:integer, -Types:list(atom)) is semidet. 99% 100% Types is the sorted list of UTS #39 Identifier_Type atoms for Code 101% (`recommended`, `inclusion`, `technical`, `obsolete`, `limited_use`, 102% `exclusion`, `not_nfkc`, `not_xid`, `default_ignorable`, 103% `deprecated`, `uncommon_use`). Fails for code points outside the 104% Unicode range or with no entry in IdentifierType.txt. 105 106%! unicode_skeleton(+Text, -Skeleton:atom) is det. 107% 108% Compute the UTS #39 §4 skeleton of Text: apply NFD, substitute each 109% code point with its confusables.txt prototype string, then apply NFD 110% again. Two strings are confusable iff their skeletons compare equal. 111 112%! unicode_confusable(+T1, +T2) is semidet. 113% 114% True when unicode_skeleton/2 of T1 and T2 are equal. 115 116%! unicode_confusable(+T1, +T2, +Options) is semidet. 117% 118% As unicode_confusable/2. Options: 119% 120% * ignore_intentional(+Bool) 121% If `true`, skip the per-character substitution when the source 122% and target form a pair listed in UTS #39 intentional.txt (e.g. 123% Latin A versus Greek capital Alpha). Default `false`. 124 125%! unicode_resolved_scripts(+Text, -Scripts:list(atom)) is det. 126% 127% Scripts is the UTS #39 §5.1 resolved augmented Script_Extensions set 128% of Text: the intersection of `augscx(c)` over all 129% non-Common/non-Inherited characters, with the augmentation rules for 130% Han, Hiragana, Katakana, Hangul and Bopomofo applied. The empty list 131% signals a mixed-script string. 132 133%! unicode_restriction_level(+Text, -Level:atom) is det. 134% 135% Classify Text under UTS #39 §5.2 at the most restrictive level for 136% which it qualifies. Level is one of: 137% 138% * `ascii_only` — every code point in U+0020..U+007E and Allowed. 139% * `single_script` — augmented resolved-script-set non-empty and 140% every code point Allowed. 141% * `highly_restrictive` — covered by Latin plus one of ``Hanb``, 142% ``Jpan`` or ``Kore`` (UTS #39 §5.1 augmented profiles). 143% * `moderately_restrictive` — covered by Latin plus a single 144% non-Latin Recommended script (``Cyrl`` or ``Grek``). 145% * `minimally_restrictive` — every code point has Identifier_Type 146% in `{recommended, inclusion}`. 147% * `unrestricted` — otherwise. 148% 149% A linter that walks source clauses and reports atoms with the 150% confusability issues above is registered in `library(check)` itself 151% (predicate `list_confusable_identifiers/0`); see the 152% `library(check)` documentation for details.