.. _library_classifier_protocols:
classifier_protocols
This library provides protocols used in the implementation of machine
learning classifier algorithms. Datasets are represented as objects
implementing the dataset_protocol protocol. Classifiers are
represented as objects implementing the classifier_protocol
protocol.
This library also provides test datasets. See below for details.
Logtalk currently provides several classifiers including c45,
knn, naive_bayes, nearest_centroid, and random_forest.
See these libraries documentation for details.
API documentation
Open the
`../../apis/library_index.html#classifier_protocols <../../apis/library_index.html#classifier_protocols>`__
link in a web browser.
Loading
To load all entities in this library, load the loader.lgt file:
::
| ?- logtalk_load(classifier_protocols(loader)).
Test datasets
Several sample datasets are included in the test_files directory:
- Play Tennis â The classic weather/tennis dataset with 14 examples
and 4 discrete attributes (outlook, temperature, humidity, wind).
Originally from Quinlan (1986) and widely used in machine learning
textbooks including Mitchell (1997). Also available from the UCI
Machine Learning Repository:
https://archive.ics.uci.edu/dataset/349/tennis+major+tournament+match+statistics
- Contact Lenses â A dataset with 24 examples and 4 discrete
attributes (age, spectacle prescription, astigmatism, tear production
rate) for deciding the type of contact lenses to prescribe. Originally
from Cendrowska, J. (1987). PRISM: An algorithm for inducing modular
rules. International Journal of Man-Machine Studies, 27(4), 349-370.
Available from the UCI Machine Learning Repository:
https://archive.ics.uci.edu/dataset/58/lenses
- Iris â The classic Iris flower dataset with 150 examples and 4
continuous attributes (sepal length, sepal width, petal length, petal
width) for classifying iris species (setosa, versicolor, virginica).
Originally from Fisher, R.A. (1936). The use of multiple measurements
in taxonomic problems. Annals of Eugenics, 7(2), 179-188. Available
from the UCI Machine Learning Repository:
https://archive.ics.uci.edu/dataset/53/iris
- Breast Cancer â A dataset with 286 examples and 9 discrete
attributes (age, menopause, tumor size, inv-nodes, node-caps, degree
of malignancy, breast, breast quadrant, irradiation) for predicting
breast cancer recurrence events. Contains missing values (9 examples
with missing values in the node-caps and breast-quad attributes,
represented using anonymous variables). Originally from the Institute
of Oncology, University Medical Centre, Ljubljana, Yugoslavia. Donors:
Ming Tan and Jeff Schlimmer. Available from the UCI Machine Learning
Repository: https://archive.ics.uci.edu/dataset/14/breast+cancer
- Gaussian Anomalies â A synthetic 2D anomaly detection dataset with
50 examples and 2 continuous attributes (x, y). Normal points are
sampled from a standard normal distribution centered at the origin.
Anomalous points are placed far from the cluster center. Inspired by
the canonical test case used in the Extended Isolation Forest paper by
Hariri et al. (2019).
- Shuttle Anomalies â A subset of the Statlog Shuttle dataset with
50 examples and 9 continuous attributes representing sensor readings
from the NASA Space Shuttle. Class 1 (Rad Flow) is the majority class
(normal), while all other classes are treated as anomalies. Originally
from Catlett, J. (1991). Available from the UCI Machine Learning
Repository: https://archive.ics.uci.edu/dataset/148/statlog+shuttle
- Water Potability â A water potability dataset with 50 examples and
9 continuous attributes (pH, hardness, solids, chloramines, sulfate,
conductivity, organic carbon, trihalomethanes, turbidity). Normal
instances represent potable water samples within acceptable ranges.
Anomalous instances represent water samples with hazardous
contamination levels. Based on the publicly available Water Quality
dataset (Kadiwal, A., 2020, Kaggle).
- Sensor Anomalies â A synthetic industrial sensor anomaly dataset
with 40 examples and 3 continuous attributes (temperature, pressure,
vibration). Contains missing values (14 examples with missing values,
represented using anonymous variables). Normal readings cluster around
typical operating ranges. Anomalous readings show extreme values
indicating equipment malfunction.