Learning

Pleasant machine-learning library inspired by ``requests``. Design goals: * The pandas / sklearn ``X``/``y`` split is hidden. Users work with two opaque types — ``DataFrame`` (all columns intact) and ``Dataset`` (features + a designated target column). * Classification and regression are distinguished at the type level (``Classifier`` vs ``Regressor``). Metrics are typed against the right kind of model — ``accuracy`` will not type-check against a regressor, ``mse`` will not type-check against a classifier. * A few dataset-consuming operations are linear (``(1 ds: Dataset)``) so callers who bind their source with ``let 1 df = ...`` get a type error if they keep using the pre-transformed dataset. * Refinements carry shape and class invariants. Common ML bugs are rejected at compile time: - ``target`` and ``target_at`` link the resulting dataset's row/feature counts to the source ``DataFrame``. - Training functions stamp the resulting model with ``clf_features`` / ``clf_classes`` / ``reg_features``. - Scaling, imputation and oversampling preserve the feature count and class count; scaling/imputation preserve balance. - ``random_undersample`` only ever shrinks the row count; oversamplers only grow it. - ``smote`` / ``random_oversample`` / ``random_undersample`` require ``ds_classes ds >= 2``; ``knn_classifier_k`` requires ``k <= ds_rows ds``; ``complement_nb`` requires an imbalanced dataset (its actual design intent). - Probabilities are refined to ``[0, 1]``; error metrics to ``>= 0.0``. Runtime witnesses (``is_multiclass``, ``is_splittable``, ``is_nonempty``, ``balanced``) refine their Bool return type so they can discharge the corresponding precondition inside an ``if ... then ... else ...`` guard. ─── Opaque types ───────────────────────────────────────────────────────
Imports
open List; open Tuple; open Pair;
Table of Contents

Types

DataFrame

(type)
type DataFrame

Dataset

(type)
type Dataset

Classifier

(type)
type Classifier

Regressor

(type)
type Regressor

Functions

np

def np : Unit

pd

def pd : Unit

skl

def skl : Unit

read_csv

Read a CSV file into a ``DataFrame`` (all columns kept).
def read_csv (path: String) : {df : DataFrame | df_cols df >= 1 && df_rows df >= 0}

target

Designate ``column`` as the target. Consumes the source DataFrame. Row count is preserved; feature count is one less than the column count of the source.
def target (df: DataFrame) (column: String) : {ds : Dataset | ds_rows ds == df_rows df && ds_features ds == df_cols df - 1}

target_at

Designate the column at position ``index`` (0-based) as the target.
def target_at (df: DataFrame) (index: {n : Int | n >= 0 && n < df_cols df}) : {ds : Dataset | ds_rows ds == df_rows df && ds_features ds == df_cols df - 1}

load

def load (path: String) : {ds : Dataset | ds_features ds >= 0 && ds_rows ds >= 0}

columns

def columns (df: DataFrame) : List

has_column

def has_column (df: DataFrame) (column: String) : Bool

n_columns

def n_columns (df: DataFrame) : {n : Int | n == df_cols df && n >= 0}

n_rows

def n_rows (ds: Dataset) : {n : Int | n == ds_rows ds && n >= 0}

n_features

def n_features (ds: Dataset) : {n : Int | n == ds_features ds && n >= 0}

balanced

Runtime witness for ``is_balanced``.
def balanced (ds: Dataset) : {b : Bool | b == is_balanced ds}

is_nonempty

Runtime witnesses that discharge precondition refinements inside an ``if ... then ... else ...`` branch.
def is_nonempty (ds: Dataset) : {b : Bool | b == ds_rows ds > 0}

is_splittable

def is_splittable (ds: Dataset) : {b : Bool | b == ds_rows ds >= 2}

is_multiclass

def is_multiclass (ds: Dataset) : {b : Bool | b == ds_classes ds >= 2}

class_counts

List of ``[label, count]`` pairs sorted by label.
def class_counts (ds: Dataset) : List

split

Linear split into ``(training, testing)``. Consumes the source so training on the un-split dataset becomes a type error. The two halves preserve the source's feature count, class count and balance flag — expressed via the auto-generated ``Pair_mk_fst`` / ``Pair_mk_snd`` projections so the relationship survives through downstream calls.
def split (ds: Dataset) (fraction: {f : Float | 0.0 < f && f < 1.0}) : {p : Pair Dataset Dataset | ds_features (Pair_mk_fst p) == ds_features ds && (ds_features (Pair_mk_snd p) == ds_features ds && (ds_classes (Pair_mk_fst p) == ds_classes ds && (ds_classes (Pair_mk_snd p) == ds_classes ds && (is_balanced (Pair_mk_fst p) == is_balanced ds && is_balanced (Pair_mk_snd p) == is_balanced ds))))}

train_of

Runtime accessors for the training / testing halves, refined so the returned dataset is linked to the source ``Pair`` via the auto- generated projection. ``let train = train_of parts`` makes ``train == Pair_mk_fst parts`` visible to the SMT solver, so refinements about ``Pair_mk_fst parts`` (from the split's output type) carry to ``train`` directly.
def train_of (p: Pair Dataset Dataset) : {d : Dataset | d == Pair_mk_fst p}

test_of

def test_of (p: Pair Dataset Dataset) : {d : Dataset | d == Pair_mk_snd p}

smote

SMOTE oversampling; requires multi-class input (guard with ``is_multiclass``). Output is balanced and preserves the feature count; row count only ever grows.
def smote (ds: {d : Dataset | ds_classes d >= 2}) : {out : Dataset | is_balanced out == True && (ds_features out == ds_features ds && (ds_classes out == ds_classes ds && ds_rows out >= ds_rows ds))}

random_oversample

def random_oversample (ds: {d : Dataset | ds_classes d >= 2}) : {out : Dataset | is_balanced out == True && (ds_features out == ds_features ds && (ds_classes out == ds_classes ds && ds_rows out >= ds_rows ds))}

random_undersample

def random_undersample (ds: {d : Dataset | ds_classes d >= 2}) : {out : Dataset | is_balanced out == True && (ds_features out == ds_features ds && (ds_classes out == ds_classes ds && ds_rows out <= ds_rows ds))}

standard_scale

def standard_scale (ds: Dataset) : {out : Dataset | ds_rows out == ds_rows ds && (ds_features out == ds_features ds && (ds_classes out == ds_classes ds && is_balanced out == is_balanced ds))}

minmax_scale

def minmax_scale (ds: Dataset) : {out : Dataset | ds_rows out == ds_rows ds && (ds_features out == ds_features ds && (ds_classes out == ds_classes ds && is_balanced out == is_balanced ds))}

robust_scale

def robust_scale (ds: Dataset) : {out : Dataset | ds_rows out == ds_rows ds && (ds_features out == ds_features ds && (ds_classes out == ds_classes ds && is_balanced out == is_balanced ds))}

fill_mean

def fill_mean (ds: Dataset) : {out : Dataset | ds_rows out == ds_rows ds && (ds_features out == ds_features ds && (ds_classes out == ds_classes ds && is_balanced out == is_balanced ds))}

fill_median

def fill_median (ds: Dataset) : {out : Dataset | ds_rows out == ds_rows ds && (ds_features out == ds_features ds && (ds_classes out == ds_classes ds && is_balanced out == is_balanced ds))}

fill_most_frequent

def fill_most_frequent (ds: Dataset) : {out : Dataset | ds_rows out == ds_rows ds && (ds_features out == ds_features ds && (ds_classes out == ds_classes ds && is_balanced out == is_balanced ds))}

fill_constant

def fill_constant (value: Float) (ds: Dataset) : {out : Dataset | ds_rows out == ds_rows ds && (ds_features out == ds_features ds && (ds_classes out == ds_classes ds && is_balanced out == is_balanced ds))}

random_forest_classifier

def random_forest_classifier (ds: Dataset) : {m : Classifier | clf_features m == ds_features ds && clf_classes m == ds_classes ds}

random_forest_classifier_n

def random_forest_classifier_n (n: {k : Int | k > 0}) (ds: Dataset) : {m : Classifier | clf_features m == ds_features ds && clf_classes m == ds_classes ds}

logistic_regression

def logistic_regression (ds: Dataset) : {m : Classifier | clf_features m == ds_features ds && clf_classes m == ds_classes ds}

decision_tree_classifier

def decision_tree_classifier (ds: Dataset) : {m : Classifier | clf_features m == ds_features ds && clf_classes m == ds_classes ds}

gradient_boosting_classifier

def gradient_boosting_classifier (ds: Dataset) : {m : Classifier | clf_features m == ds_features ds && clf_classes m == ds_classes ds}

knn_classifier

def knn_classifier (ds: Dataset) : {m : Classifier | clf_features m == ds_features ds && clf_classes m == ds_classes ds}

knn_classifier_k

``k`` must be positive and not exceed the number of training rows.
def knn_classifier_k (k: {n : Int | n > 0}) (ds: {d : Dataset | ds_rows d >= k}) : {m : Classifier | clf_features m == ds_features ds && clf_classes m == ds_classes ds}

svc

def svc (ds: Dataset) : {m : Classifier | clf_features m == ds_features ds && clf_classes m == ds_classes ds}

gaussian_nb

def gaussian_nb (ds: Dataset) : {m : Classifier | clf_features m == ds_features ds && clf_classes m == ds_classes ds}

multinomial_nb

def multinomial_nb (ds: Dataset) : {m : Classifier | clf_features m == ds_features ds && clf_classes m == ds_classes ds}

complement_nb

ComplementNB targets imbalanced data — refining out the balanced case guides callers toward the right tool. Discharge with ``if Learning.balanced ds then ... else Learning.complement_nb ds``.
def complement_nb (ds: {d : Dataset | is_balanced d == False}) : {m : Classifier | clf_features m == ds_features ds && clf_classes m == ds_classes ds}

mlp_classifier

def mlp_classifier (ds: Dataset) : {m : Classifier | clf_features m == ds_features ds && clf_classes m == ds_classes ds}

linear_regression

def linear_regression (ds: Dataset) : {m : Regressor | reg_features m == ds_features ds}

ridge

def ridge (ds: Dataset) : {m : Regressor | reg_features m == ds_features ds}

ridge_alpha

def ridge_alpha (alpha: {a : Float | a >= 0.0}) (ds: Dataset) : {m : Regressor | reg_features m == ds_features ds}

lasso

def lasso (ds: Dataset) : {m : Regressor | reg_features m == ds_features ds}

lasso_alpha

def lasso_alpha (alpha: {a : Float | a >= 0.0}) (ds: Dataset) : {m : Regressor | reg_features m == ds_features ds}

elastic_net

def elastic_net (ds: Dataset) : {m : Regressor | reg_features m == ds_features ds}

random_forest_regressor

def random_forest_regressor (ds: Dataset) : {m : Regressor | reg_features m == ds_features ds}

random_forest_regressor_n

def random_forest_regressor_n (n: {k : Int | k > 0}) (ds: Dataset) : {m : Regressor | reg_features m == ds_features ds}

decision_tree_regressor

def decision_tree_regressor (ds: Dataset) : {m : Regressor | reg_features m == ds_features ds}

gradient_boosting_regressor

def gradient_boosting_regressor (ds: Dataset) : {m : Regressor | reg_features m == ds_features ds}

knn_regressor

def knn_regressor (ds: Dataset) : {m : Regressor | reg_features m == ds_features ds}

svr

def svr (ds: Dataset) : {m : Regressor | reg_features m == ds_features ds}

mlp_regressor

def mlp_regressor (ds: Dataset) : {m : Regressor | reg_features m == ds_features ds}

predict

def predict (model: Classifier) (ds: {d : Dataset | clf_features model == ds_features d}) : List

predict_proba

def predict_proba (model: Classifier) (ds: {d : Dataset | clf_features model == ds_features d}) : List

predict_value

def predict_value (model: Regressor) (ds: {d : Dataset | reg_features model == ds_features d}) : List

accuracy

def accuracy (model: Classifier) (ds: {d : Dataset | clf_features model == ds_features d}) : {f : Float | 0.0 <= f && f <= 1.0}

precision

def precision (model: Classifier) (ds: {d : Dataset | clf_features model == ds_features d}) : {f : Float | 0.0 <= f && f <= 1.0}

recall

def recall (model: Classifier) (ds: {d : Dataset | clf_features model == ds_features d}) : {f : Float | 0.0 <= f && f <= 1.0}

f1

def f1 (model: Classifier) (ds: {d : Dataset | clf_features model == ds_features d}) : {f : Float | 0.0 <= f && f <= 1.0}

roc_auc

def roc_auc (model: Classifier) (ds: {d : Dataset | clf_features model == ds_features d}) : {f : Float | 0.0 <= f && f <= 1.0}

confusion_matrix

def confusion_matrix (model: Classifier) (ds: {d : Dataset | clf_features model == ds_features d}) : List

classification_report

def classification_report (model: Classifier) (ds: {d : Dataset | clf_features model == ds_features d}) : String

mse

def mse (model: Regressor) (ds: {d : Dataset | reg_features model == ds_features d}) : {f : Float | f >= 0.0}

rmse

def rmse (model: Regressor) (ds: {d : Dataset | reg_features model == ds_features d}) : {f : Float | f >= 0.0}

mae

def mae (model: Regressor) (ds: {d : Dataset | reg_features model == ds_features d}) : {f : Float | f >= 0.0}

r2

def r2 (model: Regressor) (ds: {d : Dataset | reg_features model == ds_features d}) : Float

explained_variance

def explained_variance (model: Regressor) (ds: {d : Dataset | reg_features model == ds_features d}) : Float

Uninterpreted

Functions declared as def f ... = uninterpreted: only their signature is known to the verifier; they have no body.

df_rows

uninterpreted
def df_rows : (df : DataFrame) -> Int

df_cols

uninterpreted
def df_cols : (df : DataFrame) -> Int

ds_rows

uninterpreted
def ds_rows : (ds : Dataset) -> Int

ds_features

uninterpreted
def ds_features : (ds : Dataset) -> Int

ds_classes

uninterpreted
def ds_classes : (ds : Dataset) -> Int

is_balanced

uninterpreted
Every class label appears the same number of times.
def is_balanced : (ds : Dataset) -> Bool

clf_features

uninterpreted
Shape of a fitted model.
def clf_features : (m : Classifier) -> Int

clf_classes

uninterpreted
def clf_classes : (m : Classifier) -> Int

reg_features

uninterpreted
def reg_features : (m : Regressor) -> Int