Learning from Discrete Data

All modules related to learning Bayesian Belief Networks BBNs from fully discrete data.

Data

The data.

class pysparkbbn.discrete.data.DiscreteData(sdf)

Bases: object

Discrete data.

__init__(sdf)

ctor

Parameters

sdf – Spark dataframe.

static compute_bd_score(X, Y, XY_vals, counts)

Computes the Bayesian dirichlet score.

Parameters
  • X – Child.

  • Y – Parent.

  • XY_vals – XY values.

  • counts – Counts.

Returns

Tuple of X, Y and score.

static compute_cmi(X, Y, Z, XYZ_vals, counts, n)

Computes the conditional mutual information of X and Y given Z.

Parameters
  • X – X variables.

  • Y – Y variables.

  • Z – Z variables.

  • XYZ_vals – XYZ values.

  • counts – Counts.

  • n – Number of data.

Returns

Tuple of X, Y, Z and conditional mutual information.

static compute_mi(X, Y, XY_vals, counts, n)

Computes the mutual information.

Parameters
  • X – X variables.

  • Y – Y variables.

  • XY_vals – XY values.

  • counts – Counts.

  • n – Total data points.

Returns

Tuple of X, Y and mutual information.

drop(columns)

Drops the specified columns.

Parameters

columns – List of columns.

Returns

New dataset without the dropped columns.

static get_bd(scoring_pairs, counts)

Computes the Bayesian dirichlet score.

Parameters
  • scoring_pairs – Scoring pairs.

  • counts – Counts.

Returns

List of Bayesian dirichlet scores.

get_bd_par(pairs)

Computes the Bayesian dirichlet for each pair of variable (parallel).

Parameters

pairs – Scoring pairs.

Returns

List of Bayesian dirichlet scores.

static get_cmi(triplets, counts, n)

Computes the conditional mutual information for each triplet of variables.

Parameters
  • triplets – List of triplet variables.

  • counts – Count dictionary.

  • n – Number of data.

Returns

List of conditional mutual information.

get_cmi_par(triplets)

Computes the conditional mutual information for each triplet of variables (parallel).

Parameters
  • triplets – List of triplets.

  • n – Number of data.

Returns

List of conditional mutual information.

get_counts_for_bd(scoring_pairs)

Gets the counts required to compute Bayesian dirichlet scoring.

Parameters

scoring_pairs – List of scoring pairs (child, parents).

Returns

get_counts_for_cmi(triplets)

Gets the counts required to compute conditional mutual information.

Parameters

triplets – List of triplets.

Returns

Dictionary of counts.

get_counts_for_mi(pairs)

Gets the counts required to compute pairwise mutual information.

Parameters

pairs – List of pairs.

Returns

Dictionary of counts.

static get_mi(pairs, counts, n)

Computes the mutual information for each pair of variable.

Parameters
  • pairs – List of pairs of variables.

  • counts – Count dictionary.

  • n – Number of data.

Returns

List of mutual information.

get_mi_par(pairs)

Computes the mutual information for each pair of variable (parallel).

Parameters

pairs – List of pairs of variables.

Returns

List of mutual information.

static get_pairs(data)

Gets a list of pairs of variables.

Parameters

data – Data.

Returns

List of pairs.

get_profile()

Gets the data profile.

Returns

Dictionary. Keys are variable names. Values are dictionary of value-frequency.

class pysparkbbn.discrete.data.Pair(X, Y, profile)

Bases: object

Pair. Useful for computing pairwise statistics.

__init__(X, Y, profile)

ctor

Parameters
  • X – List of X variables.

  • Y – List of Y variables.

  • profile – Profile of variables.

get_entries(r)

Gets the entries used for counting.

Parameters

r – Row.

Returns

Keys used for counting.

get_entries_par(r)
class pysparkbbn.discrete.data.ScoringPair(X, Y, profile)

Bases: pysparkbbn.discrete.data.Pair

Scoring pair X, Y. Useful for Bayesian dirichlet scoring.

__init__(X, Y, profile)

ctor

Parameters
  • X – Child.

  • Y – Parents.

  • profile – Variable profile.

get_entries(r)

Gets the entries used for counting.

Parameters

r – Row.

Returns

Keys used for counting.

get_entries_par(r)
class pysparkbbn.discrete.data.Triplet(X, Y, Z, profile)

Bases: object

Triplet X, Y, Z. Useful for computing conditional statistics.

__init__(X, Y, Z, profile)

ctor

Parameters
  • X – List of X variables.

  • Y – List of Y variables.

  • Z – List of Z variables.

  • profile – Profile of variables.

get_entries(r)

Gets the entries used for counting.

Parameters

r – Row.

Returns

Keys used for counting.

get_entries_par(r)

Structure Learning

There are two broad classes of structure learning: constraint-based (CB) and search-and-scoring (SS). CB structure learning typically uses independence and conditional independence tests to learn the network structure. SS structure learning uses a scoring measure over a search space to find the best scoring structures.

The CB structure learning algorithms are as follows. Some of these structure learning algorithms are appropriate for learning classification models.

  • Naive Bayes: Creates a naive Bayes model (classification).

  • Tree-Augmented Network (TAN): Creates a tree-augmented network (classification).

  • BN augmented naive Bayes (BAN): A modified BAN algorithm (classification).

  • Chow-Liu (aka Maximum Weight Spanning Tree, MWST): Creates a tree structured BBN.

  • Three-Phase Dependency Analysis (TPDA): Uses TPDA (draft, thicken and thin).

  • PC Algorithm: Uses the PC algorithm.

There is only one search-and-scoring based algorithm implemented, which uses genetic algorithms.

  • Genetic Algorithm (GA): Discovers a highly scoring network structure.

Constraint-Based

Constraint-based structure learning.

class pysparkbbn.discrete.scblearn.Ban(data, clazz, cmi_threshold=0.06, method='pc')

Bases: pysparkbbn.discrete.scblearn.BaseStructureLearner

Modified Bayesian network augmented naive bayes (BAN). See Bayesian Network Classifiers.

__init__(data, clazz, cmi_threshold=0.06, method='pc')

ctor

Parameters
  • data – Data.

  • clazz – Class variable.

  • cmi_threshold – Threshold (equal to above which) to consider conditionally dependent.

  • method – Either pc or tpda (default=pc).

get_network()

Gets the network structure.

class pysparkbbn.discrete.scblearn.BaseStructureLearner(data)

Bases: object

Base structure learner.

__init__(data)

ctor

Parameters

data – Data.

get_network()

Gets the network structure.

class pysparkbbn.discrete.scblearn.Mwst(data, cmi_threshold=0.06)

Bases: pysparkbbn.discrete.scblearn.BaseStructureLearner

Maximum weight spanning tree.

__init__(data, cmi_threshold=0.06)

ctor

Parameters
  • data – Data.

  • cmi_threshold – Threshold (equal to above which) to consider conditionally dependent.

get_network()

Gets the network structure.

class pysparkbbn.discrete.scblearn.Naive(data, clazz)

Bases: pysparkbbn.discrete.scblearn.BaseStructureLearner

Naive Bayesian network. The clazz variable/node is drawn with an arc to all other nodes.

__init__(data, clazz)

ctor.

Parameters
  • data – Data.

  • clazz – The clazz node.

get_network()

Gets the network structure.

class pysparkbbn.discrete.scblearn.Pc(data, cmi_threshold=0.06)

Bases: pysparkbbn.discrete.scblearn.BaseStructureLearner

PC algorithm. See A fast PC algorithm for high dimensional causal discovery with multi-core PCs.

__init__(data, cmi_threshold=0.06)

ctor

Parameters
  • data – Data.

  • depth – Maximum depth.

  • cmi_threshold – Threshold (equal to above which) to consider conditionally dependent.

get_network()

Gets the network structure.

learn_undirected_graph()

Learns an undirected graph.

Returns

Undirected graph.

class pysparkbbn.discrete.scblearn.Tan(data, clazz, cmi_threshold=0.06)

Bases: pysparkbbn.discrete.scblearn.Mwst

Tree-augmented network. See Comparing Bayesian Network Classifiers.

__init__(data, clazz, cmi_threshold=0.06)

ctor.

Parameters
  • data – Data.

  • clazz – The clazz node.

  • cmi_threshold – Threshold (equal to above which) to consider conditionally dependent.

get_network()

Gets the network structure.

class pysparkbbn.discrete.scblearn.Tpda(data, cmi_threshold=0.06)

Bases: pysparkbbn.discrete.scblearn.Mwst

Three-phase dependency analysis (TPDA). See Learning Belief Networks from Data: An Information Theory Based Approach.

__init__(data, cmi_threshold=0.06)

ctor.

Parameters
  • data – Data.

  • cmi_threshold – Threshold (equal to above which) to consider conditionally dependent.

get_network()

Gets the network structure.

learn_undirected_graph()

Learns an undirected graph.

Returns

Undirected graph.

Search-and-Scoring

Search-and-scoring structure learning.

class pysparkbbn.discrete.ssslearn.Ga(data, sc, max_parents=4, mutation_rate=0.25, pop_size=100, crossover_prob=0.5, max_iters=20, convergence_threshold=3, ordering='mwst', ordering_params={'cmi_threshold': 0.06}, seed=37)

Bases: object

Uses genetic algorithm to search-and-score candidate networks. The particular algorithm is actually a hybrid approach where the ordering of nodes is induced first by a constraint-based algorithm (MWST, PC or TPDA). The ordered nodes are then used to constrain the candidate parents of each node; later nodes cannot be parents of earlier ones. See Learning Bayesian Networks: Search Methods and Experimental results.

__init__(data, sc, max_parents=4, mutation_rate=0.25, pop_size=100, crossover_prob=0.5, max_iters=20, convergence_threshold=3, ordering='mwst', ordering_params={'cmi_threshold': 0.06}, seed=37)

ctor

Parameters
  • data – Data.

  • sc – Spark context.

  • max_parents – Maximum number of parents (default=4).

  • mutation_rate – Mutation rate (default=0.25).

  • pop_size – Population size (default=100).

  • crossover_prob – Crossover probability (default=0.5).

  • max_iters – Maximum iterations (default=20).

  • convergence_threshold – Convergence threshold; terminate when no improvement is made after this many generations (default=3).

  • ordering – Ordering method: mwst, pc or tpda (default=mwst).

  • ordering_params – Ordering parameters to the ordering method.

  • seed – Seed for random number generation (default=37).

get_network()

Gets the network structure.

Parameter Learning

Parameter learning.

class pysparkbbn.discrete.plearn.ParamLearner(data, g)

Bases: object

Parameter learner.

__init__(data, g)

ctor

Parameters

data – Data.

G

Directed acyclic graph.

get_params()

Gets the parameters.

Returns

Dictionary of parameters.

Utilities

Utilities to make life easier.

pysparkbbn.discrete.util.get_triplets(g)

Gets all triplets (x, y, z) where x–y and y–z, but not x–z.

Parameters

g – Undirected graph.

Returns

List of triplets.

pysparkbbn.discrete.util.log_gamma(x)

Computes log gamma(x), where gamma(x) = (x - 1)!. If x=5, then gamma(5) = (5-1)! = 4! = 4 x 3 x 2 x 1, and log(gamma(5)) = log((5-1)!) = log(4!) = log(4) + log(3) + log(2) + log(1).

Parameters

x – Positive integer.

Returns

Log gamma(x).

pysparkbbn.discrete.util.log_gamma_ratio(numerator, denominator)

Computes the ratio of gammas in log-space.

Parameters
  • numerator – Numerator gamma.

  • denominator – Denominator gamma.

Returns

log(gamma(numerator) / gamma(denominator)).

pysparkbbn.discrete.bbn.get_bbn(g, p, profile)

Gets a Bayesian Belief Network based on the specified graph and parameters.

Parameters
  • g – Directed acyclic graph.

  • p – Parameters.

  • profile – Variable profiles.

Returns

BBN.