Learning from Continuous Data

All modules related to learning Bayesian Belief Networks BBNs from multivariate Gaussian data.

Data

The data.

class pysparkbbn.continuous.data.CondMvn(index_1, index_2, means, cov, zero=1e-06, cols_1=None, cols_2=None)

Bases: object

Conditional multivariate normal.

__init__(index_1, index_2, means, cov, zero=1e-06, cols_1=None, cols_2=None)

ctor

Parameters
  • index_1 – Index of dependent variables.

  • index_2 – Index oc conditioning variables.

  • means – Means.

  • cov – Covariance matrix.

  • zero – Threshold below which to consider a probability as zero.

  • cols_1 – Names corresponding to index_1.

  • cols_2 – Names corresponding to index_2.

equals(other)

Checks if this is equal to other.

Parameters

other – CondMvn.

Returns

Boolean.

log_proba(x)

Estimate the log conditional probability of the specified data point.

Parameters

x – Data point.

Returns

Log probability.

static partition_cov(cov, index_1, index_2)

Partitions the covariance matrix.

Parameters
  • cov – Covariance matrix.

  • index_1 – Index.

  • index_2 – Index.

Returns

Partitioned covariance matrix.

static partition_means(means, index_1, index_2)

Partitions the means.

Parameters
  • means – Means.

  • index_1 – Index.

  • index_2 – Index.

Returns

Partitoned mean.

static partition_x(x, index_1, index_2)

Partitions the data point.

Parameters
  • x – Data point.

  • index_1 – Index.

  • index_2 – Index.

Returns

Tuple of data point partitioned.

pdf(x)

Estimate the conditional probability of the specified data point.

Parameters

x – Data point.

Returns

Probability.

class pysparkbbn.continuous.data.GaussianData(sdf, n_samples=6, spark=None)

Bases: object

Gaussian data.

__init__(sdf, n_samples=6, spark=None)

ctor

Parameters
  • sdf – Spark data frame.

  • n_samples – Number of samples.

  • spark – Spark object: Bug requirement.

drop(columns)

Drops specified columns.

Parameters

columns – List of columns.

Returns

Guassian data.

get_cmi_par(triplets)

Computes conditional mutual information between triplets in parallel.

Parameters

triplets – List of triplets (of variables).

Returns

List of conditional mutual information.

get_covariance()

Gets the covariance matrix.

Returns

Covariance matrix.

get_means()

Get means.

Returns

List of means.

get_mi_par(pairs)

Computes mutual information between the pairs of variables in parallel.

Parameters

pairs – List of pairs (of variables).

Returns

List of mutual information.

get_min_max(columns)

Get dictionary of min/max.

Parameters

columns – Variable names.

Returns

Dictionary of min/max associated with names.

get_min_max_for(column)

Get min/max value for specified variable.

Parameters

column – Variable name.

Returns

Dictionary of min/max.

get_mvn(columns)

Gets a multivariate normal instance.

Parameters

columns – List of variable names.

Returns

Multivariate normal.

get_pair(x, y)

Gets a pair.

Parameters
  • x – X variables.

  • y – Y variables.

Returns

Pair.

get_pairs(col_pairs=None)

Gets list of pairs.

Parameters

col_pairs – List of column pairs.

Returns

List of pairs.

get_pairwise_columns()

Gets pairs of columns.

Returns

List of pairs of columns. Each column is inside an array.

get_profile()

Gets profile of variables.

Returns

Dictionary; keys are variable names and values are summary stats.

get_score_par(cmvns)

Computes the scores.

Parameters

cmvns – List of conditional multivariate gaussian distributions.

Returns

List of scores.

get_triplet(x, y, z)

Gets a triplet.

Parameters
  • x – X variables.

  • y – Y variables.

  • z – Z variables.

Returns

Triplet.

get_triplets(col_triplets)

Gets list of triplets.

Parameters

col_triplets – List of column triplets.

Returns

List of triplets.

slice_covariance(columns)

Slices covariance matrix according to variables.

Parameters

columns – List of variables.

Returns

Covariance matrix.

slice_means(columns)

Slices means vector according to variables.

Parameters

columns – List of variables.

Returns

List of means.

class pysparkbbn.continuous.data.Mvn(columns, means, cov, profile, n_samples=10)

Bases: object

Multivariate normal distribution.

__init__(columns, means, cov, profile, n_samples=10)

ctor

Parameters
  • columns – List of variable names.

  • means – Vector means.

  • cov – Covariance matrix.

  • profile – Dictionary of min/max for each variable.

  • n_samples – Number of samples.

get_values()

Gets the sampled values.

Returns

Generator of values.

pdf(x)

Estimate the probability of the specified data point.

Parameters

x – Data point.

Returns

Probability.

class pysparkbbn.continuous.data.Pair(X, Y, XY)

Bases: object

Pair of variables.

__init__(X, Y, XY)

ctor

Parameters
  • X – X variables.

  • Y – Y variables.

  • XY – XY variables.

get_mi()

Computes the mutual information.

Returns

Mutual information.

get_partial_mi(dp)

Computes the partial mutual information.

Parameters

dp – Data point.

Returns

Partial mutual information.

get_values()

Gets the XY values.

Returns

Generator of XY values.

class pysparkbbn.continuous.data.Triplet(x_cols, y_cols, z_cols, Z, XZ, YZ, XYZ)

Bases: object

Triplet variables.

__init__(x_cols, y_cols, z_cols, Z, XZ, YZ, XYZ)

ctor

Parameters
  • x_cols – X columns.

  • y_cols – Y columns.

  • z_cols – Z columns.

  • Z – Z variables.

  • XZ – XZ variables.

  • YZ – YZ variables.

  • XYZ – XYZ variables.

get_cmi()

Computes the conditional mutual information.

Returns

Conditional mutual information.

get_partial_mi(dp)

Computes the partial mutual information.

Parameters

dp – Data point.

Returns

Partial mutual information.

get_values()

Gets the XYZ values.

Returns

List of XYZ values.

Structure Learning

Constraint-Based

Constraint-based structure learning.

class pysparkbbn.continuous.scblearn.Ban(data, clazz, cmi_threshold=0.0001, method='pc')

Bases: pysparkbbn.continuous.scblearn.BaseStructureLearner

Modified Bayesian network augmented naive bayes (BAN). See Bayesian Network Classifiers.

__init__(data, clazz, cmi_threshold=0.0001, method='pc')

ctor

Parameters
  • data – Data.

  • clazz – Class variable.

  • cmi_threshold – Threshold (equal to above which) to consider conditionally dependent.

  • method – Either pc or tpda (default=pc).

get_network()

Gets the network structure.

class pysparkbbn.continuous.scblearn.BaseStructureLearner(data)

Bases: object

Base structure learner.

__init__(data)

ctor

Parameters

data – Data.

get_network()

Gets the network structure.

class pysparkbbn.continuous.scblearn.Mwst(data, cmi_threshold=0.01)

Bases: pysparkbbn.continuous.scblearn.BaseStructureLearner

Maximum weight spanning tree.

__init__(data, cmi_threshold=0.01)

ctor

Parameters

data – Data. :param cmi_threshold: Threshold (equal to above which) to consider conditionally dependent.

get_network()

Gets the network structure.

class pysparkbbn.continuous.scblearn.Naive(data, clazz)

Bases: pysparkbbn.continuous.scblearn.BaseStructureLearner

Naive Bayesian network. The clazz variable/node is drawn with an arc to all other nodes.

__init__(data, clazz)

ctor.

Parameters
  • data – Data.

  • clazz – The clazz node.

get_network()

Gets the network structure.

class pysparkbbn.continuous.scblearn.Pc(data, cmi_threshold=0.0001)

Bases: pysparkbbn.continuous.scblearn.BaseStructureLearner

PC algorithm. See A fast PC algorithm for high dimensional causal discovery with multi-core PCs.

__init__(data, cmi_threshold=0.0001)

ctor

Parameters
  • data – Data.

  • depth – Maximum depth.

  • cmi_threshold – Threshold (equal to above which) to consider conditionally dependent.

get_network()

Gets the network structure.

learn_undirected_graph()

Learns an undirected graph.

Returns

Undirected graph.

class pysparkbbn.continuous.scblearn.Tan(data, clazz, cmi_threshold=0.01)

Bases: pysparkbbn.continuous.scblearn.Mwst

Tree-augmented network. See Comparing Bayesian Network Classifiers.

__init__(data, clazz, cmi_threshold=0.01)

ctor.

Parameters
  • data – Data.

  • clazz – The clazz node.

  • cmi_threshold – Threshold (equal to above which) to consider conditionally dependent.

get_network()

Gets the network structure.

class pysparkbbn.continuous.scblearn.Tpda(data, cmi_threshold=0.006)

Bases: pysparkbbn.continuous.scblearn.Mwst

Three-phase dependency analysis (TPDA). See Learning Belief Networks from Data: An Information Theory Based Approach.

__init__(data, cmi_threshold=0.006)

ctor.

Parameters
  • data – Data.

  • cmi_threshold – Threshold (equal to above which) to consider conditionally dependent.

get_network()

Gets the network structure.

learn_undirected_graph()

Learns an undirected graph.

Returns

Undirected graph.

Search-and-Scoring

Search-and-scoring structure learning.

class pysparkbbn.continuous.ssslearn.Ga(data, sc, max_parents=4, mutation_rate=0.25, pop_size=100, crossover_prob=0.5, max_iters=20, convergence_threshold=3, ordering='mwst', ordering_params={'cmi_threshold': 0.0001}, seed=37)

Bases: object

Uses genetic algorithm to search-and-score candidate networks. The particular algorithm is actually a hybrid approach where the ordering of nodes is induced first by a constraint-based algorithm (MWST, PC or TPDA). The ordered nodes are then used to constrain the candidate parents of each node; later nodes cannot be parents of earlier ones. See Learning Bayesian Networks: Search Methods and Experimental results.

__init__(data, sc, max_parents=4, mutation_rate=0.25, pop_size=100, crossover_prob=0.5, max_iters=20, convergence_threshold=3, ordering='mwst', ordering_params={'cmi_threshold': 0.0001}, seed=37)

ctor

Parameters
  • data – Data.

  • sc – Spark context.

  • max_parents – Maximum number of parents (default=4).

  • mutation_rate – Mutation rate (default=0.25).

  • pop_size – Population size (default=100).

  • crossover_prob – Crossover probability (default=0.5).

  • max_iters – Maximum iterations (default=20).

  • convergence_threshold – Convergence threshold; terminate when no improvement is made after this many generations (default=3).

  • ordering – Ordering method: mwst, pc or tpda (default=mwst).

  • ordering_params – Ordering parameters to the ordering method.

  • seed – Seed for random number generation (default=37).

get_network()

Gets the network structure.