Learning from Continuous Data¶
All modules related to learning Bayesian Belief Networks BBNs
from multivariate Gaussian data.
Data¶
The data.
-
class
pysparkbbn.continuous.data.
CondMvn
(index_1, index_2, means, cov, zero=1e-06, cols_1=None, cols_2=None)¶ Bases:
object
Conditional multivariate normal.
-
__init__
(index_1, index_2, means, cov, zero=1e-06, cols_1=None, cols_2=None)¶ ctor
- Parameters
index_1 – Index of dependent variables.
index_2 – Index oc conditioning variables.
means – Means.
cov – Covariance matrix.
zero – Threshold below which to consider a probability as zero.
cols_1 – Names corresponding to index_1.
cols_2 – Names corresponding to index_2.
-
equals
(other)¶ Checks if this is equal to other.
- Parameters
other – CondMvn.
- Returns
Boolean.
-
log_proba
(x)¶ Estimate the log conditional probability of the specified data point.
- Parameters
x – Data point.
- Returns
Log probability.
-
static
partition_cov
(cov, index_1, index_2)¶ Partitions the covariance matrix.
- Parameters
cov – Covariance matrix.
index_1 – Index.
index_2 – Index.
- Returns
Partitioned covariance matrix.
-
static
partition_means
(means, index_1, index_2)¶ Partitions the means.
- Parameters
means – Means.
index_1 – Index.
index_2 – Index.
- Returns
Partitoned mean.
-
static
partition_x
(x, index_1, index_2)¶ Partitions the data point.
- Parameters
x – Data point.
index_1 – Index.
index_2 – Index.
- Returns
Tuple of data point partitioned.
-
pdf
(x)¶ Estimate the conditional probability of the specified data point.
- Parameters
x – Data point.
- Returns
Probability.
-
-
class
pysparkbbn.continuous.data.
GaussianData
(sdf, n_samples=6, spark=None)¶ Bases:
object
Gaussian data.
-
__init__
(sdf, n_samples=6, spark=None)¶ ctor
- Parameters
sdf – Spark data frame.
n_samples – Number of samples.
spark – Spark object: Bug requirement.
-
drop
(columns)¶ Drops specified columns.
- Parameters
columns – List of columns.
- Returns
Guassian data.
-
get_cmi_par
(triplets)¶ Computes conditional mutual information between triplets in parallel.
- Parameters
triplets – List of triplets (of variables).
- Returns
List of conditional mutual information.
-
get_covariance
()¶ Gets the covariance matrix.
- Returns
Covariance matrix.
-
get_means
()¶ Get means.
- Returns
List of means.
-
get_mi_par
(pairs)¶ Computes mutual information between the pairs of variables in parallel.
- Parameters
pairs – List of pairs (of variables).
- Returns
List of mutual information.
-
get_min_max
(columns)¶ Get dictionary of min/max.
- Parameters
columns – Variable names.
- Returns
Dictionary of min/max associated with names.
-
get_min_max_for
(column)¶ Get min/max value for specified variable.
- Parameters
column – Variable name.
- Returns
Dictionary of min/max.
-
get_mvn
(columns)¶ Gets a multivariate normal instance.
- Parameters
columns – List of variable names.
- Returns
Multivariate normal.
-
get_pair
(x, y)¶ Gets a pair.
- Parameters
x – X variables.
y – Y variables.
- Returns
Pair.
-
get_pairs
(col_pairs=None)¶ Gets list of pairs.
- Parameters
col_pairs – List of column pairs.
- Returns
List of pairs.
-
get_pairwise_columns
()¶ Gets pairs of columns.
- Returns
List of pairs of columns. Each column is inside an array.
-
get_profile
()¶ Gets profile of variables.
- Returns
Dictionary; keys are variable names and values are summary stats.
-
get_score_par
(cmvns)¶ Computes the scores.
- Parameters
cmvns – List of conditional multivariate gaussian distributions.
- Returns
List of scores.
-
get_triplet
(x, y, z)¶ Gets a triplet.
- Parameters
x – X variables.
y – Y variables.
z – Z variables.
- Returns
Triplet.
-
get_triplets
(col_triplets)¶ Gets list of triplets.
- Parameters
col_triplets – List of column triplets.
- Returns
List of triplets.
-
slice_covariance
(columns)¶ Slices covariance matrix according to variables.
- Parameters
columns – List of variables.
- Returns
Covariance matrix.
-
slice_means
(columns)¶ Slices means vector according to variables.
- Parameters
columns – List of variables.
- Returns
List of means.
-
-
class
pysparkbbn.continuous.data.
Mvn
(columns, means, cov, profile, n_samples=10)¶ Bases:
object
Multivariate normal distribution.
-
__init__
(columns, means, cov, profile, n_samples=10)¶ ctor
- Parameters
columns – List of variable names.
means – Vector means.
cov – Covariance matrix.
profile – Dictionary of min/max for each variable.
n_samples – Number of samples.
-
get_values
()¶ Gets the sampled values.
- Returns
Generator of values.
-
pdf
(x)¶ Estimate the probability of the specified data point.
- Parameters
x – Data point.
- Returns
Probability.
-
-
class
pysparkbbn.continuous.data.
Pair
(X, Y, XY)¶ Bases:
object
Pair of variables.
-
__init__
(X, Y, XY)¶ ctor
- Parameters
X – X variables.
Y – Y variables.
XY – XY variables.
-
get_mi
()¶ Computes the mutual information.
- Returns
Mutual information.
-
get_partial_mi
(dp)¶ Computes the partial mutual information.
- Parameters
dp – Data point.
- Returns
Partial mutual information.
-
get_values
()¶ Gets the XY values.
- Returns
Generator of XY values.
-
-
class
pysparkbbn.continuous.data.
Triplet
(x_cols, y_cols, z_cols, Z, XZ, YZ, XYZ)¶ Bases:
object
Triplet variables.
-
__init__
(x_cols, y_cols, z_cols, Z, XZ, YZ, XYZ)¶ ctor
- Parameters
x_cols – X columns.
y_cols – Y columns.
z_cols – Z columns.
Z – Z variables.
XZ – XZ variables.
YZ – YZ variables.
XYZ – XYZ variables.
-
get_cmi
()¶ Computes the conditional mutual information.
- Returns
Conditional mutual information.
-
get_partial_mi
(dp)¶ Computes the partial mutual information.
- Parameters
dp – Data point.
- Returns
Partial mutual information.
-
get_values
()¶ Gets the XYZ values.
- Returns
List of XYZ values.
-
Structure Learning¶
Constraint-Based¶
Constraint-based structure learning.
-
class
pysparkbbn.continuous.scblearn.
Ban
(data, clazz, cmi_threshold=0.0001, method='pc')¶ Bases:
pysparkbbn.continuous.scblearn.BaseStructureLearner
Modified Bayesian network augmented naive bayes (BAN). See Bayesian Network Classifiers.
-
__init__
(data, clazz, cmi_threshold=0.0001, method='pc')¶ ctor
- Parameters
data – Data.
clazz – Class variable.
cmi_threshold – Threshold (equal to above which) to consider conditionally dependent.
method – Either
pc
ortpda
(default=pc).
-
get_network
()¶ Gets the network structure.
-
-
class
pysparkbbn.continuous.scblearn.
BaseStructureLearner
(data)¶ Bases:
object
Base structure learner.
-
__init__
(data)¶ ctor
- Parameters
data – Data.
-
get_network
()¶ Gets the network structure.
-
-
class
pysparkbbn.continuous.scblearn.
Mwst
(data, cmi_threshold=0.01)¶ Bases:
pysparkbbn.continuous.scblearn.BaseStructureLearner
Maximum weight spanning tree.
-
__init__
(data, cmi_threshold=0.01)¶ ctor
- Parameters
data – Data. :param cmi_threshold: Threshold (equal to above which) to consider conditionally dependent.
-
get_network
()¶ Gets the network structure.
-
-
class
pysparkbbn.continuous.scblearn.
Naive
(data, clazz)¶ Bases:
pysparkbbn.continuous.scblearn.BaseStructureLearner
Naive Bayesian network. The clazz variable/node is drawn with an arc to all other nodes.
-
__init__
(data, clazz)¶ ctor.
- Parameters
data – Data.
clazz – The clazz node.
-
get_network
()¶ Gets the network structure.
-
-
class
pysparkbbn.continuous.scblearn.
Pc
(data, cmi_threshold=0.0001)¶ Bases:
pysparkbbn.continuous.scblearn.BaseStructureLearner
PC algorithm. See A fast PC algorithm for high dimensional causal discovery with multi-core PCs.
-
__init__
(data, cmi_threshold=0.0001)¶ ctor
- Parameters
data – Data.
depth – Maximum depth.
cmi_threshold – Threshold (equal to above which) to consider conditionally dependent.
-
get_network
()¶ Gets the network structure.
-
learn_undirected_graph
()¶ Learns an undirected graph.
- Returns
Undirected graph.
-
-
class
pysparkbbn.continuous.scblearn.
Tan
(data, clazz, cmi_threshold=0.01)¶ Bases:
pysparkbbn.continuous.scblearn.Mwst
Tree-augmented network. See Comparing Bayesian Network Classifiers.
-
__init__
(data, clazz, cmi_threshold=0.01)¶ ctor.
- Parameters
data – Data.
clazz – The clazz node.
cmi_threshold – Threshold (equal to above which) to consider conditionally dependent.
-
get_network
()¶ Gets the network structure.
-
-
class
pysparkbbn.continuous.scblearn.
Tpda
(data, cmi_threshold=0.006)¶ Bases:
pysparkbbn.continuous.scblearn.Mwst
Three-phase dependency analysis (TPDA). See Learning Belief Networks from Data: An Information Theory Based Approach.
-
__init__
(data, cmi_threshold=0.006)¶ ctor.
- Parameters
data – Data.
cmi_threshold – Threshold (equal to above which) to consider conditionally dependent.
-
get_network
()¶ Gets the network structure.
-
learn_undirected_graph
()¶ Learns an undirected graph.
- Returns
Undirected graph.
-
Search-and-Scoring¶
Search-and-scoring structure learning.
-
class
pysparkbbn.continuous.ssslearn.
Ga
(data, sc, max_parents=4, mutation_rate=0.25, pop_size=100, crossover_prob=0.5, max_iters=20, convergence_threshold=3, ordering='mwst', ordering_params={'cmi_threshold': 0.0001}, seed=37)¶ Bases:
object
Uses genetic algorithm to search-and-score candidate networks. The particular algorithm is actually a hybrid approach where the ordering of nodes is induced first by a constraint-based algorithm (MWST, PC or TPDA). The ordered nodes are then used to constrain the candidate parents of each node; later nodes cannot be parents of earlier ones. See Learning Bayesian Networks: Search Methods and Experimental results.
-
__init__
(data, sc, max_parents=4, mutation_rate=0.25, pop_size=100, crossover_prob=0.5, max_iters=20, convergence_threshold=3, ordering='mwst', ordering_params={'cmi_threshold': 0.0001}, seed=37)¶ ctor
- Parameters
data – Data.
sc – Spark context.
max_parents – Maximum number of parents (default=4).
mutation_rate – Mutation rate (default=0.25).
pop_size – Population size (default=100).
crossover_prob – Crossover probability (default=0.5).
max_iters – Maximum iterations (default=20).
convergence_threshold – Convergence threshold; terminate when no improvement is made after this many generations (default=3).
ordering – Ordering method: mwst, pc or tpda (default=mwst).
ordering_params – Ordering parameters to the ordering method.
seed – Seed for random number generation (default=37).
-
get_network
()¶ Gets the network structure.
-