| Title: | Computational Test for Conditional Independence |
|---|---|
| Description: | Tool for performing computational testing for conditional independence between variables in a dataset. 'CCI' implements permutation in combination with Monte Carlo Cross-Validation in generating null distributions and test statistics. For more details see Computational Test for Conditional Independence (2024) <doi:10.3390/a17080323>. |
| Authors: | Christian Thorjussen [aut, cre] (ORCID: <https://orcid.org/0009-0005-5006-6491>), Kristian Hovde Liland [aut] (ORCID: <https://orcid.org/0000-0001-6468-9423>) |
| Maintainer: | Christian Thorjussen <[email protected]> |
| License: | GPL (>= 2) |
| Version: | 0.3.6.1 |
| Built: | 2026-05-15 18:17:56 UTC |
| Source: | https://github.com/khliland/cci |
<var1>_int_<var2> (e.g., Z1_int_Z2 for the product of Z1 and Z2).Creates interaction terms for specified variables in a data frame
Interaction terms are named as <var1>_int_<var2> (e.g., Z1_int_Z2 for the product of Z1 and Z2).
add_interaction_terms(data, Z, mode = c("numeric_only", "mixed"))add_interaction_terms(data, Z, mode = c("numeric_only", "mixed"))
data |
Data frame. The data frame containing the variables for which interaction terms are to be created. |
Z |
Character vector. The names of the variables for which interaction terms are to be created. |
mode |
Character. Specifies the type of interaction terms to create. Options are: numeric_only (only numeric-numeric interactions as products) or mixed (numeric-numeric as products, factor/character involved as categorical interactions). Default is "numeric_only". |
A list with two components:
data: The modified data frame with added interaction terms.
new_terms: A character vector of the names of the added interaction terms (e.g., Z1_int_2).
data_generator <- function(N){ Z1 <- rnorm(N,0,1) Z2 <- rnorm(N,0,1) X <- rnorm(N, Z1 + Z2, 1) Y <- rnorm(N, Z1 + Z2, 1) df <- data.frame(Z1, Z2, X, Y) return(df) } dat <- data_generator(250) interaction_terms <- add_interaction_terms(data = dat, Z = c("Z1", "Z2")) head(interaction_terms$data$Z1_int_Z2)data_generator <- function(N){ Z1 <- rnorm(N,0,1) Z2 <- rnorm(N,0,1) X <- rnorm(N, Z1 + Z2, 1) Y <- rnorm(N, Z1 + Z2, 1) df <- data.frame(Z1, Z2, X, Y) return(df) } dat <- data_generator(250) interaction_terms <- add_interaction_terms(data = dat, Z = c("Z1", "Z2")) head(interaction_terms$data$Z1_int_Z2)
<variable>_d_<degree> (e.g., Z1_d_2 for the square of Z1).Creates polynomial terms for specified variables in a data frame
Polynomial terms are named as <variable>_d_<degree> (e.g., Z1_d_2 for the square of Z1).
add_poly_terms(data, Z, degree = 3, poly = TRUE)add_poly_terms(data, Z, degree = 3, poly = TRUE)
data |
Data frame. The data frame containing the variables for which polynomial terms are to be created. |
Z |
Character vector. The names of the variables for which polynomial terms are to be created. |
degree |
Integer. The maximum degree of polynomial terms to be created. Default is 3. |
poly |
Logical. If TRUE, polynomial terms will be created. If FALSE, no polynomial terms will be created. Default is TRUE. |
A list with two components:
data: The modified data frame with added polynomial terms.
new_terms: A character vector of the names of the added polynomial terms (e.g., Z1_d_2).
#'
set.seed(123) data_generator <- function(N){ Z1 <- rnorm(N,0,1) Z2 <- rnorm(N,0,1) X <- rnorm(N, Z1 + Z2, 1) Y <- rnorm(N, Z1 + Z2, 1) df <- data.frame(Z1, Z2, X, Y) return(df) } dat <- data_generator(250) poly_terms <- add_poly_terms(data = dat, Z = c("Z1", "Z2"), degree = 3, poly = TRUE) print(poly_terms$new_terms)set.seed(123) data_generator <- function(N){ Z1 <- rnorm(N,0,1) Z2 <- rnorm(N,0,1) X <- rnorm(N, Z1 + Z2, 1) Y <- rnorm(N, Z1 + Z2, 1) df <- data.frame(Z1, Z2, X, Y) return(df) } dat <- data_generator(250) poly_terms <- add_poly_terms(data = dat, Z = c("Z1", "Z2"), degree = 3, poly = TRUE) print(poly_terms$new_terms)
Build an expanded formula with poly and interaction terms
build_formula(formula, poly_terms = NULL, interaction_terms = NULL)build_formula(formula, poly_terms = NULL, interaction_terms = NULL)
formula |
A base formula in the format Y ~ X | Z1 + Z2 |
poly_terms |
Character vector of polynomial term names |
interaction_terms |
Character vector of interaction term names |
A formula object combining all terms
poly_terms <- c("Z1_d_2", "Z2_d_2") interaction_terms <- c("Z1_int_Z2") formula <- Y ~ X | Z1 + Z2 final_formula <- build_formula(formula, poly_terms, interaction_terms) print(final_formula)poly_terms <- c("Z1_d_2", "Z2_d_2") interaction_terms <- c("Z1_int_Z2") formula <- Y ~ X | Z1 + Z2 final_formula <- build_formula(formula, poly_terms, interaction_terms) print(final_formula)
This function selects the best direction for the CCI test based on cross validation. For the condition Y || X | Z, the function return the recommended formula either Y ~ X | Z or X ~ Y | Z .
CCI.direction( formula, data, method = "rf", folds = 4, nrounds = 600, max_depth = 6, eta = 0.3, gamma = 0, colsample_bytree = 1, min_child_weight = 1, subsample = 1, poly = TRUE, degree = 3, interaction = TRUE, verbose = FALSE, ... )CCI.direction( formula, data, method = "rf", folds = 4, nrounds = 600, max_depth = 6, eta = 0.3, gamma = 0, colsample_bytree = 1, min_child_weight = 1, subsample = 1, poly = TRUE, degree = 3, interaction = TRUE, verbose = FALSE, ... )
formula |
A formula object specifying the model to be fitted. |
data |
A data frame containing the variables specified in the formula. |
method |
A character string specifying the method to be used for model fitting. Options include "rf" (random forest), "xgboost" (XGBoost), "nnet" (neural network), "gpr" (Gaussian process regression), and "svm" (support vector machine). |
folds |
An integer specifying the number of folds for cross-validation. Default is 4. |
nrounds |
Integer. The number of rounds (trees) for methods like xgboost, ranger, and lightgbm. Default is 600. |
max_depth |
Integer. The maximum depth of the trees for methods like xgboost. Default is 6. |
eta |
Numeric. The learning rate for methods like xgboost. Default is 0.3. |
gamma |
Numeric. The minimum loss reduction required to make a further partition on a leaf node of the tree for methods like xgboost. Default is 0. |
colsample_bytree |
Numeric. The subsample ratio of columns when constructing each tree for methods like xgboost. Default is 1. |
min_child_weight |
Numeric. The minimum sum of instance weight (hessian) needed in a child for methods like xgboost. Default is 1. |
subsample |
Numeric. The proportion of the data to be used for subsampling. Default is 1 (no subsampling). |
poly |
Logical. If TRUE, polynomial terms of the conditioning variables are included in the model. Default is TRUE. |
degree |
Integer. The degree of polynomial terms to include if |
interaction |
Logical. If TRUE, interaction terms of the conditioning variables are included in the model. Default is TRUE. |
verbose |
Logical. If TRUE, prints additional information during the execution. Default is FALSE. |
... |
Additional arguments to be passed to the model fitting function. |
A formula object specifying the selected model direction.
The CCI.tuner function performs a grid search over parameters for a conditional independence test using machine learning model supported by CCI.test. The tuner use the caret package for tuning.
CCI.pretuner( formula, data, method = "rf", metric = "RMSE", validation_method = "cv", folds = 4, training_share = 0.7, tune_length = 4, random_grid = TRUE, samples = 35, poly = TRUE, degree = 3, interaction = TRUE, verboseIter = FALSE, include_explanatory = FALSE, verbose = FALSE, parallel = FALSE, mtry = 1:10, nrounds = c(100, 200, 300, 400, 500, 600, 700, 800, 900, 1000), eta = seq(0.01, 0.3, by = 0.05), max_depth = 2:6, gamma = c(0, 1, 2, 3), colsample_bytree = c(0.8, 0.9, 1), min_child_weight = c(1, 3), subsample = 1, sigma = seq(0.1, 2, by = 0.3), C = seq(0.1, 2, by = 0.5), ... )CCI.pretuner( formula, data, method = "rf", metric = "RMSE", validation_method = "cv", folds = 4, training_share = 0.7, tune_length = 4, random_grid = TRUE, samples = 35, poly = TRUE, degree = 3, interaction = TRUE, verboseIter = FALSE, include_explanatory = FALSE, verbose = FALSE, parallel = FALSE, mtry = 1:10, nrounds = c(100, 200, 300, 400, 500, 600, 700, 800, 900, 1000), eta = seq(0.01, 0.3, by = 0.05), max_depth = 2:6, gamma = c(0, 1, 2, 3), colsample_bytree = c(0.8, 0.9, 1), min_child_weight = c(1, 3), subsample = 1, sigma = seq(0.1, 2, by = 0.3), C = seq(0.1, 2, by = 0.5), ... )
formula |
Model formula specifying the relationship between dependent and independent variables. |
data |
A data frame containing the variables specified in the formula. |
method |
Character. Specifies the machine learning method to use. Supported methods are random forest "rf", extreme gradient boosting "xgboost" and Support Vector Machine "svm". |
metric |
Character. The performance metric to optimize during tuning. Default is "RMSE". |
validation_method |
Character. Specifies the resampling method. Default is "cv". |
folds |
Integer. The number of folds for cross-validation during the tuning process. Default is 10. |
training_share |
Numeric. For leave-group out cross-validation: the training percentage. Default is 0.7. |
tune_length |
Integer. The number of parameter combinations to try during the tuning process. Default is 10. |
random_grid |
Logical. If TRUE, a random grid search is performed. If FALSE, a full grid search is performed. Default is TRUE. |
samples |
Integer. The number of random samples to take from the grid. Default is 30. |
poly |
Logical. If TRUE, polynomial terms of the conditional variables are included in the model. Default is TRUE. |
degree |
Integer. The degree of polynomial terms to include if poly is TRUE. Default is 3. |
interaction |
Logical. If TRUE, interaction terms of the conditional variables are included in the model. Default is TRUE. |
verboseIter |
Logical. If TRUE, the function will print the tuning process. Default is FALSE. |
include_explanatory |
Logical. If TRUE, given the condition Y || X | Z, the function will include explanatory variable X in the model for Y. Default is FALSE |
verbose |
Logical. If TRUE, the function will print the tuning process. Default is FALSE.. |
parallel |
Logical. If TRUE, the function will use parallel processing. Default is TRUE. |
mtry |
Integer. The number of variables randomly sampled as candidates at each split for random forest. Default is 1:5. |
nrounds |
Integer. The number of rounds (trees) for methods such as xgboost and random forest. Default is seq(50, 200, by = 25). |
eta |
Numeric. The learning rate for xgboost. Default is seq(0.01, 0.3, by = 0.05). |
max_depth |
Integer. The maximum depth of the tree for xgboost. Default is 1:6. |
gamma |
Numeric. The minimum loss reduction required to make a further partition on a leaf node for xgboost. Default is seq(0, 5, by = 1). |
colsample_bytree |
Numeric. The subsample ratio of columns when constructing each tree for xgboost. Default is seq(0.5, 1, by = 0.1). |
min_child_weight |
Integer. The minimum sum of instance weight (hessian) needed in a child for xgboost. Default is 1:5. |
subsample |
Numeric. The subsample ratio of the training. Default is 1. |
sigma |
Numeric. The standard deviation of the Gaussian kernel for Gaussian Process Regression. Default is seq(0.1, 2, by = 0.3). |
C |
Numeric. The regularization parameter for Support Vector Machine. Default is seq(0.1, 2, by = 0.5). |
... |
Additional arguments to pass to the |
A list containing:
best_param: A data frame with the best parameters.
tuning_result: A data frame with all tested parameter combinations and their performance metrics.
warnings: A character vector of warnings issued during tuning.
CCI.test perm.test, print.summary.CCI, plot.CCI, QQplot
set.seed(123) data <- data.frame(x1 = rnorm(100), x2 = rnorm(100), x3 = rnorm(100), y = rnorm(100)) # Tune random forest parameters result <- CCI.pretuner(formula = y ~ x1 | x2 + x3, data = data, samples = 5, folds = 3, method = "rf")set.seed(123) data <- data.frame(x1 = rnorm(100), x2 = rnorm(100), x3 = rnorm(100), y = rnorm(100)) # Tune random forest parameters result <- CCI.pretuner(formula = y ~ x1 | x2 + x3, data = data, samples = 5, folds = 3, method = "rf")
The CCI.test function performs a conditional independence test using a specified machine learning model or a custom model provided by the user. It calculates the test statistic, generates a null distribution via permutations, computes p-values, and optionally generates a plot of the null distribution with the observed test statistic.
The 'CCI.test' function serves as a wrapper around the 'perm.test' function
CCI.test( formula = NULL, data, p = 0.5, nperm = 160, nrounds = 600, mtry = NULL, metric = "Auto", method = "rf", choose_direction = FALSE, parametric = FALSE, poly = TRUE, degree = 3, robust = TRUE, subsample = "Auto", subsample_set, min_child_weight = 1, colsample_bytree = 1, eta = 0.3, gamma = 0, max_depth = 6, interaction = TRUE, mode = "numeric_only", metricfunc = NULL, mlfunc = NULL, tail = NA, tune = FALSE, samples = 35, folds = 5, tune_length = 10, k = 15, center = TRUE, scale = TRUE, eps = 1e-15, positive = NULL, kernel = "optimal", distance = 2, seed = NA, random_grid = TRUE, nthread = 2, verbose = FALSE, progress = TRUE, ... )CCI.test( formula = NULL, data, p = 0.5, nperm = 160, nrounds = 600, mtry = NULL, metric = "Auto", method = "rf", choose_direction = FALSE, parametric = FALSE, poly = TRUE, degree = 3, robust = TRUE, subsample = "Auto", subsample_set, min_child_weight = 1, colsample_bytree = 1, eta = 0.3, gamma = 0, max_depth = 6, interaction = TRUE, mode = "numeric_only", metricfunc = NULL, mlfunc = NULL, tail = NA, tune = FALSE, samples = 35, folds = 5, tune_length = 10, k = 15, center = TRUE, scale = TRUE, eps = 1e-15, positive = NULL, kernel = "optimal", distance = 2, seed = NA, random_grid = TRUE, nthread = 2, verbose = FALSE, progress = TRUE, ... )
formula |
Model formula specifying the relationship between dependent and independent variables. (Ex: Y ~ X | Z1 + Z2 for Y || X | Z1, Z2) |
data |
A data frame containing the variables specified in the formula. |
p |
Numeric. Proportion of data used for training the model. Default is 0.5. |
nperm |
Integer. The number of permutations to perform. Default is 60. |
nrounds |
Integer. The number of rounds (trees) for methods 'xgboost' and 'rf' Default is 600. |
mtry |
Number of variables to possibly split at in each node for method 'rf'. Default is NULL (sqrt of number of variables). |
metric |
Character. Specifies the type of data: "Auto", "RMSE" or "Kappa". Default is "Auto". |
method |
Character. Specifies the machine learning method to use. Supported methods are random forest "rf", extreme gradient boosting "xgboost", support vector machine 'svm' and K-nearest neighbour 'KNN'. Default is "rf". |
choose_direction |
Logical. If TRUE, the function will choose the best direction for testing. Default is FALSE. |
parametric |
Logical, indicating whether to compute a parametric p-value instead of the empirical p-value. A parametric p-value assumes that the null distribution is gaussian. Default is FALSE. |
poly |
Logical. If TRUE, polynomial terms of the conditional variables are included in the model. Default is TRUE. |
degree |
Integer. The degree of polynomial terms to include if poly is TRUE. Default is 3. |
robust |
Logical. If TRUE, uses a robust method for permutation. Default is TRUE. |
subsample |
Character. Specifies whether to use automatic subsampling based on sample size ("Auto"), user-defined subsampling ("Yes"), or no subsampling ("No"). Default is "Auto" |
subsample_set |
Numeric. If |
min_child_weight |
Numeric. The minimum sum of instance weight (hessian) needed in a child for methods like xgboost. Default is 1. |
colsample_bytree |
Numeric. The subsample ratio of columns when constructing each tree for methods like xgboost. Default is 1. |
eta |
Numeric. The learning rate for methods like xgboost. Default is 0.3. |
gamma |
Numeric. The minimum loss reduction required to make a further partition on a leaf node of the tree for methods like xgboost. Default is 0. |
max_depth |
Integer. The maximum depth of the trees for methods like xgboost. Default is 6. |
interaction |
Logical. If TRUE, interaction terms of the conditional variables are included in the model. Default is TRUE. |
mode |
Character. Specifies the mode of operation: "numeric_only" or "mixed". Default is "numeric_only". |
metricfunc |
Optional the user can pass a custom function for calculating a performance metric based on the model's predictions. Default is NULL. |
mlfunc |
Optional the user can pass a custom machine learning wrapper function to use instead of the predefined methods. Default is NULL. |
tail |
Character. Specifies whether to calculate left-tailed or right-tailed p-values, depending on the performance metric used. Only applicable if using |
tune |
Logical. If TRUE, the function will perform hyperparameter tuning for the specified machine learning method. Default is FALSE. |
samples |
Integer. Number of hyperparameter combinations used in tuning. Default is 35. |
folds |
Integer. The number of folds for cross-validation during the tuning process. Default is 5. |
tune_length |
Integer. The number of parameter combinations to try during the tuning process. Default is 10. |
k |
Integer. The number of nearest neighbors to use for KNN method. Default is 15. |
center |
Logical. If TRUE, the data will be centered before fitting the model |
scale |
Logical. If TRUE, the data will be scaled before fitting the model. Default is TRUE. |
eps |
Numeric. A small value to avoid division by zero in some calculations. |
positive |
Character. The name of the positive class (KNN) in the data, used for classification tasks. Default is NULL. |
kernel |
Character. The kernel type to use for KNN method. Default is "optimal". |
distance |
Numeric. Parameter of Minkowski distance for the "KNN" method. Default is 2. |
seed |
Integer. Set the seed for reproducing results. Default is NA. |
random_grid |
Logical. If TRUE, a random grid search is performed. If FALSE, a full grid search is performed. Default is TRUE. |
nthread |
Integer. The number of threads to use for parallel processing. Default is 1. |
verbose |
Logical. If TRUE, additional information is printed during the execution of the function. Default is FALSE. |
progress |
Logical. If TRUE, a progress bar is displayed during the permutation process. Default is TRUE. |
... |
Additional arguments to pass to the |
Invisibly returns the result of perm.test, which is an object of class 'CCI' containing the null distribution, observed test statistic, p-values, the machine learning model used, and the data.
perm.test, print.summary.CCI, plot.CCI, CCI.pretuner, QQplot
set.seed(123) data <- data.frame(x1 = stats::rnorm(100), x2 = stats::rnorm(100), y = stats::rnorm(100)) result <- CCI.test(y ~ x1 | x2, data = data, nperm = 25, interaction = FALSE) summary(result)set.seed(123) data <- data.frame(x1 = stats::rnorm(100), x2 = stats::rnorm(100), y = stats::rnorm(100)) result <- CCI.test(y ~ x1 | x2, data = data, nperm = 25, interaction = FALSE) summary(result)
This function verifies that all variables specified in the formula are present in the provided data frame. If any variables are missing, the function will stop and return an error message listing the missing variables.
check_formula(formula, data)check_formula(formula, data)
formula |
Formula. The model formula that specifies the relationship between the dependent and independent variables. |
data |
Data frame. The data frame in which to check for the presence of variables specified in the formula. |
Invisibly returns NULL if all variables are present. Stops with an error if any variables are missing.
This function processes and reformats formula string to ensure it is in the correct format for conditional independence testing. The function checks if the formula uses the '+' operator for additive models and transforms it into a format that includes a conditioning variable separated by '|'.
clean_formula(formula)clean_formula(formula)
formula |
Formula. The model formula that specifies the relationship between the dependent and independent variables, and potentially the conditioning variables. The formula is expected to follow the format |
A reformatted formula in the correct format for conditional independence testing. The returned formula will either retain the original format or be transformed to include conditioning variables.
clean_formula(y ~ x | z + v) clean_formula(y ~ x + z + v)clean_formula(y ~ x | z + v) clean_formula(y ~ x + z + v)
A dataset containing simulated conditional independence test results.
ExponentialNoiseExponentialNoise
A data frame with 600 rows and 3 variables:
Numeric vector
Numeric vector
Conditioning variable
Conditioning variable
Simulated data.
This function calculates p-values based on the comparison of a test statistic against a null distribution. It can perform either empirical or parametric p-value calculations and supports both left-tailed and right-tailed tests.
get_pvalues( dist, test_statistic, parametric = FALSE, tail = c("left", "right") )get_pvalues( dist, test_statistic, parametric = FALSE, tail = c("left", "right") )
dist |
Numeric vector. Represents the null distribution of the test statistic. |
test_statistic |
Numeric. The observed test statistic for which the p-value is to be calculated. |
parametric |
Logical. If TRUE, calculates parametric p-values assuming the null distribution is normal. If FALSE, calculates empirical p-values. Default is FALSE. |
tail |
Character. Specifies whether to calculate left-tailed or right-tailed p-values. Must be either "left" or "right". Default is "left". |
Numeric. The calculated p-value.
set.seed(123) null_dist <- rnorm(1000) observed_stat <- 1.5 p_value <- get_pvalues(null_dist, observed_stat, parametric = FALSE, tail = "right") print(p_value)set.seed(123) null_dist <- rnorm(1000) observed_stat <- 1.5 p_value <- get_pvalues(null_dist, observed_stat, parametric = FALSE, tail = "right") print(p_value)
Get the best parameters after tuning with CCI.tuner
get_tuned_params(tuned_model)get_tuned_params(tuned_model)
tuned_model |
A model object returned from the CCI.pretuner function. This object contains the tuned parameters and other relevant information. |
A named list of tuned parameters specific to the model method (e.g., mtry for random forest, eta, max_depth for xgboost). Returns NULL for unsupported methods.
A dataset containing simulated conditional independence test results.
HardCaseHardCase
A data frame with 500 rows and 3 variables:
Numeric vector
Numeric vector
Conditioning variable
Conditioning variable
Simulated data.
Categorical is defined as factor (and optionally character and logical). This helper is intentionally strict about inputs and does not modify data.
is_categorical_Z_any(sub_data, Z, allow_character = TRUE, allow_logical = TRUE)is_categorical_Z_any(sub_data, Z, allow_character = TRUE, allow_logical = TRUE)
sub_data |
data.frame containing the Z columns. |
Z |
character vector of column names defining the conditioning set. |
allow_character |
logical; treat character as categorical. Default TRUE. |
allow_logical |
logical; treat logical as categorical. Default TRUE. |
logical scalar.
Uses interaction() on the categorical Z columns only. Characters (and logicals, if enabled) are coerced to factor for stable behavior.
make_strata_from_categorical_Z( sub_data, Z, allow_character = TRUE, allow_logical = TRUE )make_strata_from_categorical_Z( sub_data, Z, allow_character = TRUE, allow_logical = TRUE )
sub_data |
data.frame containing Z columns. |
Z |
character vector of Z column names. |
allow_character |
logical; treat character as categorical. Default TRUE. |
allow_logical |
logical; treat logical as categorical. Default TRUE. |
Rows with NA in any stratification column will receive NA strata.
A factor defining strata (same length as nrow(sub_data)).
A dataset containing simulated data from a non-linear transformation followed by categorization.
NonLinearCategorizationNonLinearCategorization
A data frame with 600 rows and 3 variables:
Numeric vector
Numeric vector
Conditioning variable
Simulated data.
A dataset containing simulated data from a non-linear transformation of a multivariate normal distribution.
NonLinNormalNonLinNormal
A data frame with 500 rows and 4 variables:
Numeric vector
Numeric vector
Conditioning variable
Conditioning variable
Simulated data.
A dataset containing simulated data with uniform noise.
NonLinNormalZs_d0NonLinNormalZs_d0
A data frame with 1000 rows and 15 variables:
Numeric vector
Numeric vector
Conditioning variable
Conditioning variable
Conditioning variable
Conditioning variable
Conditioning variable
Conditioning variable
Conditioning variable
Conditioning variable
Conditioning variable
Conditioning variable
Conditioning variable
Conditioning variable
Conditioning variable
Conditioning variable
Conditioning variable
Simulated data.
A dataset containing simulated data with uniform noise.
NonLinNormalZs_d05NonLinNormalZs_d05
A data frame with 1000 rows and 15 variables:
Numeric vector
Numeric vector
Conditioning variable
Conditioning variable
Conditioning variable
Conditioning variable
Conditioning variable
Conditioning variable
Conditioning variable
Conditioning variable
Conditioning variable
Conditioning variable
Conditioning variable
Conditioning variable
Conditioning variable
Conditioning variable
Conditioning variable
Simulated data.
A dataset containing simulated data from a multivariate normal distribution.
NormalDataNormalData
A data frame with 400 rows and 4 variables:
Numeric vector
Numeric vector
Conditioning variable
Conditioning variable
@source Simulated data.
Permutation Test for Conditional Independence
perm.test( formula, data, p = 0.5, nperm = 160, subsample = 1, metric = "RMSE", method = "rf", nrounds = 600, mtry = NULL, parametric = FALSE, tail = NA, robust = TRUE, metricfunc = NULL, mlfunc = NULL, nthread = 1, progress = TRUE, k = 15, center = TRUE, scale = TRUE, eps = 1e-15, positive = NULL, kernel = "optimal", distance = 2, ... )perm.test( formula, data, p = 0.5, nperm = 160, subsample = 1, metric = "RMSE", method = "rf", nrounds = 600, mtry = NULL, parametric = FALSE, tail = NA, robust = TRUE, metricfunc = NULL, mlfunc = NULL, nthread = 1, progress = TRUE, k = 15, center = TRUE, scale = TRUE, eps = 1e-15, positive = NULL, kernel = "optimal", distance = 2, ... )
formula |
Model formula or DAGitty object specifying the relationship between dependent and independent variables. |
data |
A data frame containing the variables specified in the formula. |
p |
Proportion of data to use for training the model. Default is 0.5. |
nperm |
Number of permutations to perform. Default is 160. |
subsample |
The proportion of the data to be used. Default is 1 (no subsampling). |
metric |
Type of metric: "RMSE", "Kappa" or "LogLoss". Default is 'RMSE'. |
method |
The machine learning method to use for the learner. Supported methods include "rf", "xgboost", "KNN" and "svm". Default is "rf". |
nrounds |
Number of rounds (trees) for methods 'xgboost' and 'rf'. Default is 600. |
mtry |
Number of variables to possibly split at in each node for method 'rf'. Default is NULL (sqrt of number of variables). |
parametric |
Logical. If TRUE, a parametric p-value is calculated instead of an empirical p-value. Default is FALSE. |
tail |
Specifies whether the test is one-tailed ("left" or "right") or two-tailed. Default is NA. |
robust |
Logical. If TRUE, uses a robust method for permutation. Default is TRUE. |
metricfunc |
An optional custom function to calculate the performance metric based on the model's predictions. Default is NULL. |
mlfunc |
An optional custom machine learning function to use instead of the predefined methods. Default is NULL. |
nthread |
Integer. The number of threads to use for parallel processing for method 'rf' and 'xgboost'. Default is 1. |
progress |
Logical. If TRUE, a progress bar is displayed during the permutation process. Default is TRUE. |
k |
Integer. The number of nearest neighbors for the "KNN" method. Default is 15. |
center |
Logical. If TRUE, the data is centered before model fitting. Default is TRUE. |
scale |
Logical. If TRUE, the data is scaled before model fitting. Default is TRUE. |
eps |
Numeric. A small value added to avoid division by zero. Default is 1e-15. |
positive |
Character vector. Specifies which levels of a factor variable should be treated as positive class in classification tasks. Default is NULL. |
kernel |
Character string specifying the kernel type for method option "KNN" . Possible choices are "rectangular" (which is standard unweighted knn), "triangular", "epanechnikov" (or beta(2,2)), "biweight" (or beta(3,3)), "triweight" (or beta(4,4)), "cos", "inv", "gaussian" and "optimal". Default is "optimal". |
distance |
Numeric. Parameter of Minkowski distance for the "KNN" method. Default is 2. |
... |
Additional arguments to pass to the machine learning model fitting function. |
An object of class 'CCI' containing the null distribution, observed test statistic, p-values, the machine learning model used, and the data.
print.CCI, summary.CCI,
plot.CCI, QQplot
set.seed(123) dat <- data.frame(x1 = rnorm(100), x2 = rnorm(100), x3 = rnorm(100), x4 = rnorm(100), y = rnorm(100)) perm.test(y ~ x1 | x2 + x3 + x4, data = dat, nperm = 25)set.seed(123) dat <- data.frame(x1 = rnorm(100), x2 = rnorm(100), x3 = rnorm(100), x4 = rnorm(100), y = rnorm(100)) perm.test(y ~ x1 | x2 + x3 + x4, data = dat, nperm = 25)
Permutes values of 'x' within each stratum. By default, rows with NA in 'x' or 'strata' are left unchanged to preserve alignment.
permute_within_strata(x, strata, seed = NULL, na_action = c("keep", "drop"))permute_within_strata(x, strata, seed = NULL, na_action = c("keep", "drop"))
x |
vector to permute. |
strata |
factor-like vector defining strata. |
seed |
optional seed for reproducibility. |
na_action |
how to handle NA rows: "keep" (default) keeps them fixed, "drop" removes them (returns shorter vector). |
x_star permuted within strata.
set.seed(123) x <- 1:10 strata <- rep(letters[1:2], each = 5) x_permuted <- permute_within_strata(x, strata, seed = 123)set.seed(123) x <- 1:10 strata <- rep(letters[1:2], each = 5) x_permuted <- permute_within_strata(x, strata, seed = 123)
Plot for CCI testing
## S3 method for class 'CCI' plot( x, fill_color = "lightblue", title.size = 14, axis.text.x = 13, axis.text.y = 13, strip.text.x = 13, strip.text.y = 13, legend.text = 13, legend.title = 13, axis.title.x = 13, axis.title.y = 13, base_size = 13, ... )## S3 method for class 'CCI' plot( x, fill_color = "lightblue", title.size = 14, axis.text.x = 13, axis.text.y = 13, strip.text.x = 13, strip.text.y = 13, legend.text = 13, legend.title = 13, axis.title.x = 13, axis.title.y = 13, base_size = 13, ... )
x |
Object of class 'CCI' |
fill_color |
Color for the histogram fill |
title.size |
Size of the plot title |
axis.text.x |
Size of x-axis text |
axis.text.y |
Size of y-axis text |
strip.text.x |
Size of x-axis strip text |
strip.text.y |
Size of y-axis strip text |
legend.text |
Size of legend text |
legend.title |
Size of legend title |
axis.title.x |
Size of x-axis title |
axis.title.y |
Size of y-axis title |
base_size |
Base font size |
... |
Additional arguments to ggplot2 |
A plot of the null distribution and the test statistic in ggplot2 format.
print.CCI, summary.CCI,
plot.CCI, perm.test
dat <- data.frame(x1 = rnorm(100), x2 = rnorm(100), y = rnorm(100)) cci <- CCI.test(y ~ x1 + x2, data = dat, interaction = FALSE, nperm = 30) plot(cci)dat <- data.frame(x1 = rnorm(100), x2 = rnorm(100), y = rnorm(100)) cci <- CCI.test(y ~ x1 + x2, data = dat, interaction = FALSE, nperm = 30) plot(cci)
A dataset containing simulated data from a Poisson distribution.
PoissonNoisePoissonNoise
A data frame with 1000 rows and 4 variables:
Numeric vector
Numeric vector
Conditioning variable
Conditioning variable
Simulated data.
A dataset containing simulated data from a polynomial relationship.
PolyDataPolyData
A data frame with 600 rows and 4 variables:
Numeric vector
Numeric vector
Conditioning variable
Conditioning variable
Simulated data.
Print and summary methods for the CCI class
## S3 method for class 'summary.CCI' print(x, ...)## S3 method for class 'summary.CCI' print(x, ...)
x |
Object of class 'CCI' |
... |
Additional arguments to print/summary |
The print methods have no return value, the summary methods return an object of class 'summary.CCI'.
QQ-plot for multiple testing in CCI
QQplot( object, title.size = 14, axis.text.x = 13, axis.text.y = 13, strip.text.x = 13, strip.text.y = 13, legend.text = 13, legend.title = 13, axis.title.x = 13, axis.title.y = 13, progress = TRUE, ... )QQplot( object, title.size = 14, axis.text.x = 13, axis.text.y = 13, strip.text.x = 13, strip.text.y = 13, legend.text = 13, legend.title = 13, axis.title.x = 13, axis.title.y = 13, progress = TRUE, ... )
object |
Object of class 'CCI' |
title.size |
Size of the plot title |
axis.text.x |
Size of x-axis text |
axis.text.y |
Size of y-axis text |
strip.text.x |
Size of x-axis strip text |
strip.text.y |
Size of y-axis strip text |
legend.text |
Size of legend text |
legend.title |
Size of legend title |
axis.title.x |
Size of x-axis title |
axis.title.y |
Size of y-axis title |
progress |
Logical indicating whether to show progress during computation |
... |
Additional arguments to pass to the |
A QQ-plot of the p-values in ggplot2 format.
print.CCI, summary.CCI,
plot.CCI, perm.test
dat <- data.frame(x1 = rnorm(100), x2 = rnorm(100), y = rnorm(100)) cci <- CCI.test(y ~ x1 | x2, data = dat, nperm = 25, interaction = FALSE) QQplot(cci)dat <- data.frame(x1 = rnorm(100), x2 = rnorm(100), y = rnorm(100)) cci <- CCI.test(y ~ x1 | x2, data = dat, nperm = 25, interaction = FALSE) QQplot(cci)
This function generates the test statistic or a null distribution through permutation for conditional independence testing. It supports various machine learning methods, including random forests, extreme gradient boosting, and allows for custom metric functions and model fitting functions.
test.gen( formula, data, method = "rf", metric = "RMSE", nperm = 160, subsample = 1, p = 0.5, nrounds = 600, mtry = NULL, nthread = 1, permutation = FALSE, robust = TRUE, metricfunc = NULL, mlfunc = NULL, progress = TRUE, center = TRUE, scale = TRUE, eps = 1e-15, k = 15, positive = NULL, kernel = "optimal", distance = 2, ... )test.gen( formula, data, method = "rf", metric = "RMSE", nperm = 160, subsample = 1, p = 0.5, nrounds = 600, mtry = NULL, nthread = 1, permutation = FALSE, robust = TRUE, metricfunc = NULL, mlfunc = NULL, progress = TRUE, center = TRUE, scale = TRUE, eps = 1e-15, k = 15, positive = NULL, kernel = "optimal", distance = 2, ... )
formula |
Formula specifying the relationship between dependent and independent variables. |
data |
Data frame. The data containing the variables used. |
method |
Character. The modeling method to be used. Options include "xgboost" for gradient boosting, or "rf" for random forests or "svm" for Support Vector Machine. |
metric |
Character. The type of metric: can be "RMSE", "Kappa" or "LogLoss". Default is 'RMSE' |
nperm |
Integer. The number of generated Monte Carlo samples. Default is 160. |
subsample |
Numeric. The proportion of the data to be used for subsampling. Default is 1 (no subsampling). |
p |
Numeric. The proportion of the data to be used for training. The remaining data will be used for testing. Default is 0.5. |
nrounds |
Integer. The number of rounds (trees) for methods like 'xgboost' and 'rf'. Default is 600. |
mtry |
Integer. The number of variables to possibly split at in each node for method 'rf'. Default is the rounded down square root of numbers of columns in data. |
nthread |
Integer. The number of threads to use for parallel processing. Only relevant for methods 'rf' and 'xgboost'. Default is 1. |
permutation |
Logical. Whether to perform permutation of the 'X' variable. Used to generate a null distribution. Default is FALSE. |
robust |
Logical. If TRUE, automatically performs stratified permutation if all conditional variables are factor or categorical. Default is TRUE. |
metricfunc |
Function. A custom metric function provided by the user. It must take arguments: |
mlfunc |
Function. A custom machine learning function provided by the user. The function must have the arguments: |
progress |
Logical. A logical value indicating whether to show a progress bar during when building the null distribution. Default is TRUE. |
center |
Logical. If TRUE, the data is centered before model fitting. Default is TRUE. |
scale |
Logical. If TRUE, the data is scaled before model fitting. Default is TRUE. |
eps |
Numeric. A small value added to avoid division by zero. Only relevant for method 'KNN'. Default is 1e-15. |
k |
Integer. The number of nearest neighbors for the "KNN" method. Default is 15. |
positive |
Character vector. Only relevant for method 'KNN'. Specifies which levels of a factor variable should be treated as positive class in classification tasks. Default is NULL. |
kernel |
Character. Only relevant for method 'KNN'. Specifies the kernel type for method option "KNN" . Possible choices are "rectangular" (which is standard unweighted knn), "triangular", "epanechnikov" (or beta(2,2)), "biweight" (or beta(3,3)), "triweight" (or beta(4,4)), "cos", "inv", "gaussian" and "optimal". Default is "optimal". |
distance |
Numeric. Parameter of Minkowski distance for the "KNN" method. Default is 2. |
... |
Additional arguments to pass to the machine learning wrapper functions |
A list containing the test distribution.
set.seed(123) data <- data.frame(x1 = rnorm(100), x2 = rnorm(100), x3 = rnorm(100), x4 = rnorm(100), y = rnorm(100)) result <- test.gen(formula = y ~ x1 | x2 + x3 + x4, metric = "RMSE", nperm = 50, data = data) hist(result$distribution)set.seed(123) data <- data.frame(x1 = rnorm(100), x2 = rnorm(100), x3 = rnorm(100), x4 = rnorm(100), y = rnorm(100)) result <- test.gen(formula = y ~ x1 | x2 + x3 + x4, metric = "RMSE", nperm = 50, data = data) hist(result$distribution)
Convert CI-style formula Y ~ X | Z into regression-style Y ~ X + Z
unclean_formula(f)unclean_formula(f)
f |
A formula of the form Y ~ X | Z1 + Z2, or already Y ~ X + Z1 + Z2. |
A standard formula of the form Y ~ X + Z1 + Z2.
A dataset containing simulated data with uniform noise.
UniformNoise_largeUniformNoise_large
A data frame with 20000 rows and 4 variables:
Numeric vector
Numeric vector
Conditioning variable
Conditioning variable
Simulated data.
k-Nearest Neighbors (KNN) wrapper for CCI (kknn-based)
wrapper_knn( formula, data, train_indices, test_indices, metric, metricfunc = NULL, k = 15, eps = 1e-15, positive = NULL, kernel = "optimal", distance = 2, ... )wrapper_knn( formula, data, train_indices, test_indices, metric, metricfunc = NULL, k = 15, eps = 1e-15, positive = NULL, kernel = "optimal", distance = 2, ... )
formula |
Model formula |
data |
Data frame |
train_indices |
Indices for training rows |
test_indices |
Indices for test rows |
metric |
Performance metric: "RMSE" (regression), "Kappa" (classification), or "LogLoss" (classification) |
metricfunc |
Optional custom metric function: function(actual, predictions, ...) |
k |
Integer, number of neighbors (default 15) |
eps |
Small value to avoid log(0) in LogLoss calculations. Default is 1e-15. |
positive |
Character. The positive class label for binary classification (used in LogLoss). Default is NULL. |
kernel |
Character. Weighting kernel for kknn. Default "optimal". |
distance |
Numeric. Minkowski distance parameter. 2 = Euclidean. Default 2. |
... |
Additional arguments passed to kknn::kknn (e.g., ykernel, na.action) |
Numeric performance metric
Random Forest wrapper for CCI
wrapper_ranger( formula, data, train_indices, test_indices, metric, metricfunc = NULL, nthread = 1, mtry = NULL, num.trees, eps = 1e-15, ... )wrapper_ranger( formula, data, train_indices, test_indices, metric, metricfunc = NULL, nthread = 1, mtry = NULL, num.trees, eps = 1e-15, ... )
formula |
Model formula specifying the dependent and independent variables. |
data |
Data frame containing the dataset to be used for training and testing the model. |
train_indices |
A vector of indices specifying the rows in |
test_indices |
A vector of indices specifying the rows in |
metric |
Type of metric ("RMSE", "Kappa" or "Log Loss") |
metricfunc |
Optional user-defined function to calculate a custom performance metric. This function should take the arguments |
nthread |
Integer. The number of threads to use for parallel processing. Default is 1. |
mtry |
Integer. The number of variables to possibly split at in each node. Default is the square root of the number of columns in |
num.trees |
Integer. The number of trees to grow in the random forest. |
eps |
Small value to avoid log(0) in LogLoss calculations. Default is 1e-15. |
... |
Additional arguments passed to the |
A numeric value representing the performance metric of the model on the test set.
SVM wrapper for CCI
wrapper_svm( formula, data, train_indices, test_indices, metric, metricfunc = NULL, eps = 1e-15, ... )wrapper_svm( formula, data, train_indices, test_indices, metric, metricfunc = NULL, eps = 1e-15, ... )
formula |
Model formula |
data |
Data frame |
train_indices |
Indices for training data |
test_indices |
Indices for testing data |
metric |
Type of metric ("RMSE", "Kappa" or "Log Loss") |
metricfunc |
Optional user-defined function to calculate a custom performance metric. |
eps |
Small value to avoid log(0) in LogLoss calculations. Default is 1e-15. |
... |
Additional arguments passed to e1071::svm |
Performance metric (RMSE for continuous, Kappa for classification)
Extreme Gradient Boosting wrapper for CCI
wrapper_xgboost( formula, data, train_indices, test_indices, metric, nrounds = 500, metricfunc = NULL, nthread = 1, eps = 1e-15, subsample = 1, ... )wrapper_xgboost( formula, data, train_indices, test_indices, metric, nrounds = 500, metricfunc = NULL, nthread = 1, eps = 1e-15, subsample = 1, ... )
formula |
Model formula |
data |
Data frame |
train_indices |
Indices for training data |
test_indices |
Indices for training data |
metric |
Type of metric ("RMSE", "Kappa" or "Log Loss") |
nrounds |
Number of boosting rounds |
metricfunc |
A user specific metric function which have the arguments data, model test_indices and test_matrix and returns a numeric value |
nthread |
Integer. Number of threads to use for parallel computation during model training in XGBoost. Default is 1. |
eps |
Small value to avoid log(0) in LogLoss calculations. Default is 1e-15. |
subsample |
Numeric. The proportion of the data to be used for subsampling. Default is 1 (no subsampling). |
... |
Additional arguments passed to xgb.train |
Performance metric