ML_tools package#

ML_tools.classifiers module#

ML_tools.classifiers.RFPipeline_PCA(df1, df2, n_iter, cv)[source]#

Creates pipeline that perform Random Forest classification on the data with Principal Component Analysis. The input data is split into training and test sets, then a Randomized Search (with cross-validation) is performed to find the best hyperparameters for the model.

Parameters:
  • df1 (pandas.DataFrame) – Dataframe containing the features.

  • df2 (pandas.DataFrame) – Dataframe containing the labels.

  • n_iter (int) – Number of parameter settings that are sampled.

  • cv (int) – Number of cross-validation folds to use.

Returns:

pipeline_PCA – A fitted pipeline (includes PCA, hyperparameter optimization using RandomizedSearchCV and a Random Forest Classifier model).

Return type:

sklearn.pipeline.Pipeline

ML_tools.classifiers.RFPipeline_noPCA(df1, df2, n_iter, cv)[source]#

Creates pipeline that perform Random Forest classification on the data without Principal Component Analysis. The input data is split into training and test sets, then a Randomized Search (with cross-validation) is performed to find the best hyperparameters for the model.

Parameters:
  • df1 (pandas.DataFrame) – Dataframe containing the features.

  • df2 (pandas.DataFrame) – Dataframe containing the labels.

  • n_iter (int) – Number of parameter settings that are sampled.

  • cv (int) – Number of cross-validation folds to use.

Returns:

pipeline_simple – A fitted pipeline (includes hyperparameter optimization using RandomizedSearchCV and a Random Forest Classifier model).

Return type:

sklearn.pipeline.Pipeline

ML_tools.classifiers.SVM_feature_reduction(df1, df2)[source]#

Performs SVM classification on the data. The input data is split into training and test sets, then a Grid Search (with cross-validation) is performed to find the best hyperparameters for the model. Feature reduction is implemented in this function.

Parameters:
  • df1 (pandas.DataFrame) – Dataframe containing the features.

  • df2 (pandas.DataFrame) – Dataframe containing the labels.

Returns:

grid – A fitted grid search object with the best parameters for the SVM model using the selected features.

Return type:

sklearn.model_selection.GridSearchCV

ML_tools.classifiers.SVM_simple(df1, df2, ker: str)[source]#

Performs SVM classification on the data. The input data is split into training and test sets, then a Grid Search (with cross-validation) is performed to find the best hyperparameters for the model. Feature reduction is not implemented in this function.

Parameters:
  • df1 (pandas.DataFrame) – Dataframe containing the features.

  • df2 (pandas.DataFrame) – Dataframe containing the labels.

  • ker (str) – Kernel type.

Returns:

grid – A fitted grid search object with the best parameters for the SVM model.

Return type:

sklearn.model_selection.GridSearchCV

ML_tools.feature_extractor module#

ML_tools.feature_extractor.feature_extractor(image_filepaths, masks_filepaths)[source]#

Uses the MATLAB Engine API to run the feature_extractor.m function. From the outputs of that function, it defines 2 dataframes containing the extracted features and a series containing the labels of the respective subjects.

Parameters:
  • image_filepaths (list) – Paths to the diffusion parameters maps.

  • masks_filepaths (list) – Paths to the diffusion space segmentations.

Returns:

  • df_mean (pandas.DataFrame) – Mean of pixel values for each region (columns) and each subject (rows).

  • df_std (pandas.DataFrame) – Standard deviation of pixel values for each region (columns) and each subject (rows).

  • group (pandas.Series) – Subject labels.

ML_tools.feature_extractor.feature_extractor_par(image_filepaths, masks_filepaths)[source]#

Uses the MATLAB Engine API to run the feature_extractor_par.m function (parallelized version of feature_extractor.m). From the outputs of that function, it defines 2 dataframes containing the extracted features and an array containing the labels of the respective subjects.

Parameters:
  • image_filepaths (list) – Paths to the diffusion parameters maps.

  • masks_filepaths (list) – Paths to the diffusion space segmentations.

Returns:

  • df_mean (pandas.DataFrame) – Mean of pixel values for each region (columns) and each subject (rows).

  • df_std (pandas.DataFrame) – Standard deviation of pixel values for each region (columns) and each subject (rows).

  • group (pandas.Series) – Subject labels.

ML_tools.reading module#

ML_tools.reading.data_path(dir, subdir)[source]#

Creates a list collecting absolute paths to the files contained in a sub-folder of a parent folder.

Parameters:
  • dir (str) – Name of the parent folder.

  • subdir (str) – Name of the parent folder.

Returns:

filepaths – Paths to the files contained in the specified sub-folder.

Return type:

list

ML_tools.score_and_error module#

ML_tools.score_and_error.performance_scores(y_test, y_predicted, y_probability, confidence_int=0.683)[source]#

Computes and displays various performance scores (including accuracy, precision, recall and AUC) with related errors for binary classification models.

Parameters:
  • y_test (numpy.ndarray) – True labels of test set.

  • y_predicted (numpy.ndarray) – Predicted labels of test set.

  • y_probability (numpy.ndarray) – Predicted label probabilities of test set.

  • confidence_int (float, optional) – Confidence interval for error estimation. Default value is 0.683 (approximately 1 sigma).

Returns:

scores – Dictionary containing various performance scores (and relative errors) including: Accuracy, Precision, Recall and AUC.

Return type:

dict