QuickClus

class quickclus.QuickClus(random_state: Optional[int] = None, n_neighbors: int = 15, min_cluster_size: int = 15, min_samples: Optional[int] = None, threshold_combine_rare_levels: float = 0.0, n_components: Optional[int] = None, scaler_type_numerical: str = 'standard', imputer_strategy_numerical: str = 'mean', transformation_type_numerical: str = 'power', umap_combine_method: str = 'intersection', n_neighbors_intersection_union: Optional[int] = None, verbose: bool = False)[source]

Creates UMAP embeddings and HDSCAN clusters from a pandas DataFrame with mixed data

Parameters

random_state (int, default = None) – Random State for both UMAP and numpy.random. If set to None UMAP will run in Numba in multicore mode but results may vary between runs. Setting a seed may help to offset the stochastic nature of UMAP by setting it with fixed random seed.
n_neighbors (int, default = 15) – Level of neighbors for UMAP. Setting this higher will generate higher densities at the expense of requiring more computational complexity.
min_cluster_size (int, default = 15) – Minimum Cluster size for HDBSCAN. The minimum number of points from which a cluster needs to be formed.
min_samples (int, default = None) – Samples used for HDBSCAN. The larger this is set the more noise points get declared and the more restricted clusters become to only dense areas. If None, min_samples = min_cluster_size
threshold_combine_rare_levels (float, default = 0.02) – To avoid an excessive increase in dimensionality when transforming categorical variables into-one hot encoding, rare levels can be combined. This value indicates the minimum proportion of a category that should not be combined into “other”.
n_components (int, default = None) – Number of components for UMAP. These are dimensions to reduce the data down to. Ideally, this needs to be a value that preserves all the information to form meaningful clusters. Default is the logarithm of total number of features.
imputer_strategy_numerical (str, default = "mean") – Imputation strategy for numerical variables. The values can be: “mean”, “median”, “most_frequent”
scaler_type_numerical (str, default = "standard") – Scaler strategy for numerical variables. The values can be: “robust” (RobustScaler), “standard” (StandardScaler)
transformation_type_numerical (str, default = "power") – Scaler strategy for numerical variables. The values can be: “power” (PowerTransformer), “quantile” (QuantileTransformer)
umap_combine_method (str, default = "intersection") – Method by which to combine embeddings spaces. Options include: intersection, union, contrast, intersection_union_mapper The latter combines both the intersection and union of the embeddings. See: https://umap-learn.readthedocs.io/en/latest/composing_models.html
n_neighbors_intersection_union (int, default = None) – Level of neighbors for UMAP to use to combine umaps embeddings if umap_combine_method = “intersection_union_mapper” If None, n_neighbors_intersection_union = n_neighbors
verbose (bool, defualt = False) – Level of verbosity to print when fitting and predicting. Setting to False will only show Warnings that appear.

assing_results(data)[source]

Assings hdb_model’s labels to the original data

Parameters: data (pandas.DataFrame) – Original pandas DataFrame
Returns: results – new pandas dataframe with the calculated clusters
Return type: pandas.DataFrame

cluster_summary(results_df, metric='mean', include_cat=False)[source]

Creates a cluster’s summary of the numerical and/or categorical features

Parameters

results_df (pandas.DataFrame) – pandas dataframe with a cluster column
metric (str, default = "mean") – metric to use in the summary (mean/median/max/min)
include_cat (bool, default = False) – include the mode of the categorical variables

Returns

df_summary – New dataframe with the summary

Return type

pandas.DataFrame

describe_cluster(results_df, clusters=[0], columns_analyze_numerical=[], columns_analyze_categorical=[], metric='mean')[source]

Describes the selected clusters

Parameters

results_df (pandas.DataFrame) – pandas dataframe with a cluster column
clusters (list) – list with clusters to describe (int)
columns_analyze_numerical (list) – list of numerical columns to describe
columns_analyze_categorical (list) – list of categorical columns to describe
metric (str, default = "mean") – metric to use in the summary of numerical columns (mean/median)

Returns

None

Return type

None

fit(df: DataFrame) → None[source]

Fit function for call UMAP and HDBSCAN

Parameters: df (pandas DataFrame) – DataFrame object with named columns of categorical and numerics
Returns: Fitted – Fitted UMAPs and HDBSCAN
Return type: None

plot_2d_labels(plot_lib='matploblib', data=None)[source]

Plot the first two dimensions of the final embedding with the final clusters.

Parameters

plot_lib (str) – plot library to use (plotly, matplotlib)
data (pd.Dataframe) – pandas dataframe with the original data to show in the plot. Only used if plot_lib = “plotly”

Returns

fig – plotly fig or matplotlib

Return type

figure

plot_3d_labels(data)[source]

Plot the first three dimensions of the final embedding with the final clusters.

Parameters: data (pd.Dataframe) – pandas dataframe with the original data to show in the plot.
Returns: fig – plotly fig
Return type: plotly.graph_objs._figure.Figure

plot_condensed_tree()[source]

Plots the condensed tree of the model

Parameters: self.hdb_model – hdbscan model
Return type: None

plot_embedding_labels()[source]

Plots a jointplot with the model’s labels

Parameters

self.hdb_model – hdbscan model
self.umap_embedding – data’s umap embedding

Return type

None

tune_model(n_trials=100, min_cluster_start=0.01, min_cluster_end=0.15, min_samples_start=0.01, min_samples_end=0.15, max_epsilon=None)[source]

Tunes a hdbscan model maximizing the DBCV score (https://www.dbs.ifi.lmu.de/~zimek/publications/SDM2014/DBCV.pdf)

Parameters

n_trials (int, default = 100) – number of iterations
min_cluster_start (float, default = 0.01) – lowest value of min_cluster of the search space (proportion of data)
min_cluster_end (float, default = 0.15) – highest value of min_cluster of the search space (proportion of data)
min_samples_start (float, default = 0.01) – lowest value of min_samples of the search space (proportion of data)
min_samples_end (float, default = 0.15) – highest value of min_samples of the search space (proportion of data)
max_epsilon (float, default = None) – If a value is provided, an optimal epsilon is searched between 0 and max_epsilon

Returns

optimized hdbscan

Return type

None