QuickClus
- class quickclus.QuickClus(random_state: Optional[int] = None, n_neighbors: int = 15, min_cluster_size: int = 15, min_samples: Optional[int] = None, threshold_combine_rare_levels: float = 0.0, n_components: Optional[int] = None, scaler_type_numerical: str = 'standard', imputer_strategy_numerical: str = 'mean', transformation_type_numerical: str = 'power', umap_combine_method: str = 'intersection', n_neighbors_intersection_union: Optional[int] = None, verbose: bool = False)[source]
Creates UMAP embeddings and HDSCAN clusters from a pandas DataFrame with mixed data
- Parameters
random_state (int, default = None) – Random State for both UMAP and numpy.random. If set to None UMAP will run in Numba in multicore mode but results may vary between runs. Setting a seed may help to offset the stochastic nature of UMAP by setting it with fixed random seed.
n_neighbors (int, default = 15) – Level of neighbors for UMAP. Setting this higher will generate higher densities at the expense of requiring more computational complexity.
min_cluster_size (int, default = 15) – Minimum Cluster size for HDBSCAN. The minimum number of points from which a cluster needs to be formed.
min_samples (int, default = None) – Samples used for HDBSCAN. The larger this is set the more noise points get declared and the more restricted clusters become to only dense areas. If None, min_samples = min_cluster_size
threshold_combine_rare_levels (float, default = 0.02) – To avoid an excessive increase in dimensionality when transforming categorical variables into-one hot encoding, rare levels can be combined. This value indicates the minimum proportion of a category that should not be combined into “other”.
n_components (int, default = None) – Number of components for UMAP. These are dimensions to reduce the data down to. Ideally, this needs to be a value that preserves all the information to form meaningful clusters. Default is the logarithm of total number of features.
imputer_strategy_numerical (str, default = "mean") – Imputation strategy for numerical variables. The values can be: “mean”, “median”, “most_frequent”
scaler_type_numerical (str, default = "standard") – Scaler strategy for numerical variables. The values can be: “robust” (RobustScaler), “standard” (StandardScaler)
transformation_type_numerical (str, default = "power") – Scaler strategy for numerical variables. The values can be: “power” (PowerTransformer), “quantile” (QuantileTransformer)
umap_combine_method (str, default = "intersection") – Method by which to combine embeddings spaces. Options include: intersection, union, contrast, intersection_union_mapper The latter combines both the intersection and union of the embeddings. See: https://umap-learn.readthedocs.io/en/latest/composing_models.html
n_neighbors_intersection_union (int, default = None) – Level of neighbors for UMAP to use to combine umaps embeddings if umap_combine_method = “intersection_union_mapper” If None, n_neighbors_intersection_union = n_neighbors
verbose (bool, defualt = False) – Level of verbosity to print when fitting and predicting. Setting to False will only show Warnings that appear.
- assing_results(data)[source]
Assings hdb_model’s labels to the original data
- Parameters
data (pandas.DataFrame) – Original pandas DataFrame
- Returns
results – new pandas dataframe with the calculated clusters
- Return type
pandas.DataFrame
- cluster_summary(results_df, metric='mean', include_cat=False)[source]
Creates a cluster’s summary of the numerical and/or categorical features
- Parameters
results_df (pandas.DataFrame) – pandas dataframe with a cluster column
metric (str, default = "mean") – metric to use in the summary (mean/median/max/min)
include_cat (bool, default = False) – include the mode of the categorical variables
- Returns
df_summary – New dataframe with the summary
- Return type
pandas.DataFrame
- describe_cluster(results_df, clusters=[0], columns_analyze_numerical=[], columns_analyze_categorical=[], metric='mean')[source]
Describes the selected clusters
- Parameters
results_df (pandas.DataFrame) – pandas dataframe with a cluster column
clusters (list) – list with clusters to describe (int)
columns_analyze_numerical (list) – list of numerical columns to describe
columns_analyze_categorical (list) – list of categorical columns to describe
metric (str, default = "mean") – metric to use in the summary of numerical columns (mean/median)
- Returns
None
- Return type
None
- fit(df: DataFrame) None[source]
Fit function for call UMAP and HDBSCAN
- Parameters
df (pandas DataFrame) – DataFrame object with named columns of categorical and numerics
- Returns
Fitted – Fitted UMAPs and HDBSCAN
- Return type
None
- plot_2d_labels(plot_lib='matploblib', data=None)[source]
Plot the first two dimensions of the final embedding with the final clusters.
- Parameters
plot_lib (str) – plot library to use (plotly, matplotlib)
data (pd.Dataframe) – pandas dataframe with the original data to show in the plot. Only used if plot_lib = “plotly”
- Returns
fig – plotly fig or matplotlib
- Return type
figure
- plot_3d_labels(data)[source]
Plot the first three dimensions of the final embedding with the final clusters.
- Parameters
data (pd.Dataframe) – pandas dataframe with the original data to show in the plot.
- Returns
fig – plotly fig
- Return type
plotly.graph_objs._figure.Figure
- plot_condensed_tree()[source]
Plots the condensed tree of the model
- Parameters
self.hdb_model – hdbscan model
- Return type
None
- plot_embedding_labels()[source]
Plots a jointplot with the model’s labels
- Parameters
self.hdb_model – hdbscan model
self.umap_embedding – data’s umap embedding
- Return type
None
- tune_model(n_trials=100, min_cluster_start=0.01, min_cluster_end=0.15, min_samples_start=0.01, min_samples_end=0.15, max_epsilon=None)[source]
Tunes a hdbscan model maximizing the DBCV score (https://www.dbs.ifi.lmu.de/~zimek/publications/SDM2014/DBCV.pdf)
- Parameters
n_trials (int, default = 100) – number of iterations
min_cluster_start (float, default = 0.01) – lowest value of min_cluster of the search space (proportion of data)
min_cluster_end (float, default = 0.15) – highest value of min_cluster of the search space (proportion of data)
min_samples_start (float, default = 0.01) – lowest value of min_samples of the search space (proportion of data)
min_samples_end (float, default = 0.15) – highest value of min_samples of the search space (proportion of data)
max_epsilon (float, default = None) – If a value is provided, an optimal epsilon is searched between 0 and max_epsilon
- Returns
optimized hdbscan
- Return type
None