kmeans_smote module

K-Means SMOTE oversampling method for class-imbalanced data

class kmeans_smote.KMeansSMOTE(sampling_strategy='auto', random_state=None, kmeans_args=None, smote_args=None, imbalance_ratio_threshold=1.0, density_power=None, use_minibatch_kmeans=True, n_jobs=1, **kwargs)

Bases: imblearn.over_sampling.base.BaseOverSampler

Class to perform oversampling using K-Means SMOTE.

K-Means SMOTE works in three steps:

  1. Cluster the entire input space using k-means.

  2. Distribute the number of samples to generate across clusters:

    1. Select clusters which have a high number of minority class samples.
    2. Assign more synthetic samples to clusters where minority class samples are sparsely distributed.
  3. Oversample each filtered cluster using SMOTE.

The method implements SMOTE and random oversampling as limit cases. Therefore, the following configurations may be used to achieve the behavior of …

… SMOTE: imbalance_ratio_threshold=float('Inf'), kmeans_args={'n_clusters':1}

… random oversampling: imbalance_ratio_threshold=float('Inf'), kmeans_args={'n_clusters':1}, smote_args={'k_neighbors':0})

Parameters:
  • sampling_strategy (str, dict, or callable, optional (default='auto')) –

    Ratio to use for resampling the data set.

    • If str, has to be one of: (i) 'minority': resample the minority class; (ii) 'majority': resample the majority class, (iii) 'not minority': resample all classes apart of the minority class, (iv) 'all': resample all classes, and (v) 'auto': correspond to 'all' with for oversampling methods and 'not minority' for undersampling methods. The classes targeted will be oversampled or undersampled to achieve an equal number of sample with the majority or minority class.
    • If dict, the keys correspond to the targeted classes. The values correspond to the desired number of samples.
    • If callable, function taking y and returns a dict. The keys correspond to the targeted classes. The values correspond to the desired number of samples.
  • random_state (int, RandomState instance or None, optional (default=None)) – If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random. Will be copied to kmeans_args and smote_args if not explicitly passed there.
  • kmeans_args (dict, optional (default={})) – Parameters to be passed to sklearn.cluster.KMeans or sklearn.cluster.MiniBatchKMeans (see use_minibatch_kmeans). If n_clusters is not explicitly set, scikit-learn’s default will apply.
  • smote_args (dict, optional (default={})) – Parameters to be passed to imblearn.over_sampling.SMOTE. Note that k_neighbors is automatically adapted without warning when a cluster is smaller than the number of neighbors specified. sampling_strategy will be overwritten according to sampling_strategy passed to this class. random_state will be passed from this class if none is specified.
  • imbalance_ratio_threshold (float or dict, optional (default=1.0)) – Specify a threshold for a cluster’s imbalance ratio ((majority_count + 1) / (minority_count + 1)). Only clusters with an imbalance ratio less than the threshold are oversampled. Use a dictionary to specify different thresholds for different minority classes.
  • density_power (float, optional (default=None)) – Used to compute the density of minority samples within each cluster. By default, the number of features will be used.
  • use_minibatch_kmeans (boolean, optional (default=True)) – If False, use sklearn.cluster.KMeans. If True, use sklearn.cluster.MiniBatchKMeans.
  • n_jobs (int, optional (default=1)) – The number of threads to open if possible. This parameter will be copied to kmeans_args and smote_args if not explicitly passed there. Note: MiniBatchKMeans does not accept n_jobs.

Examples

>>> import numpy as np
>>> from imblearn.datasets import fetch_datasets
>>> from kmeans_smote import KMeansSMOTE
>>>
>>> datasets = fetch_datasets(filter_data=['oil'])
>>> X, y = datasets['oil']['data'], datasets['oil']['target']
>>>
>>> [print('Class {} has {} instances'.format(label, count))
...  for label, count in zip(*np.unique(y, return_counts=True))]
>>>
>>> kmeans_smote = KMeansSMOTE(
...     kmeans_args={
...         'n_clusters': 100
...     },
...     smote_args={
...        'k_neighbors': 10
...     }
... )
>>> X_resampled, y_resampled = kmeans_smote.fit_sample(X, y)
>>>
>>> [print('Class {} has {} instances after oversampling'.format(label, count))
...  for label, count in zip(*np.unique(y_resampled, return_counts=True))]
fit(X, y)

Check inputs and statistics of the sampler.

You should use fit_resample in all cases.

Parameters:
  • X ({array-like, sparse matrix}, shape (n_samples, n_features)) – Data array.
  • y (array-like, shape (n_samples,)) – Target array.
Returns:

self – Return the instance itself.

Return type:

object

fit_resample(X, y)

Resample the dataset.

Parameters:
  • X ({array-like, sparse matrix}, shape (n_samples, n_features)) – Matrix containing the data which have to be sampled.
  • y (array-like, shape (n_samples,)) – Corresponding label for each sample in X.
Returns:

  • X_resampled ({array-like, sparse matrix}, shape (n_samples_new, n_features)) – The array containing the resampled data.
  • y_resampled (array-like, shape (n_samples_new,)) – The corresponding label of X_resampled.

fit_sample(X, y)

Resample the dataset.

Parameters:
  • X ({array-like, sparse matrix}, shape (n_samples, n_features)) – Matrix containing the data which have to be sampled.
  • y (array-like, shape (n_samples,)) – Corresponding label for each sample in X.
Returns:

  • X_resampled ({array-like, sparse matrix}, shape (n_samples_new, n_features)) – The array containing the resampled data.
  • y_resampled (array-like, shape (n_samples_new,)) – The corresponding label of X_resampled.

get_params(deep=True)

Get parameters for this estimator.

Parameters:deep (boolean, optional) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns:params – Parameter names mapped to their values.
Return type:mapping of string to any
ratio_
set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns:
Return type:self