StratifiedKFold#
- class mvpy.crossvalidation.StratifiedKFold(n_splits: int = 5, shuffle: bool = False, random_state: int | Generator | Generator | None = None)[source]#
Implements a stratified k-folds cross-validator.
Unlike sklearn, this will also stratify across features of (n_samples[, …], n_features[, n_timepoints]).
- Parameters:
- n_splitsint, default=5
Number of splits to use.
- shufflebool, default=False
Should we shuffle indices before splitting?
- random_stateOptional[Union[int, np.random._generator.Generator, torch._C.Generator]], default=None
Random state to use for shuffling (either integer seed or numpy/torch generator), if any.
- Attributes:
- n_splitsint, default=5
Number of splits to use.
- shufflebool, default=False
Should we shuffle indices before splitting?
- random_stateOptional[Union[int, np.random._generator.Generator, torch._C.Generator]], default=None
Random state to use for shuffling (either integer seed or numpy/torch generator), if any.
- rng_Union[np.random._generator.Generator, torch._C.Generator]
Random generator derived from random_state.
Notes
For reproducability when using shuffling, you can set the random_state to an integer.
Note also that, when using shuffling, please make sure to instantiate and transform immediately to the backend you would like. Otherwise, each call to split will instantiate a new object with the same random seed. See examples for a demonstration.
Examples
First, let’s assume we have just one feature:
>>> import torch >>> from mvpy.crossvalidation import StratifiedKFold >>> X = torch.randn(75, 5) >>> y = torch.tensor([0] * 40 + [1] * 25 + [2] * 10) >>> kf = StratifiedKFold() >>> for f_i, (train, test) in enumerate(kf.split(X, y)): >>> train_idx, train_cnt = torch.unique(y[train], return_counts = True) >>> _, test_cnt = torch.unique(y[test], return_counts = True) >>> print(f'Fold {f_i}: classes={train_idx} N(train)={train_cnt} N(test)={test_cnt}') Fold 0: classes=tensor([0, 1, 2]) N(train)=tensor([32, 20, 8]) N(test)=tensor([8, 5, 2]) Fold 1: classes=tensor([0, 1, 2]) N(train)=tensor([32, 20, 8]) N(test)=tensor([8, 5, 2]) Fold 2: classes=tensor([0, 1, 2]) N(train)=tensor([32, 20, 8]) N(test)=tensor([8, 5, 2]) Fold 3: classes=tensor([0, 1, 2]) N(train)=tensor([32, 20, 8]) N(test)=tensor([8, 5, 2]) Fold 4: classes=tensor([0, 1, 2]) N(train)=tensor([32, 20, 8]) N(test)=tensor([8, 5, 2])
Second, let’s assume we have multiple features and we want to shuffle indices. Note that this will also work if features have overlapping class names, but for clarity here we use different offsets:
>>> import torch >>> from mvpy.crossvalidation import StratifiedKFold >>> X = torch.randn(75, 5) >>> y0 = torch.tensor([0] * 40 + [1] * 25 + [2] * 10)[:,None] >>> y1 = torch.tensor([3] * 15 + [4] * 45 + [5] * 15)[:,None] >>> y = torch.stack((y0, y1), dim = 1) >>> kf = StratifiedKFold(shuffle = True).to_torch() >>> for f_i, (train, test) in enumerate(kf.split(X, y)): >>> train_idx, train_cnt = torch.unique(y[train], return_counts = True) >>> _, test_cnt = torch.unique(y[test], return_counts = True) >>> print(f'Fold {f_i}: classes={train_idx} N(train)={train_cnt} N(test)={test_cnt}') Fold 0: classes=tensor([0, 1, 2, 3, 4, 5]) N(train)=tensor([32, 20, 8, 12, 36, 12]) N(test)=tensor([8, 5, 2, 3, 9, 3]) Fold 1: classes=tensor([0, 1, 2, 3, 4, 5]) N(train)=tensor([32, 20, 8, 12, 36, 12]) N(test)=tensor([8, 5, 2, 3, 9, 3]) Fold 2: classes=tensor([0, 1, 2, 3, 4, 5]) N(train)=tensor([32, 20, 8, 12, 36, 12]) N(test)=tensor([8, 5, 2, 3, 9, 3]) Fold 3: classes=tensor([0, 1, 2, 3, 4, 5]) N(train)=tensor([32, 20, 8, 12, 36, 12]) N(test)=tensor([8, 5, 2, 3, 9, 3]) Fold 4: classes=tensor([0, 1, 2, 3, 4, 5]) N(train)=tensor([32, 20, 8, 12, 36, 12]) N(test)=tensor([8, 5, 2, 3, 9, 3])
- split(X: ndarray | Tensor, y: ndarray | Tensor | None = None) Generator[tuple[ndarray, ndarray], None, None] | Generator[tuple[Tensor, Tensor], None, None] [source]#
Split the dataset into stratified iterable (train, test).
- Parameters:
- XUnion[np.ndarray, torch.Tensor]
Input data of shape (n_samples, …)
- yOptional[Union[np.ndarray, torch.Tensor]], default=None
Target data of shape (n_samples, …). Unused, but parameter available for consistency.
- Returns:
- kfUnion[collections.abc.Generator[tuple[np.ndarray, np.ndarray], None, None], collections.abc.Generator[tuple[torch.Tensor, torch.Tensor], None, None]]
Iterable generator of (train, test) pairs.