KFold#
- class mvpy.crossvalidation.KFold(n_splits: int = 5, shuffle: bool = False, random_state: int | Generator | Generator | None = None)[source]#
Implements a k-folds cross-validator.
In principle, this class is redundant with sklearn.model_selection.KFold. However, for the torch backend, this class is useful because it automatically creates indices on the desired device.
- Parameters:
- n_splitsint, default=5
Number of splits to use.
- shufflebool, default=False
Should we shuffle indices before splitting?
- random_stateOptional[Union[int, np.random._generator.Generator, torch._C.Generator]], default=None
Random state to use for shuffling (either integer seed or numpy/torch generator), if any.
- Attributes:
- n_splitsint, default=5
Number of splits to use.
- shufflebool, default=False
Should we shuffle indices before splitting?
- random_stateOptional[Union[int, np.random._generator.Generator, torch._C.Generator]], default=None
Random state to use for shuffling (either integer seed or numpy/torch generator), if any.
- rng_Union[np.random._generator.Generator, torch._C.Generator]
Random generator derived from random_state.
Notes
For reproducability when using shuffling, you can set the random_state to an integer.
Note also that, when using shuffling, please make sure to instantiate and transform immediately to the backend you would like. Otherwise, each call to split will instantiate a new object with the same random seed. See examples for a demonstration.
Examples
If we are not using shuffling, we can simply do:
>>> import torch >>> from mvpy.crossvalidation import KFold >>> X = torch.arange(10) >>> kf = KFold() >>> for f_i, (train, test) in enumerate(kf.split(X)): >>> print(f'Fold{f_i}: train={train} test={test}') Fold0: train=tensor([2, 3, 4, 5, 6, 7, 8, 9]) test=tensor([0, 1]) Fold1: train=tensor([0, 1, 4, 5, 6, 7, 8, 9]) test=tensor([2, 3]) Fold2: train=tensor([0, 1, 2, 3, 6, 7, 8, 9]) test=tensor([4, 5]) Fold3: train=tensor([0, 1, 2, 3, 4, 5, 8, 9]) test=tensor([6, 7]) Fold4: train=tensor([0, 1, 2, 3, 4, 5, 6, 7]) test=tensor([8, 9])
However, let’s assume we want to use shuffling. We might be inclined to do:
>>> import torch >>> from mvpy.crossvalidation import KFold >>> X = torch.arange(6) >>> kf = KFold(n_splits = 2, shuffle = True, random_state = 42) >>> print(f'Run 1:') >>> for f_i, (train, test) in enumerate(kf.split(X)): >>> print(f'Fold{f_i}: train={train} test={test}') >>> print(f'Run 2:') >>> for f_i, (train, test) in enumerate(kf.split(X)): >>> print(f'Fold{f_i}: train={train} test={test}') Run 1: Fold0: train=tensor([4, 1, 5]) test=tensor([0, 3, 2]) Fold1: train=tensor([0, 3, 2]) test=tensor([4, 1, 5]) Run 2: Fold0: train=tensor([4, 1, 5]) test=tensor([0, 3, 2]) Fold1: train=tensor([0, 3, 2]) test=tensor([4, 1, 5])
Note that here we pass random_state to make this reproducible on your end. As you can see, the randomisation is now static across runs. This occurs because, up until the call to split the data, MVPy cannot consistently infer the desired data type. Therefore, the backend class is instantiated only upon calling split where types become explicit. However, this means that each call to split will re-instantiate the class. We can easily work around this in two ways:
>>> import torch >>> from mvpy.crossvalidation import KFold >>> X = torch.arange(6) >>> kf = KFold(n_splits = 2, shuffle = True, random_state = 42).to_torch() >>> print(f'Run 1:') >>> for f_i, (train, test) in enumerate(kf.split(X)): >>> print(f'Fold{f_i}: train={train} test={test}') >>> print(f'Run 2:') >>> for f_i, (train, test) in enumerate(kf.split(X)): >>> print(f'Fold{f_i}: train={train} test={test}') Run 1: Fold0: train=tensor([4, 1, 5]) test=tensor([0, 3, 2]) Fold1: train=tensor([0, 3, 2]) test=tensor([4, 1, 5]) Run 2: Fold0: train=tensor([4, 0, 3]) test=tensor([5, 1, 2]) Fold1: train=tensor([5, 1, 2]) test=tensor([4, 0, 3])
Here, we explicitly instantiate a torch operator that is not reinstantiated across runs, which works perfectly. We could, however, also use an external generator to achieve the same result:
>>> import torch >>> from mvpy.crossvalidation import KFold >>> X = torch.arange(6) >>> rng = torch.Generator() >>> rng.manual_seed(42) >>> kf = KFold(n_splits = 2, shuffle = True, random_state = rng) >>> print('Run 1:') >>> for f_i, (train, test) in enumerate(kf.split(X)): >>> print(f'Fold{f_i}: train={train} test={test}') >>> print('Run 2:') >>> for f_i, (train, test) in enumerate(kf.split(X)): >>> print(f'Fold{f_i}: train={train} test={test}') Run 1: Fold0: train=tensor([4, 1, 5]) test=tensor([0, 3, 2]) Fold1: train=tensor([0, 3, 2]) test=tensor([4, 1, 5]) Run 2: Fold0: train=tensor([4, 0, 3]) test=tensor([5, 1, 2]) Fold1: train=tensor([5, 1, 2]) test=tensor([4, 0, 3])
- split(X: ndarray | Tensor, y: ndarray | Tensor | None = None) Generator[tuple[ndarray, ndarray], None, None] | Generator[tuple[Tensor, Tensor], None, None][source]#
Split the dataset into iterable (train, test).
- Parameters:
- XUnion[np.ndarray, torch.Tensor]
Input data of shape (n_samples, …)
- yOptional[Union[np.ndarray, torch.Tensor]], default=None
Target data of shape (n_samples, …). Unused, but parameter available for consistency.
- Returns:
- kfUnion[collections.abc.Generator[tuple[np.ndarray, np.ndarray], None, None], collections.abc.Generator[tuple[torch.Tensor, torch.Tensor], None, None]]
Iterable generator of (train, test) pairs.