Create split
create_split#
def create_split(
data: Union[xr.Dataset, Dict[str, np.ndarray]], to_pred: Union[List[str], str] = None,
exclude_vars: List[str] = list(), feature_vars: List[str] = None, stack_axis: int = -1
) -> Tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray]
Description#
The create_split
function efficiently generates train-test splits for machine learning models by using a predefined
split
variable within a dataset. This method supports inputs in the form of either xarray.Dataset
or a dictionary of numpy
arrays.
It allows for flexible specification of feature and target variables, along with optional exclusions and custom stacking
configurations for input dimensions.
Parameters#
- data (
Union[xr.Dataset, Dict[str, np.ndarray]]
): The data set from which to generate the split, provided either as an xarray dataset or a dictionary of variables. - to_pred (
Union[List[str], str]
): Names of the variables to be used as targets. Can be a single string or a list of strings. - exclude_vars (
List[str]
): Names of the variables to exclude from the feature set. Defaults to an empty list. - feature_vars (
List[str]
): Explicit list of variable names to use as features. IfNone
, the function automatically determines which variables to use based on exclusion criteria and target variables.
Returns#
Tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray]
: A tuple containing numpy arrays for training features, testing features, training targets, and testing targets.
Example#
import numpy as np
from ml4xcube.splits import create_split
# Example data setup
data = {
'temperature': np.random.rand(100, 10),
'humidity': np.random.rand(100, 10),
'split': np.random.choice([0., 1.], size=(100,))
}
# Specify the variables to predict and the feature set
to_predict = 'temperature'
features = ['humidity']
# Create train and test splits
X_train, X_test, y_train, y_test = create_split(data, to_pred=to_predict, feature_vars=features)
print('Training features shape:', X_train.shape)
print('Training labels shape:', y_train.shape)
create_split
to prepare data arrays for training and testing a model, where
temperature
is the target variable and humidity
serves as a feature variable.
The split
variable within the data cube specifies which entries belong to the training set (1.) and which to the testing set (0.).
It can be assigned using the assign_block_split or the assign_rand_split
method.