maxent_grpo.core.data

Dataset loading utilities with support for mixtures.

This module wraps Hugging Face datasets.load_dataset to handle either a single dataset (dataset_name) or a declarative mixture with optional column selection, subsampling via weights, shuffling, and an optional train/test split. It returns a mapping compatible with downstream training/evaluation code (a datasets.DatasetDict when the library is installed, or a lightweight stub during tests).

The import of datasets is guarded so this module can be imported in environments where the library is unavailable; tests monkey-patch the missing symbols when needed.

License Copyright 2025 Liv d’Aliberti

Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the specific language governing permissions and limitations under the License.

Functions

_dataset_load_retry_settings()

_is_saved_hf_dataset_dir(candidate)

Return True when candidate looks like datasets.save_to_disk output.

_load_dataset_with_retries(*args, **kwargs)

_should_retry_dataset_load(exc)

_to_dataset_dict(payload)

get_dataset(args)

Load a dataset or a weighted mixture and return a dictionary.

load_dataset_split(dataset_name[, ...])

Load a single split from a dataset independent of ScriptArguments.

maxent_grpo.core.data.get_dataset(args)[source]

Load a dataset or a weighted mixture and return a dictionary.

The function dispatches to datasets.load_dataset for simple cases or combines multiple datasets when args.dataset_mixture is provided. Each dataset in a mixture can specify a subset of columns, a fractional weight to subsample with deterministic shuffling, and an optional global test split on the concatenated result.

Parameters:

args (maxent_grpo.config.ScriptArguments) – Parsed script arguments that describe either a single dataset (dataset_name / dataset_config) or a declarative mixture (dataset_mixture).

Returns:

Mapping with at least a train split, and possibly test if a split size was requested.

Return type:

datasets.DatasetDict

Raises:

ValueError – If neither a dataset name nor mixture is supplied, or when a mixture resolves to zero loaded datasets.

maxent_grpo.core.data.load_dataset_split(dataset_name, dataset_config=None, split='validation')[source]

Load a single split from a dataset independent of ScriptArguments.

This helper is used by evaluation code that cannot rely on the full CLI argument object but still needs consistent column filtering and error handling.

Parameters:
  • dataset_name (str) – Dataset repository ID on the Hugging Face Hub.

  • dataset_config (str | None) – Optional dataset configuration name to disambiguate multiple configurations.

  • split (str) – Split to load (for example "train", "validation", or "test").

Returns:

The requested dataset split as returned by datasets.load_dataset.

Return type:

datasets.Dataset

Raises:

ValueError – If split is falsy, as evaluation requires an explicit split to avoid loading entire datasets inadvertently.