maxent_grpo.core.data¶
Dataset loading utilities with support for mixtures.
This module wraps Hugging Face datasets.load_dataset to handle either a
single dataset (dataset_name) or a declarative mixture with optional column
selection, subsampling via weights, shuffling, and an optional train/test split.
It returns a mapping compatible with downstream training/evaluation code (a
datasets.DatasetDict when the library is installed, or a lightweight stub
during tests).
The import of datasets is guarded so this module can be imported in
environments where the library is unavailable; tests monkey-patch the missing
symbols when needed.
License Copyright 2025 Liv d’Aliberti
Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the specific language governing permissions and limitations under the License.
Functions
|
|
|
Return True when |
|
|
|
|
|
|
|
Load a dataset or a weighted mixture and return a dictionary. |
|
Load a single split from a dataset independent of |
- maxent_grpo.core.data.get_dataset(args)[source]¶
Load a dataset or a weighted mixture and return a dictionary.
The function dispatches to
datasets.load_datasetfor simple cases or combines multiple datasets whenargs.dataset_mixtureis provided. Each dataset in a mixture can specify a subset of columns, a fractional weight to subsample with deterministic shuffling, and an optional global test split on the concatenated result.- Parameters:
args (maxent_grpo.config.ScriptArguments) – Parsed script arguments that describe either a single dataset (
dataset_name/dataset_config) or a declarative mixture (dataset_mixture).- Returns:
Mapping with at least a
trainsplit, and possiblytestif a split size was requested.- Return type:
datasets.DatasetDict
- Raises:
ValueError – If neither a dataset name nor mixture is supplied, or when a mixture resolves to zero loaded datasets.
- maxent_grpo.core.data.load_dataset_split(dataset_name, dataset_config=None, split='validation')[source]¶
Load a single split from a dataset independent of
ScriptArguments.This helper is used by evaluation code that cannot rely on the full CLI argument object but still needs consistent column filtering and error handling.
- Parameters:
- Returns:
The requested dataset split as returned by
datasets.load_dataset.- Return type:
datasets.Dataset
- Raises:
ValueError – If
splitis falsy, as evaluation requires an explicit split to avoid loading entire datasets inadvertently.