DatasetInfo seems to be missing when I pull my dataset from HFHub

I have a number of datasets, which I create from a dictionary like so:

    info = DatasetInfo(
            description="my happy lil dataset",
            version="0.0.1",
            homepage="https://www.myhomepage.co.uk"
        )
    train_dataset = Dataset.from_dict(prepare_data(data["train"]), info=info)
    test_dataset = Dataset.from_dict(prepare_data(data["test"]), info=info)
    validation_dataset = Dataset.from_dict(prepare_data(data["validation"]),info=info)

I then combine these into a DatasetDict.

    # Create a DatasetDict
    dataset = DatasetDict(
        {"train": train_dataset, "test": test_dataset, "validation": validation_dataset}
    )

So far, so good. If I access dataset['train'].info.description I see the expected result of "My happy lil dataset".

So I push to the hub, like so:

dataset.push_to_hub(f"{organization}/{repo_name}", commit_message="Some commit message")

And this succeeds too.

However, when I come to pull the dataset back down from the hub, and access the information associated with it; like so:

pulled_data = full = load_dataset("f{organization}/{repo_name}" ,use_auth_token = True)

# I expect the following to print out "my happy lil dataset"
print(pulled_data["train"].info.description)

# However, instead it returns ''

Am I loading my data in from the hub incorrectly? Am I pushing only my dataset and not the info somehow?
I feel like I’m missing something obvious, but I’m really not sure. Any help would be appreciated.

1 Like

Did you ever happen to figure this out? I’m seeing exactly the same issue.

1 Like

It seems that the contents are stored in ds.info.description only when the dataset is created by a build script. But since trust_remote_code has been removed from the datasets library in version 4.0.0 and later, it can be said that ds.info.description is normally empty as expected.
If you need information about a dataset, refer to the card information in the dataset repository or use DatasetViewer API.

# pip install -U datasets<4.0.0 huggingface_hub[hf_xet]
from datasets import load_dataset_builder, load_dataset
import datasets
from huggingface_hub import RepoCard
import textwrap
print("datasets version:", datasets.__version__)

# 1) Scripted dataset: builder.info.description comes from its loading script
b = load_dataset_builder("livecodebench/code_generation_lite", trust_remote_code=True) # trust_remote_code is no longer supported in >= 4.0.0 https://github.com/huggingface/datasets/releases/tag/4.0.0
print("[scripted] description:", textwrap.shorten(b.info.description or "", 180))

# 2) File-based repo: read metadata from the dataset card; ds.info.description is usually empty
repo = "databricks/databricks-dolly-15k"
card = RepoCard.load(repo, repo_type="dataset")
print("[file-based] card.license:", card.data.to_dict().get("license"))
print("[file-based] card.desc:", textwrap.shorten(card.text or "", 180))
ds = load_dataset(repo, split="train")
print("[file-based] ds.info.description:", repr(ds.info.description))

#datasets version: 3.2.0
#[scripted] description: LiveCodeBench is a temporaly updating benchmark for code generation. Please check the homepage: https://livecodebench.github.io/.
#[file-based] card.license: cc-by-sa-3.0
#[file-based] card.desc: # Summary `databricks-dolly-15k` is an open source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral [...]
#[file-based] ds.info.description: ''