Need help to find a dataset for fine tuning

alwaysaditi · May 21, 2024, 11:44am

Hello there !

I am trying to use models for converting an extractive summary generated from a scientific paper to an abstractive one. the abstractive summary can be around 6-7 lines which would be preferable.

can someone please guide me if any such dataset is present here or anywhere else ? it would be helpful if the dataset consisted of proposed input to the model which looks like extractive summaries and targets as abstractive summaries

DinoDS · March 4, 2026, 3:10pm

Here’s a short, engaging reply you can post on that thread (helpful first, no selling):

Hi Aditi, yes, this is doable and there are a couple good dataset directions.

Look for scientific paper summarization datasets where you can treat parts of the paper as “extractive style input” and the abstract as the target. The most common are arXiv and PubMed style datasets used in scientific summarization.
If you specifically need “extractive summary to abstractive summary” pairs, you usually create the extractive side yourself. For example, run a simple extractive method like TextRank or use the top k sentences from the paper, then train the model to map that to the abstract.

Model wise, BART or T5 are strong baselines for abstractive summarization and work well with the Transformers library.

Quick question so people can recommend the right dataset and setup. Do you want to summarize full papers, or only specific sections like introduction plus conclusion, and which domain, arXiv or biomedical?

Topic		Replies	Views
LongT5 tGlobal Base Extractive Result Models	0	148	September 1, 2023
Datasets for generating longer summaries Models	0	303	December 3, 2020
Fine Tune text generation Model using different type of data 🤗Transformers	0	386	August 1, 2023
Finetuned gpt2 model generates from very begining not from summary Beginners	0	198	June 25, 2023
Summarization on long documents 🤗Transformers	63	59737	August 16, 2024

Need help to find a dataset for fine tuning

Related topics