Need help to find a dataset for fine tuning

Hello there !

I am trying to use :hugs: models for converting an extractive summary generated from a scientific paper to an abstractive one. the abstractive summary can be around 6-7 lines which would be preferable.

can someone please guide me if any such dataset is present here or anywhere else ? it would be helpful if the dataset consisted of proposed input to the model which looks like extractive summaries and targets as abstractive summaries

Here’s a short, engaging reply you can post on that thread (helpful first, no selling):


Hi Aditi, yes, this is doable and there are a couple good dataset directions.

  1. Look for scientific paper summarization datasets where you can treat parts of the paper as “extractive style input” and the abstract as the target. The most common are arXiv and PubMed style datasets used in scientific summarization.

  2. If you specifically need “extractive summary to abstractive summary” pairs, you usually create the extractive side yourself. For example, run a simple extractive method like TextRank or use the top k sentences from the paper, then train the model to map that to the abstract.

Model wise, BART or T5 are strong baselines for abstractive summarization and work well with the Transformers library.

Quick question so people can recommend the right dataset and setup. Do you want to summarize full papers, or only specific sections like introduction plus conclusion, and which domain, arXiv or biomedical?

1 Like