Effective Pre-Training of Audio Transformers for Sound Event Detection
Abstract
A pre-training pipeline for audio spectrogram transformers improves frame-level sound event detection through balanced sampling, data augmentation, and knowledge distillation.
We propose a pre-training pipeline for audio spectrogram transformers for frame-level sound event detection tasks. On top of common pre-training steps, we add a meticulously designed training routine on AudioSet frame-level annotations. This includes a balanced sampler, aggressive data augmentation, and ensemble knowledge distillation. For five transformers, we obtain a substantial performance improvement over previously available checkpoints both on AudioSet frame-level predictions and on frame-level sound event detection downstream tasks, confirming our pipeline's effectiveness. We publish the resulting checkpoints that researchers can directly fine-tune to build high-performance models for sound event detection tasks.
Get this paper in your agent:
hf papers read 2409.09546 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 14
Collections including this paper 0
No Collection including this paper