Monday
, 
December 
30
 at 
7:00pm
trecpodcasts


Request for Dataset Access
Dataset Description
Organizers
FAQ
Citing

SPOTIFY podcast dataset

Podcasts are a rapidly growing audio-only medium, and with this growth comes an opportunity to better understand the content within podcasts. To this end, we present the Spotify Podcast Dataset. 


This dataset consists of 100,000 episodes from different podcast shows on Spotify. The dataset is available for research purposes.

 

The dataset was initially created in the context of the TREC 2020 Podcasts Track shared tasks. In this task, participants were asked to complete two tasks focusing on understanding podcast content, and enhancing the search functionality within podcasts. 


We are releasing this dataset more widely to facilitate research on podcasts through the lens of speech and audio technology, natural language processing, information retrieval, and linguistics. The dataset contains about 50,000 hours of audio, and over 600 million words. The episodes span a variety of lengths, topics, styles, and qualities.

 

 

Important Dates
Organizers
Data Description
FAQ

Request for Dataset




Use this Google form link to request the dataset

 



Dataset Description

The podcast dataset contains about 100k podcasts filtered to contain only documents which the creator tags as being in the English language, as well as by a language filter applied to the creator-provided title and description. We expect that there will be a small amount of multilingual content that may have slipped through these filters.


Episodes were sampled from both professional and amateur podcasts including:
Episodes produced in a studio with dedicated equipment by trained professionals
Episodes self-published from a phone app — these vary in quality depending on professionalism and equipment of the creator.


The episodes represent a wide range of:


  • Audio quality: we can expect professionally produced podcasts to have high audio quality, but there is significant variability in the amateur podcasts. We have included a basic popularity filter to remove most podcasts that are defective or noisy.
  • Topics: the episodes represent a wide range of topics, both coarse- and fine-grained. These include lifestyle and culture, storytelling, sports and recreation, news, health, documentary, and commentary.
  • Formats: podcasts are structured in a number of different ways. These include scripted and unscripted monologues, interviews, conversations, debate, and included clips of other non-speech audio material.


The podcast dataset contains about 100k podcasts filtered to contain only documents which the creator tags as being in the English language, as well as by a language filter applied to the creator-provided title and description. We expect that there will be a small amount of multilingual content that may have slipped through these filters.


Episodes were sampled from both professional and amateur podcasts including episodes produced in a studio with dedicated equipment by trained professionals, as well as episodes self-published from a phone app — these vary in quality depending on professionalism and equipment of the creator.


The episodes represent a wide range of:


  • Audio quality: we can expect professionally produced podcasts to have high audio quality, but there is significant variability in the amateur podcasts. We have included a basic popularity filter to remove most podcasts that are defective or noisy.
  • Topics: the episodes represent a wide range of topics, both coarse- and fine-grained. These include lifestyle and culture, storytelling, sports and recreation, news, health, documentary, and commentary.
  • Structural formats: podcasts are structured in a number of different ways. These include scripted and unscripted monologues, interviews, conversations, debate, and included clips of other non-speech audio material.




Each of the 100,000 episodes in the dataset includes an audio file, a text transcript, and some associated metadata. 


The data are separated into three top-level directories:

 

 

  • one for transcripts, one for RSS files, and one for audio data.
  • Since the audio files are vastly larger than the metadata, and not all researchers will choose to work on the audio data, we make these available for separate download.
  • The metadata can be found in a single csv file in the top-level directory

 

 

Audio directory:

 

OGG format available for separate download

Median duration of an episode ~ 31.6 minutes
Estimated size: ~2 TB for entire audio data set

 

 

Metadata:

 

Extracted basic metadata file in TSV format with fields: show_uri, show_name, show_description, publisher, language, rss_link, episode_uri, episode_name, episode_description, duration

 

Subdirectory for the episode RSS header files:

 

~1000 words with additional fields of potential interest, not necessarily aligned for every episode: channel, title, description, author, link, copyright, language, image
Estimated size: 145MB total for entire RSS set when compressed. 

 


Subdirectory for transcripts: 

 

JSON format
Average length is just under 6000 words, ranging from a small number of extremely short episodes to up to 45,000 words. Two-thirds of the transcripts are between about 1,000 and about 10,000 words in length; about 1% or 1,000 episodes are very short trailers to advertise other content. 
Estimated size: 12GB for entire transcript set
 

Example of Transcript:


The transcripts consist of a JSON structure. The below figure demonstrates the "results" structure which begins with a list of transcriptions of 30 second chunks of speech, each such chunk with a confidence score and with every word annotated with "startTime" and "endTime". The last item in the "results" structure is a list of all words for the entire episode, again with with "startTime" and "endTime" and in addition an inferred "speakerTag" to distinguish episode participants. While the "results" structure is designed to accommodate several hypotheses through its "alternatives" list structure, this present transcription does not provide alternative transcription hypotheses.

 

 

  {"results":  

 [{"alternatives":  // always only one alternative in these transcripts

   [{"transcript": "Hello, y'all, ... <30 s worth of text> ... ",

     "confidence": 0.8640950322151184,

     "words":  // list of words

     [{"startTime": "3s", "endTime": "3.300s", "word": "Hello,"}, 

... 

]}]},

  {"alternatives": [

      {"transcript": "Aaron ... ",

       "confidence": 0.7733442187309265,

       "words": [

  {"startTime": "30s", "endTime": "30.200s", "word": "Aaron"}, ... ]}]},

  {"alternatives":  // last item in "results": a straight list of words with "speakerTag"

   [{"words":

     [{"startTime": "3s", "endTime": "3.300s", "word": "Hello,", "speakerTag": 1},

      ...

{"startTime": "30s", "endTime": "30.200s", "word": "Aaron", "speakerTag": 1},

      ...

      

      {"startTime": "39.900s", "endTime": "40.500s", "word": "salon.", "speakerTag": 2} ] }] }]

}

 


Organizers




Ann Clifton, Spotify

Sravana Reddy, Spotify

 

Yongze Yu, Spotify

 

Md Iftekhar Tanveer, Spotify

 

Aasish Pappu, Spotify

 

Jussi Karlgren, Spotify

 

Ben Carterette, Spotify

 

Jen McFadden, Spotify

 

Gareth Jones, Dublin City University

 

Maria Eskevich,  CLARIN ERIC

 

Rosie Jones, Spotify

 



FAQ




Is the dataset multilingual?


Episodes are limited to English as the primary language, but we hope to release successive multilingual versions of the dataset in the future.

All information included in this dataset is pulled from content that is already publicly available on Spotify’s service (i.e. metadata and content of published podcast episodes)


What were the TREC 2020 Podcasts Track Tasks?


We defined two tasks for participants in the TREC 2020 Podcasts Track.

 

 

Task 1: Ad-hoc Segment Retrieval (Search)


Given an arbitrary keyword query, retrieve the jump-in point for relevant segments of podcast episodes. The best result would be a segment with very relevant content, which is also a good jump-in point for the user to start listening.

 

Topics will consist of a topic number, keyword query, and a description of the user’s information needed. For example:

 

 

<topic>

<num>1</num>

<query>Higgs boson</query>

<description>I’m looking for news and discussion about the discovery of the Higgs boson.  When was it discovered? How? Who was involved? What are the implications of the discovery for physics?</description>

</topic>

 

 

Task 2: Summarization

 

 

Given a podcast episode with its audio and transcription, return a short text snippet capturing the most important information in the content. Returned summaries should be grammatical  standalone utterances of significantly shorter length than the input episode description. 

 
The input is a podcast episode — participants may use the provided transcript or the raw audio, not including information in the RSS headers. Information in the RSS header for the episode should not be considered. 

 

What if there are inaccuracies in the data?


All RSS headers and audio are supplied by creators, and Spotify does not claim responsibility for the content therein. All transcripts are generated using automatic speech recognition, and may contain errors; Spotify makes no claim that these are accurate reproductions of the audio content.

 

 


Who should be excited by this dataset?

 

Speech, NLP and Information Retrieval researchers who want to develop novel models on previously inaccessible streams of data. Also, any researchers interested in podcasts!
 

Who ran the competition?

 

The competition was a collaboration between Spotify, NIST (the National Institute of Standards and Technology), and TREC (the Text Retrieval Conference). Spotify supplies the data, the annotation standards, and the evaluation metrics. TREC supplies the infrastructure for participants to join the competition, submit their entries, and publish their system descriptions, and organizes a conference in November where participants share their results. NIST supplies the expert human annotators who will judge the participants’ entries according to Spotify’s annotation guidelines and metrics.


What are some helpful resources we can look at if we want to learn more?

 

The previous Spoken Document Retrieval task at TREC: https://pdfs.semanticscholar.org/57ee/3a15088f2db36e07e3972e5dd9598b5284af.pdf

 

Who can I reach out to if I have a question?

Contact the organizers: podcasts-challenge-organizers@spotify.com

 



Citing the dataset

When referring to the data, please cite the following paper:


“100,000 Podcasts: A Spoken English Document Corpus” by Ann Clifton, Sravana Reddy, Yongze Yu, Aasish Pappu, Rezvaneh Rezapour, Hamed Bonab, Maria Eskevich, Gareth Jones, Jussi Karlgren, Ben Carterette, and Rosie Jones, COLING 2020

https://www.aclweb.org/anthology/2020.coling-main.519/


Bibtex:


@inproceedings{clifton-etal- 2020-100000,

    title = "100,000 Podcasts: A Spoken {E}nglish Document Corpus",

    author = "Clifton, Ann  and

      Reddy, Sravana  and

      Yu, Yongze  and

      Pappu, Aasish  and

      Rezapour, Rezvaneh  and

      Bonab, Hamed  and

      Eskevich, Maria  and

      Jones, Gareth  and

      Karlgren, Jussi  and

      Carterette, Ben  and

      Jones, Rosie",

    booktitle = "Proceedings of the 28th International Conference on Computational Linguistics",

    month = dec,

    year = "2020",

    address = "Barcelona, Spain (Online)",

    publisher = "International Committee on Computational Linguistics",

    url = "https://www.aclweb.org/ anthology/2020.coling-main.519 ",

    pages = "5903--5917",

    abstract = "Podcasts are a large and growing repository of spoken audio. As an audio format, podcasts are more varied in style and production type than broadcast news, contain more genres than typically studied in video data, and are more varied in style and format than previous corpora of conversations. When transcribed with automatic speech recognition they represent a noisy but fascinating collection of documents which can be studied through the lens of natural language processing, information retrieval, and linguistics. Paired with the audio files, they are also a resource for speech processing and the study of paralinguistic, sociolinguistic, and acoustic aspects of the domain. We introduce the Spotify Podcast Dataset, a new corpus of 100,000 podcasts. We demonstrate the complexity of the domain with a case study of two tasks: (1) passage search and (2) summarization. This is orders of magnitude larger than previous speech corpora used for search and summarization. Our results show that the size and variability of this corpus opens up new avenues for research.",

}

 



 

Legal                     Privacy Center                 Privacy Policy                Cookies           


About Ads         Additional CA Privacy Disclosures

CONTACT THE ORGANIZER
Google   Outlook   iCal   Yahoo

BEER ME

Add to my Calendar
  • Google  Outlook  iCal  Yahoo