Podcasts are a rapidly growing audio-only medium, and with this growth comes an opportunity to better understand the content within podcasts. To this end, we present the Spotify Podcast Dataset.
This dataset consists of 100,000 episodes from different podcast shows on Spotify. The dataset is available for research purposes.
The dataset was initially created in the context of the TREC 2020 Podcasts Track shared tasks. In this task, participants were asked to complete two tasks focusing on understanding podcast content, and enhancing the search functionality within podcasts.
We are releasing this dataset more widely to facilitate research on podcasts through the lens of speech and audio technology, natural language processing, information retrieval, and linguistics. The dataset contains about 50,000 hours of audio, and over 600 million words. The episodes span a variety of lengths, topics, styles, and qualities.
The podcast dataset contains about 100k podcasts filtered to contain only documents which the creator tags as being in the English language, as well as by a language filter applied to the creator-provided title and description. We expect that there will be a small amount of multilingual content that may have slipped through these filters.
Episodes were sampled from both professional and amateur podcasts including episodes produced in a studio with dedicated equipment by trained professionals, as well as episodes self-published from a phone app — these vary in quality depending on professionalism and equipment of the creator.
The episodes represent a wide range of:
- Audio quality: we can expect professionally produced podcasts to have high audio quality, but there is significant variability in the amateur podcasts. We have included a basic popularity filter to remove most podcasts that are defective or noisy.
- Topics: the episodes represent a wide range of topics, both coarse- and fine-grained. These include lifestyle and culture, storytelling, sports and recreation, news, health, documentary, and commentary.
- Structural formats: podcasts are structured in a number of different ways. These include scripted and unscripted monologues, interviews, conversations, debate, and included clips of other non-speech audio material.
Each of the 100,000 episodes in the dataset includes an audio file, a text transcript, and some associated metadata.
The data are separated into three top-level directories:
- one for transcripts, one for RSS files, and one for audio data.
- Since the audio files are vastly larger than the metadata, and not all researchers will choose to work on the audio data, we make these available for separate download.
- The metadata can be found in a single csv file in the top-level directory
Audio directory:
OGG format available for separate download
Median duration of an episode ~ 31.6 minutes
Estimated size: ~2 TB for entire audio data set
Metadata:
Extracted basic metadata file in TSV format with fields: show_uri, show_name, show_description, publisher, language, rss_link, episode_uri, episode_name, episode_description, duration
Subdirectory for the episode RSS header files:
~1000 words with additional fields of potential interest, not necessarily aligned for every episode: channel, title, description, author, link, copyright, language, image
Estimated size: 145MB total for entire RSS set when compressed.
Subdirectory for transcripts:
JSON format
Average length is just under 6000 words, ranging from a small number of extremely short episodes to up to 45,000 words. Two-thirds of the transcripts are between about 1,000 and about 10,000 words in length; about 1% or 1,000 episodes are very short trailers to advertise other content.
Estimated size: 12GB for entire transcript set
Example of Transcript:
The transcripts consist of a JSON structure. The below figure demonstrates the "results" structure which begins with a list of transcriptions of 30 second chunks of speech, each such chunk with a confidence score and with every word annotated with "startTime" and "endTime". The last item in the "results" structure is a list of all words for the entire episode, again with with "startTime" and "endTime" and in addition an inferred "speakerTag" to distinguish episode participants. While the "results" structure is designed to accommodate several hypotheses through its "alternatives" list structure, this present transcription does not provide alternative transcription hypotheses.
{"results":
[{"alternatives": // always only one alternative in these transcripts
[{"transcript": "Hello, y'all, ... <30 s worth of text> ... ",
"confidence": 0.8640950322151184,
"words": // list of words
[{"startTime": "3s", "endTime": "3.300s", "word": "Hello,"},
...
]}]},
{"alternatives": [
{"transcript": "Aaron ... ",
"confidence": 0.7733442187309265,
"words": [
{"startTime": "30s", "endTime": "30.200s", "word": "Aaron"}, ... ]}]},
{"alternatives": // last item in "results": a straight list of words with "speakerTag"
[{"words":
[{"startTime": "3s", "endTime": "3.300s", "word": "Hello,", "speakerTag": 1},
...
{"startTime": "30s", "endTime": "30.200s", "word": "Aaron", "speakerTag": 1},
...
{"startTime": "39.900s", "endTime": "40.500s", "word": "salon.", "speakerTag": 2} ] }] }]
}
Is the dataset multilingual?
Episodes are limited to English as the primary language, but we hope to release successive multilingual versions of the dataset in the future.
All information included in this dataset is pulled from content that is already publicly available on Spotify’s service (i.e. metadata and content of published podcast episodes)
What were the TREC 2020 Podcasts Track Tasks?
We defined two tasks for participants in the TREC 2020 Podcasts Track.
Task 1: Ad-hoc Segment Retrieval (Search)
Given an arbitrary keyword query, retrieve the jump-in point for relevant segments of podcast episodes. The best result would be a segment with very relevant content, which is also a good jump-in point for the user to start listening.
Topics will consist of a topic number, keyword query, and a description of the user’s information needed. For example:
<topic>
<num>1</num>
<query>Higgs boson</query>
<description>I’m looking for news and discussion about the discovery of the Higgs boson. When was it discovered? How? Who was involved? What are the implications of the discovery for physics?</description>
</topic>
Task 2: Summarization
Given a podcast episode with its audio and transcription, return a short text snippet capturing the most important information in the content. Returned summaries should be grammatical standalone utterances of significantly shorter length than the input episode description.
The input is a podcast episode — participants may use the provided transcript or the raw audio, not including information in the RSS headers. Information in the RSS header for the episode should not be considered.
What if there are inaccuracies in the data?
All RSS headers and audio are supplied by creators, and Spotify does not claim responsibility for the content therein. All transcripts are generated using automatic speech recognition, and may contain errors; Spotify makes no claim that these are accurate reproductions of the audio content.
Who should be excited by this dataset?
Speech, NLP and Information Retrieval researchers who want to develop novel models on previously inaccessible streams of data. Also, any researchers interested in podcasts!
Who ran the competition?
The competition was a collaboration between Spotify, NIST (the National Institute of Standards and Technology), and TREC (the Text Retrieval Conference). Spotify supplies the data, the annotation standards, and the evaluation metrics. TREC supplies the infrastructure for participants to join the competition, submit their entries, and publish their system descriptions, and organizes a conference in November where participants share their results. NIST supplies the expert human annotators who will judge the participants’ entries according to Spotify’s annotation guidelines and metrics.
What are some helpful resources we can look at if we want to learn more?
The previous Spoken Document Retrieval task at TREC: https://pdfs.semanticscholar.org/57ee/3a15088f2db36e07e3972e5dd9598b5284af.pdf
Who can I reach out to if I have a question?
Contact the organizers: podcasts-challenge-organizers@spotify.com
When referring to the data, please cite the following paper:
“100,000 Podcasts: A Spoken English Document Corpus” by Ann Clifton, Sravana Reddy, Yongze Yu, Aasish Pappu, Rezvaneh Rezapour, Hamed Bonab, Maria Eskevich, Gareth Jones, Jussi Karlgren, Ben Carterette, and Rosie Jones, COLING 2020
https://www.aclweb.org/anthology/2020.coling-main.519/
Bibtex:
@inproceedings{clifton-etal- 2020-100000,
title = "100,000 Podcasts: A Spoken {E}nglish Document Corpus",
author = "Clifton, Ann and
Reddy, Sravana and
Yu, Yongze and
Pappu, Aasish and
Rezapour, Rezvaneh and
Bonab, Hamed and
Eskevich, Maria and
Jones, Gareth and
Karlgren, Jussi and
Carterette, Ben and
Jones, Rosie",
booktitle = "Proceedings of the 28th International Conference on Computational Linguistics",
month = dec,
year = "2020",
address = "Barcelona, Spain (Online)",
publisher = "International Committee on Computational Linguistics",
url = "https://www.aclweb.org/ anthology/2020.coling-main.519 ",
pages = "5903--5917",
abstract = "Podcasts are a large and growing repository of spoken audio. As an audio format, podcasts are more varied in style and production type than broadcast news, contain more genres than typically studied in video data, and are more varied in style and format than previous corpora of conversations. When transcribed with automatic speech recognition they represent a noisy but fascinating collection of documents which can be studied through the lens of natural language processing, information retrieval, and linguistics. Paired with the audio files, they are also a resource for speech processing and the study of paralinguistic, sociolinguistic, and acoustic aspects of the domain. We introduce the Spotify Podcast Dataset, a new corpus of 100,000 podcasts. We demonstrate the complexity of the domain with a case study of two tasks: (1) passage search and (2) summarization. This is orders of magnitude larger than previous speech corpora used for search and summarization. Our results show that the size and variability of this corpus opens up new avenues for research.",
}