Is the dataset multilingual?
Episodes are limited to English as the primary language, but we hope to release successive multilingual versions of the dataset in the future.
All information included in this dataset is pulled from content that is already publicly available on Spotify’s service (i.e. metadata and content of published podcast episodes)
What were the TREC 2020 Podcasts Track Tasks?
We defined two tasks for participants in the TREC 2020 Podcasts Track.
Task 1: Ad-hoc Segment Retrieval (Search)
Given an arbitrary keyword query, retrieve the jump-in point for relevant segments of podcast episodes. The best result would be a segment with very relevant content, which is also a good jump-in point for the user to start listening.
Topics will consist of a topic number, keyword query, and a description of the user’s information needed. For example:
<description>I’m looking for news and discussion about the discovery of the Higgs boson. When was it discovered? How? Who was involved? What are the implications of the discovery for physics?</description>
Task 2: Summarization
Given a podcast episode with its audio and transcription, return a short text snippet capturing the most important information in the content. Returned summaries should be grammatical standalone utterances of significantly shorter length than the input episode description.
The input is a podcast episode — participants may use the provided transcript or the raw audio, not including information in the RSS headers. Information in the RSS header for the episode should not be considered.
What if there are inaccuracies in the data?
All RSS headers and audio are supplied by creators, and Spotify does not claim responsibility for the content therein. All transcripts are generated using automatic speech recognition, and may contain errors; Spotify makes no claim that these are accurate reproductions of the audio content.
Who should be excited by this dataset?
Speech, NLP and Information Retrieval researchers who want to develop novel models on previously inaccessible streams of data. Also, any researchers interested in podcasts!
Who ran the competition?
The competition was a collaboration between Spotify, NIST (the National Institute of Standards and Technology), and TREC (the Text Retrieval Conference). Spotify supplies the data, the annotation standards, and the evaluation metrics. TREC supplies the infrastructure for participants to join the competition, submit their entries, and publish their system descriptions, and organizes a conference in November where participants share their results. NIST supplies the expert human annotators who will judge the participants’ entries according to Spotify’s annotation guidelines and metrics.
What are some helpful resources we can look at if we want to learn more?
The previous Spoken Document Retrieval task at TREC: https://pdfs.semanticscholar.org/57ee/3a15088f2db36e07e3972e5dd9598b5284af.pdf
Who can I reach out to if I have a question?
Contact the organizers: email@example.com