Through the Eyes of the Jackals
/With all the great podcasting content produced these days, I often end up wishing for easier ways of finding past episodes. Memory does not always serve well enough, and clicking through episode descriptions can be tedious (and unreliable for podcasts that span many topics in a single show).
Clearly it would be great to have some way of searching or browsing podcast episodes, but it’s equally clear that it’s hard to do without accurate transcriptions.
As I’ve mentioned before (in my only other post so far), live listeners of shows on the 5by5 network can submit title suggestions via Showbot, which was created by Jeremy Mack. Suggestions are almost always quotes from the shows, and during popular shows the submissions are so frequent it seems almost like a live transcript. Of course it’s not—submissions are unsurprisingly biased toward humorous or silly quotes—but I wondered if they could be used in some way.
Basing a search engine on the title suggestions would be frustrating. You would have no way of knowing if results didn't appear because the titles didn’t happen to mention those topics, or if they were truly not a part of a show. However, maybe they could be useful for browsing instead. You’d at least see how an episode was viewed through the eyes of the chatroom jackals.
I took a list of titles recorded by Showbot (the set I had on hand covered episodes from Showbot’s inception at the end of June 2011 through July 2012) and wrote a Perl script to identify the most relevant words for each episode.
After doing a bit of research, it looked like the tf-idf statistic would be a decent measure of the relevance of an individual word to an episode. Tf-idf is the product of two statistics: the term frequency (tf) and the inverse document frequency (idf). The term frequency is just the number of times a word appears in a document (in this case, how many times it appears among all the title submissions for a given episode). The inverse document frequency is calculated by dividing the number of documents (or episodes) by the number of documents the word appears in and then taking the logarithm of that ratio.
You end up with a statistic that is high if the word appears a lot in a document, but that is balanced out if the word is just common and appears in many documents. That way, you don’t end up with “the” and “a” being your top words. You can also avoid that result by manually removing so-called stop words from the text (which I also did).
Getting the word frequency is pretty straightforward in Perl:
use Lingua::Stopwords qw( getStopWords );
sub count_words_in_titles {
my @titles = @_;
local $_;
my @words = map { s/[.!?,*]/ /g; tokenize(lc($_)) } @titles;
my $stopwords = getStopWords('en');
@{$stopwords}{qw( 's n't 'll 're 'd 've )} = (1) x 6;
@words = grep { !$stopwords->{$_} } @words;
my %word_counts;
$word_counts{$_}++ for @words;
return \%word_counts;
}
I used a tokenizer()
function taken from the Lingua::BrillTagger CPAN package and based on the Penn Treebank tokenizer sed script. From there you can calculate idf as well:
for my $show (keys %show_data) {
for my $ep (@{$show_data{$show}}) {
$ep->{wordcounts} = count_words_in_titles( @{$ep->{titles}} );
$ep->{maxwordcount} = 0;
for (keys $ep->{wordcounts}) {
$ep->{maxwordcount} = $ep->{wordcounts}{$_} if $ep->{wordcounts}{$_} > $ep->{maxwordcount};
$idf{$_}++;
}
}
}
for (keys %idf) {
$idf{$_} = $total_episode_count / $idf{$_};
$idf{$_} = log($idf{$_});
}
And continue in the same vein to multiply the relevant values together for each word per episode.
To visualize the output, I used the Wordle engine to create word clouds in which the size of the word reflected its tf-idf score.
Overall, I’m pleased with the results. I’ve put together a set of word clouds for the episodes of the now-ended, much-missed Hypercritical podcast. Since I’ve listened to many of them, the clouds serve as a nice, quick visual reference that helps me find specific episodes. I also think some of them capture the character of the episode particularly well.