Through the Eyes of the Jackals

Example cloud generated from title submissions for The Incomparable episode 108.

With all the great podcasting content produced these days, I often end up wishing for easier ways of finding past episodes. Memory does not always serve well enough, and clicking through episode descriptions can be tedious (and unreliable for podcasts that span many topics in a single show).

Clearly it would be great to have some way of searching or browsing podcast episodes, but it’s equally clear that it’s hard to do without accurate transcriptions.

As I’ve mentioned before (in my only other post so far), live listeners of shows on the 5by5 network can submit title suggestions via Showbot, which was created by Jeremy Mack. Suggestions are almost always quotes from the shows, and during popular shows the submissions are so frequent it seems almost like a live transcript. Of course it’s not—submissions are unsurprisingly biased toward humorous or silly quotes—but I wondered if they could be used in some way.

Basing a search engine on the title suggestions would be frustrating. You would have no way of knowing if results didn't appear because the titles didn’t happen to mention those topics, or if they were truly not a part of a show. However, maybe they could be useful for browsing instead. You’d at least see how an episode was viewed through the eyes of the chatroom jackals.

I took a list of titles recorded by Showbot (the set I had on hand covered episodes from Showbot’s inception at the end of June 2011 through July 2012) and wrote a Perl script to identify the most relevant words for each episode.

After doing a bit of research, it looked like the tf-idf statistic would be a decent measure of the relevance of an individual word to an episode. Tf-idf is the product of two statistics: the term frequency (tf) and the inverse document frequency (idf). The term frequency is just the number of times a word appears in a document (in this case, how many times it appears among all the title submissions for a given episode). The inverse document frequency is calculated by dividing the number of documents (or episodes) by the number of documents the word appears in and then taking the logarithm of that ratio.

You end up with a statistic that is high if the word appears a lot in a document, but that is balanced out if the word is just common and appears in many documents. That way, you don’t end up with “the” and “a” being your top words. You can also avoid that result by manually removing so-called stop words from the text (which I also did).

Getting the word frequency is pretty straightforward in Perl:

use Lingua::Stopwords qw( getStopWords );

sub count_words_in_titles {
    my @titles = @_;
    local $_;

    my @words = map { s/[.!?,*]/ /g; tokenize(lc($_)) } @titles;

    my $stopwords = getStopWords('en');
    @{$stopwords}{qw( 's n't 'll 're 'd 've )} = (1) x 6;
    @words = grep { !$stopwords->{$_} } @words;

    my %word_counts;
    $word_counts{$_}++ for @words;

    return \%word_counts;
}

I used a tokenizer() function taken from the Lingua::BrillTagger CPAN package and based on the Penn Treebank tokenizer sed script. From there you can calculate idf as well:

for my $show (keys %show_data) {
    for my $ep (@{$show_data{$show}}) {
        $ep->{wordcounts} = count_words_in_titles( @{$ep->{titles}} );

        $ep->{maxwordcount} = 0;

        for (keys $ep->{wordcounts}) {
            $ep->{maxwordcount} = $ep->{wordcounts}{$_} if $ep->{wordcounts}{$_} > $ep->{maxwordcount};

            $idf{$_}++;
        }
    }
}

for (keys %idf) {
    $idf{$_} = $total_episode_count / $idf{$_};
    $idf{$_} = log($idf{$_});
}

And continue in the same vein to multiply the relevant values together for each word per episode.

To visualize the output, I used the Wordle engine to create word clouds in which the size of the word reflected its tf-idf score.

Overall, I’m pleased with the results. I’ve put together a set of word clouds for the episodes of the now-ended, much-missed Hypercritical podcast. Since I’ve listened to many of them, the clouds serve as a nice, quick visual reference that helps me find specific episodes. I also think some of them capture the character of the episode particularly well.

Episodes 67 to 69.

From episode 42: The Wrong Guy. "Textbook!" 

Show Titles

One of the fun parts of listening to 5by5 shows is helping pick show titles. Jeremy Mack created the Showbot, which lets members of the 5by5 chat room suggest titles as the shows run live (it also lets anyone vote for their favorite suggestions).

Of course, the hosts have the final say over what title gets used, and, as you can imagine, different hosts have different preferences for show titles. For example, John Siracusa, host of Hypercritical, only will consider things that are actually said on his show as possible titles. Inspired by Kieran Healy’s recent analysis of show durations, I decided to look at naming trends across several 5by5 shows and see how they have changed over time.

The main variable I looked at is the show title’s “originality.” That’s in scare quotes because I’m defining originality very narrowly -- here, it is just a reflection of the number of Google hits for the exact title phrase. Since the data span many orders of magnitude, I log-transform the number of hits (adding 0.5 to deal with zeros) and call it an “originality index.” Note that numerically smaller indices are more original, so I have reversed the y-axis on the graphs to put the indices that reflect more originality at the top.

Unfortunately, the number of results reported by Google when you perform a standard search is not very reliable, as described by Randall Munroe of xkcd. Therefore, I used Google’s Custom Search API instead and took the number returned in the totalResults field as a data point. It also allowed me to easily automate the collection of data. There are still issues with this approach: the Custom Search API documentation helpfully notes that

The totalResults property in the objects above identifies the estimated total number of results for the search, which may not be accurate.

Another problem with using Google results is that you’d want to exclude results that refer to the episodes themselves (which in principle should not affect the index). As a crude way of trying to do that, I appended -5by5 to all searches, which eliminates many, but not all, episode-related results.

Clearly the methodology has some flaws. Nevertheless, the results are interesting.

You can see that many shows started out with more commonplace titles, but quickly settled into using rather unusual ones. Hypercritical and The Critical Path have used less original titles than other shows, although this has varied over time (it seems that we are experiencing a recent drop with Hypercritical, though -- maybe the creative juices have all been channeled into a certain Mac OS X review.)

One factor that could affect the originality index is title length; you would probably expect longer titles to be more unusual than shorter ones, all else being equal. We do see that in the data, although the relationship between title length and originality index varies somewhat by show. For example, the indices for Back to Work titles are less dependent on the length than those of Hypercritical or Build and Analyze, but Build and Analyze titles tend to be more original than Hypercritical titles of the same length (and both more original than Critical Path titles).

The Incomparable tends to have longer titles than others, but most are around four words long on average. The Incomparable titles might be longer because that podcast uses the Showbot less often, and the Showbot limits suggestions to 40 characters.

Now, I’ve focused on the originality index here, but that isn’t the only factor that makes a good or apt title. There are clear counterexamples, like Hypercritical #42. This episode uses the fairly common phrase “The Wrong Guy” (index = 5.57) as its title. But the episode is about Walter Isaacson’s biography of Steve Jobs, and “The Wrong Guy” nicely sums up John’s criticism of the book.

On the other hand, “The Bridges of Siracusa County” (index = 0.81) is tough to beat...