Updating the TEDtalk Corpus

Today’s task is to determine which talks have been added since the initial download of talks in May 2016 both in order to increase the size of the corpus as well as to make sure that our procedures for acquiring and cleaning data have been fully documented. To get started, I read my own documentation for 2016’s effort, which contained a link to the Google Sheet where the data is housed.

I downloaded the Sheet as a CSV file and compared it to the one from May 2016. A couple of issues emerged immediately: First, neither FileMerge, part of Xcode, nor DiffMerge handle files this size very well. Both are slowww. Second, the columns have moved and changed names:

May 2016: 18,id,/,Speaker,Name,Short Summary,Event,Duration,Publish date,
May 2018: Talk ID,public_url,speaker_name,headline,description,event,duration,language,published,tags

For the purpose of comparison, I only need ID and url, so I think I’ll make a copy of both files and trim them to those columns.… On closer inspection, the URLs have changed. In 2016, the URLs were by the ID number — http://www.ted.com/talks/view/id/53 — but now they are a blend of author and title: https://www.ted.com/talks/majora_carter_s_tale_of_urban_renewal. So I’ll go with ID and speaker for comparison but keep the URL so I can use that for the download list.

Kaleidoscope can handle files this size and its viewing options are pretty nice — and it’s affordably priced, if I need to keep it. (It tracks 76 changes, but I can see that some of those changes are simply some internal shifting that the applications doesn’t follow. I’ll take the automagic where I can get it, and work where I don’t.)

Slight change of plan: I duplicated the files and I am going to sort by ID number to see if that helps. Now Kaleidoscop is showing 412 changes. That doesn’t work. Some talks from the earlier CSV do not have IDs.

I considered a number of solutions to the problem, but I returned to the original diff, where Kaleidoscope tracked 76 changes and inspected the diff by hand. Most of the changes were lines shifted up or down, so I resolved those by hand. That left three large blocks of lines at the end that I copied and pasted into a new CSV as well as the following outliers listed by ID and speaker:

- 25, Tan
- 1676, Davis
- 1923, Cameron
+ 2386, n/a, year in ideas
+ 2451, Gerald
+ 2464, Torvalds
- n/a, Evans
+ 925, Shaw

Tomorrow I will see if the wget code still works.

Leave a Reply