Friday, January 25, 2013

Remembering Aaron Swartz - Raw Thought

Be curiousRead widely. Try new things. I think a lot of what people call intelligence just boils down to curiosity.
 - Aaron Swartz

Aaron Swartz (1986 - 2013)
Last week, I was saddened to hear the news about the death of Aaron Swartz. While I have never met Aaron in person but I was a regular follower of his blog and work for many years. Both of these led me to deeply admire and respect him. His tireless work against the passing of the SOPA bill was of significant interest to me because of the serious implications of the bill. As a programmer his talents were legendary and as an activist his tireless efforts admirable. His passing is indeed a great loss. The Economist has a nice obituary here that perfectly reflects this sentiment. 

As an avid reader of his blog, I am deeply saddened to know that there will not be anymore updates to the site. So I decided to collect all of his blog posts over the years and compile it into a PDF/Ebook. So this past week, I wrote a simple Python script that crawled Aaron's weblog and retrieved all of the posts one by one and complied them into a single file.
The crawler was written rather crudely and in haste and this makes it prone to bugs. As a matter of fact, the program actually crashed three times during the crawl because it encountered URLs that linked to posts that no longer existed. At times like these, I kept track of the URL and slightly tweaked the program to resume from the next URL in the list. I also encountered some problems with the encoding at times. 

The crawler at work

The entire crawling and retrieving process took about 9 minutes to complete. The output was first written to text files. Due to the three crashes, four text files were generated, all of which were then manually combined into one text file and then exported to PDF using TextMate. I know this entire process sounds cumbersome but towards the end it felt worth it. 

The crawler retrieved 449 posts including the comments and stretches to more than 3000 pages in total. All the files, including the source codes, the four text files and the single PDF can be found here. Feel free to download it, share it, tweak it and improve it (would appreciate if you shared your improvements). For EBooks, I recommend using an online converter. I hope to refine it as time permits. 

Personally, this was the first time that I wrote an entire functioning program all by myself that actually solves a real world problem that I encountered. I couldn't have done it for a better cause. I am going to miss reading "Raw Thought" and the world is going to miss Aaron Swartz. This is my #pdftribute to Aaron.

Update: I meant to add a gist to my crawler code, but the dynamic views in the new Blogger does not support it yet. Instead you can find the direct download links here:

Update 2: Since I changed the blog template, I am able to embed the code directly:


2 comments :

  1. Hey, there isn't a licence in your repository. It is advisable that you put one there - http://programmers.stackexchange.com/questions/181167/licence-all-code

    ReplyDelete
  2. Hey, I forked and made some changes. Also, uploaded the result of my fork to the Internet Archive. Check it out here - https://github.com/elssar/aaronsw_RawThought

    ReplyDelete