Thursday, September 25, 2008

Donald Knuth And The Complexity Of Songs

Donald Knuth is a living legend among computer scientists. His monumental work-of-life "The Art Of Computer Programming" (I was never able to fully follow it) is a standard reference/textbook/work of art for computer science. Mr. Knuth took a new, immature geek-only hobby and transformed it into a solid and complete scientific discipline.

Among his contributions, the systematic analysis of the complexity of algorithms, really stands out. As a concept complexity is very familiar to developers. It refers to how a specific algorithm or algorithmic solution to a problem scale to the 'problem size'.

If his work is a reason to admire him, his humor is a reason to love him. Recently, I had the most splendid time reading his hilarious paper titled "The Complexity Of Songs" ! Here is the story.

A song has some lyrics which we have to remember in order to sing it. Humans are trying to learn lyrics of length (space complexity) to sing a song of length n. You would expect mathematically that s ~ n, but Knuth investigates all human 'inventions' to reduce the song space complexity!!

After a series of 'theorems', Knuth proves the existence of songs with s= a.n, a<1, s="O(sqrt(n))," s="O(logn)">and finally.....s=O(1)!! Yes, you read well. 

Knuth argues the first step was the invention of the refrain, the repeated part of a song. If the song has m verses of length and the refrain has length R, then the song length is roughly n = V.m+R.m and the complexity s = m.V+R. Hence we have a reduction of V/V+R in song complexity. So, a refrain is also a tool to save some memory space to remember the song!

Remember the old song "Old Mc'Donald had a farm, Ei-gh,Ei-gh,Oh!" ? Well this has a complexity s = O(sqrt(n)). It is much like the Greek song "Οταν θα πας κυρα μου στο παζαρι...". In this pattern, each verse includes all previous verses. For verses n = o(m^2) and s = O(m), so s = O(sqrt(n)). This means that it is much easier to remember! That's why they are most suitable for children! A same pattern can yield a log complexity.

Now to the best part, Theorem 2 is the real killer! The introductory text goes as follows:"However, the advent of modern drugs has led to demands for still less memory, and the ultimate improvement of Theorem 1 has consequently just been announced:

Theorem 2 (Donald Knuth)
There exist arbitrarily long song of complexity O(1)

PROOF. Define U = 'uh huh','uh huh' and the k-th verse Vk = 'That's the way', U, 'I like it', U . This is a constant V. Then the song V^k, completes our proof!! Oh dear!

This last one really cracked me up!!!   :)

Friday, September 5, 2008

A One-Line Code on How To Parse a URI

Uniform Resource Identifiers(URI) is perhaps synonymous to the Internet itself. As the name implies it is used to identify something on the net. Typically a URI is a different than a URL(Uniform Resource Location) but in most cases we use them interchangeably.

Mathematically URI=URL+URN, which means an entity can be identified either by its location(URL) or its name(URN). See the relevant Wikipedia page for details.

Every URI has the following syntax:
(scheme)://(authority)(/path)?(query)#(fragment)
with query and fragment being optional. So in a URI like http://www.example.com/comment?id=1#1123, we have:

scheme="http"
authority="www.example.com"
path="/comment"
query="id=1"
fragment="1123"

A very common task is when we have to parse a URI into its components. The most usual case is URL normalization, which is essential for search crawlers. During this process, a URI is trinsformed into an equivalent canonical form, such that 2 different canonical forms cannot refer to the same resource. For example the URL http://www.example.com and the URL http://www.example.com:80 , are different but refer to the same location on the net. This greatly helps spiders retrieving every time pages they have not visited before. You can find a good article about URL normalization here.

To parse a URI using a regular expression is really easy. In fact the method to do this is given by Tim Berner's Lee (photo) himself in the RFC for the URI standard. Using Perl, if $uri has the URI then by applying
$uri =~ /^(([^:\/\?#]+):)?(\/\/([^\/\?#]*))?([^\?#]*)(\?([^#]*))?(#(.*))?/;

Then:
scheme=$2
authority=$4
path=$5
query=$7
fragment=$9

That's it! All languages (C#, Java etc) has an almost identical syntax and it is straightforward to port it there given that $n refers to the n-th group.