Pragdave: April 2007

Thursday, April 19, 2007

Test-First Word Wrap in Erlang

I’m continuing to play with Erlang. This week, I needed to write some code that extracts information from a bunch of source files. Part of the output was supposed to be a list of strings, nicely word wrapped to fit on a terminal screen.

I decided to experiment with using a test-first approach to developing the word wrap function. I wrote unit tests that describe the functionality I want, and then wrote some code to make the tests run successfully.

To do that, I used the new EUnit-2.0 unit testing framework. It’s still under development, but it seems to work nicely (although I’d love for it to have an assert_equal method).

Installing EUnit

First, we’ll install EUnit from their Subversion repository:

  [dave/Erlang/common] svn co http://svn.process-one.net/contribs/trunk/eunit
  A    eunit/Emakefile
  A    eunit/sys.config
  A    eunit/include
   :     :     :
  [dave/Erlang/common] cd eunit
  [Erlang/common/eunit] make
   :     :     :

Now we need to add the EUnit library into your default Erlang path. Edit the file .erlang in your home directory (create it if it’s not there already) and add the line

  code:add_pathz("/Users/dave/Erlang/common/eunit/ebin").

(You’ll need to change the path to reflect the location where you downloaded EUnit. Remember to add the /ebin part at the end. And, yes, the “z” is really path of that line.)

EUnit In 60 Seconds

EUnit basically lets you run assertions against Erlang code. You can use it in a number of different modes. In our case we’ll include the tests in the same file that contains the code we’re testing. This is very convenient, but longer term it has a downside—anyone using your code will also need to have EUnit installed, even if they don’t want to run the tests. So, later we can split the tests into their own file.

The key to using EUnit is including the library in the source file containing the tests. That’s as simple as adding the line:

  -include_lib("eunit/include/eunit.hrl").

Any module that includes this line will automatically have a new function called test/0 defined. Thistest function will automatically run all the tests in the module.

So, how do you define tests? As with any xUnit framework, you can write tests by defining regular methods that follow a naming convention. With EUnit, the convention is that the function name should end _test. The EUnit convention is that any test function that returns is considered to have completed successfully; any function that throws an exception is a failure. It’s common to use pattern matching as a kind of assertion:

  length_test()  ->  3 = length("cat").
  reverse_test() ->  [3, 2, 1] = lists:reverse([1,2,3]).

However, for simple tests, it’s easier to write a test generator. This is a function whose name ends_test_ (with a final underscore). Test generators return a representation of the tests to be run. In our case, we’ll return a list of tests created using the EUnit _assert macro. We can rewrite our previous two test functions using a test generator called demo_test_:

  demo_test_() -> [
    ?_assert(3 == length("cat")),
    ?_assert([3, 2, 1] == lists:reverse([1,2,3]))
  ].

Here the generator returns an array containing two tests. The tests are written using the _assertmacro. This takes a boolean expression (rather than a pattern) which is why we’re now using two equals signs.

Word Wrap

My application needs a library function which takes an array of strings and joins them together to form an array of lines where no line is longer than a given length. We’ll write a function wrap/1 in a module called text. Let’s start with a basic test. If you wrap no words, you get a single empty line.

  -module(text).
  -export([wrap/1]).

  -include_lib("eunit/include/eunit.hrl").


  wrap_test_() -> [
    ?_assert(wrap([]) == [""])
  ].

This won’t compile: the wrap/1 function hasn’t been defined. Getting it to pass this single test isn’t hard:

  wrap([]) ->
      [ "" ].

We can run this in the Erlang shell:

  1> c(text).
  {ok,text}
  2> text:test().
    Test successful.
  ok

Let’s add the next test. Wrapping a single short word should result in a single line containing that word.

  wrap_test_() -> [
    ?_assert(wrap([]) == [""]),
    ?_assert(wrap(["cat"]) == ["cat"])
  ].

The tests now fail when we run them: our wrap function doesn’t yet know how to wrap strings:

  1> c(text).
  {ok,text}
  2> text:test().
  text:9:wrap_test_...*failed*
  ::{error,function_clause,
           [{text,wrap,[["cat"]]},{text,'-wrap_test_/0-fun-2-',0}]}

  =======================================================
    Failed: 1.  Aborted: 0.  Skipped: 0.  Succeeded: 1.  error

Let’s see if we can fix this. We’ll add a second version of the wrap/1 function that knows how to wrap a single string. Do do this, we’ll use a key feature of Erlang.

Pattern Matching and Function Definitions

In the same way that Erlang uses pattern matching when evaluating the = operator, it also uses pattern matching when invoking functions. This is often illustrated with an inefficient implementation of a function to calculate the nth term in the Fibonacci sequence (the sequence that goes 0, 1, 1, 2, 3, 5, …, where each term is the sum of the preceding two terms).

  -module(fib).
  -export([fib/1]).

  fib(0) -> 1;
  fib(1) -> 1;
  fib(N) -> fib(N-1) + fib(N-2).

Here we have three definitions of the same function (note how they’re separated by semicolons, and terminated with a period). The first two match when the function is called with a parameter or zero or one. The third is called with any other parameter.

We can use this to define two versions of our wrap/1 function, one for the case where it is called with no words in the list, the other when we have a single string.

  wrap([]) ->
    [ "" ];

  wrap([String]) ->
    [ String ].

The tests now pass.

  9> text:test().
    All 2 tests successful.

Add a test which wraps two strings, though, and our tests again fail:

  wrap_test_() -> [
    ?_assert(wrap([]) == [""]),
    ?_assert(wrap(["cat"]) == ["cat"]),
    ?_assert(wrap(["cat", "dog"]) == ["cat dog"])
  ].

  text:10:wrap_test_...*failed*
  ::{error,function_clause,
           [{text,wrap,[["cat","dog"]]},{text,'-wrap_test_/0-fun-4-',0}]}

We’re going to have to do some work. For a start, we’re going to have to rewrite our wrap methods to give them somewhere to store the result. Then they’re going to have to recurse over the input parameter list, adding each word to the output until the input is exhausted.

Building an Output List

We’re going to do two things here. The externally visible function, wrap/1, is simply going to invoke an internal function, wrap/2, passing it the word list and an addition parameter. This second parameter is the list that holds the results. Remember that in Erlang the number of arguments is part of a function’s signature, so wrap/1 and wrap/2 are separate functions.

Then we’re going to use our parameter matching trick to define two variants of wrap/2:

  % This is the exported function: it passes the initial
  % result set to the internal versions
  wrap(Words) ->
    wrap(Words, [""]).

  % When we run out of words, we're done    
  wrap([], Result) ->
    Result;

  % Otherwise append the next word to the result
  wrap([NextWord|Rest], Result) ->
    wrap(Rest, [ Result ++ NextWord]).

The second version of wrap/2 uses a neat feature of Erlang list notation. The pattern [NextWord|Rest]matches a list whose head is matched with NextWord and whose tail is matched with Rest. If we invoked this function with:

  wrap([ "cat", "dog", "elk" ], Result).

then NextWord would be set to "cat" and Rest would be set to ["dog", "elk"].

Unfortunately, this fails our test:

  1> c(text).    
  {ok,text}
  2> text:test().
  text:10:wrap_test_...*failed*
  ::{error,{assertion_failed,[{module,text},
                              {line,10},
                              {expression,
                                  "wrap ( [ \"cat\" , \"dog\" ] ) == [ \"cat dog\" ]"},
                              {expected,true},
                              {value,false}]},
           [{text,'-wrap_test_/0-fun-4-',0}]}

  =======================================================
    Failed: 1.  Aborted: 0.  Skipped: 0.  Succeeded: 2.

This is where I wish EUnit had an assert_equals function that could report the actual and expected values. However, I can still invoke my wrap method from the shell to see what it’s returning. (Stop press: EUnit does have the equivalent, and I didn’t notice it. It’s called assertMatch(Expected, Actual). Sorry.)

  3> text:wrap(["cat", "dog"]).
  ["catdog"]

Oops. We forgot the space between words. We need to add this when we add a word to an existing line, but only if that line is not empty. See how we use pattern matching to distinguish a list containing an empty line from one containing a non-empty one.

  % This is the exported function: it passes the initial
  % result set to the internal versions
  wrap(Words) ->
    wrap(Words, [""]).

  % When we run out of words, we're done    
  wrap([], Result) ->
    Result;

  % Adding a word to an empty line
  wrap([NextWord|Rest], [ "" ]) ->
    wrap(Rest, [ NextWord]);

  % or to a line that's already partially full  
  wrap([NextWord|Rest], [ CurrentLine ]) ->
    wrap(Rest, [ CurrentLine ++ " " ++ NextWord]).

Now our tests pass again:

  1> c(text).
  {ok,text}
  2> text:test().              
    All 3 tests successful.
  ok

When Clauses

For testing purposes, let’s assume that we wrap lines longer than 10 characters. That means that if we give our method the strings “cat”, “dog”, and “elk”, we’ll expect to see two lines in the output (because “cat<space>dog<space>elk” is 11 characters long). Time for another test.

  wrap_test_() -> [
    ?_assert(wrap([]) == [""]),
    ?_assert(wrap(["cat"]) == ["cat"]),
    ?_assert(wrap(["cat", "dog"]) == ["cat dog"]),
    ?_assert(wrap(["cat", "dog", "elk"]) == ["cat dog", "elk"])
  ].

We don’t even have to run this to know it will fail: our wrap function knows nothing about line lengths. Time for some code. We’re now going to have to treat out result as a list or strings, rather than a single string. We’re also going to have to deal with the case where there’s not enough room in the current output line for the next word. In that case we have to add a new empty line to the function’s result and put the word into it.

  wrap(Words) ->
    wrap(Words, [""]).

  % When we run out of words, we're done    
  wrap([], Result) ->
    Result;

  % Adding a word to an empty line
  wrap([NextWord|Rest], [ "" | PreviousLines ]) ->
    wrap(Rest, [ NextWord | PreviousLines ]);

  % or to a line that's already partially full  
  % there are two cases:
  % 1. The word fits
  wrap([NextWord|Rest], [ CurrentLine | PreviousLines ]) 
    when length(NextWord) + length(CurrentLine) < 10 ->
    wrap(Rest, [ CurrentLine ++ " " ++ NextWord | PreviousLines ]);

  % 2. The word doesn't fit, so we create a new line    
  wrap([NextWord|Rest], [ CurrentLine | PreviousLines ]) ->
    wrap(Rest, [ NextWord, CurrentLine | PreviousLines ]).


This introduces a new Erlang feature—you can further qualify the pattern
matching used to determine which function is selected using a when
clause. In our case, the parameter patterns for the last two definitions of the
wrap/2 method are the same. However, the first of them has a
whenclause

. This means that this function will only be selected if the length of the current output line plus the length of the next word is less that our maximum line length (hard coded to 10 in this case).

Unfortunately, our tests fail:

1> text:test().              
text:11:wrap_test_...*failed*
::{error,{assertion_failed,[{module,text},
                            {line,11},
                            {expression,
                                "wrap ( [ \"cat\" , \"dog\" , \"elk\" ] ) == [ \"cat dog\" , \"elk\" ]"},
                            {expected,true},
                            {value,false}]},
         [{text,'-wrap_test_/0-fun-6-',0}]}

=======================================================
  Failed: 1.  Aborted: 0.  Skipped: 0.  Succeeded: 3.
error

Running text:wrap manually shows why:

2> text:wrap(["cat", "dog", "elk"]).
["elk","cat dog"]

We’re building the result list backwards, adding each successive line to the front of it, rather than the back. We could fix that by changing the way we add lines to the list, but, like Lisp, Erlang favors list manipulations that work on the head of the list, not the last element. It turns out to be both idiomatic and more efficient to build the result the way we’re doing it, and then to result the resulting list before returning it to our caller. That’s a simple change to our exported wrap/1 function.

  wrap(Words) ->
    lists:reverse(wrap(Words, [""])).

Now all out tests pass.

So, let’s review our tests. We have covered an empty input set, a set that fits on one line, and a set that (just) forces the result to take more than one line. Are there any other cases to consider? It turns out that there’s (at least) one. What happens if a word is too long to fit on a line on its own? Let’s see:

  wrap_test_() -> [
    ?_assert(wrap([]) == [""]),
    ?_assert(wrap(["cat"]) == ["cat"]),
    ?_assert(wrap(["cat", "dog"]) == ["cat dog"]),
    ?_assert(wrap(["cat", "dog", "elk"]) == ["cat dog", "elk"]),
    ?_assert(wrap(["cat", "dog", "hummingbird", "ibix"]) == ["cat dog", "hummingbird", "ibix"])
  ].

Run the tests:

1> c(text).                         
{ok,text}
2> text:test().
  All 5 tests successful.
ok

Wrapping Up

Using simple tests like this are a great way of designing some code. They’re also a goo way to teach yourself the language. As I was writing this code, I used the tests to test my understanding of Erlang semantics.

Update: Kevin Smith has produced a wonderful screencast on EUnit as episode 5 of his Erlang by Example series.

The Final Program

-module(text).
-export([wrap/1]).

-include_lib("eunit/include/eunit.hrl").

wrap_test_() -> [
  ?_assert(wrap([]) == [""]),
  ?_assert(wrap(["cat"]) == ["cat"]),
  ?_assert(wrap(["cat", "dog"]) == ["cat dog"]),
  ?_assert(wrap(["cat", "dog", "elk"]) == ["cat dog", "elk"]),
  ?_assert(wrap(["cat", "dog", "hummingbird", "ibix"]) == ["cat dog", "hummingbird", "ibix"])
].

% This is the exported function: it passes the initial
% result set to the internal versions
wrap(Words) ->
    lists:reverse(wrap(Words, [""])).

% When we run out of words, we're done  
wrap([], Result) ->
    Result;

% Adding a word to an empty line
wrap([ NextWord | Rest ], [ "" | PreviousLines ]) ->
    wrap(Rest, [ NextWord | PreviousLines ]);

% Or to a line that's already partially full. There are two cases:
% 1. The word fits
wrap([ NextWord | Rest ], [ CurrentLine | PreviousLines ]) 
  when length(NextWord) + length(CurrentLine) < 10 ->
    wrap(Rest, [ CurrentLine ++ " " ++ NextWord | PreviousLines ]);

% 2. The word doesn't fit, so we create a new line  
wrap([ NextWord | Rest], [ CurrentLine | PreviousLines ]) ->
    wrap(Rest, [ NextWord, CurrentLine | PreviousLines ]).

Monday, April 16, 2007

Adding Concurrency to Our Erlang Program

In our last thrilling installment, we used Erlang to fetch a book’s title and sales rank from Amazon. Now let’s extend this to fetch the data for multiple books, first one-at-a-time, and then in parallel.

A word from our lawyers: read your Amazon Terms of Service before trying this code. You may have limitations on the number of requests you can send per second or somesuch. This code is just to illustrate some Erlang constructs.

Let’s start with the function that fetches sales ranks in series. We’ll pass it a list of ISBNs, and it will return a corresponding list of {title, rank} tuples. For convenience, let’s define a function that returns a list of ISBNS to check. (Later, we can change this function to read from a database or a file). We’re editing our file ranks.erl.

  isbns() ->
    [ "020161622X", "0974514004", "0974514012", "0974514020", "0974514039",
      "0974514055", "0974514063", "0974514071", "097669400X", "0974514047",
      "0976694018", "0976694026", "0976694042", "0976694085", "097451408X",
      "0976694077", "0977616606", "0976694093", "0977616665", "0976694069",
      "0976694050", "0977616649", "0977616657"
     ].

Users of our code call the fetch_in_series function. It uses the built-in lists:map function to convert our ISBN list into the result list.

  fetch_in_series() ->
    lists:map(fun fetch_title_and_rank/1, isbns()).

The first parameter to lists:map is the function to be applied to each element in the ISBN list. Here we’re telling Erlang to call the fetch_title_and_rank function that we defined in the first article. The/1 says to use the version of the function that takes a single argument.

We need to export this function from the module before we can call it.

  -export([fetch_title_and_rank/1, fetch_in_series/0]).

Let’s fire up the Erlang shell and try it.

  1> c(ranks).
  {ok,ranks}
  2> ranks:fetch_in_series().
  [{"The Pragmatic Programmer: From Journeyman to Master","4864"},
   {"Pragmatic Version Control Using CVS","288118"},
   {"Pragmatic Unit Testing in Java with JUnit","116011"},
   . . .

“But wait!” I hear you cry. “Isn’t Erlang supposed to be good at parallel programming. What’s with all this fetch-the-results-in-series” business?”

OK, let’s create a parallel version. We have lots of options here, but for now let’s do it the hard way. We’ll spawn a separate (Erlang) process to handle each request, and we’ll write all our own process management code to coordinate harvesting the results as these processes complete.

Erlang processes are lightweight abstractions; they are not the same as the processes your operating system provides. Unlike your operating system, Erlang is happy dealing with thousands of concurrent processes. What’s more, you can distribute these processes across servers, processors, and cores within a processor. All this is achieved with a handful of basic operations. To send a tuple to a process whose process ID is in the variable Fred, you can use:

  Fred  ! {status, ok}

To receive a message from another process, use a receive stanza:

  receive 
    { status, StatusCode } -> StatusCode
  end

There’s something cool about the receive code. See the line

    { status, StatusCode } -> StatusCode

The stuff on the left hand side of the arrow is a pattern match. It means that this receive code will only accept messages which are a two element tuple where the first element is the atom status. This particular receive stanza then strips out just the actual code part of this tuple and returns it. In general a receive stanza can contain multiple patterns, each with its own specialized processing. This is remarkably powerful.

There’s one last primitive we need. The function spawn takes a function and a set of parameters and invokes that function in a new Erlang process. It returns the process ID for that subprocess.

Let’s write a simple, and stupid, example. This module exports test/1. This function spawns a new process that simply doubles a value. Here’s the code—we’ll dig into it in a second.

  -module(parallel).
  -export([test/1]).

  test(Number) ->
    MyPID = self(),
    spawn(fun() -> double(MyPID, Number) end),
    receive
        { answer, Val } -> { "Child process said", Val }
    end.

  double(Parent, Number) ->
    Result = Number + Number,
    Parent ! { answer, Result }.

Because we’re handling all the details ourselves, we have to tell the process running the double function where to send its result. To do that, we pass it two parameters: the first is the process ID of the process to receive the result, and the second is the value to double. The second line of the double function uses the ! operator to send the result back to the original process.

The original process has to pass its own process ID to the double method. It uses the built-in self()function to get this PID. Then, on the second line of the function, it invokes spawn, passing it an anonymous function which in turn invokes double. There’s a wee catch here: it’s tempting to write:

  test(Number) ->
    spawn(fun() -> double(self(), Number) end),
    ...

This won’t work: because the anonymous function only gets executed once the new process is started, the value returned by self() will be that process’s ID, and not that of the parent.

Anyway, we can invoke this code using the Erlang shell:

  1> c(parallel).
  {ok,parallel}
  2> parallel:test(22).
  {"Child process said",44}

Back to Amazon. We want to fetch all the sales ranks in parallel. We’ll start with the top-level function. It starts all the fetcher processes running in parallel, then gathers up their results.

  fetch_in_parallel() ->
    inets:start(),
    set_queries_running(isbns()),
    gather_results(isbns()).

The first line of this method is a slight optimization. The HTTP client code that we use to fetch the results actually runs behind the scenes in its own process. By calling the inets:start method, we precreate this server process.

Here’s the code that creates the processes to fetch the sales data:

  set_queries_running(ISBNS) ->
    lists:foreach(fun background_fetch/1, ISBNS).

  background_fetch(ISBN) ->
    ParentPID = self(), 
    spawn(fun() ->
                ParentPID ! { ok, fetch_title_and_rank(ISBN) }
          end).

The lists:foreach function invokes its first argument on each element of its second argument. In this case, it invokes the background_fetch function that in term calls spawn to start the subprocess. Perhaps surprising is the fact that the anonymous function acts as a closure: the values of ParentID andISBN in its calling scope are available inside the fun’s body, even though it is running is a separate process. Cool stuff, eh?

There’s an implicit decision in the design of this code: I decided that I don’t care what order the results are listed in the returned list—I simply want a list that contains one title/rank tuple for each ISBN. I can make use of this fact in the function that gathers the results. It again uses lists:map, but bends its meaning somewhat. Normally the map function maps an value onto some other corresponding value. Here we’re simply using it to populate a list with the same number of entries as the original ISBN list. Each entry contains a result from Amazon, but it won’t be the result that corresponds to the ISBN that is in the corresponding position in the ISBN list. For my purposes, this is fine.

  gather_results(ISBNS) ->      
    lists:map(fun gather_a_result/1, ISBNS).

  gather_a_result(_) ->
    receive 
      { ok, Anything } -> Anything
    end.

Let’s run this:

  1> c(ranks).                              
  {ok,ranks}
  2> ranks:fetch_in_parallel().             
  [{"The Pragmatic Programmer: From Journeyman to Master","6359"},
   {"Pragmatic Version Control Using CVS","299260"},
   {"Pragmatic Unit Testing in Java with JUnit","131616"},
   . . .

What kind of speedup does this give us? We can use the built-in timer:tc function to call our two methods.

 timer:tc(ranks, fetch_in_parallel, []).
   {1163694, . . .
 timer:tc(ranks, fetch_in_series, []).  
   {3070261, . . .

The first element of the result in the wallclock time taken to run the function (in microseconds). The parallel version is roughly 3 times faster than the serial function.

So, do you have to go to all this trouble to make your Erlang code run in parallel? Not really. I just wanted to show some of the nuts and bolts. In reality, you’d probably use a library method, such as thepmap function that Joe wrote for the Programming Erlang book. Armed with this, you could turn the serial fetch program to a parallel fetch program by changing

  lists:map(fun fetch_title_and_rank/1, ?ISBNS).

  lib_misc:pmap(fun fetch_title_and_rank/1, ?ISBNS).

Now that’s an abstraction!

Anyway, here’s the current version of the ranks.erl program, containing both the serial and parallel fetch functions.

  -module(ranks).
  -export([fetch_title_and_rank/1, fetch_in_series/0, fetch_in_parallel/0]).
  -include_lib("xmerl/include/xmerl.hrl").

  -define(BASE_URL,
          "http://webservices.amazon.com/onca/xml?Service=AWSECommerceService" 
          "&SubscriptionId=<your ID goes here>" 
          "&Operation=ItemLookup" 
          "&ResponseGroup=SalesRank,Small" 
          "&ItemId=").

  isbns() ->
    [ "020161622X", "0974514004", "0974514012", "0974514020", "0974514039",
      "0974514055", "0974514063", "0974514071", "097669400X", "0974514047",
      "0976694018", "0976694026", "0976694042", "0976694085", "097451408X",
      "0976694077", "0977616606", "0976694093", "0977616665", "0976694069",
      "0976694050", "0977616649", "0977616657"
     ].

  fetch_title_and_rank(ISBN) ->
    URL = amazon_url_for(ISBN),
    { ok, {_Status, _Headers, Body }} = http:request(URL),
    { Xml, _Rest } = xmerl_scan:string(Body),
    [ #xmlText{value=Rank} ]  = xmerl_xpath:string("//SalesRank/text()", Xml),
    [ #xmlText{value=Title} ] = xmerl_xpath:string("//Title/text()", Xml),
    { Title, Rank }.

  amazon_url_for(ISBN) ->
    ?BASE_URL ++ ISBN.


  fetch_in_series() ->
    lists:map(fun fetch_title_and_rank/1, isbns()).


  fetch_in_parallel() ->
    inets:start(),
    set_queries_running(isbns()),
    gather_results(isbns()).

  set_queries_running(ISBNS) ->
    lists:foreach(fun background_fetch/1, ISBNS).

  background_fetch(ISBN) ->
    ParentPID = self(), 
    spawn(fun() ->
              ParentPID ! { ok, fetch_title_and_rank(ISBN) }
          end).

  gather_results(ISBNS) ->      
    lists:map(fun(_) ->
            receive 
                { ok, Anything } -> Anything
            end
          end, ISBNS).

Sunday, April 15, 2007

A First Erlang Program

One of the joys of playing at being a publisher is that I get to mess around with the technology in books as those books are getting written. Lately I’ve been having a blast with Joe Armstrong's new Erlang Book. At some point I’ll blog about the really neat way the Erlang database unifies the set-based SQL query language and list comprehensions (it’s obvious when you think about it, but it blew me away when I first read it). But I just wanted to start off with some simple stuff.

One of my standard Ruby examples is a program that fetches our book sales ranks from Amazon using their REST interface. I thought I’d try the exercise again in Erlang. In this first part we’ll write a simple Erlang function that uses the Amazon Web Services API to fetch the data on a book (identified by its ISBN). This data is returned in XML, so we’ll then use some simple XPath to extract the title and the sales rank.

My Erlang module is called ranks, and I store it in the fileranks.erl. Because I’m just playing for now, the interface to this module is simple: I’ll define a function calledfetch_title_and_rank. This returns a {title, rank} tuple. So, I started off with something like this:

  -module(ranks).
  -export([fetch_title_and_rank/1]).

  fetch_title_and_rank(ISBN) ->
    { "Agile Web Development with Rails", 1234 }.

The -module directive simply names the module. The next line tells Erlang that this module exports a function called fetch_title_and_rank. This method takes one parameter. (This idea of specifying the number of parameters seems strange until you realize that in Erlang the function signature is the function name and the parameter count. It is valid to export two functions with the same name if they take a different number of arguments—as far as Erlang is concerned they are different functions.)

Next we define the fetch_title_and_rank function. It takes a single parameter, the book’s ISBN. In Erlang, variable names (including parameters) start with uppercase letters. This takes a little getting used to. Just to make sure thinks are wired up correctly, let’s try running this code. In a command shell, I navigate to the directory containing ranks.erl and started the interactive Erlang shell. Once inside it, I compiled my module (using c(ranks).) and then invoke the fetch_title_and_rank method.

  dave[~/tmp 11:51:09] erl
  Erlang (BEAM) emulator version 5.5.4 [source] ...

  Eshell V5.5.4  (abort with ^G)
  1> c(ranks).
  ./ranks.erl:11: Warning: variable 'ISBN' is unused
  {ok,ranks}
  2> ranks:fetch_title_and_rank("0977616630").
  {"Agile Web Development with Rails",1234}

Let’s start wiring this up to Amazon.

To fetch information from Amazon using REST, you need to issue an ItemLookup operation, specifying what you want Amazon to return. In our case, we want the sales rank information, plus something called the small response group. The latter includes the book’s title. The URL you use to get this looks like this (except all on one line, and with no spaces):

http://webservices.amazon.com/onca/xml?
Service=AWSECommerceService
&SubscriptionId=<your ID goes here>
&Operation=ItemLookup
&ResponseGroup=SalesRank,Small
&ItemId=0977616630

The ItemID parameter is the ISBN of our book. Let’s write an Erlang function that returns the URL to fetch the information for an ISBN.

  -define(BASE_URL,
        "http://webservices.amazon.com/onca/xml?" ++
        "Service=AWSECommerceService" ++
        "&SubscriptionId=<" ++
            "&Operation=ItemLookup" ++
            "&ResponseGroup=SalesRank,Small" ++
            "&ItemId=").

  amazon_url_for(ISBN) ->
    ?BASE_URL ++ ISBN.

This code defines an Erlang macro called BASE_URL which holds the static part of the URL. The functionamazon_url_for builds the full URL by appending the ISBN to this. Note that both the macro and the function use ++, the operator that concatenates strings.

We now need to find a way of sending this request to Amazon and fetching back the XML response. Here we bump into one of Erlang’s current problems—it can be hard to browse the library documentation. My solution is to download the documentation onto my local box and search it there. I tend to use both the HTML and the man pages. Figuring I wanted something HTTP related, I tried the following:

  dave[Downloads/lib 12:04:07] ls **/*http*
  inets-4.7.11/doc/html/http.html        
  inets-4.7.11/doc/html/http_base_64.html 
  inets-4.7.11/doc/html/http_client.html  
  inets-4.7.11/doc/html/http_server.html 
  inets-4.7.11/doc/html/httpd.html
  inets-4.7.11/doc/html/httpd_conf.html
  inets-4.7.11/doc/html/httpd_socket.html
  inets-4.7.11/doc/html/httpd_util.html

Opening up http.html revealed the http:request method.

request(Url) -> {ok, Result} | {error, Reason}

This says that we give the request method a URL string. It returns a tuple. If the first element of the tuple is the atom ok, then the second element is the result for the request. If the first element is error, the second element is the reason it failed. This technique of returning both status and data as a tuple is idiomatic in Erlang, and the language makes it easy to handle the different cases.

Let’s look a little deeper into the successful case. The Result element is actually itself a tuple containing a status, the response headers, and the response body.

So let’s take all this and develop our fetch_title_and_rank method a little more.

  fetch_title_and_rank(ISBN) ->
    URL = amazon_url_for(ISBN),
    { ok, {Status, Headers, Body }} = http:request(URL),
    Body.

We call our amazon_url_for method to create the URL to fetch the data, then invoke the http:requestlibrary method to fetch the result. Let’s look at that second line in more detail.

The equals sign in Erlang looks like an assignment statement, but looks are deceptive. In Erlang, the equals sign is actually a pattern matching operation—an assertion that two things are equal. For example, it is perfectly valid in Erlang to say

  6> 1 = 1.
  1
  7> 9 * 7 = 60 + 3.
  63

So what happens when variables get involved? This is where it gets interesting. When I say

  A = 1.

I’m not assigning 1 to the variable A. Instead, I’m matching A against 1 (just like 9*7 = 60+3 asserts that the results of the two expressions are the same). So, does A match 1? That depends. If this is the first appearance of A on the left hand side of an equals sign, then A is said to be unbound. In this condition, Erlang says to itself “I can make this statement true if I give A the value 1,” which is does. From this point forward, A has the value 1 (in the current scope). If we say A = 1. again in the same scope in our Erlang program, A is no longer unbound, but this pattern match succeeds, because A = 1.is a simple assertion of truth. But what happens if we say A=2.?

  10> A = 1.
  1
  11> A = 1.
  1
  12> A = 2.

  =ERROR REPORT==== 16-Apr-2007::12:32:36 ===
  Error in process <0.55.0> with exit value: {{badmatch,2},[{erl_eval,expr,3}]}

  ** exited: {{badmatch,2},[{erl_eval,expr,3}]} **

We get an error: Erlang was unable to match the values on the left and right hand sides of the equals, so it raised a badmatch error. (Erlang error reporting is, at best, obscure.)

So, back to our HTTP request. I wrote

  { ok, {Status, Headers, Body }} = http:request(URL)

The left hand side matches a tuple. The first element of the tuple is the atom ok. The second element is another tuple. Thus the entire expression only succeeds if http:request returns a two element tuple with ok as the first element and a three element tuple as the second element. If the match succeeds, then the variables Status, Headers, and Body are set to the three elements of that second tuple.

(An aside for Ruby folk: you can do something similar using an obscure Ruby syntax. The statement

  rc, (status, headers, body) = [ 200, [ "status", "headers", "body" ] ]

will leave headers containing the string "headers".)

In our Erlang example, what happens if http:request returns an error? In this case, the first element of the returned tuple will be the atom error. This won’t match the left hand side of our expression (because the atom error is not the same as the atom ok) and we’ll get a badmatch error. We won’t bother to handle this here—it’s someone else’s problem.

Let’s compile our program.

  1> c(ranks).
  ./ranks.erl:13: Warning: variable 'Headers' is unused
  ./ranks.erl:13: Warning: variable 'Status' is unused
  {ok,ranks}

Erlang warns us that we have some unused variables. Do we care? To a point. I personally like my compilations to be clean, so let’s fix this. If you prefix a variable’s name with an underscore, it tells the Erlang compiler that you don’t intend to use that variable. As a degenerate case, you can use just a lone underscore, which means “some anonymous variable.” So, we can fix our warnings by writing either

  { ok, {_Status, _Headers, Body }} = http:request(URL),

  { ok, {_, _, Body }} = http:request(URL),

I mildly prefer the first form, as it helps me remember what those elements in the tuple are when I reread the program later.

Now we can recompile and run our code:

  1> c(ranks).
  {ok,ranks}
  2> ranks:fetch_title_and_rank("0977616630").
  "<?xml version=\"1.0\" encoding=\"UTF-8\"?><ItemLookupResponse
  xmlns=\"http://webservices.amazon.com/AWSECommerceService/2005-10-05\">
  <OperationRequest><HTTPHeaders><Header ...
     ... <SalesRank>1098</SalesRank>
     ... <Title>Agile Web Development with Rails (Pragmatic
          Programmers)</Title>...
     ...

Cool! We get a boatload of XML back. (If you’re running this at home, you’ll need to substitute your own AWS subscription ID into the request string to make this work.) Now we need to extract out the title and the sales rank. We could to this using string operations, but wouldn’t it be more fun to use XPath? Another quick filesystem scan reveals the xmerl library, which both generates and parses XML. It also supports XPath queries. To use this, we first scan the Amazon response. This constructs an Erlang data structure representing the original XML. We then call the xmerl_xpath library to find the elements in this data structure matching our query.

  -include_lib("xmerl/include/xmerl.hrl").

  fetch_title_and_rank(ISBN) ->
    URL = amazon_url_for(ISBN),
    { ok, {_Status, _Headers, Body }} = http:request(URL),
    { Xml, _Rest } = xmerl_scan:string(Body),
    [ #xmlText{value=Rank} ]  = xmerl_xpath:string("//SalesRank/text()", Xml),
    [ #xmlText{value=Title} ] = xmerl_xpath:string("//Title/text()", Xml),
    { Title, Rank }.

The -include line loads up the set of Erlang record definitions that xmerl uses to represent elements in the parsed XML. This isn’t strictly necessary, but doing so makes the XPath stuff a little cleaner.

The line

    { Xml, _Rest } = xmerl_scan:string(Body),

parses the XML response, storing the internal Erlang representation in the variable Xml.

Then we have the line

    [ #xmlText{value=Rank} ]  = xmerl_xpath:string("//SalesRank/text()", Xml),

The notation #xmlText identifies an Erlang record. A record is basically a tuple where each element is given a name. In this case the tuple (defined in that .hrl file we included) is called xmlText. It represents a text node in the XML. The xmlText record has a field called value. We use pattern matching to assign whatever is in that field to the variable Rank. But there’s one more twist. Did you notice that we wrote square brackets around the left hand side of the pattern match? Thats because XPath queries return an array of results. By writing the left hand side the way we did, we force Erlang to check that an array containing just one element was returned, and then to check that this element was a record of type xmlText. That’s pretty cool stuff—pattern matching extends down deep into the bowels of Erlang.

The last line of the method simply returns a tuple containing the title and the sales rank.

We can call it from the command line:

  1> c(ranks).
  {ok,ranks}
  2> ranks:fetch_title_and_rank("0977616630").
  {"Agile Web Development with Rails (Pragmatic Programmers)","1098"}

That’s it for now. Next we’ll look at fetching the data for multiple books at once. In the meantime, here’s the full listing of our ranks.erl file.

  -module(ranks).
  -export([fetch_title_and_rank/1]).
  -include_lib("xmerl/include/xmerl.hrl").

  -define(BASE_URL, 
        "http://webservices.amazon.com/onca/xml?" ++
        "Service=AWSECommerceService" ++
        "&SubscriptionId=<your id>" ++
        "&Operation=ItemLookup" ++
        "&ResponseGroup=SalesRank,Small" ++
        "&ItemId=").


  fetch_title_and_rank(ISBN) ->
    URL = amazon_url_for(ISBN),
    { ok, {_Status, _Headers, Body }} = http:request(URL),
    { Xml, _Rest } = xmerl_scan:string(Body),
    [ #xmlText{value=Rank} ]  = xmerl_xpath:string("//SalesRank/text()", Xml),
    [ #xmlText{value=Title} ] = xmerl_xpath:string("//Title/text()", Xml),
    { Title, Rank }.

  amazon_url_for(ISBN) ->
    ?BASE_URL ++ ISBN.