Tutorial: Word Counting

At it’s heart triv.io relies on the MapReduce paradigm to process data. Ever since Google first published it’s white paper on MapReduce it’s become customary to demonstrate the technology by using it to calculate words that occur in some set of documents (typically referred to as a corups).

Well, we’re not ones to break with tradition. To help you understand how to use triv.io, we will teach to count words. Rather then demoing it on one document, as is traditionally done, we’ll teach you how to do it on all the web pages on the internet. Best of all it takes less than 15 lines of code!

That’s while you’ll want to use triv.io

Creating a project and adding Code

To get started, login to triv.io with your github account.

If this is the first time, you’ve logged into triv.io you’ll have a default project setup that looks like this

_images/empty_project.png

Now you need to add some code to your project that tells triv.io what to do.

First click on the code icon labeled by the callout #1 in the image below.

_images/adding_repos.png

You can either import a project shared by the community or create code from scratch. In callout #2 in the above image you’ll see that we’re pointing out the hello_world repository. This repository contains the same code that we’ll have you write in this guide. We assume you’ll be creating the code from scratch. But if you’re the lazy sort you can simply click import (callout #3) and skip to the Word Counting Explained section.

Still with us? Good you’ll use these steps when making your own projects. Click on the “New Repository” button callout #4

Once you click New Repository enter in “word_counting” as the repository name as highlighted in callout 1 below.

_images/new_repo_add_script.png

Hit enter, then add a new script. Click the button highlighted in callout 2 and type in “word_counting.pyj” when the name appears in the file list see callout 3

Hit enter and now you can edit the contents of this blank document by clicking on it

Enter in the following code into the editor

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
word_counts = rule('s3://AKIAIOV23F6ZNL5YPRNA:8Gwz48zgzwoYIZv70V4uGDD6%2fdNtHdbFq4kLXGlR@aws-publicdatasets/common-crawl/crawl-002/', 'word_counts')

@word_counts.map
def map(record, params):
  for word in record['payload'].split():
    yield word, 1

@word_counts.reduce
def reduce(iter, params):
  d = {}
  for word, count in iter:
    d[word] =  d.get(word, 0) + count

 for word, count in d.iteritems():
   yield word, count

When you’re done your window should look something like this

_images/edit_script.png

Click save (callout #1) and triv.io will start counting words for you.

You can see this in action by navigating to your pipeline view, which shows trivi is using common crawl data as your input and it’s outputting the results of your map/reduce job to a table inside of triv.io named word counts.

_images/pipeline.png

By the way, if you dislike editing through a web browser, you can do most of your work from the command line. Simply create your own git repo, host it on github and import it into your project.

Word Counting Explained

Now that we have our project setup and we’re happily counting words from the internet, let’s explain what’s going on.

Line 1 from the code above:

word_counts = rule('s3://AKIAIOV23F6ZNL5YPRNA:8Gwz48zgzwoYIZv70V4uGDD6%2fdNtHdbFq4kLXGlR@aws-publicdatasets/common-crawl/crawl-002/',
              'word_counts')

This tell’s triv.io to use the Common Crawl Corpus (a free public dataset hosted by amazon) and to save it as a table named word_counts. We also assign a reference of the rule instance to the word_counts variable so that we can further customize the job with a map and reduce function.

The actual form to use is this:

word_counts = rule('s3://<AWS ID>:<AWS Secret>@aws-publicdatasets/common-crawl/crawl-002/',
               'word_counts')

During the alpha of triv.io these docs provide a access key and amazon secret that you can use for this example. In the future we may require you to use your own Amazon credentials.

Line’s 3 @word_counts.map declares that the function following it should be used as the mapping function.

Mapping functions take 2 arguments. The first argument is the input record and the second argument, unused in this example, can be used to store state information during a job run. The map function will be called once for every record in the current job segment. Whatever is yieled from this function is passed to the reducer. A record is simply a Python dictionary with some key’s and values. The actual contents of the dictionary depend on the datastore. In this case we’re using a datastore the specifcally adapts each web page in the common crawl corpus to be a dictionar with payload key. There are many other keys whose values may be of interest to you. They include information like the URL of the page, the size, web server headers and more. See the Common Crawl Datastore for more information.

Lines 4-6 implement our mapper. In this case it simply splits the entire payload based on line space and yields each word found along with the value 1 to the reducer. This obviously is very unsophisticated but it’s a toy example. We’ll leave it as an exercise to you to create a more robust word tokenizer.

Line 8 declares that the function following it will be used as the reducer.

Lines 9-15 preform the summation. The reducer is called with an iterator of all the keys and values yielded by our mapper. The code simply sums all the values found in the iterator and yields the final results.