Key Concepts

triv.io is your data warehouse in the cloud. Using triv.io infrastructure you can fetch massive amounts of data from anywhere on the internet, transform it into something useful, store it and even ship the results back to your production servers.

This is all done with simple instructions written in Python. There is no server setup or software to install and configure. In this guide we’ll introduce you to the processing API along with some of the key concepts you need to create triv.io projects.

For those eager to get to a real working project have a look at the Tutorial: Word Counting. Make sure to come back here as we assume you are familiar with the terminology introduced in this guide.

URLs

With triv.io you use URLs to specify where your data is and where you want it to be sent when you’re done processing it.

For example, you can use triv.io to fetch Apple’s latest stock quote from Yahoo finance using the following URL:

http://download.finance.yahoo.com/d/quotes.csv?s=AAPL&f=nsl1op

To tell triv.io to push the results to your quotes table in your MySql server you would use this URL:

mysql://<user>:<password>@mysql.example.com/quotes

Or you could POST data to a web endpoint with:

http://web.example.com/receive_quote

Or push your data to an Amazon S3 Bucket with:

s3://<ACCESS_ID>:<ACCESS_SECRET>@mybucket/prefix

Rules

You instruct triv.io to fetch and process data by declaring one or more rules in a script. For example, using the stock url from above, this is how we would tell triv.io to fetch the url and and store it in our database.:

source_url="http://download.finance.yahoo.com/d/quotes.csv?s=AAPL&f=nsl1op"
destination_url = "mysql://user:password@mysql.example.com/quotes"
quotes = rule(source_url, destination_url)

The above declaration, broken into three lines for readability, is all it takes to have triv.io fetch data from yahoo finance and store the results into you MySQL database.

Generally, you wouldn’t send the raw data from your source directly to your destination database. Most triv.io projects use triv.io as long term storage and only pushes summarized data back to their production systems. (think roll-up reporting or a computed models)

This is so common that the typical import declaration simply contains the source url without specifying the destination like so:

quotes = rule("http://download.finance.yahoo.com/d/quotes.csv?s=AAPL&f=nsl1op")

Which instructs triv.io to fetch the quote and store it internally.

Transporting and storing data is all well and good, but the real power of triv.io is that we give you the computing resources to clean and conform your data to specific standards, run computations on it, merge it with other data or change it to a different format. You can accomplish this by writing simple MapReduce jobs and chaining the results together.

For instance, continuing from our quote example, let’s instruct triv.io to calculate the stock change as a percentage:

quotes = rule("http://download.finance.yahoo.com/d/quotes.csv?s=AAPL&f=nsl1op")

@quotes.map
def map(record, params):
  record['change'] = (record['closing'] - record['opening']) / record['opening'] * 100
  yield  record

Using the above @quotes.map decorator we instructed triv.io to run the cleverly named map function once on each record in the input. In this case we calculated the change in stock price and yield ed the record with the calculated value.

Whatever is yieled by the map function is what is actually stored by triv.io. Had we only wanted to store the stock symbol and change in price for the day we could have written our script like this:

quotes = rule("http://download.finance.yahoo.com/d/quotes.csv?s=AAPL&f=nsl1op")

@quotes.map
def map(record, params):
  result = {}
  result['symbol'] = record['symbol']
  result['change'] = (record['closing'] - record['opening']) / record['opening'] * 100
  yield  result

Note, above we yield a completely new dictionary which only contained keys for the stock symbol and change. All the rest of the data was effectively discarded. You can yield more than one record per invocation of the map function or no records to signify the data should be filtered out.

You have the full power of the MapReduce paradigm at your disposal to do your bidding. More information is available in our Understanding MapReduce guide. Which starts with a basic intro to map reduce and continues with information you need to understand how triv.io allocates computing resources to your projects.

Datastores

We call the sources and destinations for your data, datastores. We have created a number of connectors to these stores for you, a sample of which you saw in the section on URLs.

The list of current data stores are available from our trivio.datastores github project. If you don’t see support for your database, file format or web service please contribute it! You’ll find instructions for contributing at the above github page.

Datastores present data to your pipeline as a series of records optionally grouped by time.

Records

You can process just about any kind of data through triv.io: log files, spreadsheets, images, database records, html documents and more. triv.io presents this data to your MapReduce functions as common python dictionaries, which we refer to simply as records. What keys are available in the records depends on the source datastore, the mime-type of the content and what alterations that may have been made to the record by other rules in the processing chain.

For example records produced from databases or csv files will have keys that represent columns. While image content may simply have a key labeled contents which contains the binary data representing the data.

When outputting records you should stick to only emitting records with keys that are JSON serializable. Our plan in the future is to introduce the ability to specify rules in other programming languages and following this practice will help keep your data interchangeable.

Tables

A table is a collection of records partitioned by time. Each rule declaration results in the creation of at least one table. The partitions of the table are known as segments.

Segments

triv.io rules are processed on a schedule. By default each rule is executed once a day. The results of the execution or job are added to the table as exactly one segment. You can change the schedule to meet your needs. For instance you can execute a rule every 5 minutes, once a week or on the last day of the month.

Rules can be chained together, in which case the segment (all the records created in the execution) from the first job serve as the input to the next rule.

Repositories

Your code itself is stored in collection of git repos that you import into your project. Typically you’ll create one git repository specifically for your project and then you may import one or more auxiliary repositories contributed by the community.

The vast majority of the work involved in a data project typically revolves around obtaining, cleaning and conforming your data. Please consider sharing your non-proprietary code on github with other users. Who knows, you may save some researcher so much time that they use triv.io to find the cure for cancer.

Scripts

A repository contains a collection of scripts. Each script has one ore more rules. Currently scripts must end with the .pyj exception to signify that they are a python job.

Projects

All your code and data is grouped into a project. You can invite your colleagues to work on your project along with other triv.io users. All security and preferences are managed through the project interface.