As discussed in a recent InfoQ post, the new Hortonwork Stinger Initiative heavily relies on Tez, a new Hadoop data processing framework.
The Apache Tez: A New Chapter in Hadoop Data Processing blog post explains:
“Higher-level data processing applications like Hive and Pig need an execution framework that can express their complex query logic in an efficient manner and then execute it with high performance. Apache Tez … represents an alternative to the traditional MapReduce that allows for jobs to meet demands for fast response times and extreme throughput at petabyte scale.”
Tez achieves this goal by modeling data processing not as a single job, but rather as a data flow graph:
… with vertices in the graph representing application logic and edges representing movement of data. A rich dataflow definition API allows users to express complex query logic in an intuitive manner and it is a natural fit for query plans produced by higher-level declarative applications like Hive and Pig... [The] dataflow pipeline can be expressed as a single Tez job that will run the entire computation. Expanding this logical graph into a physical graph of tasks and executing it is taken care of by Tez.
Specific user logic is modeled in Tez vertices in the form of input, processor and output modules, where input and output modules define an input and output data (including its format, access method and location), while processor modules define data transformation logic. Data transformation logic can be expressed in a form of a MapReduce job or a reducer. Although Tez does not explicitly impose any data format restriction it requires that input, processor and output are compatible with each other. Similarly, an input/output pair connected by an edge has to be format/location compatible.
Data Processing API in Apache Tez blog post describes a simple Java API used to express a DAG of data processing. The API has three components
- DAG. this defines the overall job. The user creates a DAG object for each data processing job.
- Vertex. this defines the user logic and the resources & environment needed to execute the user logic. The user creates a Vertex object for each step in the job and adds it to the DAG.
- Edge. this defines the connection between producer and consumer vertices. The user creates an Edge object and connects the producer and consumer vertices using it.
Edge properties defined by Tez enable it to instantiate user tasks, configure their inputs and outputs, schedule them appropriately and define how to route data between the tasks. Tez also allows to define parallelism for each vertex execution by specifying user guidance, data size and resources.
- Data movement. Defines routing of data between tasks
- One-To-One: Data from the ith producer task routes to the ith consumer task.
- Broadcast: Data from a producer task routes to all consumer tasks.
- Scatter-Gather: Producer tasks scatter data into shards and consumer tasks gather the shards. The ith shard from all producer tasks routes to the ith consumer task.
- Scheduling. Defines when a consumer task is scheduled
- Sequential: Consumer task may be scheduled after a producer task completes.
- Concurrent: Consumer task must be co-scheduled with a producer task.
- Data source. Defines the lifetime/reliability of a task output
- Persisted: Output will be available after the task exits. Output may be lost later on.
- Persisted-Reliable: Output is reliably stored and will always be available
- Ephemeral: Output is available only while the producer task is running.
Additional details on Tez architecture are presented in this Tez Design document.
The idea to represent data processing as a dataflow is not new. It is the foundation of cascading. It is also achieved in many applications using Oozie. The advantage of Tez is bringing it all together in a single framework optimized for resource management (based on Apache Hadoop YARN), data transfer and execution. Additionally the design of Tez includes support for pluggable vertex management modules to collect relevant information from tasks and change the dataflow graph at runtime to optimize for performance and resource usage.