Item: Apache Pig
Rating: 8
Author: Jordan Moore

Overall Satisfaction with Apache Pig

Use Cases and Deployment Scope

Pig is used by data engineers as a stopgap between setting up a Spark environment and having more declarative flexibility than HiveQL while moving away from MapReduce. It solves the problem of needing to iteratively transform and migrate data between supported Hadoop environments while being able to debug the process at each step.

Pros and Cons

Pros

Iterative Development - you can write aliases/variables, which are not immediately executed and these are stored in a DAG, which is only evaluated upon dumping or storing another alias.
Fast execution - Works with MapReduce, Tez, or Spark execution frameworks to provide fast run times at large scales.
Local and remote interoperability - Scripts that depend on testing a small dataset locally before moving to the full thing can simply be done with "pig -x local."

Cons

General syntax for the FOREACH ... GENERATE feature is confusing for nested actions.
The docs are hard to navigate, but it is made up for by reasonable examples.
A version less than 1.0 doesn't instill confidence in the product that has been around for over half a decade (as of writing).

Return on Investment

Iterate quickly on ETL pipelines.
Scale up parallel processing.
Easily templatable scripting language.

Alternatives Considered

Apache Spark, Apache Flink and Apache Hive

Pig is more focused on scripting in its own PigLatin language rather than integrate into another language like Java/Scala/Python/SQL.

However, for batch ETL workloads, I find that I can write a Pig script quicker than setting up and deploying a Spark program, for example.

Support Rating

The documentation is adequate. I'm not sure how large of an external community there is for support.

Key Insights

Do you think Apache Pig delivers good value for the price?

Yes

Are you happy with Apache Pig's feature set?

Yes

Did Apache Pig live up to sales and marketing promises?

I wasn't involved with the selection/purchase process

Did implementation of Apache Pig go as expected?

Yes

Would you buy Apache Pig again?

Yes

Likelihood to Recommend

If someone wants to process data and doesn't have access to platforms such as Spark or Flink, and wants to do so in a minimal, portable fashion that requires simply requires learning a new scripting language, then Pig is great. It also supports running the same code against a cluster as a single developer machine for testing.

Pig is more suited for batch ETL workloads, not ML or Streaming big data use-cases.

Comments

Please log in to join the conversation

Useful ETL scripting tool