Useful ETL scripting tool
March 20, 2020

Useful ETL scripting tool

Jordan Moore | TrustRadius Reviewer
Score 8 out of 10
Vetted Review
Verified User

Overall Satisfaction with Apache Pig

Pig is used by data engineers as a stopgap between setting up a Spark environment and having more declarative flexibility than HiveQL while moving away from MapReduce. It solves the problem of needing to iteratively transform and migrate data between supported Hadoop environments while being able to debug the process at each step.
  • Iterative Development - you can write aliases/variables, which are not immediately executed and these are stored in a DAG, which is only evaluated upon dumping or storing another alias.
  • Fast execution - Works with MapReduce, Tez, or Spark execution frameworks to provide fast run times at large scales.
  • Local and remote interoperability - Scripts that depend on testing a small dataset locally before moving to the full thing can simply be done with "pig -x local."
  • General syntax for the FOREACH ... GENERATE feature is confusing for nested actions.
  • The docs are hard to navigate, but it is made up for by reasonable examples.
  • A version less than 1.0 doesn't instill confidence in the product that has been around for over half a decade (as of writing).
  • Iterate quickly on ETL pipelines.
  • Scale up parallel processing.
  • Easily templatable scripting language.
Pig is more focused on scripting in its own PigLatin language rather than integrate into another language like Java/Scala/Python/SQL.
However, for batch ETL workloads, I find that I can write a Pig script quicker than setting up and deploying a Spark program, for example.
The documentation is adequate. I'm not sure how large of an external community there is for support.

Do you think Apache Pig delivers good value for the price?


Are you happy with Apache Pig's feature set?


Did Apache Pig live up to sales and marketing promises?

I wasn't involved with the selection/purchase process

Did implementation of Apache Pig go as expected?


Would you buy Apache Pig again?


If someone wants to process data and doesn't have access to platforms such as Spark or Flink, and wants to do so in a minimal, portable fashion that requires simply requires learning a new scripting language, then Pig is great. It also supports running the same code against a cluster as a single developer machine for testing.

Pig is more suited for batch ETL workloads, not ML or Streaming big data use-cases.