Apache Pig - Is it the tool for the job? Maybe, but probably not.
January 18, 2018

Apache Pig - Is it the tool for the job? Maybe, but probably not.

Anonymous | TrustRadius Reviewer
Score 7 out of 10
Vetted Review
Verified User

Overall Satisfaction with Apache Pig

Apache Pig is one of the distributed processing technologies we are using within the engineering department as a whole and we are currently using it mainly to generate aggregate statistics from logs, run additional refinement and filtering on certain logs, and to generate reports for both internal use and customer deliveries.
  • Provides a decent abstraction for Map-Reduce jobs, allowing for a faster result than creating your own MR jobs
  • Good documentation and resources for learning Pig Latin (the Domain Specific Language of the Apache Pig platform)
  • Large community allows for easy learning, support, and feature improvements/updates
  • May not fit every need and a SQL-like abstraction may be more effective for some tasks (look at Spark-SQL, Hive, or even an actual DBMS)
  • All Pig jobs are written in a Domain Specific Language so not a lot of transferable knowledge
  • Writing your own User Defined Functions (UDFS) is a nice feature but can be painful to implement in practice
  • Higher learning curve than other similar technologies so on-boarding new engineers or change ownership of Apache Pig code tends to be a bit of a headache
  • Once the language is learned and understood it can be relatively straightforward to write simple Pig scripts so development can go relatively quickly with a skilled team
  • As distributed technologies grow and improve, overall Apache Pig feels left in the dust and is more legacy code to support than something to actively develop with.
Early on Apache Pig was a great tool for easily writing distributed processing applications without needing to write a complete Java MapReduce job from scratch, but as time as moved on there now better alternatives to get results faster for both ad-hoc analysis and for production systems. Apache Pig was used since it was what was available early on in the industry and since it has reached maturity, but at this point it feels a little long in the tooth.
Apache Pig is well suited as part of an ongoing data pipeline where there is already a team of engineers in place that are familiar with the technology since at this point I would consider it relatively depreciated since there are more suitable technologies that have more robust and flexible APIs with the added benefit of being easier to learn and apply. For ad-hoc needs, I would recommend Hive or Spark-SQL if a SQL-esque language makes sense otherwise to make use of Spark + a Notebook technology such as Apache Zeppelin. For production data pipelines I would recommend Apache Spark over Apache Pig for its performance, ease of use, and its libraries.