Item: Apache Pig
Rating: 7
Author: Verified User

Overall Satisfaction with Apache Pig

Use Cases and Deployment Scope

Apache Pig is one of the distributed processing technologies we are using within the engineering department as a whole and we are currently using it mainly to generate aggregate statistics from logs, run additional refinement and filtering on certain logs, and to generate reports for both internal use and customer deliveries.

Pros and Cons

Pros

Provides a decent abstraction for Map-Reduce jobs, allowing for a faster result than creating your own MR jobs
Good documentation and resources for learning Pig Latin (the Domain Specific Language of the Apache Pig platform)
Large community allows for easy learning, support, and feature improvements/updates

Cons

May not fit every need and a SQL-like abstraction may be more effective for some tasks (look at Spark-SQL, Hive, or even an actual DBMS)
All Pig jobs are written in a Domain Specific Language so not a lot of transferable knowledge
Writing your own User Defined Functions (UDFS) is a nice feature but can be painful to implement in practice

Return on Investment

Higher learning curve than other similar technologies so on-boarding new engineers or change ownership of Apache Pig code tends to be a bit of a headache
Once the language is learned and understood it can be relatively straightforward to write simple Pig scripts so development can go relatively quickly with a skilled team
As distributed technologies grow and improve, overall Apache Pig feels left in the dust and is more legacy code to support than something to actively develop with.

Alternatives Considered

Apache Hive and Apache Spark

Early on Apache Pig was a great tool for easily writing distributed processing applications without needing to write a complete Java MapReduce job from scratch, but as time as moved on there now better alternatives to get results faster for both ad-hoc analysis and for production systems. Apache Pig was used since it was what was available early on in the industry and since it has reached maturity, but at this point it feels a little long in the tooth.

Other Software Used

Oracle Java SE, Eclipse, IntelliJ IDEA, HipChat, JIRA Software, Databricks, Hortonworks Data Platform

Likelihood to Recommend

Apache Pig is well suited as part of an ongoing data pipeline where there is already a team of engineers in place that are familiar with the technology since at this point I would consider it relatively depreciated since there are more suitable technologies that have more robust and flexible APIs with the added benefit of being easier to learn and apply. For ad-hoc needs, I would recommend Hive or Spark-SQL if a SQL-esque language makes sense otherwise to make use of Spark + a Notebook technology such as Apache Zeppelin. For production data pipelines I would recommend Apache Spark over Apache Pig for its performance, ease of use, and its libraries.

Comments

Please log in to join the conversation

Apache Pig - Is it the tool for the job? Maybe, but probably not.