Item: Apache Pig
Rating: 7
Author: Kartik Chavan

Overall Satisfaction with Apache Pig

Use Cases and Deployment Scope

As a requirement of a distributed processing system, we are using Apache Pig within our Information Technology department. I use it to an extent of generating reports with advanced statistical methods, both for internal use as well as external purposes. But our Data Science team and Data Engineering team use it to build pipelines in Big Data environment, to conduct further advanced analysis including for machine learning purposes.

Pros and Cons

Pros

Long logics in Java? Apache Pig is a good alternative.
Has a lot of great features including table joins on many databases like DBMS, Hive, Spark-SQL etc.
Faster & easy development compared to regular map-reduce jobs.

Cons

UDFS Python errors are not interpretable. Developer struggles for a very very long time if he/she gets these errors.
Being in early stage, it still has a small community for help in related matters.
It needs a lot of improvements yet. Only recently they added datetime module for time series, which is a very basic requirement.

Return on Investment

Return on Investments are significant considering what it can do with traditional analysis techniques. But, other alternatives like Apache Spark, Hive being more efficient, it is hard to stick to Apache Pig.
It can handle large datasets pretty easily compared to SQL. But, again, alternatives are more efficient.
While working on unstructured, decentralized dataset, Pig is highly beneficial, as it is not a complete deviation from SQL, but it does not take you in complexity MapReduce as well.

Alternatives Considered

Apache Hive, Apache Spark and Apache Spark MLib

I use both Apache Pig and its alternatives like Apache Spark & Apache Hive. Apache Pig was one of the best options in Big Data's initial stages. But now alternatives have taken over the market, rendering Apache Pig behind in the competition. But it is still a better alternative to Map Reduce. It is also a good option for working with unstructured datasets. Moreover, in certain cases, Apache Pig is much faster than Hive & Spark.

Other Software Used

Apache Hive, Apache Spark, Apache Spark MLib

Likelihood to Recommend

It is one great option in terms of database pipelining. It is highly effective for unstructured datasets to work with. Also, Apache Pig being a procedural language, unlike SQL, it is also easy to learn compared to other alternatives. But other alternatives like Apache Spark would be my recommendation due to the high availability of advanced libraries, which will reduce our extra efforts of writing from scratch.

Comments

Please log in to join the conversation

My Apache Pig Review