My Apache Pig Review
June 22, 2018
My Apache Pig Review
Score 7 out of 10
Overall Satisfaction with Apache Pig
As a requirement of a distributed processing system, we are using Apache Pig within our Information Technology department. I use it to an extent of generating reports with advanced statistical methods, both for internal use as well as external purposes. But our Data Science team and Data Engineering team use it to build pipelines in Big Data environment, to conduct further advanced analysis including for machine learning purposes.
- Long logics in Java? Apache Pig is a good alternative.
- Has a lot of great features including table joins on many databases like DBMS, Hive, Spark-SQL etc.
- Faster & easy development compared to regular map-reduce jobs.
- UDFS Python errors are not interpretable. Developer struggles for a very very long time if he/she gets these errors.
- Being in early stage, it still has a small community for help in related matters.
- It needs a lot of improvements yet. Only recently they added datetime module for time series, which is a very basic requirement.
- It can handle large datasets pretty easily compared to SQL. But, again, alternatives are more efficient.
- While working on unstructured, decentralized dataset, Pig is highly beneficial, as it is not a complete deviation from SQL, but it does not take you in complexity MapReduce as well.
I use both Apache Pig and its alternatives like Apache Spark & Apache Hive. Apache Pig was one of the best options in Big Data's initial stages. But now alternatives have taken over the market, rendering Apache Pig behind in the competition. But it is still a better alternative to Map Reduce. It is also a good option for working with unstructured datasets. Moreover, in certain cases, Apache Pig is much faster than Hive & Spark.
It is one great option in terms of database pipelining. It is highly effective for unstructured datasets to work with. Also, Apache Pig being a procedural language, unlike SQL, it is also easy to learn compared to other alternatives. But other alternatives like Apache Spark would be my recommendation due to the high availability of advanced libraries, which will reduce our extra efforts of writing from scratch.