Skip to main content
TrustRadius
Apache Pig

Apache Pig

Overview

What is Apache Pig?

Apache Pig is a programming tool for creating MapReduce programs used in Hadoop.

Read more
Recent Reviews

TrustRadius Insights

Apache Pig has proven to be an invaluable tool for data engineers working with large datasets in the Apache Hadoop ecosystem. Users have …
Continue reading

Apache Pig

7 out of 10
April 07, 2022
We mainly use Apache Pig for its capabilities that allows us to easily create data pipelines. Also it comes with its native language Pig …
Continue reading
Read all reviews
Return to navigation

Product Details

What is Apache Pig?

Apache Pig Technical Details

Operating SystemsUnspecified
Mobile ApplicationNo
Return to navigation

Comparisons

View all alternatives
Return to navigation

Reviews and Ratings

(22)

Community Insights

TrustRadius Insights are summaries of user sentiment data from TrustRadius reviews and, when necessary, 3rd-party data sources. Have feedback on this content? Let us know!

Apache Pig has proven to be an invaluable tool for data engineers working with large datasets in the Apache Hadoop ecosystem. Users have found it to be an excellent high-level scripting language that simplifies the process of working with big data. With Apache Pig, data engineers can easily build pipelines for advanced analysis and machine learning purposes, allowing them to transform and optimize data operations into MapReduce.

One of the key advantages of Apache Pig is its ability to write complex map-reduce or Spark jobs without requiring deep knowledge of Java, Python, or Groovy. This feature has been highly appreciated by users who value the efficiency and simplicity it brings to their work. Additionally, Apache Pig's query language, Pig Latin, provides users with a straightforward way to build data pipelines, eliminating redundant data and supporting user-defined functions UDFs.

The software also gives users control over task execution, which is crucial in maintaining control in a distributed processing system. This control allows users to efficiently handle transportation problems and manage large volumes of data including data streaming from multiple sources and performing joins. Users have utilized Apache Pig to explore and process large datasets in big data analytics projects, performing various operations within a single Java Virtual Machine.

Another key use case for Apache Pig is the generation of aggregate statistics, running refinement and filtering on logs, as well as generating reports for both internal use and customer deliveries. Data science and data engineering teams also utilize Apache Pig for building big data workflows pipelines for ETL and analytics. The software simplifies the creation of these pipelines by providing native language support with Pig Latin, combining features from various database systems like Hive, DBMS, and Spark-SQL.

Overall, Apache Pig offers a versatile solution for handling big data tasks in a simple yet efficient manner. Its user-friendly query language and extensive capabilities make it a valuable tool for data engineers working in the Apache Hadoop ecosystem.

Users have provided several recommendations for using Pig as a tool for writing quick big data applications.

One recommendation is that Pig is a good starting point for developing ad-hoc analytics applications, especially for those with basic programming experience in Java.

Another recommendation is to use Pig as a base pipeline for parallelizing and utilizing User-Defined Functions (UDFs) on large datasets. The lazy evaluation feature of Pig allows for efficient program optimization.

Users also appreciate Pig's integration with Hadoop, which provides parallelization, fault-tolerance, and relational database features. This makes Pig suitable for applying statistics to datasets, and its functional programming paradigm aligns well with pipeline processes.

Additionally, users suggest considering Spark or Hive as alternative tools for developing pipelines. While Pig may be more suitable for developers with programming experience, it is free and has extensive online documentation available for learning purposes.

Attribute Ratings

Reviews

(1-1 of 1)
Companies can't remove reviews or game the system. Here's why
Jordan Moore | TrustRadius Reviewer
Score 8 out of 10
Vetted Review
Verified User
Incentivized
Pig is used by data engineers as a stopgap between setting up a Spark environment and having more declarative flexibility than HiveQL while moving away from MapReduce. It solves the problem of needing to iteratively transform and migrate data between supported Hadoop environments while being able to debug the process at each step.
  • Iterative Development - you can write aliases/variables, which are not immediately executed and these are stored in a DAG, which is only evaluated upon dumping or storing another alias.
  • Fast execution - Works with MapReduce, Tez, or Spark execution frameworks to provide fast run times at large scales.
  • Local and remote interoperability - Scripts that depend on testing a small dataset locally before moving to the full thing can simply be done with "pig -x local."
  • General syntax for the FOREACH ... GENERATE feature is confusing for nested actions.
  • The docs are hard to navigate, but it is made up for by reasonable examples.
  • A version less than 1.0 doesn't instill confidence in the product that has been around for over half a decade (as of writing).
If someone wants to process data and doesn't have access to platforms such as Spark or Flink, and wants to do so in a minimal, portable fashion that requires simply requires learning a new scripting language, then Pig is great. It also supports running the same code against a cluster as a single developer machine for testing.

Pig is more suited for batch ETL workloads, not ML or Streaming big data use-cases.
  • Iterate quickly on ETL pipelines.
  • Scale up parallel processing.
  • Easily templatable scripting language.
Pig is more focused on scripting in its own PigLatin language rather than integrate into another language like Java/Scala/Python/SQL.
However, for batch ETL workloads, I find that I can write a Pig script quicker than setting up and deploying a Spark program, for example.
The documentation is adequate. I'm not sure how large of an external community there is for support.
Return to navigation