Pachyderm

Pachyderm

Pachyderm is the leader in data versioning and pipelines for MLOps. We provide the data foundation that allows data science teams to automate and scale their machine learning lifecycle while guaranteeing reproducibility. With investment from Benchmark, Microsoft M12, and others, Pachyderm, Inc. offers a commercial Pachyderm Enterprise Edition and an open source Pachyderm Community Edition. Pachyderm helps customers get their ML and AI projects to market faster, lower data processing and storage costs, and supports strict data governance requirements.

Key Features

  • Automated Data Versioning — Pachyderm’s Data Versioning gives teams an automated and performant way to keep track of all data changes

    • Utilizes a Git-like structure that enables effective team collaboration through commits, branches and rollbacks

    • Powerful content-based deduplication reduces the cost of storing and accessing large data sets

    • File-based versioning provides a complete audit trail for all data and artifacts across pipeline stages including intermediate results

    • Stored as native objects (not metadata pointers) so that versioning is automated and guaranteed

  • Data-Driven Pipelines — Pachyderm’s Containerized Pipelines speed data processing while lowering compute costs

    • Kubernetes native approach supports any library or language

    • Autoscale with parallel processing of data without writing additional code

    • Automated pipelines execute whenever new data is committed

    • Incremental processing saves compute by only processing differences and automatically skipping duplicate data

    • Pipeline steps have JSON/YAML defined inputs and outputs that ease debugging

  • Immutable Data Lineage — Pachyderm’s Data Lineage provides an immutable record for all activities and assets in the ML lifecycle

    • Track every version of your code, models, and data

    • Maintain reproducibility of data and code for compliance

    • Manage relationships between historical data states

    • Pachyderm’s Global IDs make it easy for teams to track any result all the way back to its raw input, including all analysis, parameters, code, and intermediate results.

  • Console — The Pachyderm Console provides an intuitive visualization of your DAG (directed acyclic graph) and aids in reproducibility

    • See the overall structure and flow of all your pipelines

    • Ease pipeline and workflow design

    • Facilitate collaboration across teams on shared DAGs

    • Drill into pipelines and job details for easy debugging

  • Notebooks — Pachyderm’s JupyterLab Mount Extension provides a point-and-click interface to Pachyderm versioned data

    • Accelerate experimentation with easy and intuitive access to versioned data

    • Mount any Pachyderm data repository locally for convenient access

    • Work with versioned data like it’s on your own file system. No Pachyderm knowledge required

    • Explore data with a built in file browser

    • Collaborate across teams with a single source of truth for your data

  • Enterprise Administration — Pachyderm provides robust tools for deploying and administering Pachyderm at scale across different teams in your organization

    • Helm 3 provides robust and standards-based deployment on any public or private cloud

    • Enterprise Server provides easy centralized licensing and administration of all Pachyderm clusters / workspaces

    • Use any identity provider with Pachyderm’s pluggable authentication

    • Role Based Access Control (RBAC), allows for fine grained control over access to clusters and data

Products