Replicating Enterprise Data at Scale: How PeerDB's CEO is Solving the Challenges of Migrating to Data Warehouses

Discover how PeerDB's CEO is solving the challenges of migrating to data warehouses at scale. Learn how PeerDB's peer-to-peer architecture and innovative engineering enable fast, reliable, and cost-effective data replication from Postgres to warehouses like Snowflake and BigQuery.

February 24, 2025

party-gif

Moving enterprise data at scale can be a complex challenge, but PeerDB's CEO Sai Srirampur has developed a solution that makes it fast, simple, and cost-effective to replicate data from PostgreSQL to data warehouses, queues, and storage. PeerDB's laser-focused approach and commitment to quality over breadth sets it apart, delivering reliable performance and native feature support that helps enterprises streamline their data movement needs.

Replicating Data at Scale with PeerDB's CEO Sai Srirampur

At PeerDB, our focus is on building the world's best solution for replicating data from Postgres to data warehouses, queues, and storage. We take a peer-to-peer architecture approach, which allows us to deeply optimize the connector between Postgres and the target system.

Some key technical challenges we've solved include:

  1. Parallel Snapshotting: For initial data loads, we partition large Postgres tables based on internal identifiers and stream the data in parallel to the target. This allows us to move terabytes of data in hours instead of days.

  2. Optimized CDC: For incremental replication, we leverage Postgres logical replication slots to capture changes. We then apply various optimizations like AO conversion and zstd compression to achieve sub-minute latencies, far exceeding existing tools.

  3. Native Data Type Support: We ensure that rich Postgres data types like geospatial are seamlessly replicated to the target in their native format, avoiding the need for costly transformations.

Our open-source approach has been invaluable, providing visibility, validation, and trust with our users. While a portion of our users opt for the open-source version, the majority prefer our managed service, which offers enterprise-grade support and additional features.

Blogging has been a key part of our go-to-market strategy. We divide our content into four buckets - product updates, community/engineering insights, and fun/creative pieces. The goal is to build awareness and thought leadership over time, rather than expecting immediate results.

As a founder, I'm learning constantly - from managing a diverse set of responsibilities to iterating on our product strategy based on customer feedback. Our current focus is on nailing the Postgres-to-warehouse use case, and we aim to become the go-to data movement tool for Postgres in the years to come.

Solving Challenges with Existing Data Movement Tools

At scale, customers faced several issues with existing data movement tools when replicating data from PostgreSQL to data warehouses:

  1. Performance and Reliability: Moving large volumes of data, such as terabytes of data or managing a fleet of PostgreSQL databases, was very slow and unreliable. Initial loads and ongoing synchronization would often take multiple days and break in between, requiring manual intervention.

  2. Feature Support: Existing tools did not natively support many of the rich data types and features available in PostgreSQL, such as geospatial data types, JSON columns, and partitioning. This forced customers to perform additional transformations, adding complexity and overhead.

  3. Cost: The pricing models of existing tools, often based on the volume of data transferred or the number of rows, resulted in high and unpredictable costs for customers running large-scale PostgreSQL workloads.

These challenges led customers to resort to building in-house solutions using open-source tools like Debezium, which, while functional at scale, required significant engineering effort and resources to implement and maintain.

To address these problems, the PDB team has developed a peer-to-peer architecture focused on providing a robust, high-performance, and feature-rich data movement solution specifically for PostgreSQL. Key technical innovations include:

  • Parallel Snapshotting: Partitioning large tables and streaming the data in parallel to enable terabytes of data to be moved in hours instead of days.
  • Optimized Incremental Replication: Leveraging PostgreSQL's logical replication slots, performing AO conversion and zstd compression to achieve sub-minute latencies.
  • Native Data Type Support: Preserving rich data types, such as geospatial data, by converting them to the appropriate formats for the target data warehouse.

By addressing the core challenges faced by customers, PDB aims to provide the world's best experience for replicating data from PostgreSQL to data warehouses, queues, and storage.

Key Features and Technical Advantages of PeerDB

PeerDB is designed to provide a robust and high-performance solution for replicating data from PostgreSQL to data warehouses, queues, and storage. Some of the key features and technical advantages of PeerDB include:

  1. Parallel Snapshotting: PeerDB utilizes a unique parallel snapshotting technique to move terabytes of data from PostgreSQL to the target in a matter of hours, rather than days, as seen with other generalized ETL tools.

  2. Incremental Replication with Low Latency: PeerDB leverages PostgreSQL's logical replication slots to achieve incremental data replication with latencies of less than 1 minute, significantly faster than the 5-minute minimum latency of existing tools.

  3. Native Data Type Support: PeerDB ensures that rich data types in PostgreSQL, such as geospatial data, are preserved and replicated in their native format to the target, avoiding the need for costly transformations.

  4. Performance Optimizations: PeerDB employs several performance-enhancing techniques, including converting data to Append-Optimized (AO) format for Snowflake and utilizing zstd compression, which can provide up to a 30% performance improvement.

  5. Parallel Merges: When applying changes to the target, PeerDB performs parallel merges to ensure efficient and high-throughput data replication.

  6. Peer-to-Peer Architecture: Unlike hub-and-spoke models used by many generalized ETL tools, PeerDB's peer-to-peer architecture allows it to focus on building a robust and high-quality connector between PostgreSQL and specific targets, rather than supporting a broad range of connectors.

  7. Open-Source Approach: PeerDB is an open-source project, which provides transparency, builds trust with customers, and allows for community contributions and validation of the tool's capabilities.

By addressing the performance, reliability, and feature limitations of existing data movement tools, PeerDB aims to deliver a superior experience for customers who need to replicate data from PostgreSQL to their target data stores, whether it's for real-time analytics, fraud detection, or other use cases.

PeerDB's Open Source Strategy and Go-to-Market Insights

Open source was a no-brainer for PeerDB, given the team's backgrounds and the fact that they are building a data movement tool for PostgreSQL, which is fully open source. The benefits they've seen from open sourcing PeerDB include:

  1. Validation: PeerDB has several large-scale production workloads using the open-source version, which validates that there is a real need for their product.

  2. Visibility: The open-source activity, stars, and community engagement help increase PeerDB's visibility.

  3. Trust: Offering an open-source version builds trust with customers, as they can inspect the code and see that PeerDB is not tied to proprietary software.

The ratio of open-source to paid customers varies based on the complexity of the tool. For PeerDB, around 2-3 out of 10 customers use the open-source version, while the rest prefer the managed service or enterprise offering with support.

Regarding PeerDB's content strategy, they divide their blog into four main buckets:

  1. Product: Updates on new features and releases.
  2. Community: Sharing learnings and insights that the community would find valuable.
  3. Engineering: Diving into the technical details of how PeerDB is built.
  4. Fun: Lighthearted and creative blog posts.

The goal of the blog is to raise awareness about PeerDB and showcase the benefits it provides. While the immediate impact may not be visible, the team has seen that the blog can lead to customers reaching out after 1-2 years of following the content.

As for the future of PeerDB, the team's vision is to make it the go-to data movement tool for PostgreSQL, providing the world's best experience for any data movement use case, whether it's getting data into or out of PostgreSQL. The immediate focus is on nailing the change data capture use case from PostgreSQL to data warehouses, queues, and storage.

Founder Lessons: Building a Team and Defining Product Focus

As a founder, Sai has learned several valuable lessons about running a team and determining product focus. He emphasizes that being a founder involves wearing many hats and learning diverse skills, from product to sales to marketing and investor relations. This diverse set of responsibilities is a significant change from his previous roles at Microsoft and Citus Data.

Sai relies on a network of mentors and champions to guide him through the challenges of founding a startup. He leans on the expertise of his investors, co-founder, and others he has worked with in the past. This support system helps him navigate the uncertainty of whether his current experiment will succeed.

The core strategy Sai and his team have adopted is to maintain a laser-like focus on their current experiment - providing the world's best solution for replicating data from Postgres to data warehouses, queues, and storage. They view this as a critical experiment that will determine the direction of the company over the next 6 months to a year.

Sai believes that execution is more important than the initial idea, as ideas can evolve and pivot over time. However, he acknowledges that having a strong starting point, rooted in the founder's market experience, can provide a helpful foundation. For Sai, his background working with customers on Postgres data movement challenges gave him a valuable edge in identifying the problem to solve.

Ultimately, Sai emphasizes the importance of being persistent, patient, and adaptable as a founder. He recognizes the ups and downs of the startup journey and the need to maintain a stoic mindset, not getting too excited by highs or too discouraged by lows. By focusing on execution across all aspects of the business, from engineering to marketing, Sai and his team aim to determine whether their current experiment will lead to product-market fit and scale.

PeerDB's Vision for 2024 and Beyond

In 2024, PeerDB aims to provide the world's best experience for replicating data from PostgreSQL to data warehouses, queues, and storage. The key focus areas are:

  1. Performance: PeerDB will be top-notch in performance, allowing customers to move terabytes of data quickly and reliably.

  2. Simplicity: PeerDB will be extremely simple to use, with a focus on providing a great user experience.

  3. Cost-Effectiveness: PeerDB will be cost-effective and offer transparent pricing, unlike existing tools that can be expensive and difficult to predict.

Beyond 2024, the long-term vision for PeerDB is to become the go-to data movement tool for PostgreSQL. The goal is to provide the best experience for any data movement use case, whether it's getting data into or out of PostgreSQL.

PeerDB aims to be to PostgreSQL what tools like Oracle's OCI Data Integration Suite and SQL Server Integration Services are to their respective databases - a dedicated, high-performance data movement solution. The immediate focus is on nailing the change data capture use case from PostgreSQL, but the team is also working towards expanding the supported connectors and use cases over time.

The key to achieving this vision is a relentless focus on execution. While the initial idea provided a starting point, the team believes that execution, persistence, and a deep understanding of customer needs are more important for startup success. By listening to customers, iterating on the product, and building a strong team and structure across marketing, engineering, and product, PeerDB aims to find the right product-market fit and scale its solution.

Conclusion

At a high level, PB makes it fast and simple to replicate data from Postgres to data warehouses, queues, and storage. The key technical challenges that PB has solved include:

  1. Parallel Snapshotting: PB partitions large Postgres tables based on internal identifiers and streams the data in parallel to the target, enabling terabytes of data to be moved in hours instead of days.

  2. Optimized Incremental Replication: PB leverages Postgres logical replication slots and performs optimizations like AO conversion and zstd compression to achieve sub-minute latency for change data capture.

  3. Native Data Type Support: PB ensures that rich data types in Postgres, like geospatial data, are preserved and converted to the appropriate native formats in the target systems.

PB has adopted an open-source strategy, which has provided benefits in terms of validation, visibility, and building trust with customers. The open-source approach has also helped the team stay focused on quality over breadth.

Looking ahead to 2024, the goal for PB is to provide the world's best experience for replicating data from Postgres to data warehouses, queues, and storage. The team aims to make PB the go-to data movement tool for Postgres, supporting a wide range of use cases beyond just change data capture.

The founder emphasizes that execution is more important than the initial idea, as the idea will evolve, and it's the team's persistence, patience, and structured approach across marketing, engineering, and product that will ultimately determine the startup's success.

FAQ