This publish is co-written with Andries Engelbrecht and Scott Teal from Snowflake.
Companies are continually evolving, and knowledge leaders are challenged day by day to satisfy new necessities. For a lot of enterprises and huge organizations, it’s not possible to have one processing engine or device to cope with the assorted enterprise necessities. They perceive {that a} one-size-fits-all method not works, and acknowledge the worth in adopting scalable, versatile instruments and open knowledge codecs to assist interoperability in a contemporary knowledge structure to speed up the supply of latest options.
Prospects are utilizing AWS and Snowflake to develop purpose-built knowledge architectures that present the efficiency required for contemporary analytics and synthetic intelligence (AI) use instances. Implementing these options requires knowledge sharing between purpose-built knowledge shops. That is why Snowflake and AWS are delivering enhanced assist for Apache Iceberg to allow and facilitate knowledge interoperability between knowledge providers.
Apache Iceberg is an open-source desk format that gives reliability, simplicity, and excessive efficiency for big datasets with transactional integrity between numerous processing engines. On this publish, we talk about the next:
- Benefits of Iceberg tables for knowledge lakes
- Two architectural patterns for sharing Iceberg tables between AWS and Snowflake:
- Handle your Iceberg tables with AWS Glue Knowledge Catalog
- Handle your Iceberg tables with Snowflake
- The method of changing present knowledge lakes tables to Iceberg tables with out copying the information
Now that you’ve a high-level understanding of the matters, let’s dive into every of them intimately.
Benefits of Apache Iceberg
Apache Iceberg is a distributed, community-driven, Apache 2.0-licensed, 100% open-source knowledge desk format that helps simplify knowledge processing on massive datasets saved in knowledge lakes. Knowledge engineers use Apache Iceberg as a result of it’s quick, environment friendly, and dependable at any scale and retains data of how datasets change over time. Apache Iceberg provides integrations with well-liked knowledge processing frameworks corresponding to Apache Spark, Apache Flink, Apache Hive, Presto, and extra.
Iceberg tables preserve metadata to summary massive collections of recordsdata, offering knowledge administration options together with time journey, rollback, knowledge compaction, and full schema evolution, decreasing administration overhead. Initially developed at Netflix earlier than being open sourced to the Apache Software program Basis, Apache Iceberg was a blank-slate design to unravel widespread knowledge lake challenges like consumer expertise, reliability, and efficiency, and is now supported by a strong neighborhood of builders targeted on frequently enhancing and including new options to the challenge, serving actual consumer wants and offering them with optionality.
Transactional knowledge lakes constructed on AWS and Snowflake
Snowflake gives numerous integrations for Iceberg tables with a number of storage choices, together with Amazon S3, and a number of catalog choices, together with AWS Glue Knowledge Catalog and Snowflake. AWS gives integrations for numerous AWS providers with Iceberg tables as effectively, together with AWS Glue Knowledge Catalog for monitoring desk metadata. Combining Snowflake and AWS offers you a number of choices to construct out a transactional knowledge lake for analytical and different use instances corresponding to knowledge sharing and collaboration. By including a metadata layer to knowledge lakes, you get a greater consumer expertise, simplified administration, and improved efficiency and reliability on very massive datasets.
Handle your Iceberg desk with AWS Glue
You need to use AWS Glue to ingest, catalog, remodel, and handle the information on Amazon Easy Storage Service (Amazon S3). AWS Glue is a serverless knowledge integration service that lets you visually create, run, and monitor extract, remodel, and cargo (ETL) pipelines to load knowledge into your knowledge lakes in Iceberg format. With AWS Glue, you may uncover and connect with greater than 70 various knowledge sources and handle your knowledge in a centralized knowledge catalog. Snowflake integrates with AWS Glue Knowledge Catalog to entry the Iceberg desk catalog and the recordsdata on Amazon S3 for analytical queries. This enormously improves efficiency and compute price compared to exterior tables on Snowflake, as a result of the extra metadata improves pruning in question plans.
You need to use this identical integration to benefit from the information sharing and collaboration capabilities in Snowflake. This may be very highly effective if in case you have knowledge in Amazon S3 and have to allow Snowflake knowledge sharing with different enterprise models, companions, suppliers, or clients.
The next structure diagram gives a high-level overview of this sample.
The workflow contains the next steps:
- AWS Glue extracts knowledge from purposes, databases, and streaming sources. AWS Glue then transforms it and hundreds it into the information lake in Amazon S3 in Iceberg desk format, whereas inserting and updating the metadata in regards to the Iceberg desk in AWS Glue Knowledge Catalog.
- The AWS Glue crawler generates and updates Iceberg desk metadata and shops it in AWS Glue Knowledge Catalog for present Iceberg tables on an S3 knowledge lake.
- Snowflake integrates with AWS Glue Knowledge Catalog to retrieve the snapshot location.
- Within the occasion of a question, Snowflake makes use of the snapshot location from AWS Glue Knowledge Catalog to learn Iceberg desk knowledge in Amazon S3.
- Snowflake can question throughout Iceberg and Snowflake desk codecs. You may share knowledge for collaboration with a number of accounts in the identical Snowflake area. You can too use knowledge in Snowflake for visualization utilizing Amazon QuickSight, or use it for machine studying (ML) and synthetic intelligence (AI) functions with Amazon SageMaker.
Handle your Iceberg desk with Snowflake
A second sample additionally gives interoperability throughout AWS and Snowflake, however implements knowledge engineering pipelines for ingestion and transformation to Snowflake. On this sample, knowledge is loaded to Iceberg tables by Snowflake by integrations with AWS providers like AWS Glue or by different sources like Snowpipe. Snowflake then writes knowledge on to Amazon S3 in Iceberg format for downstream entry by Snowflake and numerous AWS providers, and Snowflake manages the Iceberg catalog that tracks snapshot areas throughout tables for AWS providers to entry.
Just like the earlier sample, you should use Snowflake-managed Iceberg tables with Snowflake knowledge sharing, however you may also use S3 to share datasets in instances the place one get together doesn’t have entry to Snowflake.
The next structure diagram gives an outline of this sample with Snowflake-managed Iceberg tables.
This workflow consists of the next steps:
- Along with loading knowledge through the COPY command, Snowpipe, and the native Snowflake connector for AWS Glue, you may combine knowledge through the Snowflake Knowledge Sharing.
- Snowflake writes Iceberg tables to Amazon S3 and updates metadata mechanically with each transaction.
- Iceberg tables in Amazon S3 are queried by Snowflake for analytical and ML workloads utilizing providers like QuickSight and SageMaker.
- Apache Spark providers on AWS can entry snapshot areas from Snowflake through a Snowflake Iceberg Catalog SDK and instantly scan the Iceberg desk recordsdata in Amazon S3.
Evaluating options
These two patterns spotlight choices accessible to knowledge personas immediately to maximise their knowledge interoperability between Snowflake and AWS utilizing Apache Iceberg. However which sample is right in your use case? In case you’re already utilizing AWS Glue Knowledge Catalog and solely require Snowflake for learn queries, then the primary sample can combine Snowflake with AWS Glue and Amazon S3 to question Iceberg tables. In case you’re not already utilizing AWS Glue Knowledge Catalog and require Snowflake to carry out reads and writes, then the second sample is probably going an excellent answer that permits for storing and accessing knowledge from AWS.
Contemplating that reads and writes will most likely function on a per-table foundation fairly than all the knowledge structure, it’s advisable to make use of a mixture of each patterns.
Migrate present knowledge lakes to a transactional knowledge lake utilizing Apache Iceberg
You may convert present Parquet, ORC, and Avro-based knowledge lake tables on Amazon S3 to Iceberg format to reap the advantages of transactional integrity whereas enhancing efficiency and consumer expertise. There are a number of Iceberg desk migration choices (SNAPSHOT, MIGRATE, and ADD_FILES) for migrating present knowledge lake tables in-place to Iceberg format, which is preferable to rewriting the entire underlying knowledge recordsdata—a pricey and time-consuming effort with massive datasets. On this part, we concentrate on ADD_FILES, as a result of it’s helpful for customized migrations.
For ADD_FILES choices, you should use AWS Glue to generate Iceberg metadata and statistics for an present knowledge lake desk and create new Iceberg tables in AWS Glue Knowledge Catalog for future use with no need to rewrite the underlying knowledge. For directions on producing Iceberg metadata and statistics utilizing AWS Glue, confer with Migrate an present knowledge lake to a transactional knowledge lake utilizing Apache Iceberg or Convert present Amazon S3 knowledge lake tables to Snowflake Unmanaged Iceberg tables utilizing AWS Glue.
This selection requires that you simply pause knowledge pipelines whereas changing the recordsdata to Iceberg tables, which is an easy course of in AWS Glue as a result of the vacation spot simply must be modified to an Iceberg desk.
Conclusion
On this publish, you noticed the 2 structure patterns for implementing Apache Iceberg in an information lake for higher interoperability throughout AWS and Snowflake. We additionally supplied steering on migrating present knowledge lake tables to Iceberg format.
Join AWS Dev Day on April 10 to get hands-on not solely with Apache Iceberg, but in addition with streaming knowledge pipelines with Amazon Knowledge Firehose and Snowpipe Streaming, and generative AI purposes with Streamlit in Snowflake and Amazon Bedrock.
In regards to the Authors
Andries Engelbrecht is a Principal Companion Options Architect at Snowflake and works with strategic companions. He’s actively engaged with strategic companions like AWS supporting product and repair integrations in addition to the event of joint options with companions. Andries has over 20 years of expertise within the subject of knowledge and analytics.
Deenbandhu Prasad is a Senior Analytics Specialist at AWS, specializing in huge knowledge providers. He’s keen about serving to clients construct fashionable knowledge architectures on the AWS Cloud. He has helped clients of all sizes implement knowledge administration, knowledge warehouse, and knowledge lake options.
Brian Dolan joined Amazon as a Army Relations Supervisor in 2012 after his first profession as a Naval Aviator. In 2014, Brian joined Amazon Internet Providers, the place he helped Canadian clients from startups to enterprises discover the AWS Cloud. Most lately, Brian was a member of the Non-Relational Enterprise Growth group as a Go-To-Market Specialist for Amazon DynamoDB and Amazon Keyspaces earlier than becoming a member of the Analytics Worldwide Specialist Group in 2022 as a Go-To-Market Specialist for AWS Glue.
Nidhi Gupta is a Sr. Companion Resolution Architect at AWS. She spends her days working with clients and companions, fixing architectural challenges. She is keen about knowledge integration and orchestration, serverless and large knowledge processing, and machine studying. Nidhi has intensive expertise main the structure design and manufacturing launch and deployments for knowledge workloads.
Scott Teal is a Product Advertising Lead at Snowflake and focuses on knowledge lakes, storage, and governance.