In our earlier submit Backtesting index rebalancing arbitrage with Amazon EMR and Apache Iceberg, we confirmed the best way to use Apache Iceberg within the context of technique backtesting. On this submit, we give attention to information administration implementation choices comparable to accessing information straight in Amazon Easy Storage Service (Amazon S3), utilizing widespread information codecs like Parquet, or utilizing open desk codecs like Iceberg. Our experiments are based mostly on real-world historic full order guide information, supplied by our associate CryptoStruct, and examine the trade-offs between these decisions, specializing in efficiency, value, and quant developer productiveness.
Information administration is the inspiration of quantitative analysis. Quant researchers spend roughly 80% of their time on crucial however not impactful information administration duties comparable to information ingestion, validation, correction, and reformatting. Conventional information administration decisions embrace relational, SQL, NoSQL, and specialised time sequence databases. Lately, advances in parallel computing within the cloud have made object shops like Amazon S3 and columnar file codecs like Parquet a most popular selection.
This submit explores how Iceberg can improve quant analysis platforms by enhancing question efficiency, lowering prices, and growing productiveness, finally enabling sooner and extra environment friendly technique improvement in quantitative finance. Our evaluation exhibits that Iceberg can speed up question efficiency by as much as 52%, scale back operational prices, and considerably enhance information administration at scale.
Having chosen Amazon S3 as our storage layer, a key determination is whether or not to entry Parquet recordsdata straight or use an open desk format like Iceberg. Iceberg gives distinct benefits by way of its metadata layer over Parquet, comparable to improved information administration, efficiency optimization, and integration with numerous question engines.
On this submit, we use the time period vanilla Parquet to discuss with Parquet recordsdata saved straight in Amazon S3 and accessed by way of customary question engines like Apache Spark, with out the extra options supplied by desk codecs comparable to Iceberg.
Quant developer and researcher productiveness
On this part, we give attention to the productiveness options provided by Iceberg and the way it compares to straight studying recordsdata in Amazon S3. As talked about earlier, 80% of quantitative analysis work is attributed to information administration duties. Enterprise impression closely depends on high quality information (“rubbish in, rubbish out”). Quants and platform groups need to ingest information from a number of sources with completely different velocities and replace frequencies, after which validate and proper the information. These actions translate into the flexibility to run append, insert, replace, and delete operations. For easy append operations, each Parquet on Amazon S3 and Iceberg supply related comfort and productiveness. Nevertheless, real-world information isn’t excellent and must be corrected. Gaps filling (inserts), error corrections and restatements (updates), and eradicating duplicates (deletes) are the obvious examples. When writing information within the Parquet format on to Amazon S3 with out utilizing an open desk format like Iceberg, it’s a must to write code to establish the affected partition, appropriate errors, and rewrite the partition. Furthermore, if the write job fails or a downstream learn job happens throughout this write operation, all downstream jobs have the potential of studying inconsistent information. Nevertheless, Iceberg has built-in insert, replace, and delete options with ACID (Atomicity, Consistency, Isolation, Sturdiness) properties, and the framework itself manages the Amazon S3 mechanics in your behalf.
Guarding in opposition to lookahead bias is an important functionality of any quant analysis platform—what backtests as a worthwhile buying and selling technique can render itself ineffective and unprofitable in actual time. Iceberg offers time journey and snapshotting capabilities out of the field to handle lookahead bias that may very well be embedded within the information (comparable to delayed information supply).
Simplified information corrections and updates
Iceberg enhances information administration for quants in capital markets by way of its sturdy insert, delete, and replace capabilities. These options permit environment friendly information corrections, gap-filling in time sequence, and historic information updates with out disrupting ongoing analyses or compromising information integrity.
Not like direct Amazon S3 entry, Iceberg helps these operations on petabyte-scale information lakes with out requiring advanced customized code. This simplifies information modification processes, which is essential for ingesting and updating massive volumes of market and commerce information, rapidly iterating on backtesting and reprocessing workflows, and sustaining detailed audit trails for danger and compliance necessities.
Iceberg’s desk format separates information recordsdata from metadata recordsdata, enabling environment friendly information modifications with out full dataset rewrites. This strategy additionally reduces costly ListObjects
API calls sometimes wanted when straight accessing Parquet recordsdata in Amazon S3.
Moreover, Iceberg gives merge on learn (MoR) and duplicate on write (CoW) approaches, offering flexibility for various quant analysis wants. MoR permits sooner writes, appropriate for steadily up to date datasets, and CoW offers sooner reads, useful for read-heavy workflows like backtesting.
For instance, when a brand new information supply or attribute is added, quant researchers can seamlessly incorporate it into their Iceberg tables after which reprocess historic information, assured they’re utilizing appropriate, time-appropriate data. This functionality is especially precious in sustaining the integrity of backtests and the reliability of buying and selling methods.
In eventualities involving large-scale information corrections or updates, comparable to adjusting for inventory splits or dividend funds throughout historic information, Iceberg’s environment friendly replace mechanisms considerably scale back processing time and useful resource utilization in comparison with conventional strategies.
These options collectively enhance productiveness and information administration effectivity in quant analysis environments, permitting researchers to focus extra on technique improvement and fewer on information dealing with complexities.
Historic information entry for backtesting and validation
Iceberg’s time journey function can allow quant builders and researchers to entry and analyze historic snapshots of their information. This functionality could be helpful whereas performing duties like backtesting, mannequin validation, and understanding information lineage.
Iceberg simplifies time journey workflows on Amazon S3 by introducing a metadata layer that tracks the historical past of modifications made to the desk. You possibly can discuss with this metadata layer to create a psychological mannequin of how Iceberg’s time journey functionality works.
Iceberg’s time journey functionality is pushed by an idea referred to as snapshots, that are recorded in metadata recordsdata. These metadata recordsdata act as a central repository that shops desk metadata, together with the historical past of snapshots. Moreover, Iceberg makes use of manifest recordsdata to supply a illustration of knowledge recordsdata, their partitions, and any related deleted recordsdata. These manifest recordsdata are referenced within the metadata snapshots, permitting Iceberg to establish the related information for a particular cut-off date.
When a consumer requests a time journey question, the everyday workflow includes querying a particular snapshot. Iceberg makes use of the snapshot identifier to find the corresponding metadata snapshot within the metadata recordsdata. The time journey functionality is invaluable to quants, enabling them to backtest and validate methods in opposition to historic information, reproduce and debug points, carry out what-if evaluation, adjust to laws by sustaining audit trails and reproducing previous states, and roll again and get better from information corruption or errors. Quants may also acquire deeper insights into present market developments and correlate them with historic patterns. Additionally, the time journey function can additional mitigate any dangers of lookahead bias. Researchers can entry the precise information snapshots that had been current previously, after which run their fashions and techniques in opposition to this historic information, with out the danger of inadvertently incorporating future data.
Seamless integration with acquainted instruments
Iceberg offers quite a lot of interfaces that allow seamless integration with the open supply instruments and AWS providers that quant builders and researchers are conversant in.
Iceberg offers a complete SQL interface that permits quant groups to work together with their information utilizing acquainted SQL syntax. This SQL interface is suitable with widespread question engines and information processing frameworks, comparable to Spark, Trino, Amazon Athena, and Hive. Quant builders and researchers can use their current SQL data and instruments to question, filter, combination, and analyze their information saved in Iceberg tables.
Along with the first interface of SQL, Iceberg additionally offers the DataFrame API, which permits quant groups to programmatically work together with their information with widespread distributed information processing frameworks like Spark and Flink in addition to skinny shoppers like PyIceberg. Quants can additional use this API to construct extra programmatic approaches to entry and manipulate information, permitting for the implementation of customized logic and integration of Iceberg with different AWS ecosystems like Amazon EMR.
Though accessing information from Amazon S3 is a viable choice, Iceberg offers a number of benefits like metadata administration, efficiency optimization utilizing partition pruning, information manipulation, and a wealthy AWS ecosystem integration together with providers like Athena and Amazon EMR with extra seamless and feature-rich information processing expertise.
Undifferentiated heavy lifting
Information partitioning is one among main contributing components to optimizing combination throughput to and from Amazon S3, contributing to general Excessive Efficiency Computing (HPC) surroundings price-performance.
Quant researchers usually face efficiency bottlenecks and complicated information administration challenges when coping with large-scale datasets in Amazon S3. As mentioned in Finest practices design patterns: optimizing Amazon S3 efficiency, single prefix efficiency is proscribed to three,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second per partitioned Amazon S3 prefix. Iceberg’s metadata layer and clever partitioning methods mechanically optimize information entry patterns, lowering the probability of I/O throttling and minimizing the necessity for guide efficiency tuning. This automation permits quant groups to give attention to creating and refining buying and selling methods slightly than troubleshooting information entry points or optimizing storage layouts.
On this part, we talk about conditions we found whereas operating our experiments at scale and options supplied by Iceberg vs. vanilla Parquet when accessing information in Amazon S3.
As we talked about within the introduction, the character of quant analysis is “fail quick”—new concepts need to be rapidly evaluated after which both prioritized for a deep dive or dismissed. This makes it unattainable to provide you with common partitioning that works on a regular basis and for all analysis types.
When accessing information straight as Parquet recordsdata in Amazon S3, with out utilizing an open desk format like Iceberg, partitioning and throttling points can come up. Partitioning on this case is set by the bodily structure of recordsdata in Amazon S3, and a mismatch between the supposed partitioning and the precise file structure can result in I/O throttling exceptions. Moreover, itemizing directories in Amazon S3 may also lead to throttling exceptions as a result of excessive variety of API calls required.
In distinction, Iceberg offers a metadata layer that abstracts away the bodily file structure in Amazon S3. Partitioning is outlined on the desk degree, and Iceberg handles the mapping between logical partitions and the underlying file construction. This abstraction helps mitigate partitioning points and reduces the probability of I/O throttling exceptions. Moreover, Iceberg’s metadata caching mechanism minimizes the variety of Record API calls required, addressing the listing itemizing throttling concern.
Though each approaches contain direct entry to Amazon S3, Iceberg is an open desk format that introduces a metadata layer, offering higher partitioning administration and lowering the danger of throttling exceptions. It doesn’t act as a database itself, however slightly as an information format and processing engine on prime of the underlying storage (on this case, Amazon S3).
One of the efficient methods to deal with Amazon S3 API quota limits is salting (random hash prefixes)—a way that provides random partition IDs to Amazon S3 paths. This will increase the likelihood of prefixes residing on completely different bodily partitions, serving to distribute API requests extra evenly. Iceberg helps this performance out of the field for each information ingestion and studying.
Implementing salting straight in Amazon S3 requires advanced customized code to create and use partitioning schemes with random keys within the naming hierarchy. This strategy necessitates a customized information catalog and metadata system to map bodily paths to logical paths, permitting direct partition entry with out counting on Amazon S3 Record API calls. With out such a system, purposes danger exceeding Amazon S3 API quotas when accessing particular partitions.
At petabyte scale, Iceberg’s benefits turn out to be clear. It effectively manages information by way of the next options:
- Listing caching
- Configurable partitioning methods (vary, bucket)
- Information administration performance (compaction)
- Catalog, metadata, and statistics use for optimum execution plans
These built-in options get rid of the necessity for customized options to handle Amazon S3 API quotas and information group at scale, lowering improvement time and upkeep prices whereas enhancing question efficiency and reliability.
Efficiency
We highlighted numerous the performance of Iceberg that eliminates undifferentiated heavy lifting and improves developer and quant productiveness. What about efficiency?
This part evaluates whether or not Iceberg’s metadata layer introduces overhead or delivers optimization for quantitative analysis use instances, evaluating it with vanilla Parquet entry on Amazon S3. We look at how these approaches impression widespread quant analysis queries and workflows.
The important thing query is whether or not Iceberg’s metadata layer, designed to optimize vanilla Parquet entry on Amazon S3, introduces overhead or delivers the supposed optimization for quantitative analysis use instances. Then we talk about overlapping optimization methods, comparable to information distribution and sorting. We additionally talk about that there isn’t a magic partitioning and all sorting scheme the place one measurement matches all within the context of quant analysis. Our benchmarks present that Iceberg performs comparably to direct Amazon S3 entry, with further optimizations from its metadata and statistics utilization, just like database indexing.
Vanilla Parquet vs Iceberg: Amazon S3 learn efficiency
We created 4 completely different datasets: two utilizing Iceberg and two with direct Amazon S3 Parquet entry, every with each sorted and unsorted write distributions. The aim of this train was to check the efficiency of direct Amazon S3 Parquet entry vs. the Iceberg open desk format, taking into consideration the impression of write distribution patterns when operating numerous queries generally utilized in quantitative buying and selling analysis.
Question 1
We first run a easy depend question to get the overall variety of information within the desk. This question helps perceive the baseline efficiency for an easy operation. For instance, if the desk comprises tick-level market information for numerous monetary devices, the depend can provide an concept of the overall variety of information factors obtainable for evaluation.
The next is the code for vanilla Parquet:
Question 2
Our second question is a grouping and counting question to search out the variety of information for every mixture of exchange_code
and instrument
. This question is often utilized in quantitative buying and selling analysis to investigate market liquidity and buying and selling exercise throughout completely different devices and exchanges.
The next is the code for vanilla Parquet:
The next is the code for Iceberg:
Question 3
Subsequent, we run a definite question to retrieve the distinct combos of 12 months, month, and day from the adapterTimestamp_ts_utc
column. In quantitative buying and selling analysis, this question could be useful for understanding the time vary lined by the dataset. Researchers can use this data to establish intervals of curiosity for his or her evaluation, comparable to particular market occasions, financial cycles, or seasonal patterns.
The next is the code for vanilla Parquet:
The next is the code for Iceberg:
Question 4
Lastly, we run a grouping and counting question with a date vary filter on the adapterTimestamp_ts_utc
column. This question is just like Question 2 however focuses on a particular time interval. You would use this question to investigate market exercise or liquidity throughout particular time intervals, comparable to intervals of excessive volatility, market crashes, or financial occasions. Researchers can use this data to establish potential buying and selling alternatives or examine the impression of those occasions on market dynamics.
The next is the code for vanilla Parquet:
The next is the code for Iceberg. As a result of Iceberg has a metadata layer, the row depend could be fetched from metadata:
Take a look at outcomes
To judge the efficiency and price advantages of utilizing Iceberg for our quant analysis information lake, we created 4 completely different datasets: two with Iceberg tables and two with direct Amazon S3 Parquet entry, every utilizing each sorted and unsorted write distributions. We first ran AWS Glue write jobs to create the Iceberg tables after which mirrored the identical write processes for the Amazon S3 Parquet datasets. For the unsorted datasets, we partitioned the information by trade
and instrument
, and for the sorted datasets, we added a kind key on the time column.
Subsequent, we ran a sequence of queries generally utilized in quantitative buying and selling analysis, together with easy depend queries, grouping and counting, distinct worth queries, and queries with date vary filters. Our benchmarking course of concerned studying information from Amazon S3, performing numerous transformations and joins, and writing the processed information again to Amazon S3 as Parquet recordsdata.
By evaluating runtimes and prices throughout completely different information codecs and write distributions, we quantified the advantages of Iceberg’s optimized information group, metadata administration, and environment friendly Amazon S3 information dealing with. The outcomes confirmed that Iceberg not solely enhanced question efficiency with out introducing important overhead, but in addition lowered the probability of activity failures, reruns, and throttling points, resulting in extra secure and predictable job execution, notably with massive datasets saved in Amazon S3.
AWS Glue write jobs
Within the following desk, we examine the efficiency and the price implications of utilizing Iceberg vs. vanilla Parquet entry on Amazon S3, taking into consideration the next use instances:
- Iceberg desk (unsorted) – We created an Iceberg desk partitioned by
exchange_code
andinstrument
Which means the information was bodily partitioned in Amazon S3 based mostly on the distinctive combos ofexchange_code
andinstrument
values. Partitioning the information on this method can enhance question efficiency, as a result of Iceberg can prune out partitions that aren’t related to a specific question, lowering the quantity of knowledge that must be scanned. The info was not sorted on any column on this case, which is the default conduct. - Vanilla Parquet (unsorted) – For this use case, we wrote the information straight as Parquet recordsdata to Amazon S3, with out utilizing Iceberg. We repartitioned the information by
exchange_code
andinstrument
columns utilizing customary hash partitioning earlier than writing it out. Repartitioning was essential to keep away from potential throttling points when studying the information later, as a result of accessing information straight from Amazon S3 with out clever partitioning can result in too many requests hitting the identical S3 prefix. Just like the Iceberg desk, the information was not sorted on any column on this case. To make comparability truthful, we used the precise repartition depend that Iceberg makes use of. - Iceberg desk (sorted) – We created one other Iceberg desk, this time partitioned by
exchange_code
andinstrument
Moreover, we sorted the information on this desk on theadapterTimestamp_ts_utc
column. Sorting the information can enhance question efficiency for sure sorts of queries, comparable to those who contain vary filters or ordered outputs. Iceberg mechanically handles the sorting and partitioning of the information transparently to the consumer. - Vanilla Parquet (sorted) – For this use case, we once more wrote the information straight as Parquet recordsdata to Amazon S3, with out utilizing Iceberg. We repartitioned the information by vary on the
exchange_code
,instrument
, andadapterTimestamp_ts_utc
columns earlier than writing it out utilizing customary vary partitioning with 1996 partition depend, as a result of this was what Iceberg was utilizing based mostly on SparkUI. Repartitioning on the time column (adapterTimestamp_ts_utc
) was crucial to realize a sorted write distribution, as a result of Parquet recordsdata are sorted inside every partition. This sorted write distribution can enhance question efficiency for sure sorts of queries, just like the sorted Iceberg desk.
Write Distribution Sample | Iceberg Desk (Unsorted) | Vanilla Parquet (Unsorted) | Iceberg Desk (Sorted) | Vanilla Parquet (Sorted) |
DPU Hours | 899.46639 | 915.70222 | 1402 | 1365 |
Variety of S3 Objects | 7444 | 7288 | 9283 | 9283 |
Dimension of S3 Parquet Objects | 567.7 GB | 629.8 GB | 525.6 GB | 627.1 GB |
Runtime | 1h 51m 40s | 1h 53m 29s | 2h 52m 7s | 2h 47m 36s |
AWS Glue learn jobs
For the AWS Glue learn jobs, we ran a sequence of queries generally utilized in quantitative buying and selling analysis, comparable to easy counts, grouping and counting, distinct worth queries, and queries with date vary filters. We in contrast the efficiency of those queries between the Iceberg tables and the vanilla Parquet recordsdata learn in Amazon S3. Within the following desk, you’ll be able to see two AWS Glue jobs that present the efficiency and price implications of entry patterns described earlier.
Learn Queries / Runtime in Seconds | Iceberg Desk | Vanilla Parquet |
COUNT(1) on unsorted | 35.76s | 74.62s |
GROUP BY and ORDER BY on unsorted | 34.29s | 67.99s |
DISTINCT and SELECT on unsorted | 51.40s | 82.95s |
FILTER and GROUP BY and ORDER BY on unsorted | 25.84s | 49.05s |
COUNT(1) on sorted | 15.29s | 24.25s |
GROUP BY and ORDER BY on sorted | 15.88s | 28.73s |
DISTINCT and SELECT on sorted | 30.85s | 42.06s |
FILTER and GROUP BY and ORDER BY on sorted | 15.51s | 31.51s |
AWS Glue DPU hours | 45.98 | 67.97 |
Take a look at outcomes insights
These take a look at outcomes provided the next insights:
- Accelerated question efficiency – Iceberg improved learn operations by as much as 52% for unsorted information and 51% for sorted information. This pace increase permits quant researchers to investigate bigger datasets and take a look at buying and selling methods extra quickly. In quantitative finance, the place pace is essential, this efficiency acquire permits groups to uncover market insights sooner, doubtlessly gaining a aggressive edge.
- Decreased operational prices – For read-intensive workloads, Iceberg lowered DPU hours by 32.4% and achieved a ten–16% discount in Amazon S3 storage. These effectivity beneficial properties translate to value financial savings in data-intensive quant operations. With Iceberg, companies can run extra complete analyses throughout the identical price range or reallocate assets to different high-value actions, optimizing their analysis capabilities.
- Enhanced information administration and scalability – Iceberg confirmed comparable write efficiency for unsorted information (899.47 DPU hours vs. 915.70 for vanilla Parquet) and maintained constant object counts throughout sorted and unsorted eventualities (7,444 and 9,283, respectively). This consistency results in extra dependable and predictable job execution. For quant groups coping with large-scale datasets, this reduces time spent on troubleshooting information infrastructure points and will increase give attention to creating buying and selling methods.
- Improved productiveness – Iceberg outperformed vanilla Parquet entry throughout numerous question sorts. Easy counts had been 52.1% sooner, grouping and ordering operations improved by 49.6%, and filtered queries had been 47.3% sooner for unsorted information. This efficiency enhancement boosts productiveness in quant analysis workflows. It reduces question completion instances, permitting quant builders and researchers to spend extra time on mannequin improvement and market evaluation, resulting in sooner iteration on buying and selling methods.
Conclusion
Quant analysis platforms usually keep away from adopting new information administration options like Iceberg, fearing efficiency penalties and elevated prices. Our evaluation disproves these issues, demonstrating that Iceberg not solely matches or enhances efficiency in comparison with direct Amazon S3 entry, but in addition offers substantial further advantages.
Our assessments reveal that Iceberg considerably accelerates question efficiency, with enhancements of as much as 52% for unsorted information and 51% for sorted information. This pace increase permits quant researchers to investigate bigger datasets and take a look at buying and selling methods extra quickly, doubtlessly uncovering precious market insights sooner.
Iceberg streamlines information administration duties, permitting researchers to give attention to technique improvement. Its sturdy insert, replace, and delete capabilities, mixed with time journey options, allow easy administration of advanced datasets, enhancing backtest accuracy and facilitating speedy technique iteration.
The platform’s clever dealing with of partitioning and Amazon S3 API quota points eliminates undifferentiated heavy lifting, releasing quant groups from low-level information engineering duties. This automation redirects efforts to high-value actions comparable to mannequin improvement and market evaluation. Furthermore, our assessments present that for read-intensive workloads, Iceberg lowered DPU hours by 32.4% and achieved a ten–16% discount in Amazon S3 storage, resulting in important value financial savings.
Flexibility is a key benefit of Iceberg. Its numerous interfaces, together with SQL, DataFrames, and programmatic APIs, combine seamlessly with current quant analysis workflows, accommodating various evaluation wants and coding preferences.
By adopting Iceberg, quant analysis groups acquire each efficiency enhancements and highly effective information administration instruments. This mix creates an surroundings the place researchers can push analytical boundaries, keep excessive information integrity requirements, and give attention to producing precious insights. The improved productiveness and lowered operational prices allow quant groups to allocate assets extra successfully, finally resulting in a extra aggressive edge in quantitative finance.
Concerning the Authors
Man Bachar is a Senior Options Architect at AWS based mostly in New York. He makes a speciality of aiding capital markets clients with their cloud transformation journeys. His experience encompasses identification administration, safety, and unified communication.
Sercan Karaoglu is Senior Options Architect, specialised in capital markets. He’s a former information engineer and obsessed with quantitative funding analysis.
Boris Litvin is a Principal Options Architect at AWS. His job is in monetary providers trade innovation. Boris joined AWS from the trade, most just lately Goldman Sachs, the place he held quite a lot of quantitative roles throughout fairness, FX, and rates of interest, and was CEO and Founding father of a quantitative buying and selling FinTech startup.
Salim Tutuncu is a Senior Associate Options Architect Specialist on Information & AI, based mostly in Dubai with a give attention to the EMEA. With a background within the know-how sector that spans roles as an information engineer, information scientist, and machine studying engineer, Salim has constructed a formidable experience in navigating the advanced panorama of knowledge and synthetic intelligence. His present position includes working intently with companions to develop long-term, worthwhile companies utilizing the AWS platform, notably in information and AI use instances.
Alex Tarasov is a Senior Options Architect working with Fintech startup clients, serving to them to design and run their information workloads on AWS. He’s a former information engineer and is obsessed with all issues information and machine studying.
Jiwan Panjiker is a Options Architect at Amazon Internet Providers, based mostly within the Larger New York Metropolis space. He works with AWS enterprise clients, serving to them of their cloud journey to unravel advanced enterprise issues by making efficient use of AWS providers. Outdoors of labor, he likes spending time together with his family and friends, going for lengthy drives, and exploring native delicacies.