How Open Universities Australia modernized their information platform and considerably diminished their ETL prices with AWS Cloud Growth Package and AWS Step Features


It is a visitor submit co-authored by Michael Davies from Open Universities Australia.

At Open Universities Australia (OUA), we empower college students to discover an enormous array of levels from famend Australian universities, all delivered via on-line studying. We provide college students different pathways to realize their instructional aspirations, offering them with the pliability and accessibility to succeed in their tutorial targets. Since our founding in 1993, we now have supported over 500,000 college students to realize their targets by offering pathways to over 2,600 topics at 25 universities throughout Australia.

As a not-for-profit group, price is an important consideration for OUA. Whereas reviewing our contract for the third-party software we had been utilizing for our extract, rework, and cargo (ETL) pipelines, we realized that we might replicate a lot of the identical performance utilizing Amazon Internet Companies (AWS) providers resembling AWS Glue, Amazon AppFlow, and AWS Step Features. We additionally acknowledged that we might consolidate our supply code (a lot of which was saved within the ETL software itself) right into a code repository that might be deployed utilizing the AWS Cloud Growth Package (AWS CDK). By doing so, we had a chance to not solely cut back prices but in addition to reinforce the visibility and maintainability of our information pipelines.

On this submit, we present you the way we used AWS providers to exchange our current third-party ETL software, enhancing the crew’s productiveness and producing a big discount in our ETL operational prices.

Our method

The migration initiative consisted of two most important elements: constructing the brand new structure and migrating information pipelines from the prevailing software to the brand new structure. Usually, we might work on each in parallel, testing one part of the structure whereas growing one other on the identical time.

From early in our migration journey, we started to outline a couple of guiding rules that we might apply all through the event course of. These have been:

  • Easy and modular – Use easy, reusable design patterns with as few shifting elements as doable. Construction the code base to prioritize ease of use for builders.
  • Value-effective – Use assets in an environment friendly, cost-effective method. Purpose to attenuate conditions the place assets are operating idly whereas ready for different processes to be accomplished.
  • Enterprise continuity – As a lot as doable, make use of current code moderately than reinventing the wheel. Roll out updates in levels to attenuate potential disruption to current enterprise processes.

Structure overview

The next Diagram 1 is the high-level structure for the answer.

Diagram 1: Total structure of the answer, utilizing AWS Step Features, Amazon Redshift and Amazon S3

The next AWS providers have been used to form our new ETL structure:

  • Amazon Redshift – A totally managed, petabyte-scale information warehouse service within the cloud. Amazon Redshift served as our central information repository, the place we might retailer information, apply transformations, and make information obtainable to be used in analytics and enterprise intelligence (BI). Notice: The provisioned cluster itself was deployed individually from the ETL structure and remained unchanged all through the migration course of.
  • AWS Cloud Growth Package (AWS CDK) – The AWS Cloud Growth Package (AWS CDK) is an open-source software program improvement framework for outlining cloud infrastructure in code and provisioning it via AWS CloudFormation. Our infrastructure was outlined as code utilizing the AWS CDK. Consequently, we simplified the way in which we outlined the assets we needed to deploy whereas utilizing our most well-liked coding language for improvement.
  • AWS Step Features – With AWS Step Features, you possibly can create workflows, additionally referred to as State machines, to construct distributed purposes, automate processes, orchestrate microservices, and create information and machine studying pipelines. AWS Step Features can name over 200 AWS providers together with AWS Glue, AWS Lambda, and Amazon Redshift. We used the AWS Step Perform state machines to outline, orchestrate, and execute our information pipelines.
  • Amazon EventBridge – We used Amazon EventBridge, the serverless occasion bus service, to outline the event-based guidelines and schedules that may set off our AWS Step Features state machines.
  • AWS Glue – A knowledge integration service, AWS Glue consolidates main information integration capabilities right into a single service. These embrace information discovery, trendy ETL, cleaning, remodeling, and centralized cataloging. It’s additionally serverless, which implies there’s no infrastructure to handle. contains the flexibility to run Python scripts. We used it for executing long-running scripts, resembling for ingesting information from an exterior API.
  • AWS Lambda – AWS Lambda is a extremely scalable, serverless compute service. We used it for executing easy scripts, resembling for parsing a single textual content file.
  • Amazon AppFlow – Amazon AppFlow allows easy integration with software program as a service (SaaS) purposes. We used it to outline flows that may periodically load information from chosen operational methods into our information warehouse.
  • Amazon Easy Storage Service (Amazon S3) – An object storage service providing industry-leading scalability, information availability, safety, and efficiency. Amazon S3 served as our staging space, the place we might retailer uncooked information previous to loading it into different providers resembling Amazon Redshift. We additionally used it as a repository for storing code that might be retrieved and utilized by different providers.

The place sensible, we made use of the file construction of our code base for outlining assets. We arrange our AWS CDK to consult with the contents of a selected listing and outline a useful resource (for instance, an AWS Step Features state machine or an AWS Glue job) for every file it present in that listing. We additionally made use of configuration information so we might customise the attributes of particular assets as required.

Particulars on particular patterns

Within the above structure Diagram 1, we confirmed a number of flows by which information might be ingested or unloaded from our Amazon Redshift information warehouse. On this part, we spotlight 4 particular patterns in additional element which have been utilized within the closing resolution.

Sample 1: Information transformation, load, and unload

A number of of our information pipelines included vital information transformation steps, which have been primarily carried out via SQL statements executed by Amazon Redshift. Others required ingestion or unloading of knowledge from the information warehouse, which might be carried out effectively utilizing COPY or UNLOAD statements executed by Amazon Redshift.

In line with our intention of utilizing assets effectively, we sought to keep away from operating these statements from inside the context of an AWS Glue job or AWS Lambda perform as a result of these processes would stay idle whereas ready for the SQL assertion to be accomplished. As an alternative, we opted for an method the place SQL execution duties could be orchestrated by an AWS Step Features state machine, which might ship the statements to Amazon Redshift and periodically verify their progress earlier than marking them as both profitable or failed. The next Diagram 2 exhibits this workflow.

Data transformation, load, and unload

Diagram 2: Information transformation, load, and unload sample utilizing Amazon Lambda and Amazon Redshift inside an AWS Step Perform

Sample 2: Information replication utilizing AWS Glue

In circumstances the place we would have liked to duplicate information from a third-party supply, we used AWS Glue to run a script that may question the related API, parse the response, and retailer the related information in Amazon S3. From right here, we used Amazon Redshift to ingest the information utilizing a COPY assertion. The next Diagram 3 exhibits this workflow.

Image 3: Copying from external API to Redshift with AWS Glue

Diagram 3: Copying from exterior API to Redshift with AWS Glue

Notice: Another choice for this step could be to make use of Amazon Redshift auto-copy, however this wasn’t obtainable at time of improvement.

Sample 3: Information replication utilizing Amazon AppFlow

For sure purposes, we have been ready to make use of Amazon AppFlow flows instead of AWS Glue jobs. Consequently, we might summary a number of the complexity of querying exterior APIs immediately. We configured our Amazon AppFlow flows to retailer the output information in Amazon S3, then used an EventBridge rule based mostly on an Finish Stream Run Report occasion (which is an occasion which is printed when a movement run is full) to set off a load into Amazon Redshift utilizing a COPY assertion. The next Diagram 4 exhibits this workflow.

By utilizing Amazon S3 as an intermediate information retailer, we gave ourselves higher management over how the information was processed when it was loaded into Amazon Redshift, when put next with loading the information on to the information warehouse utilizing Amazon AppFlow.

Image 4: Using Amazon AppFlow to integrate external data

Diagram 4: Utilizing Amazon AppFlow to combine exterior information to Amazon S3 and replica to Amazon Redshift

Sample 4: Reverse ETL

Though most of our workflows contain information being introduced into the information warehouse from exterior sources, in some circumstances we would have liked the information to be exported to exterior methods as an alternative. This manner, we might run SQL queries with complicated logic drawing on a number of information sources and use this logic to help operational necessities, resembling figuring out which teams of scholars ought to obtain particular communications.

On this movement, proven within the following Diagram 5, we begin by operating an UNLOAD assertion in Amazon Redshift to unload the related information to information in Amazon S3. From right here, every file is processed by an AWS Lambda perform, which performs any crucial transformations and sends the information to the exterior software via a number of API calls.

Image 5: Reverse ETL workflow, sending data back out to external data sources

Diagram 5: Reverse ETL workflow, sending information again out to exterior information sources

Outcomes

The re-architecture and migration course of took 5 months to finish, from the preliminary idea to the profitable decommissioning of the earlier third-party software. Many of the architectural effort was accomplished by a single full-time worker, with others on the crew primarily aiding with the migration of pipelines to the brand new structure.

We achieved vital price reductions, with closing bills on AWS native providers representing solely a small share of projected prices in comparison with persevering with with the third-party ETL software. Shifting to a code-based method additionally gave us higher visibility of our pipelines and made the method of sustaining them faster and simpler. Total, the transition was seamless for our finish customers, who have been in a position to view the identical information and dashboards each throughout and after the migration, with minimal disruption alongside the way in which.

Conclusion

By utilizing the scalability and cost-effectiveness of AWS providers, we have been in a position to optimize our information pipelines, cut back our operational prices, and enhance our agility.

Pete Allen, an analytics engineer from Open Universities Australia, says, “Modernizing our information structure with AWS has been transformative. Transitioning from an exterior platform to an in-house, code-based analytics stack has vastly improved our scalability, flexibility, and efficiency. With AWS, we will now course of and analyze information with a lot quicker turnaround, decrease prices, and better availability, enabling speedy improvement and deployment of knowledge options, resulting in deeper insights and higher enterprise choices.”

Further assets


Concerning the Authors

Michael Davies is a Information Engineer at OUA. He has in depth expertise inside the schooling {industry}, with a selected concentrate on constructing sturdy and environment friendly information structure and pipelines.

Emma Arrigo is a Options Architect at AWS, specializing in schooling clients throughout Australia. She focuses on leveraging cloud know-how and machine studying to deal with complicated enterprise challenges within the schooling sector. Emma’s ardour for information extends past her skilled life, as evidenced by her canine named Information.

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here