Batch knowledge ingestion into Amazon OpenSearch Service utilizing AWS Glue

Organizations consistently work to course of and analyze huge volumes of knowledge to derive actionable insights. Efficient knowledge ingestion and search capabilities have turn into important to be used circumstances like log analytics, utility search, and enterprise search. These use circumstances demand a sturdy pipeline that may deal with excessive knowledge volumes and allow environment friendly knowledge exploration.

Apache Spark, an open supply powerhouse for large-scale knowledge processing, is well known for its velocity, scalability, and ease of use. Its potential to course of and remodel huge datasets has made it an indispensable instrument in trendy knowledge engineering. Amazon OpenSearch Service—a community-driven search and analytics answer—empowers organizations to go looking, mixture, visualize, and analyze knowledge seamlessly. Collectively, Spark and OpenSearch Service provide a compelling answer for constructing highly effective knowledge pipelines. Nonetheless, ingesting knowledge from Spark into OpenSearch Service can current challenges, particularly with various knowledge sources.

This publish showcases how one can use Spark on AWS Glue to seamlessly ingest knowledge into OpenSearch Service. We cowl batch ingestion strategies, share sensible examples, and talk about finest practices that will help you construct optimized and scalable knowledge pipelines on AWS.

Overview of answer

AWS Glue is a serverless knowledge integration service that simplifies knowledge preparation and integration duties for analytics, machine studying, and utility improvement. On this publish, we deal with batch knowledge ingestion into OpenSearch Service utilizing Spark on AWS Glue.

AWS Glue gives a number of integration choices with OpenSearch Service utilizing numerous open supply and AWS managed libraries, together with:

Within the following sections, we discover every integration technique intimately, guiding you thru the setup and implementation. As we progress, we incrementally construct the structure diagram proven within the following determine, offering a transparent path for creating strong knowledge pipelines on AWS. Every implementation is unbiased of the others. We selected to showcase them individually, as a result of in a real-world situation, solely one of many three integration strategies is probably going for use.

Image showing the high level architecture diagram

You could find the code base within the accompanying GitHub repo. Within the following sections, we stroll by means of the steps to implement the answer.

Stipulations

Earlier than you deploy this answer, be sure the next conditions are in place:

Clone the repository to your native machine

Clone the repository to your native machine and set the BLOG_DIR atmosphere variable. All of the relative paths assume BLOG_DIR is about to the repository location in your machine. If BLOG_DIR isn’t getting used, regulate the trail accordingly.

git clone git@github.com:aws-samples/opensearch-glue-integration-patterns.git
cd opensearch-glue-integration-patterns
export BLOG_DIR=$(pwd)

Deploy the AWS CloudFormation template to create the required infrastructure

The primary focus of this publish is to show how one can use the talked about libraries in Spark on AWS Glue to ingest knowledge into OpenSearch Service. Although we middle on this core subject, a number of key AWS elements will should be pre-provisioned for the mixing examples, resembling a Amazon Digital Personal Cloud (Amazon VPC), a number of Subnets, an AWS Key Administration Service (AWS KMS) key, an Amazon Easy Storage Service (Amazon S3) bucket, an AWS Glue function, and an OpenSearch Service cluster with domains for OpenSearch Service and Elasticsearch. To simplify the setup, we’ve automated the provisioning of this core infrastructure utilizing the cloudformation/opensearch-glue-infrastructure.yaml AWS CloudFormation template.

Run the next instructions

The CloudFormation template will deploy the required networking elements (resembling VPC and subnets), Amazon CloudWatch logging, AWS Glue function, and OpenSearch Service and Elasticsearch domains required to implement the proposed structure. Use a robust password (8–128 characters, three of that are lowercase, uppercase, numbers, or particular characters, and no /, “, or areas) and cling to your group’s safety requirements for ESMasterUserPassword and OSMasterUserPassword within the following command:

cd ${BLOG_DIR}/cloudformation/
aws cloudformation deploy 
--template-file ${BLOG_DIR}/cloudformation/opensearch-glue-infrastructure.yaml 
--stack-name GlueOpenSearchStack 
--capabilities CAPABILITY_NAMED_IAM 
--region <AWS_REGION> 
--parameter-overrides 
ESMasterUserPassword=<ES_MASTER_USER_PASSWORD> 
OSMasterUserPassword=<OS_MASTER_USER_PASSWORD>

You must see successful message resembling "Efficiently created/up to date stack – GlueOpenSearchStack" after the assets have been provisioned efficiently. Provisioning this CloudFormation stack sometimes takes roughly half-hour to finish.

On the AWS CloudFormation console, find the GlueOpenSearchStack stack, and ensure that its standing is CREATE_COMPLETE.

Image showing the "CREATE_COMPLETE" status of cloudformation template

You possibly can evaluation the deployed assets on the Sources tab, as proven within the following screenshot.The screenshot doesn’t show all of the created assets.

Image showing the "Resources" tab of cloudformation template

Extra setup steps

On this part, we accumulate important data, together with the S3 bucket identify and the OpenSearch Service and Elasticsearch area endpoints. These particulars are required for executing the code in subsequent sections.

Seize the main points of the provisioned assets

Use the next AWS CLI command to extract and save the output values from the CloudFormation stack to a file named GlueOpenSearchStack_outputs.txt. We check with the values on this file in upcoming steps.

aws cloudformation describe-stacks 
--stack-name GlueOpenSearchStack 
--query 'sort_by(Stacks[0].Outputs[], &OutputKey)[].{Key:OutputKey,Worth:OutputValue}' 
--output desk 
--no-cli-pager 
--region <AWS_REGION> > ${BLOG_DIR}/GlueOpenSearchStack_outputs.txt

Obtain NY Inexperienced Taxi December 2022 dataset and replica to S3 bucket

The aim of this publish is to show the technical implementation of ingesting knowledge into OpenSearch Service utilizing AWS Glue. Understanding the dataset itself isn’t important, except for its knowledge format, which we talk about in AWS Glue notebooks in later sections. To be taught extra in regards to the dataset, you’ll find further data on the NYC Taxi and Limousine Fee web site.

We particularly request that you simply obtain the December 2022 dataset, as a result of we have now examined the answer utilizing this explicit dataset:

S3_BUCKET_NAME=$(awk -F '|' '$2 ~ /S3Bucket/ [ t]+$/, "", $3); print $3' ${BLOG_DIR}/GlueOpenSearchStack_outputs.txt)
mkdir -p ${BLOG_DIR}/datasets && cd ${BLOG_DIR}/datasets
curl -O https://d37ci6vzurychx.cloudfront.web/trip-data/green_tripdata_2022-12.parquet
aws s3 cp green_tripdata_2022-12.parquet s3://${S3_BUCKET_NAME}/datasets/green_tripdata_2022-12.parquet

Obtain the required JARs from the Maven repository and replica to S3 bucket

We’ve specified a selected JAR file model to make sure steady deployment expertise. Nonetheless, we advocate adhering to your group’s safety finest practices and reviewing any identified vulnerabilities within the model of the JAR information earlier than deployment. AWS doesn’t assure the safety of any open-source code used right here. Moreover, please confirm the downloaded JAR file’s checksum in opposition to the printed worth to substantiate its integrity and authenticity.

mkdir -p ${BLOG_DIR}/jars && cd ${BLOG_DIR}/jars
# OpenSearch Service jar
curl -O https://repo1.maven.org/maven2/org/opensearch/shopper/opensearch-spark-30_2.12/1.0.1/opensearch-spark-30_2.12-1.0.1.jar
aws s3 cp opensearch-spark-30_2.12-1.0.1.jar s3://${S3_BUCKET_NAME}/jars/opensearch-spark-30_2.12-1.0.1.jar
# Elasticsearch jar
curl -O https://repo1.maven.org/maven2/org/elasticsearch/elasticsearch-spark-30_2.12/7.17.23/elasticsearch-spark-30_2.12-7.17.23.jar
aws s3 cp elasticsearch-spark-30_2.12-7.17.23.jar s3://${S3_BUCKET_NAME}/jars/elasticsearch-spark-30_2.12-7.17.23.jar

Within the following sections, we implement the person knowledge ingestion strategies as outlined within the structure diagram.

Ingest knowledge into OpenSearch Service utilizing the OpenSearch Spark library

On this part, we load an OpenSearch Service index utilizing Spark and the OpenSearch Spark library. We show this implementation by utilizing AWS Glue notebooks, using primary authentication utilizing consumer identify and password.

To show the ingestion mechanisms, we have now offered the Spark-and-OpenSearch-Code-Steps.ipynb pocket book with detailed directions. Observe the steps on this part along with the directions within the pocket book.

Arrange the AWS Glue Studio pocket book

Full the next steps:

On the AWS Glue console, select ETL jobs within the navigation pane.
Underneath Create job, select Pocket book.

Image showing AWS console page for AWS Glue to open notebook

Add the pocket book file positioned at ${BLOG_DIR}/glue_jobs/Spark-and-OpenSearch-Code-Steps.ipynb.
For IAM function, select the AWS Glue job IAM function that begins with GlueOpenSearchStack-GlueRole-*.

Image showing AWS console page for AWS Glue to open notebook

Enter a reputation for the pocket book (for instance, Spark-and-OpenSearch-Code-Steps) and select Save.

Image showing AWS Glue OpenSearch Notebook

Change the placeholder values within the pocket book

Full the next steps to replace the placeholders within the pocket book:

In Step 1 within the pocket book, change the placeholder <GLUE-INTERACTIVE-SESSION-CONNECTION-NAME> with the AWS Glue interactive session connection identify. You may get the identify of the interactive session by executing the next command:

cd ${BLOG_DIR}
awk -F '|' '$2 ~ /GlueInteractiveSessionConnectionName/ [ t]+$/, "", $3); print $3' ${BLOG_DIR}/GlueOpenSearchStack_outputs.txt

In Step 1 within the pocket book, change the placeholder <S3-BUCKET-NAME> and populate the variable s3_bucket with the bucket identify. You may get the identify of the S3 bucket by executing the next command:

awk -F '|' '$2 ~ /S3Bucket/ [ t]+$/, "", $3); print $3' ${BLOG_DIR}/GlueOpenSearchStack_outputs.txt

In Step 4 within the pocket book, change <OPEN-SEARCH-DOMAIN-WITHOUT-HTTPS> with the OpenSearch Service area identify. You may get the area identify by executing the next command:

awk -F '|' '$2 ~ /OpenSearchDomainEndpoint/ [ t]+$/, "", $3); print $3' ${BLOG_DIR}/GlueOpenSearchStack_outputs.txt

Run the pocket book

Run every cell of the pocket book to load knowledge into the OpenSearch Service area and skim it again to confirm the profitable load. Consult with the detailed directions inside the pocket book for execution-specific steering.

Spark write modes (append vs. overwrite)

It’s endorsed to write down knowledge incrementally into OpenSearch Service indexes utilizing the append mode, as demonstrated in Step 8 within the pocket book. Nonetheless, in sure circumstances, you could must refresh the whole dataset within the OpenSearch Service index. In these situations, you should utilize the overwrite mode, although it’s not suggested for big indexes. When utilizing overwrite mode, the Spark library deletes rows from the OpenSearch Service index one after the other after which rewrites the information, which will be inefficient for big datasets. To keep away from this, you possibly can implement a preprocessing step in Spark to determine insertions and updates, after which write the information into OpenSearch Service utilizing append mode.

Ingest knowledge into Elasticsearch utilizing the Elasticsearch Hadoop library

On this part, we load an Elasticsearch index utilizing Spark and the Elasticsearch Hadoop Library. We show this implementation by utilizing AWS Glue because the engine for Spark.

Arrange the AWS Glue Studio pocket book

Full the next steps to arrange the pocket book:

On the AWS Glue console, select ETL jobs within the navigation pane.
Underneath Create job, select Pocket book.

Image showing AWS console page for AWS Glue to open notebook

Add the pocket book file positioned at ${BLOG_DIR}/glue_jobs/Spark-and-Elasticsearch-Code-Steps.ipynb.
For IAM function, select the AWS Glue job IAM function that begins with GlueOpenSearchStack-GlueRole-*.

Image showing AWS console page for AWS Glue to open notebook

Enter a reputation for the pocket book (for instance, Spark-and-ElasticSearch-Code-Steps) and select Save.

Image showing AWS Glue Elasticsearch Notebook

Change the placeholder values within the pocket book

Full the next steps:

In Step 1 within the pocket book, change the placeholder <GLUE-INTERACTIVE-SESSION-CONNECTION-NAME> with the AWS Glue interactive session connection identify. You may get the identify of the interactive session by executing the next command:

awk -F '|' '$2 ~ /GlueInteractiveSessionConnectionName/ [ t]+$/, "", $3); print $3' ${BLOG_DIR}/GlueOpenSearchStack_outputs.txt

In Step 1 within the pocket book, change the placeholder <S3-BUCKET-NAME> and populate the variable s3_bucket with the bucket identify. You may get the identify of the S3 bucket by executing the next command:

awk -F '|' '$2 ~ /S3Bucket/ [ t]+$/, "", $3); print $3' ${BLOG_DIR}/GlueOpenSearchStack_outputs.txt

In Step 4 within the pocket book, change <ELASTIC-SEARCH-DOMAIN-WITHOUT-HTTPS> with the Elasticsearch area identify. You may get the area identify by executing the next command:

awk -F '|' '$2 ~ /ElasticsearchDomainEndpoint/ [ t]+$/, "", $3); print $3' ${BLOG_DIR}/GlueOpenSearchStack_outputs.txt

Run the pocket book

Run every cell within the pocket book to load knowledge to the Elasticsearch area and skim it again to confirm the profitable load. Consult with the detailed directions inside the pocket book for execution-specific steering.

Ingest knowledge into OpenSearch Service utilizing the AWS Glue OpenSearch Service connection

On this part, we load an OpenSearch Service index utilizing Spark and the AWS Glue OpenSearch Service connection.

Create the AWS Glue job

Full the next steps to create an AWS Glue Visible ETL job:

On the AWS Glue console, select ETL jobs within the navigation pane.
Underneath Create job, select Visible ETL

This may open the AWS Glue job visible editor. Image showing AWS console page for AWS Glue to open Visual ETL