Amazon DocumentDB zero-ETL integration with Amazon OpenSearch Service is now obtainable


At present, we’re asserting the overall availability of Amazon DocumentDB (with MongoDB compatibility) zero-ETL integration with Amazon OpenSearch Service.

Amazon DocumentDB gives native textual content search and vector search capabilities. With Amazon OpenSearch Service, you may carry out superior search analytics, resembling fuzzy search, synonym search, cross-collection search, and multilingual search, on Amazon DocumentDB information.

Zero-ETL integration simplifies your structure for superior search analytics. It frees you from performing undifferentiated heavy lifting duties and the prices related to constructing and managing information pipeline structure and information synchronization between the 2 companies.

On this publish, we present you the way to configure zero-ETL integration of Amazon DocumentDB with OpenSearch Service utilizing Amazon OpenSearch Ingestion. It entails performing a full load of Amazon DocumentDB information and constantly streaming the newest information to Amazon OpenSearch Service utilizing change streams. For different ingestion strategies, see documentation.

Resolution overview

At a excessive stage, this answer entails the next steps:

  1. Allow change streams on the Amazon DocumentDB collections.
  2. Create the OpenSearch Ingestion pipeline.
  3. Load pattern information on the Amazon DocumentDB cluster.
  4. Confirm the info in OpenSearch Service.

Stipulations

To implement this answer, you want the next stipulations:

Zero-ETL will carry out an preliminary full load of your assortment by doing a group scan on the first occasion of your Amazon DocumentDB cluster, which can take a number of minutes to finish relying on the dimensions of the info, and you could discover elevated useful resource consumption in your cluster.

Allow change streams on the Amazon DocumentDB collections

Amazon DocumentDB change stream occasions comprise a time-ordered sequence of knowledge modifications because of inserts, updates, and deletes in your information. We use these change stream occasions to transmit information modifications from the Amazon DocumentDB cluster to the OpenSearch Service area.

Change streams are disabled by default; you may allow them on the particular person assortment stage, database stage, or cluster stage. To allow change streams in your collections, full the next steps:

  1. Hook up with Amazon DocumentDB utilizing mongo shell.
  2. Allow change streams in your assortment with the next code. For this publish, we use the Amazon DocumentDB database stock and assortment product:
    db.adminCommand({modifyChangeStreams: 1,
        database: "stock",
        assortment: "product", 
        allow: true});

When you’ve got a couple of assortment for which you need to stream information into OpenSearch Service, allow change streams for every assortment. If you wish to allow it on the database or cluster stage, see Enabling Change Streams.

It’s really helpful to allow change streams for less than the required collections.

Create an OpenSearch Ingestion pipeline

OpenSearch Ingestion is a totally managed information collector that delivers real-time log and hint information to OpenSearch Service domains. OpenSearch Ingestion is powered by the open supply information collector Information Prepper. Information Prepper is a part of the open supply OpenSearch challenge.

With OpenSearch Ingestion, you may filter, enrich, remodel, and ship your information for downstream evaluation and visualization. OpenSearch Ingestion is serverless, so that you don’t want to fret about scaling your infrastructure, working your ingestion fleet, and patching or updating the software program.

For a complete overview of OpenSearch Ingestion, go to Amazon OpenSearch Ingestion, and for extra details about the Information Prepper open supply challenge, go to Information Prepper.

To create an OpenSearch Ingestion pipeline, full the next steps:

  1. On the OpenSearch Service console, select Pipelines within the navigation pane.
  2. Select Create pipeline.
  3. For Pipeline identify, enter a reputation (for instance, zeroetl-docdb-to-opensearch).
  4. Arrange pipeline capability for compute assets to routinely scale your pipeline based mostly on the present ingestion workload.
  5. Enter the minimal and most Ingestion OpenSearch Compute Items (OCUs). On this instance, we use the default pipeline capability settings of minimal 1 Ingestion OCU and most 4 Ingestion OCUs.

Every OCU is a mixture of roughly 8 GB of reminiscence and a couple of vCPUs that may deal with an estimated 8 GiB per hour. OpenSearch Ingestion helps as much as 96 OCUs, and it routinely scales up and down based mostly in your ingest workload demand.

  1. Select the configuration blueprint and underneath Use case within the navigation pane, select ZeroETL.
  2. Choose Zero-ETL with DocumentDB to construct the pipeline configuration.

This pipeline is a mixture of a supply half from the Amazon DocumentDB settings and a sink half for OpenSearch Service.

You need to set a number of AWS Identification and Entry Administration (IAM) roles (sts_role_arn) with the required permissions to learn information from the Amazon DocumentDB database and assortment and write to an OpenSearch Service area. This position is then assumed by OpenSearch Ingestion pipelines to verify the proper safety posture is at all times maintained when shifting the info from supply to vacation spot. To study extra, see Establishing roles and customers in Amazon OpenSearch Ingestion.

You want one OpenSearch Ingestion pipeline per Amazon DocumentDB assortment.

model: "2"
documentdb-pipeline:
  supply:
    documentdb:
      acknowledgments: true
      host: "<<docdb-2024-01-03-20-31-17.cluster-abcdef.us-east-1.docdb.amazonaws.com>>"
      port: 27017
      authentication:
        username: ${{aws_secrets:secret:username}}
        password: ${{aws_secrets:secret:password}}
      aws:
        sts_role_arn: "<<arn:aws:iam::123456789012:position/Instance-Function>>"
      
      s3_bucket: "<<bucket-name>>"
      s3_region: "<<bucket-region>>" 
      # optionally available s3_prefix for Opensearch ingestion to write down the information
      # s3_prefix: "<<path_prefix>>"
      collections:
        # assortment format: <databaseName>.<collectionName>
        - assortment: "<<databaseName.collectionName>>"
          export: true
          stream: true
  sink:
    - opensearch:
        # REQUIRED: Present an AWS OpenSearch endpoint
        hosts: [ "<<https://search-mydomain-1a2a3a4a5a6a7a8a9a0a9a8a7a.us-east-1.es.amazonaws.com>>" ]
        index: "<<index_name>>"
        index_type: customized
        document_id: "${getMetadata("primary_key")}"
        motion: "${getMetadata("opensearch_action")}"
        # DocumentDB document creation or occasion timestamp
        document_version: "${getMetadata("document_version")}"
        document_version_type: "exterior"
        aws:
          # REQUIRED: Present a Function ARN with entry to the area. This position ought to have a belief relationship with osis-pipelines.amazonaws.com
          sts_role_arn: "<<arn:aws:iam::123456789012:position/Instance-Function>>"
          # Present the area of the area.
          area: "<<us-east-1>>"
          # Allow the 'serverless' flag if the sink is an Amazon OpenSearch Serverless assortment
          # serverless: true
          # serverless_options:
            # Specify a reputation right here to create or replace community coverage for the serverless assortment
            # network_policy_name: "network-policy-name"
          
extension:
  aws:
    secrets and techniques:
      secret:
        # Secret identify or secret ARN
        secret_id: "<<my-docdb-secret>>"
        area: "<<us-east-1>>"
        sts_role_arn: "<<arn:aws:iam::123456789012:position/Instance-Function>>"
        refresh_interval: PT1H 

Present the next parameters from the blueprint:

  • Amazon DocumentDB endpoint – Present your Amazon DocumentDB cluster endpoint.
  • Amazon DocumentDB assortment – Present your Amazon DocumentDB database identify and assortment identify within the format dbname.assortment inside the col­­­­­lections part. For instance, stock.product.
  • s3_bucket – Present your S3 bucket identify together with the AWS Area and S3 prefix. This can be used quickly to carry the info from Amazon DocumentDB for information synchronization.
  • OpenSearch hosts – Present the OpenSearch Service area endpoint for the host and supply the popular index identify to retailer the info.
  • secret_id – Present the ARN for the key for the Amazon DocumentDB cluster together with its Area.
  • sts_role_arn – Present the ARN for the IAM position that has permissions for the Amazon Doc DB cluster, S3 bucket, and OpenSearch Service area.

To study extra, see Creating Amazon OpenSearch Ingestion pipelines.

  1. After getting into all of the required values, validate the pipeline configuration for any errors.
  2. When designing a manufacturing workload, deploy your pipeline inside a VPC. Select your VPC, subnets, and safety teams. Additionally choose Connect to VPC and select the corresponding VPC CIDR vary.

The safety group inbound rule ought to have entry to the Amazon DocumentDB port. For extra data, check with Securing Amazon OpenSearch Ingestion pipelines inside a VPC.

Load pattern information on the Amazon DocumentDB cluster

Full the next steps to load the pattern information:

  1. Hook up with your Amazon DocumentDB cluster.
  2. Insert some paperwork into the gathering product within the stock database by working the next instructions. For creating and updating paperwork on Amazon DocumentDB, check with Working with Paperwork.
    use stock;
    
     db.product.insertMany([
       {
          "Item":"Ultra GelPen",
          "Colors":[
             "Violet"
          ],
          "Stock":{
             "OnHand":100,
             "MinOnHand":35
          },
          "UnitPrice":0.99
       },
       {
          "Merchandise":"Poster Paint",
          "Colours":[
             "Red",
             "Green",
             "Blue",
             "Black",
             "White"
          ],
          "Stock":{
             "OnHand":47,
             "MinOnHand":50
          }
       },
       {
          "Merchandise":"Spray Paint",
          "Colours":[
             "Black",
             "Red",
             "Green",
             "Blue"
          ],
          "Stock":{
             "OnHand":47,
             "MinOnHand":50,
             "OrderQnty":36
          }
       }
    ])

Confirm the info in OpenSearch Service

You need to use the OpenSearch Dashboards dev console to seek for the synchronized objects inside a number of seconds. For extra data, see Creating and looking for paperwork in Amazon OpenSearch Service.

To confirm the change information seize (CDC), run the next command to replace the OnHand and MinOnHand fields for the prevailing doc merchandise Extremely GelPen within the product assortment:

db.product.updateOne({
   "Merchandise":"Extremely GelPen"
},
{
   "$set":{
      "Stock":{
         "OnHand":300,
         "MinOnHand":100
      }
   }
});

Confirm the CDC for the replace to the doc for the merchandise Extremely GelPen on the OpenSearch Service index.

Monitor the CDC pipeline

You may monitor the state of the pipelines by checking the standing of the pipeline on the OpenSearch Service console. Moreover, you should use Amazon CloudWatch to supply real-time metrics and logs, which helps you to arrange alerts in case of a breach of user-defined thresholds.

Clear up

Be sure you clear up undesirable AWS assets created throughout this publish so as to stop extra billing for these assets. Observe these steps to wash up your AWS account:

  1. On the OpenSearch Service console, select Domains underneath Managed clusters within the navigation pane.
  2. Choose the area you need to delete and select Delete.
  3. Select Pipelines underneath Ingestion within the navigation pane.
  4. Choose the pipeline you need to delete and on the Actions menu, select Delete.
  5. On the Amazon S3 console, choose the S3 bucket and select Delete.

Conclusion

On this publish, you discovered the way to allow zero-ETL integration between Amazon DocumentDB change information streams and OpenSearch Service. To study extra about zero-ETL integrations obtainable with different information sources, see Working with Amazon OpenSearch Ingestion pipeline integrations.


Concerning the Authors

Praveen Kadipikonda is a Senior Analytics Specialist Options Architect at AWS based mostly out of Dallas. He helps prospects construct environment friendly, performant, and scalable analytic options. He has labored with constructing databases and information warehouse options for over 15 years.

Kaarthiik Thota is a Senior Amazon DocumentDB Specialist Options Architect at AWS based mostly out of London. He’s obsessed with database applied sciences and enjoys serving to prospects remedy issues and modernize purposes utilizing NoSQL databases. Earlier than becoming a member of AWS, he labored extensively with relational databases, NoSQL databases, and enterprise intelligence applied sciences for over 15 years.

Muthu Pitchaimani is a Search Specialist with Amazon OpenSearch Service. He builds large-scale search purposes and options. Muthu is within the subjects o f networking and safety, and relies out of Austin, Texas.

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here