Skip to main content

Amazon Kendra Deep Dive

 

Amazon Kendra

  • easy-to-use enterprise search service (i.e., powered by ML)
  • allows developers to add search capabilities to the applications which helps in faster data discovery within vast amount of data spread across their company.
  • some of the datasets may include manuals, reports, FAQ's, human resources or customer service guides
  • data could be stored anywhere on S3, Sharepoint, Salesforce, ServiceNow, RDS, Microsoft OneDrive etc.,
  • when you type a question - Kendra uses ML algorithms to understand context and return the most relevant results. 

Steps to enable Kendra a high-level:

  1. Create an Amazon Kendra Index
  2. After index is created, we can explore on using the data source connectors based on where our data resides.
  3. Then ingest the data based on the connector selected. Based on the ingested data, we need to create a document metadata that helps in faceting and filtering the documents. 
  4. Next, we can generate and ingest a list of FAQ's
  5. Run search queries and audit the answers we get.




Creating an Amazon Kendra Index:

steps:
1. Access Amazon Kendra Console
2. Create an index
3. Give index name, IAM role and Role name (if creating a new role)
4. For configuring user access control, we can keep default as No. 
5. Select of one of the available 2 provisioning editions (Developer, Enterprise) 
6. Hit on Create, and wait until the process is completed.

NOTE:
- Kendra automatically will publish error and alert logs to Amazon CloudWatch. 
- A CloudWatch log group and corresponding log stream will be created for us.


Ingesting Documents:

We can ingest documents to Kendra using the following mechanisms:
  1. Data Sources: Location (such as Sharepoint or Salesforce or S3), where we store the documents for indexing. You can automatically synchronize data sources with Kendra index so that new, updated, or deleted documents in the data source are also added, updated or deleted the index for searching on.
  2. FAQ Documents: That contain questions and answers, which can be uploaded or using CreateFaq API
  3. Using BatchPutDocument API: that can take inline blobs and s3 locations for documents
  4. Create custom data source if needed, using the same BatchPutDocument API

Unstructured text that can be ingested via connectors or the BatchPut interface:
  • HTML files
  • Microsoft PowerPoint presentations
  • Microsoft Word documents
  • Plain text documents
  • PDF's


Amazon Kendra S3 Connector

Kendra offers s3 connector that allows document ingestion. 
Advantage of using the provided connector is that it has the ability to ingest the associated metadata attributes associated with the original document. 

Steps:
  1. Create s3 bucket to store your documents
  2. Upload the required documents to the s3 bucket
  3. Go to Data Management -> Data Sources -> Select sample dataset (Amazon s3). Select Amazon S3 (Add connector)
  4. For the s3 data source: configure sync settings: enter the data source location, metadata files prefix optional, ACL configuration file optional
  5. On the additional configuration -- select you can define inclusion and exclusion patterns, add the s3 folder and click Add
  6. Set sync run schedule -> select run on demand, click on Next
  7. On the set field mappings, keep default configuration, click on Next
  8. On the review and create page, click on add data source to complete the process of adding s3 as a data source
  9. After creation process is complete, click on Sync now. 
  10. Time to test the query by going to Data Management -> Search indexed content
  11. Type a query in the search bar to search for specific content


Filtering search results (Metadata documents)

  1. We can filter search results based on the Category field, using a category
  2. Can select "Security" as category to filter the results
  3. Search results can be improved by creating a separate metadata document

Adding fields to your Kendra index

  1. Click on Facet Definitions
  2. Click on Add field
  3. Enter the field name (the same name should be used as it appears in the metadata document) and the select datatype and click Add
  4. Save the added fields

Updating the s3 connector

- As we make changes to the fields or adding the metadata files, we can update by running the sync now job again to update the index with new files.



Filtering Queries in Amazon Kendra


Under Data Management -->Select facet definition
For every column, we do have an option of selecting one of the 4 options: Facetable, Searchable, Displayable, Sortable
Click on Search indexed content and perform a search. 


Using facets in a query

  • Once selected the facetable on the columns
  • This will add a new key in the response called "FacetResults" that contains the facet values for the documents in the response

Making an index field sortable

  • Back in the index facet definition section, unmark Sortable from all the fields
  • Run a query, you will notice that only option for sorting is "Relevance"
  • Back on the facets definition, mark the fields as sortable. 
  • Run the query again (which uses the new field added as the sorting parameter for the query)
  • For example, use code to run a query and sort the results by the new attribute in ascending order.

Relevance Tuning

  • Allows you to give a boost to a result in the response when the query includes terms that match the attribute
  • In order to allow the attribute to be used to boost a document you need to mark it as searchable

    Comments

    Popular posts from this blog

    AWS Connect: Reporting and Visualizations

    Amazon connect offers: - built in reports i.e., historical and real-time reports.  We can customize these reports, schedule them and can integrate with any BI tool of our requirement to query and view the connect data.  Sample solution provided by AWS: 1. Make sure Connect is exporting the CTR data using Kinesis Data Stream 2. Use Kinesis Firehose to deliver the CTR that are in KDS to S3. (CTR's can be delivered as batch of records, so one s3 object might have multiple CTR's). AWS Lambda is used to add a new line character to each record, which makes object easier to parse.  3. s3 Event Notifications are used to send an event to modify the CTR record and saves it in S3. 4. Athena queries the modified CTR's using SQL. Use partitions to restrict the amount of data scanned by each query, improving performance and reducing cost. Lambda function is used to maintain the partitions.  5. Quicksight is used to visualize the modified CTRs.  Solution variations: Convert re...

    Must use VS Code Extensions for anyone working on Cloud

    Here are the list of VS Code extensions that anyone working on cloud technologies can use to speed up their development.  To download any extension, refer to the extension tab on your VS code window: As we will manage all our cloud resources using Terraform, we will start with Terraform Autocomplete Extension.  1. Terraform Extensions Terraform: to manage terraform resources directly from VS Code.  Terraform Autocomplete : useful when we are creating terraform resources. 2. Docker : To build, manage and deploy docker containers from VS Code. 3. Python : extension that provides python interpreter 4. Prettier-Code formatter : 5. Markdown Preview :  6. Git :   Git History : Git Graph : Now we can select the below extensions, and click on install.  AWS VSCode Extensions: 1. AWS Toolkit : To interact with AWS resources directly from VS Code. Helpful in taking a look of AWS resources without having to login into console, provides us with a very cool UI to g...

    Databricks: Job aborted due to stage failure. Total size of serialized results is bigger that spark driver memory.

      While running a databricks job, especially running a job with large datasets and longer running queries that creates a lot of temp space - we might be facing below issue if we have a minimal configuration set to the cluster.  The simple way to fix this would be changing the spark driver config in the databricks cluster tab spark.driver.maxResultSize = 100G (change the GB based on your cluster size)