AWS Machine Learning Blog
Securely search unstructured data on Windows file systems with the Amazon Kendra connector for Amazon FSx for Windows File Server
Critical information can be scattered across multiple data sources in your organization, including sources such as Windows file systems stored on Amazon FSx for Windows File Server. You can now use the Amazon Kendra connector for FSx for Windows File Server to index documents (HTML, PDF, MS Word, MS PowerPoint, and plain text) stored in your Windows file system on FSx for Windows File Server and search for information across this content using intelligent search in Amazon Kendra.
Organizations store unstructured data in files on shared Windows file systems and secure it by using Windows Access Control Lists (ACLs) to ensure that users can read, write, and create files as per their access permissions configured in the enterprise Active Directory (AD) domain. Finding specific information from this data not only requires searching through the files, but also ensuring that the user is authorized to access it. The Amazon Kendra connector for FSx for Windows File Server indexes the files stored on FSx for Windows File Server and ingests the ACLs in the Amazon Kendra index, so that the response of a search query made by a user includes results only from those documents that the user is authorized to read.
This post takes the example of a set of documents stored securely on a file system using ACLs on FSx for Windows File Server. These documents are ingested in an Amazon Kendra index by configuring and synchronizing this file system as a data source of the index using the connector for FSx for Windows File Server. Then we demonstrate that when a user makes a search query, the Amazon Kendra index uses the ACLs based on the user name and groups the user belongs to, and returns only those documents the user is authorized to access. We also include details of the configuration and screenshots at every stage so you can use this as a reference when configuring the Amazon Kendra connector for FSx for Windows File Server in your setup.
Prerequisites
To try out the Amazon Kendra connector for FSx for Windows File Server, you need the following:
- An AWS account with privileges to create AWS Identity and Access Management (IAM) roles and policies. For more information, see Overview of access management: Permissions and policies.
- Basic knowledge of AWS and working knowledge of Windows ACLs and Microsoft AD domain administration.
- Admin access to a file system on FSx for Windows File Server, with admin access to the AD domain to which it belongs. Alternately, you can deploy this using the Quick Start for FSx for Windows File Server.
- The AWS_Whitepapers.zip, which we use to try out the functionality. For updated versions, refer to AWS Whitepapers & Guides. Alternately, you can use your own documents.
Solution architecture
The following diagram illustrates the solution architecture:
The documents in this example are stored on a file system (3 in the diagram) on FSx for Windows File Server (4). The files are set up with ACLs based on the user and group configurations in the AD domain created using AWS Directory Service (1) to which FSx for Windows File Server belongs. This file system on FSx for Windows File Server is configured as a data source for Amazon Kendra (5). AWS Single Sign On (AWS SSO) is enabled with the AD as the identity source, and the Amazon Kendra index is set up to use AWS SSO (2) for user name and group lookup for the user context of the search queries from the customer search solution deployments (6). The FSx for Windows File Server file system, AWS Managed Microsoft AD server, the Amazon Virtual Private Cloud (Amazon VPC) and subnets configured in this example are created using the Quick Start for FSx for Windows File Server.
FSx for Windows File Server configuration
The following screenshot shows the file system on FSx for Windows File Server configured as a part of an AWS Managed Microsoft AD domain that is used in our example, as seen on the Amazon FSx console.
AWS Managed Microsoft AD configuration
The AD to which FSx for Windows File Server belongs is configured as an AWS Managed Microsoft AD, as seen in the following screenshot of the Directory Service console.
Users, groups and ACL configuration for sample dataset
For this post, we used a dataset consisting of a few AWS publicly available whitepapers and stored them in directories based on their categories (Best_Practices
, Databases
, General
, Machine_Learning
, Security
, and Well_Architected
) on a file system on FSx for Windows File Server. The following screenshot shows the folders as seen from a Windows bastion host that is part of the AD domain to which the file system belongs.
Users and groups are configured in the AD domain as follows:
- kadmin –
group_kadmin
- patricia –
group_sa
,group_kauthenticated
- james –
group_db_sa
,group_kauthenticated
- john –
group_ml_sa
,group_kauthenticated
- mary, julie, tom –
group_kauthenticated
The following screenshot shows users and groups configured in the AWS Managed Microsoft AD domain as seen from the Windows bastion host.
The ACLs for the files in each directory are set up based on the user and group configurations in the AD domain to which FSx for Windows File Server belongs:
- All authenticated users (group_kauthenticated) – Can access the documents in
Best_Practices
andGeneral
directories - Solutions Architects (group_sa) – Can access the documents in
Best_Practices
,General
,Security
, andWell_Architected
directories - Database subject matter expert Solutions Architects (group_db_sa) – Can access the documents in
Best_Practices
,General
,Security
,Well_Architected
, andDatabase
directories - Machine learning subject matter expert Solutions Architects (group_ml_sa) – Can access
Best_Practices
,General
,Security
,Well_Architected
, andMachine_Learning
directories - Admin (group_kadmin) – Can access the documents in any of the six directories
The following screenshot shows the ACL configurations for each of the directories of our sample data, as seen from the Windows bastion host.
AWS Single Sign-On configuration
AWS SSO is configured with the AD domain as the identity source. The following screenshot shows the settings on the AWS SSO console.
The groups are synchronized in AWS SSO from the AD, as shown in the following screenshot.
The following screenshot shows the members of the group_kauthenticated
group synchronized from the AD.
Data source configuration using Amazon Kendra connector for FSx for Windows File Server
We configure a data source using the Amazon Kendra connector for FSx for Windows File Server in an Amazon Kendra index on the Amazon Kendra console. You can create a new Amazon Kendra index or use an existing one and add a new data source.
When you add a data source for an Amazon Kendra index, choose the FSx for Windows File Server connector by choosing Add connector under Amazon FSx.
The steps to add a data source name and resource tags are similar to adding any other data source, as shown in the following screenshot.
The details for configuring the specific file system on Amazon FSx and the type of the file system (FSx for Windows File Server in this case), are configured for in the Source section. The authentication credentials of a user with admin privileges to the file system are configured using an AWS Secrets Manager secret.
The VPC and security group settings of the data source configuration include the details of the VPC, subnets, and security group of Amazon FSx and the AD server. In the following screenshot, we also create a new IAM role for the data source.
The next step in data source configuration involves mapping the Amazon FSx connector fields to the Amazon Kendra facets or field names. In the following screenshot, we leave the configuration unchanged. The step after this involves reviewing the configuration and confirming that the data source should be created.
After you configure the file system on FSx for Windows File Server, where the example data is stored as a data source, you configure Custom Document Enrichment (CDE) basic operations for this data source so that the Amazon Kendra index filed _category
is configured based on the directory in which a document is stored. The data source sync is started after the CDE configuration, so that the _category
attributes for the documents get configured during the ingestion workflow.
As shown in the following screenshot, the Amazon Kendra index user access control settings are configured for user and group lookup through AWS SSO integration. JSON token-based user access control is enabled to search based on user and group names from the Amazon Kendra Search console.
In the facet definition for the Amazon Kendra index, make sure that the facetable and displayable boxes are checked for _category
. This allows you to use the _category
values set by the CDE basic operations as facets while searching.
Search with Amazon Kendra
After the data source sync is complete, we can start searching from the Amazon Kendra Search console, by choosing Search indexed content in the navigation pane on the Amazon Kendra console. Because we’re using AWS whitepapers as the dataset to ingest in the Amazon Kendra index, we use “What’s DynamoDB?” as the search query. Only authenticated users are authorized access to the files on the file system on FSx for Windows File Server; therefore, when we use this search query without setting any user name or group, we don’t get any results.
Now let’s set the user name to mary@kendra-01.com
. The user mary
belongs to group_kauthenticated
, and therefore is authorized to access the documents in the Best_Practices
and General
directories. In the following screenshot, the search response includes documents with the facet category
set to Best Practices and General. The CDE basic operations set the facet category
depending on the directory names contained in the source_uri
. This confirms that the ACLs ingested in Amazon Kendra by the connector for FSx for Windows File Server are being enforced in the search results based on the user name.
Now we change the user name to patricia@kendra-01.com
. The user patricia
belongs to group_sa
, with access to the Security
and Well_Architected
directories, in addition to Best_Practices
and General
directories. The search response includes results from these additional directories.
Now we can observe how the results from the search response change as we change the user name to james@kendra-01.com
, john@kendra-01.com
, and kadmin@kendra-01.com
in the following screenshots.
Clean up
If you deployed any AWS infrastructure to experiment with the Amazon Kendra connector for FSx for Windows File Server, clean up the infrastructure as follows:
- If you used the Quick Start for FSx for Windows File Server, delete the AWS CloudFormation stack you created so that it deletes all the resources it created.
- If you created a new Amazon Kendra index, delete it.
- If you only added a new data source using the connector, delete that data source.
- Delete the AWS SSO configuration.
Conclusion
The Amazon Kendra connector for FSx for Windows File Server enables secure and intelligent search of information scattered in unstructured content. The data is securely stored on file systems on FSx Windows File Server with ACLs and shared with users based on their Microsoft AD domain credentials.
For more information on the Amazon Kendra connector for FSx for Windows File Server, refer to Getting started with an Amazon FSx data source (console) and Using an Amazon FSx data source.
For information on Custom Document Enrichment, refer to Customizing document metadata during the ingestion process and Enrich your content and metadata to enhance your search experience with custom document enrichment in Amazon Kendra.
About the Author
Abhinav Jawadekar is a Senior Partner Solutions Architect at Amazon Web Services. Abhinav works with AWS Partners to help them in their cloud journey.