AWS Machine Learning Blog

Use AWS DeepLens to give Amazon Alexa the power to detect objects via Alexa skills

People are using Alexa for all types of activities in their homes, such as checking their bank balances, ordering pizza, or simply listening to their music from their favorite artists. For the most part, the primary interaction with the Echo has been your voice. In this blog post, we’ll show you how to build a new Alexa skill that will integrate with AWS DeepLens so when you ask “Alexa, what do you see?” Alexa returns objects detected by the AWS DeepLens device.

Object detection is an important topic in the AI deep learning world. For example, in autonomous driving, the camera on the vehicle needs to be able to detect objects (people, cars, signs, etc.) on the road first before making any decisions to turn, slow down, or stop.

AWS DeepLens was developed to put deep learning in the hands of developers. It ships with a fully programmable video camera, tutorials, code, and pre-trained models. It was designed so that you can have your first Deep Learning model running on the device within about 10 minutes after opening the box. For this blog post we’ll use one of the built-in object detection models included with AWS DeepLens. This enables AWS DeepLens to perform real-time object detection using the built-in camera. After the device detects objects, it sends information about the objects detected to the AWS IoT platform.

We’ll also show you how to push this data into an Amazon Kinesis data stream, and use Amazon Kinesis Data Analytics to aggregate duplicate objects detected in the stream and push them into another Kinesis data stream. Finally, you’ll create a custom Alexa skill with AWS Lambda to retrieve the detected objects from the Kinesis stream and have Alexa verbalize the result back to the user.

Solution overview

The following diagram depicts a high-level overview of this solution.

Amazon Kinesis Data Streams

You can use Amazon Kinesis Data Streams to build your own streaming application. This application can process and analyze real-time, streaming data by continuously capturing and storing terabytes of data per hour from hundreds of thousands of sources.

Amazon Kinesis Data Analytics

Amazon Kinesis Data Analytics provides an easy and familiar standard SQL language to analyze streaming data in real time. One of its most powerful features is that there are no new languages, processing frameworks, or complex machine learning algorithms that you need to learn.

AWS Lambda

AWS Lambda lets you run code without provisioning or managing servers. With AWS Lambda, you can run code for virtually any type of application or backend service – all with zero administration. Just upload your code and AWS Lambda takes care of everything required to run and scale your code with high availability. You can set up your code to automatically trigger from other AWS services or call it directly from any web app, mobile app, or, in this case, an Alexa skill.

Amazon Alexa

Alexa is the Amazon cloud-based voice service available on tens of millions of devices from Amazon and third-party device manufacturers. With Alexa, you can build natural voice experiences that offer customers a more intuitive way to interact with the technology they use every day. Our collection of tools, APIs, reference solutions, and documentation make it easy for anyone to build with Alexa.

Solution summary

The following is a quick walkthrough of the solution that’s illustrated in the diagram:

  1. First set up the AWS DeepLens device and download the object detection model onto the device. Then it will load the model, perform local inference, and send detected objects as MQTT messages to the AWS IoT platform.
  2. The MQTT messages are then sent to a Kinesis data stream by configuring an IoT rule.
  3. By using Kinesis Data Analytics on the Kinesis data stream, detected objects are aggregated and put into another Kinesis data stream for the Alexa custom skill to query.
  4. Upon the user’s request, the custom Alexa skill will invoke an AWS Lambda function, which will query the final Kinesis data stream and return list of objects detected by the AWS DeepLens device.

Implementation steps

The following sections walk through the implementation steps in detail.

Setting up DeepLens and deploying the built-in object detection model

  1. Open the AWS DeepLens console.
  2. Register your AWS DeepLens device if it’s not registered. You can follow this link for a step by step guide to register the device.
  3. Choose Create new project on the Projects page. On the Choose project type page, choose the Use a project template option, and select Object detection in the Project templates section.
  4. Choose Next to move to the Review
  5. In the Project detail page, give the project a name and then choose Create.
  6. Back on the Projects page, select the project that you created earlier, and click Deploy to device.
  7. Make sure the AWS DeepLens device is online. Then select the device to deploy, and choose Review. Review the deployment summary and then choose Deploy.
  8. The page will redirect to device detail page and at the top the page, deployment status is displayed. This process will take a few minutes, wait until the deployment is successful.
  9. After the deployment is complete, on the same page, scroll to the device details section and copy the MQTT topic from AWS IoT that the device is registered to. As mentioned, any object detected by the AWS DeepLens device will send the information to this topic.
  10. In the same section, choose the AWS IoT console link.
  11. On the AWS IoT MQTT Client test page, paste the MQTT topic copied earlier and choose Subscribe to topic.

 

  1. You should now see detected object messages flowing on the screen.

Setting up Kinesis Streams and Kinesis Analytics

  1. Open the Amazon Kinesis Data Streams console.
  2. Create a new Kinesis Data Stream. Give it a name that indicates it’s for raw incoming stream data—for example, RawStreamData. For Number of shards, type 1.
  3. Go back to the AWS IoT console and in the left navigation pane choose Act. Then choose Create to set up a rule to push MQTT messages from the AWS DeepLens device to the newly created Kinesis data stream.
  4. On the create rule page, give it a name. In the Message source section, for Attribute enter *. For Topic filter, enter the DeepLens device MQTT topic. Choose Add action.
  5. Choose Sends message to an Amazon Kinesis Stream, then click Configuration action. Select the Kinesis data stream created earlier, and for Partition key type ${newuuid()} in the text box. Then choose Create a new role or Update role and choose Add action. Choose Create rule to finish the setup.
  6. Now that the rule is set up, messages will be loaded into the Kinesis data stream. Now we can use Kinesis Data Analytics to aggregate the data and load the result to the final Kinesis data stream.
  7. Open the Amazon Kinesis Data Streams console.
  8. Create another new Kinesis Data Stream (follow instruction steps 1 and 2). Give it a name that indicates that it’s for aggregated incoming stream data—for example, AggregatedDataOutput. For Number of shards, type 1.
  9. In the Amazon Kinesis Data Analytics console, create a new application.
  10. Give it a name and choose Create application. In the source section, choose Connect stream data.
  11. Select Kinesis stream as Source and select the source stream created in step 2. Choose Discover schema to let Kinesis Data Streams to auto-discover the data schema.
  12. The Object Detection model deployment can detect up to 20 objects. However, AWS DeepLens might only detect a few of them depending on what’s in front of the camera. For example, in the screenshot below, the device only detects a chair and a person. Therefore, Kinesis auto discovery only detects the two objects as columns. You can add the other 18 objects manually to the schema by choosing the Edit schema button.
  13. On the schema page, add the rest of the objects then choose Save schema and update stream. Wait for the update to complete then click Exit (done).
  14. Scroll down to the bottom of the page then choose Save and continue.
  15. Back on the application configuration page, choose Go to SQL editor.
  16. Copy and paste the following SQL statement to the Analytics SQL window, and then choose Save and run SQL. After SQL finishes saving and running, choose Close at the bottom of the page. The SQL script aggregates each object detected in a 10 second tumbling window and stores them in the destination stream.
    CREATE OR REPLACE STREAM "TEMP_STREAM" ("objectName" varchar (40), "objectCount" integer);
    -- Creates an output stream and defines a schema
    CREATE OR REPLACE STREAM "DESTINATION_SQL_STREAM" ("objectName" varchar (40), "objectCount" integer);
     
    CREATE OR REPLACE PUMP "STREAM_PUMP_1" AS INSERT INTO "TEMP_STREAM"
    SELECT STREAM 'person', COUNT(*) AS "objectCount" FROM "SOURCE_SQL_STREAM_001"
    -- Uses a 10-second tumbling time window
    GROUP BY "person", STEP("SOURCE_SQL_STREAM_001".ROWTIME BY INTERVAL '10' SECOND);
    
    CREATE OR REPLACE PUMP "STREAM_PUMP_2" AS INSERT INTO "TEMP_STREAM"
    SELECT STREAM 'tvmonitor', COUNT(*) AS "objectCount" FROM "SOURCE_SQL_STREAM_001"
    -- Uses a 10-second tumbling time window
    GROUP BY "tvmonitor", STEP("SOURCE_SQL_STREAM_001".ROWTIME BY INTERVAL '10' SECOND);
    
    CREATE OR REPLACE PUMP "STREAM_PUMP_4" AS INSERT INTO "TEMP_STREAM"
    SELECT STREAM 'sofa', COUNT(*) AS "objectCount" FROM "SOURCE_SQL_STREAM_001"
    -- Uses a 10-second tumbling time window
    GROUP BY "sofa", STEP("SOURCE_SQL_STREAM_001".ROWTIME BY INTERVAL '10' SECOND);
    
    CREATE OR REPLACE PUMP "STREAM_PUMP_7" AS INSERT INTO "TEMP_STREAM"
    SELECT STREAM 'dog', COUNT(*) AS "objectCount" FROM "SOURCE_SQL_STREAM_001"
    -- Uses a 10-second tumbling time window
    GROUP BY "dog",  STEP("SOURCE_SQL_STREAM_001".ROWTIME BY INTERVAL '10' SECOND);
    
    CREATE OR REPLACE PUMP "STREAM_PUMP_10" AS INSERT INTO "TEMP_STREAM"
    SELECT STREAM 'chair', COUNT(*) AS "objectCount" FROM "SOURCE_SQL_STREAM_001"
    -- Uses a 10-second tumbling time window
    GROUP BY "chair",  STEP("SOURCE_SQL_STREAM_001".ROWTIME BY INTERVAL '10' SECOND);
    
    CREATE OR REPLACE PUMP "STREAM_PUMP_11" AS INSERT INTO "TEMP_STREAM"
    SELECT STREAM 'cat', COUNT(*) AS "objectCount" FROM "SOURCE_SQL_STREAM_001"
    -- Uses a 10-second tumbling time window
    GROUP BY "cat",  STEP("SOURCE_SQL_STREAM_001".ROWTIME BY INTERVAL '10' SECOND);
    
    CREATE OR REPLACE PUMP "STREAM_PUMP_14" AS INSERT INTO "TEMP_STREAM"
    SELECT STREAM 'bottle', COUNT(*) AS "objectCount" FROM "SOURCE_SQL_STREAM_001"
    -- Uses a 10-second tumbling time window
    GROUP BY "bottle", STEP("SOURCE_SQL_STREAM_001".ROWTIME BY INTERVAL '10' SECOND);
    
    CREATE OR REPLACE PUMP "OUTPUT_PUMP" AS INSERT INTO "DESTINATION_SQL_STREAM"
    SELECT STREAM "objectName", count(*) as "objectCount" FROM "TEMP_STREAM"
    GROUP BY "objectName", STEP("TEMP_STREAM".ROWTIME BY INTERVAL '10' SECOND)
    having count(*) >1;
  17. Back on application configuration page, choose Connect to a destination. In the Analytics destination section, make sure the Kinesis data stream (for example, AggregatedDataOutput) created in step 8 is selected, and enter DESTINATION_SQL_STREAM as the In-application stream name and select JSON as the output format of. Note that you can also easily have Kinesis Data Analytics send data directly to a Lambda function. That Lambda function would write to other data stores, such as Amazon DynamoDB. Then have the Alexa read from DynamoDB upon user request.
  18. Messages are now aggregated and loaded into the final Kinesis data stream (for example, AggregatedDataOutput) that was created in step 8. Next, you will create an Alexa custom skill with AWS Lambda.

Creating Alexa custom skill with AWS Lambda

  1. Open the AWS Lambda console and create a new function.
  2. The easiest way to create an Alexa skill is to create the function from the existing blue print provided by AWS Lambda and then overwrite the code with your own.
  3. It is a security best practice to enable Alexa Skill ID when using a Lambda function. If you have not created a skill for this yet, you can disable it for now and then re-enable it later by adding another Alexa trigger to the Lambda function.
  4. Copy the following Python code and replace the sample code provided by the blueprint. This code reads data from the Kinesis data stream and returns the result back to Alexa. Note: Change the default Northern Virginia AWS Region (us-east-1) in the code if you are following along in other Regions. Look for the following code to change region, kinesis = boto3.client(‘kinesis’, region_name=’us-east-1′)
    from __future__ import print_function
    import boto3
    import time
    import json
    import datetime   
    from datetime import timedelta
    
    def build_speechlet_response(title, output, reprompt_text, should_end_session):
        return {
            'outputSpeech': {
                'type': 'PlainText',
                'text': output
            },
            'card': {
                'type': 'Simple',
                'title': "SessionSpeechlet - " + title,
                'content': "SessionSpeechlet - " + output
            },
            'reprompt': {
                'outputSpeech': {
                    'type': 'PlainText',
                    'text': reprompt_text
                }
            },
            'shouldEndSession': should_end_session
        }
    def build_response(session_attributes, speechlet_response):
        return {
            'version': '1.0',
            'sessionAttributes': session_attributes,
            'response': speechlet_response
        }
    def get_welcome_response():
        """ If we wanted to initialize the session to have some attributes we could
        add those here
        """
    
        session_attributes = {}
        card_title = "Welcome"
        speech_output = "Hello, I can see now. You can ask me what I see around you right now."
        reprompt_text = "You can ask me what I see around you right now."
        should_end_session = False
        return build_response(session_attributes, build_speechlet_response(
            card_title, speech_output, reprompt_text, should_end_session))
    def handle_session_end_request():
        card_title = "Session Ended"
        speech_output = "Ok, I will close my eyes now. Have a nice day! "
        should_end_session = True
        return build_response({}, build_speechlet_response(
            card_title, speech_output, None, should_end_session))
    def create_object_names_attributes(objectNames, sequenceNumber):
        return {"objectNames": objectNames, "sequenceNumber":sequenceNumber}
    
    def get_objects(intent, session):
        session_attributes = {}
        reprompt_text = None
        speech_output=""
        should_end_session=False
        
        kinesis = boto3.client('kinesis', region_name='us-east-1')
        shard_id = 'shardId-000000000000'
        objectNames=[]
        sequenceNumber=""
    
        if session.get('attributes', {}) and "objectNames" in session.get('attributes', {}):
            objectNames = session['attributes']['objectNames']
        
        if session.get('attributes', {}) and "sequenceNumber" in session.get('attributes', {}):
                sequenceNumber = session['attributes']['sequenceNumber']
                
        if len(objectNames) ==0:
            if len(sequenceNumber)==0:   
                print("NO SEQUENCE NUMBER, GETTING OBJECTS DETECTED IN LAST FEW MINUTES")
                delta = timedelta(minutes=40)
                timestamp=datetime.datetime.now() - delta   
                shard_it = kinesis.get_shard_iterator(
                    StreamName='AggregatedDataOutput', 
                    ShardId=shard_id, 
                    ShardIteratorType='AT_TIMESTAMP',
                    Timestamp=timestamp)['ShardIterator']
            else:
                print("GOT SEQUENCE NUMBER, GETTING OBJECTS DETECTED - " + sequenceNumber)
                shard_it = kinesis.get_shard_iterator(
                    StreamName='AggregatedDataOutput', 
                    ShardId=shard_id, 
                    ShardIteratorType='AFTER_SEQUENCE_NUMBER',
                    StartingSequenceNumber=sequenceNumber)['ShardIterator']
                    
            while len(objectNames)<5:
                out = kinesis.get_records(ShardIterator=shard_it, Limit=2)
                shard_it = out["NextShardIterator"]
                if (len(out["Records"])) > 0:
                    theObject = json.loads(out["Records"][0]["Data"])
                    sequenceNumber = out["Records"][0]["SequenceNumber"]
                    #print ('sequenceNumber is ' + sequenceNumber)
                    objectName = theObject["objectName"]
                    objectCount = theObject["objectCount"]
                    if not (objectName in objectNames):
                        objectNames.append(objectName)
                else:
                    speech_output = "I currently do not see anything, is my deeplens on?"
                    break
                time.sleep(0.2)
                
        if len(objectNames) >0:
            objectName = objectNames[0]
            objectNames.remove(objectName)
            session_attributes = create_object_names_attributes(objectNames, sequenceNumber)
            speech_output = "I see " + objectName + "."
        print(speech_output)
    
        # Setting reprompt_text to None signifies that we do not want to reprompt
        # the user. If the user does not respond or says something that is not
        # understood, the session will end.
        return build_response(session_attributes, build_speechlet_response(
            intent['name'], speech_output, reprompt_text, should_end_session))
    
    # --------------- Events ------------------
    
    def on_session_started(session_started_request, session):
        """ Called when the session starts """
    
        print("on_session_started requestId=" + session_started_request['requestId']
              + ", sessionId=" + session['sessionId'])
    
    def on_launch(launch_request, session):
        """ Called when the user launches the skill without specifying what they
        want
        """
    
        print("on_launch requestId=" + launch_request['requestId'] +
              ", sessionId=" + session['sessionId'])
        # Dispatch to your skill's launch
        return get_welcome_response()
    
    def on_intent(intent_request, session):
        """ Called when the user specifies an intent for this skill """
    
        print("on_intent requestId=" + intent_request['requestId'] +
              ", sessionId=" + session['sessionId'])
    
        intent = intent_request['intent']
        intent_name = intent_request['intent']['name']
    
        # Dispatch to your skill's intent handlers
        if intent_name == "WhatDoYouSee":
            return get_objects(intent, session)
        elif intent_name == "AMAZON.HelpIntent":
            return get_welcome_response()
        elif intent_name == "AMAZON.CancelIntent" or intent_name == "AMAZON.StopIntent":
            return handle_session_end_request()
        else:
            raise ValueError("Invalid intent")
    
    def on_session_ended(session_ended_request, session):
        """ Called when the user ends the session.
    
        Is not called when the skill returns should_end_session=true
        """
        print("on_session_ended requestId=" + session_ended_request['requestId'] +
              ", sessionId=" + session['sessionId'])
        # add cleanup logic here
    
    # --------------- Main handler ------------------
    
    def lambda_handler(event, context):
        """ Route the incoming request based on type (LaunchRequest, IntentRequest,
        etc.) The JSON body of the request is provided in the event parameter.
        """
        print("event.session.application.applicationId=" +
              event['session']['application']['applicationId'])
    
        """
        Uncomment this if statement and populate with your skill's application ID to
        prevent someone else from configuring a skill that sends requests to this
        function.
        """
        # if (event['session']['application']['applicationId'] !=
        #         "amzn1.echo-sdk-ams.app.[unique-value-here]"):
        #     raise ValueError("Invalid Application ID")
    
        if event['session']['new']:
            on_session_started({'requestId': event['request']['requestId']},
                               event['session'])
    
        if event['request']['type'] == "LaunchRequest":
            return on_launch(event['request'], event['session'])
        elif event['request']['type'] == "IntentRequest":
            return on_intent(event['request'], event['session'])
        elif event['request']['type'] == "SessionEndedRequest":
            return on_session_ended(event['request'], event['session'])
    
    
    
    
    

Setting up an Alexa custom skill with AWS Lambda

  1. Open the Amazon Alexa Developer Portal, and choose Create a new custom skill.
  2. You can upload the JSON document in the Alexa Skill JSON Editor to automatically configure intents and sample utterances for the custom skill. Be sure to click Save Model to apply the changes.
    {
        "interactionModel": {
            "languageModel": {
                "invocationName": "deep lens demo",
                "intents": [
                    {
                        "name": "WhatDoYouSee",
                        "slots": [],
                        "samples": [
                            "anything else",
                            "do you see anything else",
                            "what else do you see",
                            "what else",
                            "how about now",
                            "what do you see around me",
                            "deep lens",
                            "what do you see",
                            "tell me what you see",
                            "what are you seeing",
                            "yes",
                            "yup",
                            "sure",
                            "yes please"
                        ]
                    },
                    {
                        "name": "AMAZON.HelpIntent",
                        "samples": []
                    },
                    {
                        "name": "AMAZON.StopIntent",
                        "samples": []
                    }
                ],
                "types": []
            }
        }
    }
    

  3. Finally, in the endpoint section, enter the AWS Lambda function’s Amazon Resource Name (ARN) that’s created. Your Alexa Skill is set and ready to tell you the objects detected by the AWS DeepLens device. You can wake Alexa up by saying “Alexa, open Deep Lens demo” and then ask Alexa questions such as “What do you see?” Alexa should return an answer such as “I see a person” or “I see a chair,” etc.

Conclusion

In this blog post, you learned how to set up the AWS DeepLens device and deploy the built-in object detection model from the DeepLens console. Then, you could use AWS DeepLens to perform real-time object detection. Objects detected on the AWS DeepLens device were sent to the AWS IoT platform which then forwarded them to an Amazon Kinesis data stream. You also learned how to use Amazon Kinesis Data Analytics to aggregate duplicate objects detected in the stream and push them into another Kinesis data stream. Finally, you created a custom Alexa skill with AWS Lambda to retrieve the detected objects in the Kinesis data stream and return the result back to the users from Alexa.


About the authors

Tristan Li is a Solutions Architect with Amazon Web Services. He works with enterprise customers in the US, helping them adopt cloud technology to build scalable and secure solutions on AWS.

 

 

 

 

Grant McCarthy is an Enterprise Solutions Architect with Amazon Web Services, based in Charlotte, NC. His primary role is assisting his customers move workloads to the cloud securely and ensuring that the workloads are architected in way that aligns with AWS best practices.