Writing a Voice Activated SharePoint Todo List - IoT App on RPi

Ever wanted to write a voice activated system on an IoT device to keep track of your “todo list”, hear your commands being played back, and have the system send you a text message with your todo list when it’s time to walk out the door?  Well, I did. In this blog post, I will provide a high level overview of the technologies I used, why I used them, a few things I learned along the way, and partial code to assist with your learning curve if you decide to jump on this.  I also had the pleasure of demonstrating this prototype at Microsoft’s Community Connections in Atlanta in front of my colleagues.

How It Works

I wanted to build a system using 2 Raspberry Pis (one running Windows 10 IoT Core, and another running Raspbian) that achieved the following objectives:

  • * Have 2 RPis that communicate through the Azure Service Bus
    This was an objective of mine, not necessarily a requirement; the intent was to have two RPis running different Operating Systems communicate asynchronously without sharing the same network
  • * Learn about the Microsoft Speech Recognition SDK
    I didn’t want to send data to the cloud for speech recognition; so I needed an SDK on the RPi to perform this function; I chose the Microsoft Speech Recognition SDK for this purpose

    * Communicate to multiple cloud services without any SDK so that I could program the same way on Windows and Raspbian (Twilio, Azure Bus, Azure Table, SharePoint Online)
    I also wanted to minimize the learning curve of finding which SDK could run on a Windows 10 IoT Core, and Raspbian (Linux); so I used Enzo Unified to abstract the APIs and instead send simple HTTPS commands allowing me to have an SDK-less development environment (except for the Speech Recognition SDK). Seriously… go find an SDK for SharePoint Online for Raspbian and UWP (Windows 10 IoT Core).

The overall solution looks like this:



In order to achieve the above objectives, I used the following bill of materials:

Technology Comment Link
2x Raspberry Pi 2 Model B Note that one RPi runs on Windows 10 IoT Core, and the other runs Raspbian http://amzn.to/2qnM6w7
Microphone I tried a few, but the best one I found for this project was the Mini AKIRO USB Microphone http://amzn.to/2pGbBtP
Speaker I also tried a few, and while there is a problem with this speaker on RPi and Windows, the Logitech Z50 was the better one http://amzn.to/2qrNkop
USB Keyboard I needed a simple way to have keyboard and mouse during while traveling, so I picked up the iPazzPort Mini Keyboard; awesome… http://amzn.to/2rm0FOh
Monitor You can use an existing monitor, but I also used the portable ATian 7 inch display. A bit small, but does the job. http://amzn.to/2pQ5She 
IoT Dashboard Utility that allows you to manage your RPis running Windows; make absolutely sure you run the latest build; it should automatically upgrade, but mine didn’t. http://bit.ly/2rmCWOU
Windows 10 IoT Core The Microsoft O/S used on one of the RPis; Use the latest build; mine was 15063; if you are looking for instructions on how to install Windows from a command prompt, the link provided proved useful  http://bit.ly/2pG9gik
Raspbian Your RPi may be delivered with an SD card preloaded with the necessary utilities to install Raspbian; connecting to a wired network makes the installation a breeze. http://bit.ly/2rbnp7u
Visual Studio 2015 I used VS2015, C#, to build the prototype for the Windows 10 IoT Core RPi http://bit.ly/2e6ZGj5
Python 3 On the Raspbian RPi, I used Python 3 to code. http://bit.ly/1L2Ubdb
Enzo Unified I installed and configured an Enzo Unified instance (version 1.7) in the Azure cloud; for Enzo to talk to SharePoint Online, Twilio, Azure Service Bus and Azure Storage, I also needed accounts with these providers. You can try Enzo Unified for free for 30 days. http://bit.ly/2rm4ymt


Things to Know

Creating a prototype involving the above technologies will inevitably lead you to collect a few nuggets along the way. Here are a few.

Disable Windows 10 IoT Core Updates

While disabling updates is generally speaking not recommended, IoT projects usually require a predictable environment that does not reboot in the middle of a presentation. In order to disable Windows Updates on this O/S I used information published Mike Branstein on his blog: http://bit.ly/2rcOXt9

Try different hardware, and keep your receipts…

I had to try a few different components to find the right ones; the normally recommended S-150 USB Logitech speakers did not work for me; I lost all my USB ports and network connectivity as soon as I plugged it in. Neither did the JLab USB Laptop speakers. I also tried the 7.1 Channel USB External Sound Card but was unable to make it work (others were successful). For audio input, I also tried the VAlinks Mini Flexible USB microphone; while it worked well, it picked up too much noise compared to the AKIRO, and became almost unusable in a room with 20 people where you have background noise.

Hotel WiFi Connections

This was one of the most frustrating part of this whole experience on Windows 10 IoT Core. You should know that this operating system does not currently come equipped with a browser. This means that you cannot easily connect to a hotel network since this usually requires starting a browser so that you can enter a user id and password provided by the hotel. Further more, since there is also no possible way to “forget” a previously registered network, you can find yourself in a serious bind… I first purchased the Skyroam Mobile Hotspot, hoping it would provide the answer. Unfortunately the only time I tried it, in Tampa Florida, the device could not obtain a connection. So I ended up adding a browser object into my UWP application and force it to refresh a specific image every time I start the app; this will force the hotel login page to show up when needed. I am still looking for a good solution to this problem.

Speech Privacy Policy on Windows

Because parts of the code I am running leverages the underlying APIs of Cortana, it seems that you must accept the Cortana privacy policy; this is required only the first time you run the application, but is obviously a major nightmare for applications you may want to ship. I am not aware of any programmatic workaround at this time. This stackoverflow post provides information about this policy and how to accept it.

How It Looks Like

A picture is worth a thousand words… so here is the complete setup:


C# Code

Since this is an ongoing prototype I will not share the complete code at this time; however I will share a few key components/techniques I used to make this work.

Speech Recognition

I used both continuous dictation speech recognition, and grammar-based recognition from the Microsoft Speech Recognition API. The difference is that the first one gives you the ability to listen to “anything” being said, and the other will only give you a set of results that match the expected grammar. Both methods give you a degree of confidence so you can decide if the command/text input was sufficiently clear. The following class provides a mechanism for detecting input either through continuous dictation or using a grammar file. The timeout ensures that you do not wait forever. This code also returns the confidence level of the capture.


using Enzo.UWP;
using System;
using System.Collections.Generic;

using System.Diagnostics;
using System.Net.Http;
using System.Threading.Tasks;
using Windows.ApplicationModel;
using Windows.Devices.Gpio;
using Windows.Media.SpeechRecognition;
using Windows.Media.SpeechSynthesis;
using Windows.Storage;

namespace ClientIoT

    public class VoiceResponse
        public string Response = null;
        public double RawConfidence = 0;

    public class VoiceInput
        private const int SPEECH_TIMEOUT = 3;
        private System.Threading.Timer verifyStatus;
        private string lastInput = "";
        private double lastRawConfidence = 0;
        private bool completed = false;
        private bool success = false;

        public async Task<VoiceResponse> WaitForText(string grammarFile)
            return await WaitForText(SPEECH_TIMEOUT, grammarFile);

        public async Task<VoiceResponse> WaitForText(int timeout = SPEECH_TIMEOUT, string grammarFile = null)
            var resp = new VoiceResponse();
                success = false;
                completed = false;
                lastInput = "";
                lastRawConfidence = 0;

                SpeechRecognizer recognizerInput;
                DateTime dateNow = DateTime.UtcNow;

                recognizerInput = new SpeechRecognizer();
                recognizerInput.ContinuousRecognitionSession.ResultGenerated += ContinuousRecognitionSession_InputResultGenerated;
                recognizerInput.StateChanged += InputRecognizerStateChanged;
                recognizerInput.Timeouts.BabbleTimeout = TimeSpan.FromSeconds(timeout);
                recognizerInput.ContinuousRecognitionSession.Completed += ContinuousRecognitionSession_Completed;
                recognizerInput.ContinuousRecognitionSession.AutoStopSilenceTimeout = TimeSpan.FromSeconds(SPEECH_TIMEOUT);

                if (grammarFile != null)
                    StorageFile grammarContentFile = await Package.Current.InstalledLocation.GetFileAsync(grammarFile);
                    SpeechRecognitionGrammarFileConstraint grammarConstraint = new SpeechRecognitionGrammarFileConstraint(grammarContentFile);

                var compilationResult = await recognizerInput.CompileConstraintsAsync();

                // If successful, display the recognition result.
                if (compilationResult.Status != SpeechRecognitionResultStatus.Success)
                    Debug.WriteLine(" ** VOICEINPUT - VoiceCompilationError - Status: " + compilationResult.Status);

                recognizerInput.ContinuousRecognitionSession.AutoStopSilenceTimeout = TimeSpan.FromSeconds(timeout);
                recognizerInput.RecognitionQualityDegrading += RecognizerInput_RecognitionQualityDegrading;
                await recognizerInput.ContinuousRecognitionSession.StartAsync();

                System.Threading.SpinWait.SpinUntil(() =>
                resp = new VoiceResponse() { Response = lastInput, RawConfidence = lastRawConfidence };
                    recognizerInput = null;
                catch (Exception ex)
                    Debug.WriteLine("** WaitForText (1) - Dispose ** " + ex.Message);
            catch (Exception ex2)
                Debug.WriteLine("** WaitForText ** " + ex2.Message);
            return resp;

        private void RecognizerInput_RecognitionQualityDegrading(SpeechRecognizer sender, SpeechRecognitionQualityDegradingEventArgs args)
                Debug.WriteLine("VOICE INPUT - QUALITY ISSUE: " + args.Problem.ToString());
            catch (Exception ex)
                Debug.WriteLine("** VOICE INPUT - RecognizerInput_RecognitionQualityDegrading ** " + ex.Message);

        private void ContinuousRecognitionSession_Completed(SpeechContinuousRecognitionSession sender, SpeechContinuousRecognitionCompletedEventArgs args)
            if (args.Status == SpeechRecognitionResultStatus.Success
                || args.Status == SpeechRecognitionResultStatus.TimeoutExceeded)
                success = true;
            completed = true;

        private void ContinuousRecognitionSession_InputResultGenerated(SpeechContinuousRecognitionSession sender, SpeechContinuousRecognitionResultGeneratedEventArgs args)
                lastInput = "";
                if ((args.Result.Text ?? "").Length > 0)
                    lastInput = args.Result.Text;
                    lastRawConfidence = args.Result.RawConfidence;
                    Debug.WriteLine(" " + lastInput);
            catch (Exception ex)
                Debug.WriteLine("** ContinuousRecognitionSession_InputResultGenerated ** " + ex.Message);

        private void InputRecognizerStateChanged(SpeechRecognizer sender, SpeechRecognizerStateChangedEventArgs args)
            Debug.WriteLine("  Input Speech recognizer state: " + args.State.ToString());

For example, if you want to wait for a “yes/no” confirmation, with a 3 second timeout, you would call the above code as such:

var yesNoResponse = await (new VoiceInput()).WaitForText(3, YESNO_FILE);

And the yes/no grammar file looks like this:

<?xml version="1.0" encoding="utf-8" ?>

  <rule id="root">
      <ruleref uri="#enzoCommands"/>

  <rule id="enzoCommands">
      <item> yes </item>
      <item> yep </item>
      <item> yeah </item>
      <item> no </item>
      <item> nope </item>
      <item> nah </item>


Calling Enzo Unified using HTTPS to Add a SharePoint Item

Another important part of the code is its ability to interact with other services through Enzo Unified, so that no SDK is needed on the UWP application. For an overview on how to access SharePoint Online through Enzo Unified, see this previous blog post.

The following code shows how to easily add an item to a SharePoint list through Enzo Unified. Posting this request to Enzo requires two parameters (added as headers) called “name” and “data” (data is an XML string containing the column names and values to be added as a list item).

public static async Task SharePointAddItem(string listName, string item)
            string enzoCommand = "/bsc/sharepoint/addlistitemraw";
            List<KeyValuePair<string, string>> headers = new List<KeyValuePair<string, string>>();

            string data = string.Format("<root><Title>{0}</Title></root>", item);

            headers.Add(new KeyValuePair<string, string>("name", listName));
            headers.Add(new KeyValuePair<string, string>("data", data));

            await SendRequestAsync(HttpMethod.Post, enzoCommand, headers);

And the SendRequestAsync method below shows you how to call Enzo Unified. Note that I added two cache control filters to avoid HTTP caching, and additional flags for calling Enzo Unified on an HTTPS port where a self-signed certificate is installed.

private static async Task<string> SendRequestAsync(HttpMethod method, string enzoCommand, List<KeyValuePair<string, string>> headers)
            string output = "";
            var request = EnzoUnifiedRESTLogin.BuildHttpWebRequest(method, enzoCommand, headers);
            var filter = new Windows.Web.Http.Filters.HttpBaseProtocolFilter();
            filter.CacheControl.ReadBehavior = Windows.Web.Http.Filters.HttpCacheReadBehavior.MostRecent;
            filter.CacheControl.WriteBehavior = Windows.Web.Http.Filters.HttpCacheWriteBehavior.NoCache;

            Windows.Web.Http.HttpClient httpClient = new Windows.Web.Http.HttpClient(filter);

                using (var response = await httpClient.SendRequestAsync(request))
                    output = await response.Content.ReadAsStringAsync();
            catch (Exception ex)
                System.Diagnostics.Debug.WriteLine(" ** Send Http request error: " + ex.Message);
            return output;

Last but not least, the BuildHttpWebRequest method looks like this; it ensures that the proper authentication headers are added, along with the authentication identifier for Enzo:

public static Windows.Web.Http.HttpRequestMessage BuildHttpWebRequest(Windows.Web.Http.HttpMethod httpmethod, string uri, List<KeyValuePair<string,string>> headers)
            bool hasClientAuth = false;

            Windows.Web.Http.HttpRequestMessage request = new Windows.Web.Http.HttpRequestMessage();

            request.Method = httpmethod;
            request.RequestUri = new Uri(ENZO_URI + uri);

            if (headers != null && headers.Count() > 0)
                foreach (KeyValuePair<string, string> hdr in headers)
                    request.Headers[hdr.Key] = hdr.Value;

            if (!hasClientAuth)
                request.Headers["authToken"] = ENZO_AUTH_GUID;

            return request;

Text to Speech

There is also the Text to Speech aspect, where the system speaks back what it heard, before confirming and acting on the command. Playing back is actually a bit strange in the sense that it requires a UI thread. In addition, it seems that Windows 10 IoT Core and Raspberry Pi don’t play nice together; it seems that every time a playback occurs, a loud tick can be heard before and after. A solution appears to be using USB speakers, but none worked for me. The code below simply plays back a specific text and waits a little while in an attempt to give enough time for the playback to finish (the code is non-blocking, so the SpinWait attempts to block the code until completion of the playback).

private async Task Say(string text)
            SpeechSynthesisStream ssstream = null;

                SpeechSynthesizer ss = new SpeechSynthesizer();
                ssstream = await ss.SynthesizeTextToStreamAsync(text);
            catch (Exception exSay)
                Debug.WriteLine(" ** SPEECH ERROR (1) ** - " + exSay.Message);

            var task1 = this.Dispatcher.RunAsync(Windows.UI.Core.CoreDispatcherPriority.Normal, async () =>
                    await media.PlayStreamAsync(ssstream);
                catch (Exception exSay)
                    Debug.WriteLine(" ** SPEECH ERROR (2) ** - " + exSay.Message);

            // Wait a little for the speech to complete
            System.Threading.SpinWait.SpinUntil(() => 1 == 0, lastInput.Length * 150);


Calling the above code is trivial:

await Say("I am listening");



The code in python was trivial to build; this RPi was responsible for monitoring events in the Azure Service Bus and turning on/off the LED attached to it. The following pseudo code shows how to call Enzo Unified from Python without using any SDK:

import sys
import urllib
import urllib2
import requests


while 1=1
      if (len(resp[‘data’][‘Table1’]) > 0
         #extract response here…



This prototype demonstrated that while there were a few technical challenges along the way, it was relatively simple to build a speech recognition engine that can understand commands using Windows 10 IoT Core, .NET, and the Microsoft Speech Recognition SDK. 

Further more, the intent of this project was also to demonstrate that Enzo Unified made it possible to code against multiple services without the need for an SDK on the client side regardless of the platform and the development language.  Abstracting SDKs through simple HTTP calls makes it possible to access Twilio, SharePoint Online, Azure services and much more without any additional libraries on the client system.

About Herve Roggero

Herve Roggero, Microsoft Azure MVP, @hroggero, is the founder of Enzo Unified (http://www.enzounified.com/). Herve's experience includes software development, architecture, database administration and senior management with both global corporations and startup companies. Herve holds multiple certifications, including an MCDBA, MCSE, MCSD. He also holds a Master's degree in Business Administration from Indiana University. Herve is the co-author of "PRO SQL Azure" and “PRO SQL Server 2012 Practices” from Apress, a PluralSight author, and runs the Azure Florida Association.