Virtual Barista

In the summer of 2019, I was introduced to Globe, a major telco in the Philippines, by AWS. They had been looking for someone with experience in Sumerian. Having developed several prototypes, a simple (but difficult) game, and an e-training solution using Sumerian, I was suitably experienced to answer their needs.

The premise was simple; use a 3d virtual host in Sumerian, to guide customers through a series of steps to order a choice of beverage with a choice of addon.

AWS Sumerian

Sumerian is AWS' AR/VR service. It’s essentially a javascript 3D engine with an editor in the AWS gui. You can import your 3d models, apply lighting, do animation and, of course, integrate any number of other services using javascript. The stars of the product are the virtual hosts. 3D characters created by AWS. 4 females and 4 males. You can combine them in any number of ways with polly voices. One of the common gripes is that they are not particularly customizable. Each host has 4 variations. 1 cannot be changed at all, the others, wearing a hoodie, polo or tshirt, can have the texture customized for the top clothing piece. At best, this enables you to pick the colour and add a logo. What’s great is that the hosts seamlessly integrate with services such as Polly and Lex to deliver lip-synced spoken text and natural conversation, including several animated gestures that they can make to further accentuate the message they are delivering.


The purpose of this virtual host was to wow businesses visiting the telco’s 5G launch. Showing them what the future of retail could look like powered by fast wireless connectivity. To that end, I suggested adding an IoT bulb (LIFX to be precise). This is a wifi connected RGB lamp that can be controlled via an API. I have several at home and one of my earlier Sumerian prototypes was a 3D map of my home through which I could control the pwoer and colour of my lights in each room.

One of the challenges of the project was a 5 week deadline, starting from scratch. Of which one I was going to be overseas delivering training in cambodia. On the other hand, this was going to be a one-time event with about 250 people which means we could simplify certain things. For example, a CMS was not going to be needed, the menu items could be hard coded which would save considerable time.

The final solution was going to be run on a large touch-screen TV in portrait mode to give the impression customers were standing in front of a real person. Other requests from globe were face tracking to make the host look at the customers, gender recognition for personalized content, and speech to text to enable customers to be talk to the host to place their order. They wanted to be able to personalize the responses with each customer’s name. I suggested that we have multiple spoken lines for each step so that we could randomly select one to reduce the repetitiveness and make the experience feel more spontaneous and human. They needed a simple UI for the barista to see the incoming orders and update their status. And they wanted to trigger an SMS to the user when their beverage was ready so we would need to collect the user’s mobile number too.

My Wifi light was added at this stage too, we would assign a random colour to each customer’s order and we would light up the bulb with that colour when their beverage was ready.

Lastly, one more addition that I proposed was a remote-control UI which would be helpful for testing and manually intervening or supporting should any issues occur (easier than dealing with them on the TV directly).

So, in summary:

  • Sumerian virtual host with a branded polo shirt
  • Delivering text to speech using polly
  • A custom, Globe-branded cafe backdrop and 3D representations of the order items,. Designed and textured by Vithereal
  • We needed to collect user name and phone number, and assess their gender
  • The order options consisted of 3 beverages and 3 addons (+ “none”)
  • The solution would be running in a browser on a touch screen TV in portrait mode
  • Speech-to-text to allow users to talk to the host
  • SMS to notify the user that their order was ready
  • LIFX wifi LED integration to notify the user that their order was ready
  • GUI for the orders
  • GUI for the remote control
  • Websockets for communicating events between the 3 interfaces


Considering this was a rapid prototype and I would be doing the bulk of the work myself, I did a simple high-level architecture just to help organize my thoughts:

virtual host architecture

Key components

Facial Recognition.
For this I used jslib/etc which was based on an earlier project by AWS. Their project was only showing the head and shoulders of the host whereas for Globe we would be showing the full body. This meant quite a bit of tweaking to get it to work right. I also added a feature to track the size of the face which would automatically start the experience if a face was detected within a certain range. I still think there are a lot more improvements that can be made to the face tracking but we were pressed for time. As we were already running the webcam feed through a canvas for the tracking, it was relatively simple to grab a frame from there and pass it to AWS Rekognition for face analysis which gave us the customer’s gender.

AWS Lex might be the obvious first choice for such requirements but it is still only available in USA. This means that the latency can be very disruptive to the experience. It was also a bit overkill for the very confined scope of this project. I decided instead to use the Google Chrome speech to text API. As it is not yet an official standard, it’s only available in Chrome but as this would be running in a controlled environment that could run Chrome, it wasn't an issue. I asked Slash(link) for help to develop a class for this. It needed to recognize :

  • Names, proved very difficult with local names. It tended to gravitate towards common english names and dictionary words. It could potentially be made more accurate by adding a database of local names.
  • Telephone numbers, the API has built-in support for these. You need to remember to set the right locale to get the correct format and number length. It still provided inaccurate depending on pauses between numbers when people were saying their number. Most frustratingly, the API forces the phone number type when it recognized numbers which meant we could not add our own processing logic to it like we did with lists. Eventually we decided to remove speech recognition for this and provide the user with a simple on-screen number-only keypad so they could input it with touch.
  • Lists, this was our very light-version of Lex. The class could be configured to listen and then intelligently compare the speech to text result with a list of words and aliases to find the best match. This worked really well and it was quite easy to train it to be more accurate by looking at the speech to text results for different accents and adding those results as aliases to the relevant option.

The different interfaces
I used API Gateway to setup websockets and lambda as a handler for the websocket events and tracking the sessions. I need to turn the code into a nicely documented class, but it enables me to register events in any interface to listen for and attach functions to execute when they are triggered, and send events from any interface and from the backend lambda’s to all, specific groups or specific interfaces.

The remote interface was a lot of fun. I could push any text to the avatar and trigger gestures, as well as manually progress through any of the order stages as needed. The ability to push text enabled me to add random sentences to the experience for different customers commenting on specific clothing or other features which completely blew their mind in most cases. I especially had some fun with one of the executives who cheekily tried to order an alcoholic beverage.

I used AWS SNS for the SMS notification. Sadly, the Philippines did not seem to support custom sender ID. The WiFi LED has its own restful API offered by LIFX, so that was quite easy to trigger from the Lambda backend. Lastly, I used DynamoDB for the orders and sessions. this ensures the entire solution was serverless.


The event went very well. At times the event music was a bit loud and causing issues with the speech to text, luckily we had the remote so we could help it along manually when needed without anyone noticing. We had great feedback and I think the longest queue of any booth there. It was an absolute pleasure to work with the people at Globe on this and looking forward to many more projects of this nature with them in the future.