Behind The Scenes of the Personal Voice feature in iOS 17

Behind The Scenes of the Personal Voice feature in iOS 17

Deconstructing the Personal Voice Feature in iOS 17

iOS 17 is jam-packed with cutting-edge features that are long-awaited. Amongst those is Personal Voice, a feature that can speak text in your voice. This is especially useful for accessibility and for those with disabilities.

If you have already installed the Public Beta of iOS 17, you may have had a chance to play around with this feature. If you haven't this is the process of setting it up:

You are prompted to read about 1 hour of sentences, in your natural accent as you normally speak. The Text-To-Speech (TTS) system essentially captures the patterns in your voice and applies them to the text that is entered. What's interesting about this is the use of AI behind the scenes to make this look like magic. Renowned YouTubers like Marques Brownlee (MKBHD) have shared their views on this feature and it has left them truly impressed.

At the core, Personal Voice is driven by an AI TTS system that is pre-trained on an extremely large dataset that includes texts and their speech form. Personal Voice is specific to English at launch and the reason for this is that Apple needs enough data to train a model that learns from the texts and their corresponding speech version. Since English is one of the most commonly spoken languages, it is naturally the first choice to create a dataset and train an AI model.

Coming to Personal Voice's "personal" side of things, what enhances an already well-trained and impressively accurate AI TTS model is the customization option provided to the user. Essentially, the user can add a touch of their identity to the AI model, making it more personal and closer to the user than ever before. The user is asked to read aloud some phrases and sentences that are prompted by the system, to assist an approach called "fine-tuning". If you are familiar with AI, you may have already heard this before. Fine-Tuning is a concept in Transfer Learning when a pre-trained model is customized to a different dataset than it was trained on. For example, a 1000-class multiclass classification AI model can be fine-tuned to identify 10 classes of another dataset. Essentially capturing and learning patterns from the different dataset it is being fine-tuned on.

Similarly, Personal Voice records you speaking aloud a handful of phrases and sentences and then uses them to further tweak a pre-trained AI model to make it yours. The final result is a TTS system that sounds almost exactly like you.

Something to note here is the privacy concerns that come to mind. Collectively, if many of us record ourselves speaking and it is being sent to Apple's cloud servers, we are giving them access to our recordings and providing them with our data for free. However, this is not the case.

The fine-tuning of the AI model happens on-device. This means that all the phrases and sentences we record ourselves speaking remain on our iPhones. Every iPhone is equipped with a Neural Engine, an SoC that is very specific to accelerating AI workloads such as adding the Bokeh filter in Portrait mode and noise-suppression during calls to name a few. The same chip is used for Personal Voice albeit the older the iPhone model, the longer it will take to setup.

In conclusion, iOS 17's Personal Voice feature is backed up by sophisticated and State Of The Art AI that enables it while keeping your privacy in-check.

Did you find this article valuable?

Support AI with Shrey by becoming a sponsor. Any amount is appreciated!