Approaching the end of May, (after studying for finals and AP Exams,) I began research on various speech recognition softwares that met my standards, having completed the first step of coding the OLED display. For the transcription portion of the project, I created a list of requirements and potential add-ons- first off, the software needed to be quick and efficient in its transcription process. Second, the transcription software should be simple enough to run on a Raspberry Pi 3B+, a cheaper model with less processing power than the more recent Raspberry Pi 4’s and 5’s.
According to reliable sources, Google’s Speech Recognition was a good place to start, especially for beginners. I discovered a website describing a Python project using Google Speech Recognition within a complete explanation of its speech to text code, installation, and examples. It was quite the break for me, especially since I was able to follow along through the entire video and article tutorial. I plugged in my USB shotgun microphone and tested my new code that converted the live audio into text, displayed as an output in the terminal. Now, I simply rerouted the output so that it would display the text variable on the OLED instead of in the terminal. It worked! However, another issue I ran into was formatting the OLED. It turned out that the OLED display software required an argument within the display function to specify which of the four lines the input text would be printed on. To work around this, I discovered the wonder of substrings and the code behind splitting a string of text into parts by spaces, punctuation, or length of a substring (via the len() Python function, outputting the length of a string in number of characters). This especially helped in printing a long sentence onto multiple lines, instead of only printing the start of the sentence onto the first line of the OLED.
By June, I realized an important factor that I had glazed over in developing my project with Google Speech Recognition: Internet connection. One of the inevitable drawbacks of using Google Speech Recognition was its reliance on Wifi, something a hearing impaired user might not always have access to. I figured it was time to switch softwares, specifically to an offline transcription service- something like Open AI’s Whisper. The artificial intelligence portion of this package would be most fitting to the progressing technologies of modern society and the innovative spirit of the i2 Program; however, I quickly discovered that it wouldn’t be possible, (despite the abundance of YouTube explanations of how to implement Whisper’s speech-to-text software on a Raspberry Pi,) due to the Model 3B+’s unfortunate lack of computing power. I researched other offline speech recognition software, including PocketSphinx and tensorflow packages. Unfortunately, neither of these options had enough accessible background information for a beginner coder to understand how to implement. I continued to explore different options and quickly developed confidence in my coding abilities, especially after an internship with Inspirit AI, where I learned to apply my Python skills to building my own AI models to predict patterns and discover new exoplanets, using already existing NASA data.
After such an amazing experience as an intern, I applied my newfound knowledge to my i2 project. In July, I discovered my final speech recognition software that I would officially use to transcribe audio from the shotgun microphone to text, displayed on the OLED module: Vosk speech recognition API. It was perfectly fit for my project, having the capability to run offline, exemplifying efficient transcription quality, and recommended by professionals to be best implemented on a Raspberry Pi 3B+ or 4 Model. In fact, my new code could transcribe audio into text faster than the Google Speech Recognition version, additionally including a “reset” command to restart the code and a “quit” command to end the code. Everything seemed to be perfect, until the final step of running all this code headless arrived…