Most Bilinguals encounter problems where they code-switch unintentionally, that is, to replace a certain word mid-sentence with its translated equivalent. Imagine giving a presentation fluently in one language and then you catch yourself using a different language to explain a specific concept.
These are many hypotheses as to why the phenomenon of sudden and unintentional code-switching occurs. The one that's received a lot of interest lately is that which revolves around the concept of a symbolic-semantic graph. Psychology and Computer Science, describes this as the semantic network model of knowledge representation where concepts are modeled as nodes in a graph, and their semantic relationships with each other as the edges.
As the model describes, there can be a relationship between two nodes that are so strong, that they might be practically and symbolically interchangeable with each other. Words across multiple languages that represent the same concept might have mappings with each other in ways that are similar to how we mentally map synonyms with one another. The English word "cat" might have mappings to the Spanish word "gato" and the Filipino word "pusa" in the way that the synonyms "rough" and "coarse" might be mapped to each other.
DeepMind recently published a paper called SCAN: Learning Abstract Hierarchical Compositional Visual Concepts (Symbol-Concept Association Network) that worked with a similar graphical concept. They were able to train a neural network to associate images with their respective semantic concepts. They initially showed their AI images of apples, and then later on reinforced the AI with the concept of the string "apple" in conjunction with the image of the apple. This method brings strong recall of how we, as babies, were taught representations of objects in the real world.
The DeepMind paper's objective was to model combinatorial learning, where fundamental concepts are used to permute new ideas. The idea of multiplication as an aggregate of addition operations is an example of combinatorial learning where more complex ideas/operations are discovered and pursued using basic learned concepts.
While this post won't tackle the technical aspects of SCAN, it uses this validated model as precedent for a proposed method of implementing speech to text for the Filipino language.
Classical approaches to speech to text for Filipino involve the construction of a dictionary of characters and their respective sounds. The construction of a language model following this architecture is discussed in the paper FiliText: A Filipino Hands-free Text Messaging Application by Chua, Chua, et. al.
In this aspect, this Sphinx-based model follows a very rule based approach, given that the mappings are discrete and preset. As discussed in their paper, tolerances for background noise and pronunciation variances were very low and that classification worked best in a controlled environment.
What this post proposes is the development of a new language model that utilizes a recursive mapping of sound to characters that follow the semantic network representation.
The proposed model seeks to develop the following learned mappings in order to initially achieve word-level speech-to-text, and later sentence-level speech to text:
- Character Sound to Character Symbol
- Character Sound Sequence to Word Sound
- Word Sound to Character Symbol
In this proposed model, text input is fed into the model along with its audio counterpart. The character “a” would be fed into the system with the phoneme of the character “a” which is the vowel phoneme for “a”. Initial text and audio preprocessing would be implemented to remove extraneous background sounds, white noise, and/or errant characters. These symbols, both the sound and the character, would be feed into a neural network that would learn the mappings between sounds and characters given a sufficiently large training data set.
If successful, the development of a speech to text model that is based on a symbolic-semantic model should provide greater flexibility with the input text source quality and consequently encourage usability across multiple domains. The base learned mapping of audio to text could also be used extensibly for further research with additional mappings to other forms of input like image, video, temperature, gyroscope, current, etc.