From Chaos to Order: The Evolutionary Logic of Gujarati Sentiment Analysis
Source PublicationScientific Publication
Primary AuthorsShah¹, Baser²

Have you ever wondered why nature prefers a messy, iterative process over a clean slate? It seems inefficient. Yet, biology builds upon existing structures—junk DNA, vestigial organs—to forge something robust. A recent study applies a strikingly similar logic to computational linguistics, specifically tackling the scarcity of data in Indian languages.
In the digital ecosystem, Gujarati is what linguists term a 'low-resource' language. Data is scarce. Annotated datasets are rare. Previous attempts to teach machines how to read emotions in Gujarati text relied on rigid rules. These methods were like early invertebrates trying to survive on land: they worked, but they could not scale.
A Hybrid Approach to Gujarati Sentiment Analysis
The researchers proposed a hybrid framework that mimics natural selection. They did not simply write code; they engineered a pressure system for data survival. First, they built a custom lexicon—a rule-based ancestor containing synonyms and antonyms. This system annotated over 21,000 news headlines, achieving a baseline accuracy of 72.75%.
Then came the philosophical detour. Consider the genome. It is not a static library but a dynamic engine that amplifies what works and discards what does not. The team introduced a semi-supervised pipeline to mirror this. They manually checked 11,625 headlines to train a baseline model. They then set this model loose on a massive pool of 93,000 unlabelled headlines.
Here is where the 'evolutionary' filtering occurred. The system used pseudo-labelling, but with a strict condition: only labels with a confidence score of 90% or higher were allowed to survive. This is akin to a gene pool stabilising beneficial mutations while purging the deleterious ones. The result was a robust, hybrid dataset of approximately 105,000 headlines.
When the team ran the final experiments, the results were distinct. They tested various machine learning models, including Logistic Regression and Naive Bayes. However, the Random Forest model, utilising TF-IDF features, emerged as the dominant species. It achieved an accuracy of 88.54%. Furthermore, cross-validation against human-labelled samples suggested the machine’s internal 'guesses' were accurate more than 90% of the time.
This work suggests a replicable blueprint for other languages starving for data. While the authors plan to explore deep learning architectures like mBERT in the future, the current victory lies in the organisation. It proves that you do not need the biggest brain to survive; you just need the most adaptable system.