news-06072024-100751

Generative AI models operate differently from humans when processing text. To understand why these models have limitations, it’s essential to delve into the concept of “tokens” within their internal environments.

Most generative AI models, including popular ones like Gemma and OpenAI’s GPT-40, are built on a framework called the transformer. Due to the architecture of transformers, these models cannot directly process raw text without significant computational resources.

As a result, current transformer models utilize a process called tokenization, where text is divided into smaller units known as tokens. Tokens can represent words like “fantastic,” syllables like “fan,” “tas,” and “tic,” or even individual characters in words.

While tokenization allows transformers to process more information before reaching a context window limit, it also introduces biases and challenges. For example, tokenizers may encode phrases differently based on spacing, leading to varied results depending on the input. Additionally, tokenizers handle cases differently, impacting model performance on tasks like the capital letter test.

The limitations of tokenization extend beyond English to other languages that do not use spaces to separate words, such as Chinese, Japanese, Korean, Thai, and Khmer. These languages require different tokenization methods, leading to increased processing times and costs for AI vendors charging per token.

Moreover, tokenization poses challenges for mathematical tasks, as digits are not consistently tokenized. This inconsistency can hinder model understanding of numerical patterns and context, affecting performance on tasks like solving anagrams or reversing words.

In response to these tokenization issues, researchers are exploring alternative approaches like byte-level state space models such as MambaByte. By eliminating tokenization and working directly with raw bytes, models like MambaByte show promise in handling noise and improving performance on language tasks compared to transformer models.

While solutions to tokenization challenges are still in the early research stages, the development of new model architectures may offer a path forward. By addressing the limitations imposed by tokenization, future generative AI models could enhance their capabilities and overcome existing constraints.