Debunking the Hype Around Open Source Chatbots

While the hype surrounding Open Source Chatbots is undeniable, businesses should approach them with a discerning eye. The future of AI chatbots is promising, but it’s a future that will likely be shaped by a combination of both proprietary models like ChatGPT and open-source initiatives. As the old adage goes, “All that glitters is not gold.” In the world of AI chatbots, it’s the substance, not the sparkle, that truly counts.

According to reports on the Chatbot market, the global chatbot industry is expected to reach $2.5 billion by 2025. As machine learning algorithms and natural language processing continue to improve, chatbots are expected to become even more sophisticated and powerful. With the rise of models like StableLM and the continuous evolution of AI, the landscape of chatbots is set for exciting developments. The key is to remain informed and discerning in the face of rapid technological advancements.

The Open Source Chatbot Landscape

In the rapidly evolving world of artificial intelligence, chatbots have become a focal point of interest. The race to develop a chatbot that can match or surpass the capabilities of OpenAI’s ChatGPT is on, with numerous contenders emerging from both corporate giants and the open-source community. However, a closer examination reveals that these claims of superiority often fall short when subjected to rigorous testing and scrutiny.

The open-source community and tech companies are continually introducing new chatbots, many of which claim to rival or even surpass the performance of OpenAI’s ChatGPT. A common practice among developers is to train these chatbots using data generated by ChatGPT, a shortcut that has raised questions about the validity of their performance claims.

OpenChat, an open-source chatbot presented as a decentralized alternative, recently claimed to outperform ChatGPT by scoring 105.7% on the Vicuna GPT-4 Benchmark. However, this claim warrants a closer look. OpenChat is built on top of LLaMA-13B, a model developed by Meta for research purposes and not intended for commercial use. The dataset used for fine-tuning this model consists of a subset of conversations available on ShareGPT, a hub for sharing outputs generated by ChatGPT and GPT-4.

The Issues with Benchmarking

The Vicuna GPT-4 Benchmark, used to evaluate the performance of these models, tests for style but not the information generated by the model. As a GPT-based evaluation metric, it tends to rate models trained on ChatGPT or GPT-4 data higher, calling into question the reliability of these benchmarks.

Hugging Face has identified similar issues with other open-source models. They found discrepancies between the evaluation benchmarks reported in the models’ papers and the results when these models are evaluated on Hugging Face’s benchmarks. It is notable that many recent models claiming to outperform LLaMA or GPT-4 are absent from the Open LLM Leaderboard.

The Fallacy of Superior Performance

In fact, any model trained on ChatGPT or GPT-4 data will naturally score higher when tested by GPT, rendering the benchmarking process unreliable. This is akin to a student rewriting answers to match the correct answers provided by the teacher, and then the teacher matching the answers again. The result is inevitably going to be better, but it does not accurately reflect the student’s understanding or ability.

In a way, these models are merely imitating the style of ChatGPT, which may make them sound better on individual tasks but does not necessarily translate to superior performance across a range of general tasks.

In response to the criticism, researchers have begun to transition to MT-bench to test OpenChat’s performance. However, in this case, the model performed significantly poorer than GPT-3.5 based ChatGPT, highlighting the discrepancy between the evaluation benchmarks.

Despite the controversies surrounding benchmarks and performance claims, one thing is clear: high-quality data can significantly enhance the performance of LLM-based chatbots. In this regard, ChatGPT deserves credit, as many models today are trained on synthetic data generated by this chatbot. While the trend of new open source models claiming to outperform all others continues, these models often fail to deliver when evaluated on the same metrics.

Recent Developments

In a recent development, Stability AI introduced a new family of open source AI language models named StableLM. This move is seen as a significant step towards creating an open source alternative to OpenAI’s ChatGPT. Stability AI, positioning itself as an open source competitor to OpenAI, aims to democratize access to AI technology, making it more transparent and accessible.

StableLM, currently in its alpha phase, is available on GitHub in 3 billion and 7 billion parameter model sizes. The company plans to release 15 billion and 65 billion parameter models in the future. StableLM operates similarly to GPT-4, the large language model (LLM) that powers ChatGPT. It generates text by predicting the next token (word fragment) in a sequence, which starts with information provided by a human in the form of a “prompt.” As a result, StableLM can compose human-like text and write programs.

StableLM claims to achieve similar performance to OpenAI’s benchmark GPT-3 model while using far fewer parameters—7 billion for StableLM versus 175 billion for GPT-3. Parameters are variables that a language model uses to learn from training data. Having fewer parameters makes a language model smaller and more efficient, which can make it easier to run on local devices like smartphones and laptops.

According to Stability AI, StableLM has been trained on a new experimental dataset based on an open source dataset called The Pile, but three times larger. The company attributes the “surprisingly high performance” of the model at smaller parameter sizes at conversational and coding tasks to the “richness” of this dataset.

Stability’s venture into language models with StableLM could potentially have similar outcomes. Users can test the 7 billion-parameter StableLM base model on Hugging Face and the fine-tuned model on Replicate. Hugging Face also hosts a dialog-tuned version of StableLM with a similar conversation format as ChatGPT.

Conclusion and Future Outlook

The world of AI chatbots is a dynamic and rapidly evolving landscape. While open-source chatbots have made significant strides, they still lag behind the performance of OpenAI’s ChatGPT. The allure of replicating ChatGPT’s capabilities has led many to take shortcuts, such as training their models on ChatGPT’s data. However, this approach has proven to be less effective than anticipated and has even led to legal complications.

For businesses looking to leverage AI chatbots, it’s crucial to understand these dynamics. Rather than being swayed by bold claims of superior performance, businesses should conduct thorough evaluations of these chatbots in real-world scenarios. This includes testing the chatbots across a range of tasks and not just relying on benchmark scores. Moreover, businesses should be cautious about using models trained on ChatGPT’s data, given the ongoing legal issues. Instead, they might want to explore other training methods or consider partnering with reputable AI companies that have a proven track record in developing effective AI models.

Looking ahead, the race to replicate or surpass ChatGPT’s capabilities is likely to continue. However, the ‘secret sauce’ that makes ChatGPT so effective remains a closely guarded secret of OpenAI. As such, it’s unlikely that open-source chatbots will be able to match ChatGPT’s performance in the near future. But this doesn’t mean that open-source chatbots don’t have a role to play. They can still provide valuable insights and advancements in the field of AI. For instance, Stability AI’s StableLM, despite its limitations, represents a significant step towards creating an open-source alternative to ChatGPT.

##########