Our study aims to explore how different providers use different techniques in order to serve LLMs influencing the overall quality of the models. To achieve this, we conduct empirical tests using various prompts, allowing us to systematically compare the outcomes across various service providers. This approach helps us identify significant differences in how well each provider's language model processes and responds to the given inputs, thus shedding light on the relationship between provider selection and model performance. This blog post summarizes the above-mentioned study on providers. We perform all our experiments using app.premai.io.
The need for a playground
When building LLM-powered applications, we often encounter numerous choices from various providers offering their LLMs. However, before building the applications, it is crucial to understand which providers are best suited to meet the use case.
To properly evaluate models from various providers, we will conduct empirical testing. We will utilize a predefined set of prompts and assess the models based on the following parameters:
Verbosity
Correctness
Tokens per second tok/s in case of a tie.
π Prompts
As shown in the figure below, we will evaluate our LLMs using these prompts. For each prompt, we will compare different providers using PremAI playground.
π
Accessibility Tip: Please click on each playground image for better clarity.
Prompt 1: Argumentation Capabilities
Write a paradox
Llama 2-7b Chat Comparisons: Anyscale vs Replicate vs FireworksAI vs Together
π β Anyscale comes up with a verbose and interesting paradox, "barber's paradox". Anyscale provided LLAMA-7b-chat and also gave the rationale behind the statement of paradox.
Llama 2-70b Chat Comparisons: Groq vs OctoAI vs Anyscale vs Perplexity
π β Tie. We observe that almost all providers (except Perplexity) provide the Barber/Liar paradoxes and attach the rationale. Perplexity provider could have inference bugs.
CodeLlama-7b Chat Comparisons: Cloudflare vs OctoAI vs TogetherAI
π β None. Even though codellama came up with interesting meanings behind the word 'paradox', the correctness of all of these responses is debatable. Another thing to note here is codellama is specifically used for generating code instead of natural text.
Subscribe to Prem Newsletter
No spam. Unsubscribe anytime.
Prompt 2: General Knowledge
Who was Aldous Huxley? What was he known for?
Llama 2-7b Chat Comparisons: Anyscale vs Replicate vs FireworksAI vs Together
π β TogetherAI for the higher tok/s compared to Anyscale despite both being verbose. FireworksAI seemed to have some sort of a 'monologue' before it delved into answering the prompt.
Llama 2-70b Chat Comparisons: Groq vs OctoAI vs Anyscale vs Perplexity
π β Groq for highest tok/s (no competition on that) and overall correctness and verbosity of its response.
CodeLlama-7b Chat Comparisons: Cloudflare vs OctoAI vs TogetherAI
π β Cloudflare since it stuck to the topic and gave the highest amount of information on Aldous Huxley. Also we note that the other providers OctoAI and Together either deviated from the important details or generated verbatim urls it memorized from the training data, which could signify training data leakage if the urls are valid.
Prompt 3: Logic and Consistency
Write a brief story (at most 3 sentences) without the words hell and mad, and with one negation sentence.
Llama 2-7b Chat Comparisons: Anyscale vs Replicate vs FireworksAI vs Together
π β Anyscale because it had the most coherent story. FireworksAI and Replicate seemed to have made it clear in a way that there is a 'negation' at play (especially Fireworks by directly using the word). Together' did not adhere to the story's language restrictions, resulting in a lower ranking in our empirical evaluation. Moreover, in the case of Replicate, the story is incoherent as some segments of the plot were found to be superficial.
Llama 2-70b Chat Comparisons: Groq vs OctoAI vs Anyscale vs Perplexity
π β Anyscale emerged victorious due to its comparatively stronger story generation capabilities. While Groq demonstrated impressive speed, its outputs lacked consistency, producing three distinct sentences. OctoAI, on the other hand, exhibited negation issues right from the first sentence. Finally, Perplexity appears to be experiencing persistent backend problems impacting its perplexity score.
CodeLlama-7b Chat Comparisons: Cloudflare vs OctoAI vs TogetherAI
π β OctoAI excelled due to its highly coherent narrative. In contrast, Cloudflare's story faltered in the third sentence, leaving the plot incomplete. While Together produced a more verbose narrative, it deviated from the expected storywriting style. Although all three models struggled to fully capture the concept of negation in their third sentences, OctoAI delivered the most satisfying story overall.
Prompt 4: Reasoning
Write a termination letter. Reason: staff cutbacks.
Llama 2-7b Chat Comparisons: Anyscale vs Replicate vs FireworksAI vs Together
π β FireworksAI by output elimination. In the case of Anyscale, sentences are repetitive. Together followed an incorrect letter format while Replicate's outputs are cut short (refer to the scroll bar's length for an empirical understanding).
Llama 2-70b Chat Comparisons: Groq vs OctoAI vs Anyscale vs Openrouter
π β OctoAI provided the least exaggeration and all the required details in the termination letter. Anyscale and Openrouter exaggerated a lot in the opening paragraphs. Lastly, Groq despite the higher tok/s did not provide the next steps for the employee accurately.
CodeLlama-7b Chat Comparisons: Cloudflare vs OctoAI vs TogetherAI
π β Cloudflare. Together made the letter into a first person account of events. OctoAI had the 'reason for termination' missing while Cloudflare had a clear and concise story in the letter.
Prompt 5: Logic and Simple Math
42 birds are sitting on a tree branch. A hunter passes, shoots one dead, and misses two. How many birds are left on the branch?
Llama 2-7b Chat Comparisons: Anyscale vs Replicate vs FireworksAI vs Together
π β Anyscale won since it attended and answered the prompt correctly while giving the rationale behind its response. We note that other providers either refused or gave incorrect responses (such as 43 whereas birds were at max 42 in number).
Llama 2-70b Chat Comparisons: Groq vs OctoAI vs Anyscale vs Perplexity
π β Groq since it didn't skip the rationale like Anyscale's output, wherein both of them provided the correct response as expected. OctoAI gave 42 which is factually incorrect and Perplexity generated a redundant response.
CodeLlama-7b Chat Comparisons: Cloudflare vs OctoAI vs TogetherAI
π β None. All three providers despite varying verbosity levels provided the wrong response.
π¦ Concluding Observations
Anyscale shines in logic, consistency, and simple math prompts.
Groq stands out in general knowledge and simple math for its speed and accuracy.
Cloudflare is noted for its detailed responses in general knowledge and reasoning.
Thus, each provider displays varied strengths, underlining that the "best" choice depends heavily on the task's specific requirements. While Anyscale and Groq frequently emerge as winners in their respective categories, the selection of a provider should consider the application's particular needs, such as the importance of verbosity, correctness, or speed.