AI & Data Driven Enterprise
Collection of practical usage and demonstration heavy posts about the practical intersection of AI, Data, and Knowledge

DALL-E Generated

GenAI Inference Experiment comparing services from Groq, DeepSeek, Alibaba, OpenAI, Microsoft, Google, and Cerebras

Created on 2025-01-29 16:38

Published on 2025-01-29 21:30

In a recent exploration of AI reasoning & inference, as facilitated via Large Languge Models (LLMs), I conducted an experiment comparing inference capabilities across multiple models, including:

The Experiment

Basic Test: Familial Relationships & Email Matching

πŸ‘‰ Prompt: Given the following information:

Question: How many siblings does Jimmy have?

πŸ“Œ This test checks whether models rely on pattern matching (mistaking "Jimmy" for "James") versus logical inference.

Trickier Test: Introducing a New Name

πŸ‘‰ Prompt: Given the following information:

Question: How many siblings does David have?

πŸ“Œ Here, pattern-matching models fail, while models applying relationship-based reasoning succeed.

Results & Model Performance

πŸ“ Models that struggled:

❌ Groq (DeepSeek-R1-Distill-Llama-70b) – Passed the first test but failed the second, relying on prior knowledge rather than semantic inference.

Test 1
Test 2 -- Fail
Test 3 -- Passed with additional information provided

❌ DeepSeek AI – Relied on name similarity rather than relationship logic.

Test 1 - Fail
Test 2 -- Passed with additional information provided

βœ… Models that passed:

βœ” Alibaba Qwen-Plus – Applied relationship-based reasoning correctly.

Qwen-Plus Passing Test

βœ” Google Gemini Flash 2.0 Thinking Experimental Edition – Combined speed and accuracy in understanding relationships.

Google Gemini, via the OpenLink AI Layer (OPAL), Passing Test

βœ” HuggingFace WebGPU (DeepSeek-R1-Distill-Qwen-1.5B) – Correctly reasoned without pattern-matching biases.

HuggingFace WebGPU Passing Test

βœ” OpenAI (gpt-o1-preview) – Correctly reasoned without pattern-matching biases.

OpenAI's GPT-o1-preview Passing Test

βœ” Microsoft (phi-4) – Correctly reasoned without pattern-matching biases.

Microsoft's Phi-4 Passing Test

βœ” Cerebras (hosting DeepSeek-R1-Distill-Llama-70b) – Correctly reasoned without pattern-matching biases.

Key Takeaways

Related Resources

πŸ“Œ OPAL Session: Testing Google Gemini

πŸ“Œ Perplexity AI: DeepSeek Testing

πŸ“Œ HuggingFace WebGPU (DeepSeek-R1-Distill-Qwen-1.5B)

πŸ“Œ OPAL Session: Testing OpenAI's GPT-o1-preview

πŸ“Œ Microsoft Phi-4

πŸš€ What do you think about these AI reasoning capabilities? Are you seeing similar patterns in your experiments? Let’s discuss! ⬇