GenAI Inference Experiment comparing services from Groq, DeepSeek, Alibaba, OpenAI, Microsoft, Google, and Cerebras

In a recent exploration of AI reasoning & inference, as facilitated via Large Languge Models (LLMs), I conducted an experiment comparing inference capabilities across multiple models, including:

Groq (hosting DeepSeek-R1-Distill-Llama-70b)
DeepSeek AI
Alibaba Qwen-Plus
Google Gemini Flash 2.0 Thinking Experimental Edition
HuggingFace's WebGPU (DeepSeek-R1-Distill-Qwen-1.5B)
OpenAI (GPT-o1-preview)
Microsoft (Phi-4)
Cerebras (hosting DeepSeek-R1-Distill-Llama-70b)

The Experiment

Basic Test: Familial Relationships & Email Matching

👉 Prompt: Given the following information:

John is a sibling of Jane.
Jane is a sibling of James.
James has the email address xyz@example.com.
Jimmy has the email address xyz@example.com.

Question: How many siblings does Jimmy have?

📌 This test checks whether models rely on pattern matching (mistaking "Jimmy" for "James") versus logical inference.

Trickier Test: Introducing a New Name

👉 Prompt: Given the following information:

John is a sibling of Jane.
Jane is a sibling of James.
James has the email address xyz@example.com.
David has the email address xyz@example.com.

Question: How many siblings does David have?

📌 Here, pattern-matching models fail, while models applying relationship-based reasoning succeed.

Results & Model Performance

📍 Models that struggled:

❌ Groq (DeepSeek-R1-Distill-Llama-70b) – Passed the first test but failed the second, relying on prior knowledge rather than semantic inference.

Test 1

Test 2 -- Fail

Test 3 -- Passed with additional information provided

❌ DeepSeek AI – Relied on name similarity rather than relationship logic.

Test 1 - Fail

Test 2 -- Passed with additional information provided

✅ Models that passed:

✔ Alibaba Qwen-Plus – Applied relationship-based reasoning correctly.

Qwen-Plus Passing Test

✔ Google Gemini Flash 2.0 Thinking Experimental Edition – Combined speed and accuracy in understanding relationships.

Google Gemini, via the OpenLink AI Layer (OPAL), Passing Test

✔ HuggingFace WebGPU (DeepSeek-R1-Distill-Qwen-1.5B) – Correctly reasoned without pattern-matching biases.

HuggingFace WebGPU Passing Test

✔ OpenAI (gpt-o1-preview) – Correctly reasoned without pattern-matching biases.

OpenAI's GPT-o1-preview Passing Test

✔ Microsoft (phi-4) – Correctly reasoned without pattern-matching biases.

Microsoft's Phi-4 Passing Test

✔ Cerebras (hosting DeepSeek-R1-Distill-Llama-70b) – Correctly reasoned without pattern-matching biases.

Key Takeaways

Pattern matching ≠ reasoning. Some models relied too much on surface-level data rather than relationship logic.
Not all LLMs reason equally. Some adapted well to new information, while others failed due to training biases.
AI's ability to infer relationships matters. The best-performing models incorporated semantic understanding rather than just data retrieval.

Related Resources

📌 OPAL Session: Testing Google Gemini

📌 Perplexity AI: DeepSeek Testing

📌 HuggingFace WebGPU (DeepSeek-R1-Distill-Qwen-1.5B)

📌 OPAL Session: Testing OpenAI's GPT-o1-preview

📌 Microsoft Phi-4

🚀 What do you think about these AI reasoning capabilities? Are you seeing similar patterns in your experiments? Let’s discuss! ⬇

AI & Data Driven Enterprise Collection of practical usage and demonstration heavy posts about the practical intersection of AI, Data, and Knowledge