Created on 2025-01-29 16:38
Published on 2025-01-29 21:30
In a recent exploration of AI reasoning & inference, as facilitated via Large Languge Models (LLMs), I conducted an experiment comparing inference capabilities across multiple models, including:
Groq (hosting DeepSeek-R1-Distill-Llama-70b)
DeepSeek AI
Alibaba Qwen-Plus
Google Gemini Flash 2.0 Thinking Experimental Edition
HuggingFace's WebGPU (DeepSeek-R1-Distill-Qwen-1.5B)
OpenAI (GPT-o1-preview)
Microsoft (Phi-4)
Cerebras (hosting DeepSeek-R1-Distill-Llama-70b)
π Prompt: Given the following information:
John is a sibling of Jane.
Jane is a sibling of James.
James has the email address xyz@example.com.
Jimmy has the email address xyz@example.com.
Question: How many siblings does Jimmy have?
π This test checks whether models rely on pattern matching (mistaking "Jimmy" for "James") versus logical inference.
π Prompt: Given the following information:
John is a sibling of Jane.
Jane is a sibling of James.
James has the email address xyz@example.com.
David has the email address xyz@example.com.
Question: How many siblings does David have?
π Here, pattern-matching models fail, while models applying relationship-based reasoning succeed.
β Groq (DeepSeek-R1-Distill-Llama-70b) β Passed the first test but failed the second, relying on prior knowledge rather than semantic inference.
β DeepSeek AI β Relied on name similarity rather than relationship logic.
β Alibaba Qwen-Plus β Applied relationship-based reasoning correctly.
β Google Gemini Flash 2.0 Thinking Experimental Edition β Combined speed and accuracy in understanding relationships.
β HuggingFace WebGPU (DeepSeek-R1-Distill-Qwen-1.5B) β Correctly reasoned without pattern-matching biases.
β OpenAI (gpt-o1-preview) β Correctly reasoned without pattern-matching biases.
β Microsoft (phi-4) β Correctly reasoned without pattern-matching biases.
β Cerebras (hosting DeepSeek-R1-Distill-Llama-70b) β Correctly reasoned without pattern-matching biases.
Pattern matching β reasoning. Some models relied too much on surface-level data rather than relationship logic.
Not all LLMs reason equally. Some adapted well to new information, while others failed due to training biases.
AI's ability to infer relationships matters. The best-performing models incorporated semantic understanding rather than just data retrieval.
π OPAL Session: Testing Google Gemini
π Perplexity AI: DeepSeek Testing
π HuggingFace WebGPU (DeepSeek-R1-Distill-Qwen-1.5B)
π OPAL Session: Testing OpenAI's GPT-o1-preview
π Microsoft Phi-4
π What do you think about these AI reasoning capabilities? Are you seeing similar patterns in your experiments? Letβs discuss! β¬