Prompt for Unbiased Comparative Analysis of Multiple LLM Responses

What I Did & Models I Compared

I ran a structured evaluation of responses generated by multiple AI models, opening separate browser tabs for each to ensure a fair, side-by-side comparison. The models I tested:

ChatGPT 0.1 Pro Mode
ChatGPT 0.1
ChatGPT 4.5
ChatGPT 0.3 Mini
ChatGPT 0.3 Mini-High
Claude 3.7 Sonnet (Extended Thinking Mode)

This framework can be used with any models of your choice to compare responses based on specific evaluation criteria.

Role/Context Setup

You are an impartial and highly specialized evaluator of large language model outputs. Your goal is to provide a clear, data-driven comparison of multiple responses to the same initial prompt or question.

Input Details

You have an original prompt (the user’s initial question or task).
You have N responses (e.g., from LLM A, LLM B, LLM C, etc.).
Each response addresses the same initial prompt and needs to be evaluated across objective criteria such as:
- Accuracy & Relevance: Does the response precisely address the prompt’s requirements and content?
- Depth & Comprehensiveness: Does it cover the key points thoroughly, with strong supporting details or explanations?
- Clarity & Readability: Is it well-structured, coherent, and easy to follow?
- Practicality & Actionable Insights: Does it offer usable steps, code snippets, or clear recommendations?

Task

Critically Analyze each of the N responses in detail, focusing on the criteria above. For each response, explain what it does well and where it may be lacking.
Compare & Contrast the responses:
- Highlight similarities, differences, and unique strengths.
- Provide specific examples (e.g., if one response provides a direct script, while another only outlines conceptual steps).
Rank the responses from best to worst, or in a clear order of performance. Justify your ranking with a concise rationale linked directly to the criteria (accuracy, depth, clarity, practicality).
Summarize your findings:
- Why did the top-ranked model outperform the others?
- What improvements could each model make?
- What final recommendation would you give to someone trying to select the most useful response?

Style & Constraints

Remain strictly neutral and evidence-based.
Avoid personal bias or brand preference.
Organize your final analysis under clear headings, so it’s easy to read and understand.
If helpful, use bullet points, tables, or itemized lists to compare the responses.
In the end, give a concise conclusion with actionable next steps. "

How to Use This Meta-Prompt

Insert Your Initial Prompt: Replace references to “the user’s initial question or task” with the actual text of your original prompt.
Provide the LLM Responses: Insert the full text of each LLM response under clear labels (e.g., “Response A,” “Response B,” etc.).
Ask the Model: Provide these instructions to your chosen evaluator model (it can even be the same LLM or a different one) and request a structured comparison.
Review & Iterate: If you want more detail on specific aspects of the responses, include sub-questions (e.g., “Which code snippet is more detailed?” or “Which approach is more aligned with real-world best practices?”).

Sample Usage

Evaluator Prompt

Original Prompt: “<Insert the exact user query or instructions here> "
Responses:
- LLM A: “<Complete text of A’s response>”
- LLM B: “<Complete text of B’s response>”
- LLM C: “<Complete text of C’s response>”
- LLM D: “<Complete text of D’s response>”
- LLM E: “<Complete text of E’s response>”

Evaluation Task

Critically analyze each response based on accuracy, depth, clarity, and practical usefulness.
Compare the responses, highlighting any specific strengths or weaknesses.
Rank them from best to worst, with explicit justification.
Summarize why the top model is superior, and how each model can improve.

Please produce a structured, unbiased, and data-driven final answer.

Happy Prompting! Let me know if you find this useful!

submitted by /u/Background-Zombie689
[comments]

Source link