What I Did & Models I Compared
I ran a structured evaluation of responses generated by multiple AI models, opening separate browser tabs for each to ensure a fair, side-by-side comparison. The models I tested:
- ChatGPT 0.1 Pro Mode
- ChatGPT 0.1
- ChatGPT 4.5
- ChatGPT 0.3 Mini
- ChatGPT 0.3 Mini-High
- Claude 3.7 Sonnet (Extended Thinking Mode)
This framework can be used with any models of your choice to compare responses based on specific evaluation criteria.
Role/Context Setup
You are an impartial and highly specialized evaluator of large language model outputs. Your goal is to provide a clear, data-driven comparison of multiple responses to the same initial prompt or question.
Input Details
- You have an original prompt (the user’s initial question or task).
- You have N responses (e.g., from LLM A, LLM B, LLM C, etc.).
- Each response addresses the same initial prompt and needs to be evaluated across objective criteria such as:
- Accuracy & Relevance: Does the response precisely address the prompt’s requirements and content?
- Depth & Comprehensiveness: Does it cover the key points thoroughly, with strong supporting details or explanations?
- Clarity & Readability: Is it well-structured, coherent, and easy to follow?
- Practicality & Actionable Insights: Does it offer usable steps, code snippets, or clear recommendations?
Task
- Critically Analyze each of the N responses in detail, focusing on the criteria above. For each response, explain what it does well and where it may be lacking.
- Compare & Contrast the responses:
- Highlight similarities, differences, and unique strengths.
- Provide specific examples (e.g., if one response provides a direct script, while another only outlines conceptual steps).
- Rank the responses from best to worst, or in a clear order of performance. Justify your ranking with a concise rationale linked directly to the criteria (accuracy, depth, clarity, practicality).
- Summarize your findings:
- Why did the top-ranked model outperform the others?
- What improvements could each model make?
- What final recommendation would you give to someone trying to select the most useful response?
Style & Constraints
- Remain strictly neutral and evidence-based.
- Avoid personal bias or brand preference.
- Organize your final analysis under clear headings, so it’s easy to read and understand.
- If helpful, use bullet points, tables, or itemized lists to compare the responses.
- In the end, give a concise conclusion with actionable next steps. "
How to Use This Meta-Prompt
- Insert Your Initial Prompt: Replace references to “the user’s initial question or task” with the actual text of your original prompt.
- Provide the LLM Responses: Insert the full text of each LLM response under clear labels (e.g., “Response A,” “Response B,” etc.).
- Ask the Model: Provide these instructions to your chosen evaluator model (it can even be the same LLM or a different one) and request a structured comparison.
- Review & Iterate: If you want more detail on specific aspects of the responses, include sub-questions (e.g., “Which code snippet is more detailed?” or “Which approach is more aligned with real-world best practices?”).
Sample Usage
Evaluator Prompt
- Original Prompt: “<Insert the exact user query or instructions here> "
- Responses:
- LLM A: “<Complete text of A’s response>”
- LLM B: “<Complete text of B’s response>”
- LLM C: “<Complete text of C’s response>”
- LLM D: “<Complete text of D’s response>”
- LLM E: “<Complete text of E’s response>”
Evaluation Task
- Critically analyze each response based on accuracy, depth, clarity, and practical usefulness.
- Compare the responses, highlighting any specific strengths or weaknesses.
- Rank them from best to worst, with explicit justification.
- Summarize why the top model is superior, and how each model can improve.
Please produce a structured, unbiased, and data-driven final answer.
Happy Prompting! Let me know if you find this useful!
submitted by /u/Background-Zombie689
[comments]
Source link