← all jobs

AI Model Evaluator (LLM & Agent Systems)

Work from home Full-time role Hiring

Job Title: AI Model Evaluator (LLM & Agent Systems) Job Type: Contract (Minimum 2 weeks, with potential extension) Location: Remote Job Summary: Join our customer's team as an AI Model Evaluator (LLM & Agent Systems) and play a pivotal role in shaping the future of generative AI and autonomous agents. You'll help benchmark, analyze, and assess cutting-edge AI systems in real-world scenarios, providing structured insights that drive improvements. This position is ideal for analytical professionals passionate about AI quality and real-world impact. Key Responsibilities:

  • Evaluate outputs from large language models (LLMs) and autonomous agent systems against defined guidelines and rubrics
  • Review multi-step agent actions, including screenshots and reasoning traces, to determine accuracy and quality
  • Consistently apply evaluation standards, flagging edge cases and identifying recurring patterns or failure modes
  • Provide detailed, structured feedback to inform benchmarking, product evolution, and model refinement
  • Participate in calibration and alignment sessions to ensure consistent application of evaluation criteria
  • Work collaboratively to adapt to evolving scenarios and ambiguous evaluation situations
  • Document findings and communicate insights clearly both in writing and verbally to relevant stakeholders

Required Skills and Qualifications:

  • Demonstrated experience with LLM evaluation, AI output analysis, QA/testing, UX research, or similar analytical roles
  • Strong background in AI model evaluation, benchmarking, and applying rubric-based scoring frameworks
  • Exceptional attention to detail and sound judgement in ambiguous or edge-case scenarios
  • Proficiency in English (B2+ or equivalent) with excellent written and verbal communication skills
  • Ability to adapt quickly to evolving guidelines and work independently
  • Comfort with remote work and a commitment of at least 20 hours per week for the initial term
  • Analytical mindset with a focus on actionable, qualitative feedback

Preferred Qualifications:

  • Experience with RLHF, annotation workflows, or AI benchmarking frameworks
  • Familiarity with autonomous agent systems or workflow automation tools
  • Background in mobile apps or digital product evaluation processes

Required Skills

  • LLMs
  • Generative AI
  • AI Model Evaluation
  • AI Benchmarking
  • AI Quality Assessment
  • Model Performance Evaluation
  • Prompt Response Evaluation
  • AI Output Analysis
  • Rubric-Based Scoring

More open positions

Illinois‐Licensed School‐ Clinical Evaluator – Remote, Weekday x

Work from home Full-time role

Video Evaluator | $34/hr Remote

Work from home Full-time role

Evaluator - Pittsburgh area

Work from home Full-time role

Looking for Product Owner - ServiceNow Remote Job : W2 Candidates Only !!

Work from home Full-time role

Product Owner - Contract

Work from home Full-time role

Retirement Services Customer Support Specialist – Remote, Bilingual Preferred, Financial Solutions & Client Success

Work from home Full-time role

Product Manager – Regulatory Data

Work from home Full-time role

[Remote] Growth Marketing Lead

Work from home Full-time role

SENIOR INTERNAL AUDITOR (REMOTE) (CHARLOTTE, NC, US, 28217-4511)

Work from home Full-time role

Customer Team Leader (District Sales Manager), Cardiovascular Disease - West North Carolina District

Work from home Full-time role

Korepetytor online Unity

Work from home Full-time role

Director, Regional Marketing job at Huntress Labs in US National

Work from home Full-time role

Remote Customer Support Specialist – Multi‑Timezone Phone Outreach, Lead Conversion, CRM Management & Technical Assistance (Work‑From‑Home, $25‑$50/hr)

Work from home Full-time role

Join the BEST, be the BEST: Junior Technical Consultant for AI-driven data capture and digital archive. APPLY TODAY!

Work from home Full-time role

[Remote] Human Resource Generalist

Work from home Full-time role

Agent Damage (m/f/d) - Contrato de Interinidad - 100% Remoto

Work from home Full-time role

Service Reliability Engineer

Work from home Full-time role

Game Designer, Sr.

Work from home Full-time role

Experienced Strategic Customer Success Manager – Media Technology and Entertainment

Work from home Full-time role

Comic Illustrator Job at New Heights Educational Group in Sherwood

Work from home Full-time role

Working Student Data Science Forecasting (m/f/d)

Work from home Full-time role