#ai-evaluation

2 results found

Conkurrence

Conkurrence measures whether multiple AI models produce consistent outputs on your evaluation tasks. It tells you which items your AI agrees on and which need human review — using Fleiss' κ, Kendall's W, and bootstrap confidence intervals, the same psychometric methods trusted in clinical research.

Judgmentlabs Mcp Server

A Model Context Protocol (MCP) server that provides seamless integration with the Judgment API for AI evaluation workflows. This server enables you to manage datasets, run evaluations, and track traces directly from your MCP-compatible environment.