LLM-as-Judge

Using models to evaluate model outputs, positional bias, self-preference, and calibration failures.