tentative evals on o3-mini for @ellipsis_dev code review:
reasoning_effort="low" is meh
reasoning_effort="medium" is quite good, allows us to simplify our pipeline
reasoning_effort="high" is so good it found a bunch of new bugs in our eval PRs that we hadn't noticed before