Coming soon
What good LLM evaluation actually looks like
Lessons from building evaluation pipelines in a regulated healthcare setting — what vibes-based testing misses and why it matters.
AI EvaluationLLM
§5 — Writing
On AI evaluation, causal inference, and doing rigorous data science in settings where it actually matters.
Lessons from building evaluation pipelines in a regulated healthcare setting — what vibes-based testing misses and why it matters.
A bridge for researchers moving into data science: what transfers, what doesn't, and where the real differences lie.
Why standard accuracy metrics are the wrong frame for high-stakes classification, and how to think about it instead.