LLM

This revision is from 2023/09/13 16:44. You can Restore it.

Large Language Model

Asking 60+ LLMs a set of 20 questions - Benchmarks like HellaSwag are a bit too abstract for me to get a sense of how well they perform in real-world workflows. I had the idea of writing a script that asks prompts testing basic reasoning, instruction following, and creativity on around 60 models that I could get my hands on through inferences API. The script stored all the answers in a SQLite database, and those are the raw results. https://benchmarks.llmonitor.com/

AI