Ian Magnusson

Hi, I'm Ian! I do AI research at University of Washington and the Allen Institute for AI with Noah Smith and Pang Wei Koh. I'm interested in the science of language modeling, especially advancing evaluation to better understand scaling behavior and robustness across textual domains.

Previously, I was a PYI at AI2. I got my MS in computer science from Northeastern University, and interned at AWS AI Labs and SIFT. I also hold a BA in cultural anthropology from Bard College.

Selected Publications

DataDecide: How to Predict Best Pretraining Data with Small Experiments
_{Ian Magnusson, Nguyen Tai, Ben Bogin, David Heineman, Jena D. Hwang, Luca Soldaini, Akshita Bhagia, Jiacheng Liu, Dirk Groeneveld, Oyvind Tafjord, Noah A. Smith, Pang Wei Koh, Jesse Dodge}
_{ICML 2025 // [paper] [data] [evals] [code] [models] [press]}

Paloma: A Benchmark for Evaluating Language Model Fit
_{Ian Magnusson, Akshita Bhagia, Valentin Hofmann, Luca Soldaini, Ananya Harsh Jha, Oyvind Tafjord, Dustin Schwenk, Evan Pete Walsh, Yanai Elazar, Kyle Lo, Dirk Groeneveld, Iz Beltagy,
Hannaneh Hajishirzi, Noah A. Smith, Kyle Richardson, Jesse Dodge}
_{NeurIPS 2024 // [paper] [data] [code] [models]}

Scalable Data Ablation Approximations for Language Models through Modular Training and Merging
_{Clara Na, Ian Magnusson, Ananya Harsh Jha, Tom Sherborne, Emma Strubell, Jesse Dodge, Pradeep Dasigi}
_{EMNLP 2024 // [paper]}

OLMo: Accelerating the Science of Language Models
_{Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, Shane Arora, David Atkinson, Russell Authur, Khyathi Raghavi Chandu, Arman Cohan, Jennifer Dumas, Yanai Elazar, Yuling Gu, Jack Hessel, Tushar Khot, William Merrill, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, et al.}
_{ACL 2024 // [paper] [model] [code] [blog] [press]}

Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining Research
_{Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, Valentin Hofmann, Ananya Harsh Jha, Sachin Kumar, Li Lucy, Xinxi Lyu, Nathan Lambert, Ian Magnusson, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew E. Peters, Abhilasha Ravichander, Kyle Richardson, et al.}
_{ACL 2024 // [paper] [data] [code] [blog] [press]}

What's In My Big Data?
_{Yanai Elazar, Akshita Bhagia, Ian Magnusson, Abhilasha Ravichander, Dustin Schwenk, Alane Suhr, Pete Walsh, Dirk Groeneveld, Luca Soldaini, Sameer Singh, Hanna Hajishirzi, Noah A. Smith, Jesse Dodge}
_{ICLR 2024 // [paper] [code] [demo] [press]}

Reproducibility in NLP: What Have We Learned from the Checklist?
_{Ian Magnusson, Noah A. Smith, Jesse Dodge}
_{Findings of ACL 2023 // [paper]}

Extracting Fine-Grained Knowledge Graphs of Scientific Claims: Dataset and Transformer-Based Results
_{Ian Magnusson, Scott Friedman}
_{EMNLP 2021 // [paper] [data]}