Xanda Schofield, Harvey Mudd College – Making Sense of Text with Topic Models

On Harvey Mudd College Week: It takes time to read a lot of information, and what if you need it fast?

Xanda Schofield, assistant professor of computer science, looks beyond AI.

Xanda Schofield is an Assistant Professor in Computer Science at Harvey Mudd College. She completed her B.S. in Computer Science and Mathematics at Harvey Mudd in 2013, and her Ph.D. in Computer Science at Cornell University in 2019. Her work focuses on practical applications of small-scale natural language processing (NLP) to research in the humanities and social sciences. She is also passionate about improving ethical and inclusive practice in her field.

Making Sense of Text with Topic Models

 

When confronted with a pile of documents too long to read, you might ask: can I find out what’s in here without looking at all of it?

Decades before ChatGPT and large language models, linguistics and statistics came together to answer this question with topic models. Topic models take the idea of context clues to a massive scale: they use counts of what words show up together in the same documents, whether those are news articles, customer complaints, or kids’ book chapters. Those statistics help to split the text into groups of words that show up together, which we call topics.

These topics often correspond to themes that someone who knows the text can recognize. If we look at kids’ books and see a topic emphasizing the words “store,” “money,” and “mall,” we might guess that wherever this topic shows up, people are talking about shopping.

Topic models can help with text research: they can measure common trends and distinguish unusual text to explore more. So why aren’t we seeing topic models everywhere? A few years ago, I recruited a team of undergraduate research students to help me figure out why.

We interviewed fifteen different text experts who had used topic models to find out what slowed their projects down. We found that for topic models to be useful, expert human intervention was necessary. Decisions like whether two words should be treated as the same or whether a model was good enough for an application required knowing a lot about the underlying text. Applying that knowledge was sometimes slow, but without that work, the model wouldn’t be useful.

This changed my thinking about my job as someone who works in AI. I want to help people make decisions by using AI as a support tool for human expertise and decision-making, not a substitute.

Read More:

A blog post about topic models and kid lit, led by Quinn Dombrowski

A shorter paper using topic models to make sense of companies’ Twitter behavior around corporate social responsibility

A paper diving into the details of what makes topic modeling work so complicated

Share

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *