In the last few weeks (as you have seen from previous blog posts), I have been working on the topic modeling project utilizing ongoing, cutting-edge work that is being done here at the University of Maryland in its Computer Science department. (In the early part of the internship, as readers would recall, we were working on interface design and Scala programming concerning topic modeling from the Mallet toolkit, which has a slightly different approach and was developed at the University of Massachussetts).
The question of “scale” has been on my mind over the past couple of weeks. We are processing really vast amounts of text data — topic modeling for text data is the kind of approach whose power of discovery is predicated on the assumption that vast amounts of data will be available for it to run on. It makes me pause and reflect that the assumption that these approaches would keep becoming more prominent and visible in the coming years rests on some other assumptions, which are both technological and social. For one thing, increased success for these approaches will depend on Moore’s Law continuing to hold (i.e. more and more processing power being available more and more cheaply), and also on the willingness (and legal feasibility) of those libraries and institutions that own such vast repositories of texts, to make them available in computer-readable formats. I realize that it is studying information science at an info-school (I am an SI student at Michigan) which makes me think about these additional dimensions. If I had remained just a computer-science person, I probably wouldn’t have thought about simply how much of a socio-technical infrastructure is needed to put so much text online, and if I had remained a humanities person (which I also have been in the past), then it might not have occurred to me to think about the underlying technological breakthroughs in electronics that is making such continued scaling-up possible (and will hopefully continue to do so in the future) for such approaches as topic modeling. I appreciate how being a student of Information Science attunes me to think about the entire ecology within which a particular approach is being developed.
While the availability of vast and increasing volumes of data makes one think of issues of quantitative scale, I also had an appreciation, over the last couple of weeks, of what one might call the qualitative scale of the challenge posed by taking this approach, especially when one tries to improve on the sophistication of the underlying algorithm by bringing, for example, domain knowledge to bear on the problem. An example from what we have been doing: earlier, we were working with the “unsupervised” topic modeling approach, in which no knowledge of the content of the text is really needed — the algorithm simply cranks away at whatever text corpus it is working on, and discovers topics from it. For the last week or so, though, we have focused on the brand-new and cutting-edge “supervised” topic modeling approach that is being developed by the computer science folks here at the University of Maryland. The idea in “supervised” topic modeling is to “train” the algorithm by making use of domain knowledge. For example, for the Civil War era newspaper articles archive that we are working with, we are making use of such related pieces of knowledge coming from sources outside of the corpus, as the casualty rate for each week, and the Consumer Price Index for each month, during the time period that these newspaper articles were being published. The idea behind this approach is that the algorithm will discover more “meaningful” topics if it has a way to make use of feedback on how well the topics discovered by it are associated with a parameter of interest. Thus, if we are trying to bias the algorithm into discovering topics that more directly pertain to the Civil War and its effects, then it will make sense to align the aforementioned “other kinds of data” such as — in our case, casualty figures and economic figures — which have a provenance outside the text corpus. This is where the “qualitative” scale becomes important, I think. The person who will use this kind of approach successfully, in other words, will have to have some grasp, at least, of a wide variety of other fields, and know which information sources to go to to look up additional kinds of data and bring them to bear fruitfully on the problem. The sheer number of areas with which the successful practitioner of this kind of work will, therefore, have to have at least a passing acquaintance, will “scale” up, the more intelligently we try to leverage these approaches’ power. It also made me realize that, once again, it is people trained in information science — which is a truly interdisciplinary field — who are well positioned to do this. Over the last week, for example, I read several papers on the economic history of the Civil War (which we were pointed to by Robert K. Nelson, a historian at the University of Richmond who has worked on topic modeling and history) — who would have thought that one would have to read something that in the course of a summer internship in Information Science? I aligned the economic data with the text corpus, and based on what the data seemed to be telling us, I came up with a design for some experiments to test out some hypotheses, which we will proceed to carry out over the next few days.
Also, in a piece of exciting news, the paper proposal that we (Travis, Clay and I) submitted to the “Making Meaning” conference for graduate students, organized by the Program in Rhetoric at the English Department of the University of Michigan, has been accepted. In preparing this presentation, too — which is going to be a reflection on how one might situate approaches like topic modeling in the context of literary theory and philosophy — I think we will find that our interdisciplinary training as “information-science” people really helps us to see see, and think, in terms of the “big picture” — to scale up to the big picture, as it were.
P.S. Now that this post was a reflection on the question of scale, it just occurred to me that it is also appropriate that the programming language I learned during the earlier part of the internship was — Scala!