Topic modeling and digital humanities

I spent the first couple of weeks familiarizing myself with the issue of “topic modeling” in literary corpora (the area of work that I will be working on), reading previous work (journal and conference papers), and going through the existing code base here at the Maryland Institute of Technology, which focuses on the work.

The work so far was preliminary in nature, having to do with familiarization.

At this point in time, some reflection is in order as to what the necessary skills may be for success in this internship. This is an internship in the digital humanities, which means that, by its very definition, it straddles two areas that, in the academy, do not usually interact very much: computing and the humanities. So, to work successfully in this area, a practitioner needs more than passing familiarity with both the world of the humanities and the world of software design and development. There is a particular reason why these two areas are so disjoint and disparate in the academy. Scholars in the humanities have a (well-founded) fear of, and skepticism towards, reductionism. They tend to favor a holistic approach and resist methodologies that treat the object of their study as in any way separate from its various contexts. Science, however, by its very nature, tends to be analytical, and, as a consequence, reductive. In the field of digital humanities, therefore, these two aspects/approaches are always in tension.

The area of “topic modeling” is particularly interesting in this regard because this new approach, although statistical in nature, is one of the rare computational approaches that are, arguably, *not* reductive. Topic modeling aspires to discover global properties and qualities of the text, while at the same time connecting those global, macro-level qualities to micro-level detail, and is therefore likely to appeal to humanities scholars in a way that reductive approaches do not. It is an approach that, therefore, in addition to being merely a tool for research, that is, for answering pre-existing research questions, is also an approach that is generative of research questions themselves. It also means, however, that someone working in this area cannot simply work in an instrumental way — he or she will have to have a familiarity with both area — both the software development process, and with how research is carried out in the humanities. The people who work here embody these skills — most have been trained academically in both disciplines.

Some reflection is in order at this point with regard to the existing skill sets I bring to the table in this regard. My recent academic training (before starting the master’s program in the School of Information) had been in the field of humanities, and I have a good sense of the questions, issues and mindsets that characterize the humanities. With regard to my skill set in software, however, my skills are a little rusty, since change is very fast in the world of software and I have not been developing software full-time, year-round in the last few years. So, I am finding that I have some catching up to do in order to get to speed on the project — needing to learn some new tools and methods. In particular, I need to focus on web programming, since I had not worked intensively in that area before, and mastery of the techniques of web programming is an important prerequisite to building applications that would be versatile and user-friendly enough to be attractive to humanists to use. I learn best by reading books (in the old-fashioned, i.e. dead-tree way) and so I got myself books on JQuery, git and Scala (technologies that would be needed in the project) to read.

Posted in MITH, Summer 2011, University of Michigan | Leave a comment

Week 4

This week, I spent much of my time working on modifying the logic in the watracking project to match up with the new database structure. Information about say, object TYPE classification (audio, video, text, image), is linked in an entirely different way, requiring some rewriting. CodeIgniter restructuring is the ultimate goal, and I have had some uncertainties about the best way to approach the problem. I could spend some time in the design phase now, and see if that informed the current code additions and subtractions that I’m making to get the application working. OR, I can wait and get the application working, with logic that reflects the new, generalized database design and then spend some time in design phase before implementing the restructuring.

Searching forums online, I haven’t found too much, (nothing, really), about others’ approaches to changing an existing CodeIgniter structure and challenges that seem to common across these attempts. However, I have found sample PHP instantiations of MVC and I think I have a fairly good grasp of what goes in your average model relative to the controller and the view.

The model should contain all the data logic (easily said, not as easily understood). One example that was particularly helpful in making sense of it for me asked readers to imagine a video-on-demand site. Just to keep it simple, you have two models – user.php and movie.php. In user.php, you create a user class and then you had methods such as “getUserName()” and “logIn().” In movie.php, you have a movie class with methods such as “getTitle()” and “getPrice().” After reading this simplified example, I thought, “How does this translate to our tracking app?” Well, I’m still working that out…

At first, I thought for each major application directory, (controller, view, model), you would have subdirectories themed by UI function: add, edit, delete, search.  However, after talking to Keith – it seems that might be alright for the controllers but coupling the models to the controllers so closely like that does not seem like the best approach.

I understand why currently there is a model to correspond to each table in the database – each represents a kind of data and it takes certain methods to use that data. I have some concerns about code repetition but perhaps you can’t avoid it all together. I’m continuing to work on it.

Whitman Camp is coming up here at CDRH, and Matt Cohen, from UT, is expected to be in town. I am looking forward to seeing how CDRH organizes these kinds of events – what gets done, what kind of plans get made, etc. And I am also eager to speak with the digital humanists from home to see what kinds of thoughts they have about the future of the field at UT – a center, or an embedded kind of approach where each humanities department develops its own dh working group. I’m sure there are lots of contributing factors in determining what sort of configuration works best.

Next weekend is Fourth of July and I’ve heard from more than a few people that it is a pretty big deal here. The celebration has begun a little early in my neighborhood – several loud fireworks went off over the weekend. The fireflies are equally enthusiastic – as I’m seeing them in larger numbers every evening around 8:30.

Posted in CDRH, Summer 2011 | Leave a comment

Week 3

Week 3 was all about database redesign. After talking to Keith earlier in the week, I had an idea about how to proceed. Before speaking with Keith, I was experiencing some project focus confusion between taking his code and getting it to work for Cody – an approach that would produce a usable application for the project team in a shorter amount of time VERSUS doing something totally different, something  more generic that would include a database redesign and some code rewrites -an approach that would produce something more modular that could be easily implemented as a tracking system for most types of CDRH projects but would be more time intensive. We determined the latter was the best use of my time here and I set out on the redesign.
I think one tricky part of extending what someone has already done, is that it is easy to become married to that idea/structure/organization in the time it takes for you to understand how it produces the outcome it is producing.

…And you don’t do database redesign in a vacuum. For every change, you have to have some idea for how it might play out – either in a query or somewhere in the application logic. Another potential trap is trying to design and conceptualize your way out of trouble without ever getting your hands dirty. It’s impossible. You are bound to get stuck in a purgatory of MySQLWorkbench EER modeling, or covering your desk in notebook paper prototypes, (which is a likely outcome even if you do get your hands dirty).

I’m currently in the midst of trying out my latest design. My approach is to:
1) change whatever needs to be changed in the project directory to get the application working with my new database design.
2) make changes as needed
3) reorganize the ci project so that the bulk of the logic is contained in the models and the add and jsrequest controllers are broken up into more task specific chunks that can act as controllers themselves.

We’ll see how it goes…But, in space between design, testing my design’s feasibility, and producing informative output I’m learning a ton.

Posted in CDRH, Summer 2011 | Leave a comment

Week 2

After my first week’s self-imposed rush to get comfortable with a range of new models, desktop work environments, and the CDRH itself – this week I felt more at ease with all three.  I dove head first into Keith’s code. When you look at someone’s source code for their application, you are looking at their problem solving process – the bulk setup (the design), the quick fixes(constraints on your time during the process), the places they intended to go back and revisit. Code is a fairly revealing form of authorship.

You reveal in it your desire for elegance, for modularity.

However, reading code that is so thoroughly abstracted can be difficult. The MVC model that is instantiated by the CodeIgniter framework is straightforward in the abstract. However, in practice and as you might imagine, it takes on the idiosyncrasies of the programmer and the specific project needs. Therefore, the business logic sometimes shows up in unexpected places.

We met with Laura, CDRH’s metadata encoding specialist, who has worked the most with the Cody EMS. She walked us through the parts of the EMS that are particularly problematic. Keith’s tracking DB for the Whitman already solves most of the problems that Laura pointed out, which means that modifying the Whitman tracking DB is probably the best plan. Also, CDRH faculty and staff are already familiar with the user interactions and the location of specific functionality such as editing and deleting an entry.

There are some differences between Whitman and Cody tracking systems that we are already aware of: Cody will include video and image object types; the site will require a changelog so that users can see the last person that made a modification to the object entries; naming scheme will have to be revised; additional object type-specific, dynamically generated forms; and a few more issues that I have documented elsewhere. The Whitman currently produces additional forms based on the kind of object that the user is adding (Text, etc.). Text is currently the only object type that has additional forms, (these provide places for user to add in info about whether the object they are adding is ‘recto’ or ‘verso’, etc), however, if the Cody project members decide on elements of video they would like to track, the current AJAX enabled logic that produces the additional forms without reloading the page, will need to be copied and modified accordingly.

Also, any additional object types correspond to changes in the table design. One current DB issue with Whitman is the way that object type and object ID are currently linked. The Object Type table consists of an ObjectType PK, the ObjectID, the ObjectType(text, audio, ephemera, etc), and something else. Why not simply create foreign keys for all objecttypes (AudioObjectID FK, VideoObjectTypeID FK, etc)? Well, if you did that each object type would need to be a column, and you would have 3 out of 4 cells in a given row with NULL values. In its current configuration, as Keith explained, “It sacrifices referential integrity in order to reinforce the idea that each object can only have one type.” Tasked with rebuilding the DB for Cody, we are given the opportunity to rethink the current scheme.
Another current Whitman DB issue is that ImageObjects (which would include photographs, poster, and other image content – not scans of text content) were never used. So, we get the opportunity to explore what kinds of elements need to be tracked there as well. Originally Keith had planned for ImageCreationMethods and one other element. After speaking with Laura it seems that tracking elements for video and images in particular, haven’t yet been determined. This presents an obvious but common problem in DB design, which is planning for likely additions and trying to make the schema as flexible a possible. I’ve built an EER model in MySQL Workbench to play around with alternatives.

I’ve been writing notes all week about specific files that will or won’t have to be changed, which lines and so on…attempting to make my job a little easier as I begin the new CodeIgniter project for Cody, which I’ll be doing now….

Posted in CDRH, Summer 2011 | Leave a comment

Week 1

My summer project consists of understanding the structure and logic of the Walt Whitman object tracking database, perform a needs assessment on users of the current tracking system used by the William F. Cody project and then, determine if/how/what to abstract from the Whitman tracking tool to improve on the Cody system OR determine other alternatives, and finally to come up with a proof of concept on how that might work

This week was a whirlwind introduction to: the convenience of working on projects in development environments like NetBeans and Eclipse; PHP frameworks, CodeIgniter in particular; using Boolean notation in PHP; AJAX and JQuery integrated into a PHP application; object oriented principles such as classes, inheritance and abstraction; pass by reference; and some other things that I’ve documented but are not immediately coming to mind.
Professor Walter introduced me to many of the CDRH and Library faculty and staff members, including the designer and creator of the Whitman tracking database, Keith Nickum. In terms of CDRH staff, I’ll probably be working with Keith most often for the duration of the internship.
He is a wealth of knowledge, having created the application himself, and having worked on all aspects of the project for the past year and a half.

My first day, I found myself at Keith’s desk more often than I would have liked but by the second day, I had a laundry list of items that I could research and learn on my own. Although I can’t say that I feel comfortable with all these new concepts already, my anxiety level has gone from information overload to manageable thanks to a systematic approach to Keith’s code. I make notes on everything I don’t understand, look those items up, and run through practice exercises whenever possible. In the process more questions arise and I repeat those steps.

I am grateful for the opportunity to be here, at CDRH, working with an amazing group of people who are looking for new and exciting methods to apply to working with humanities materials and humanities-oriented users.

Posted in CDRH, Summer 2011 | Leave a comment

Thanks Nebraska!

Well my time has come to an end at the CDRH and I have had the opportunity to dip my toes in to a number of digital humanities related fields.

The main focus of my time in Nebraska was working with Dr. Peter Bleed in the creation of an archaeological data web tool that would enable researchers to add, access and edit archaeological data and historical data. The idea was to create a resource that would allow for the integration of historical documentation (photographs and maps) with modern views (Lidar data, orhto-rectified maps) displaying archaeological data (GPS data, site maps).The hope was to create a tool that would aid in the interpretation of archaeological findings and assist in further archaeological exploration. To some extent this was successful. I was able to use ArcGIS to Geo-reference old Sanborn maps to the excavation site so that the historical evolution of the site could be seen. Using this to predict where other structures might be was a bit more problematic. Sanborns are accurate but they are not precise. This means that while you have a general idea of the location of a structure concrete prediction is still iffy. Alongside of this visualization I created a collection of digitized primary sources from the site. All in all, this project was more of a proof of concept, an exploration of what sort of tools could be created and how to best go about implementing them.

As a supplementary to the creation of the visualization tool I also wrote up a set of recommendations for the creation of a permanent repository for archaeological materials. It should come as no surprise that archaeological data is increasingly “born digital”. Unfortunately stewardship of this data is undertaken in a piecemeal fashion. The purpose of this archive would be to create a resource for preservation that encourages use and discovery.

In addition to do all of that, I had the opportunity to do some TEI work on a project about the Mountain Meadows Massacre. The scholar had
located a number of contemporary accounts written about the massacre using Google books. I worked on encoding OCR’d XML of the texts to help make them searchable.

I was also able to work with the archive at UNL adding photos to collections using the Archon records software. Again, very interesting work. Archon is a very robust software package and it was interesting to see how the process of resource creation functioned, and just how versatile of a software package it is.

I also had the opportunity to explore Lincoln and Nebraska in general. The folks there are generally lovely and the summer was wonderfully
mild compared to my native Austin. If you are every in Lincoln I recommended taking a couple of hours and seeing the Capitol, it is an amazing art deco structure, then go get some pizza at Yai Yai’s.

-Stephen

 

Posted by: Stephen Pipkin

Posted in CDRH, Summer 2010 | Leave a comment

Beginning of August

So, it’s the first week of August. I met with Leigh last week, and Dave was also there. We showed her the project and what I had already done. She said what I had functionality-wise was pretty much what she wanted. I suggested some additional features and she liked it, so we’ll be implementing that soon. I’ll be implementing that soon, that is. I already did one of them this morning. Elisabeth, the designer here, sat in for part of it, and she’ll be making the design. That’s kind of nice, because she’s really good, and I’m not a design expert. I can definitely finish my part by the end of the summer. I am looking forward to having the whole project finished so I can put it in my portfolio and present it at SI. 

Posted by: Isabela Carvalho

Posted in MITH, Summer 2010 | Leave a comment

Modular Emulation and Modular Description

For work this week I have focused on adding content to the catalog and moving the local instance live. On the latter, it’s almost ready! There are a couple of hitches presently but everything has migrated correctly. For the former, I have been finishing up details on the Apple IIe, and adding an Osborne 1 to the site.
For the Apple IIe, I’ve scanned some documents Matt has kept through the years: packing lists, warranties, business reply forms, manual errata, etc. These add a good deal of use context to the machine. For instance, the Apple IIe came with a wrench and nut plate for adding and swapping expansion cards. The computer was really meant to be modified and expanded upon by the user. It is really a very open device. Not only does one not need screws to access the motherboard and cards, one doesn’t even need to turn the machine on its side or upside down. It opens in its regular orientation, sitting on the desk. Besides this, a printer registration card from Star Micronics provides a list of popular computing magazines from which the purchaser can indicate which he or she reads. These range from Apple Orchard and 80 Microcomputing to Dr. Dobb’s Journal.

The Osborne 1 is clearly a less openable device, but it’s providing a good test of the how flexible the modeling we have used so far really is. Like the Apple IIe, it is a fully-functioning system, but unlike that machine, there is no physical computer case to base components around. The Osborne’s form factor prevents this sort of distinction since it’s a single containing unit. Still, the system has a motherboard (which hosts all the connections and software), and it does have component pieces, such as a 300 baud modem, the 5″ CRT display (dwarfed between two Fujitsu floppy disk drives), the microprocessor, etc.

A video game preservation paper has been making the rounds of late. Dave tweeted about its discussion on Slashdot, then it showed up at Ars. It’s a good paper, and I was particularly interested in one of its citations, a 2005 paper from the National Library of the Netherlands that proposes modular emulation as a new tact on the emulation front. The authors, Jeffrey van der Hoeven and Hilde van Wijngaarden, describe some common emulation woes such as stack emulation (the rabbit hole of emulators emulating emulators and so on to persist the particular emulator of interest through future platforms), emulator migration (rewriting the emulator over and over to persist the emualtor through future platforms), and the present limitations of Lorie’s UVC for behaviorally complex data with intense I/O requirements (like software).Modular emulation proposes breaking down emulation to component parts in the interest of reusing those component parts in different and new configurations:

Emulation of a hardware environment by emulating the components of the hardware architecture as individual emulators and interconnecting them in order to create a full emulation process. In this, each distinct module is a small emulator that reproduces the functional behaviour of its related hardware component, forming part of the total emulation process.

This makes a lot of sense to me, and it maps perfectly to the modeling we are investigating here. For example, instead of concerning ourselves with writing an emulator for the Apple IIe (which as an actual and specific machine is always going to vary in expansion cards and internal peripherals, etc.), we instead focus on a solid emulator for the MOS 6502 8-bit microprocessor that handles the system’s computations. That processor appears in many, many machines, so having that emulation software is much more useful than the Apple IIe as an unbreakable whole. It needs to be combined with other emulator-components, of course.

The benefit of modeling and describing systems by components is that if done consistently by a large number of persons, one begins to generate a collective database of parts and pieces. This can facilitate recognition of similarities across platforms (could be useful for platform studies endeavors), easier groupings of system properties, etc., and ideally, more expedient and cheaper emulation. It also strikes me that persisting these independent emulations pieces would be infinitely easier than managing a more monolithic systemwide emulation piece. And finally, this incremental approach to emulation is simply closer to the true internals of the machines, and that better accuracy of description is educational.

Posted by: Walker Sampson

Posted in MITH, Summer 2010 | Leave a comment

Multi-tasking

As my time at the CDRH nears its end (my last day is next Friday, August 6th), I have begun to reflect on the work I’ve done this summer.  I have had the opportunity to exercise a variety of intellectual “muscles.”  My tasks have included:

 

  • for French 17 and the Mountain Meadows Massacre project, TEI encoding, which means learning a strict set of rules and understanding how to apply them (a discerning eye is helpful);
  • for Louis XIV, updating a website’s design and adding content, which requires creative, technical and organizational skills;
  • and for both French-language projects, I did some translating of texts from French to English.

If one part of my brain ever became “fatigued,” there was always something different waiting in the wings.  All of these jobs were challenging at times, and also quite rewarding.

Posted by: Rosie Hanneke

Posted in Uncategorized | Tagged | Leave a comment

Drupal projects

Writing a blog entry is a nice way to start my week. Much has happened since my last entry. I can’t believe it’s already the end of July! I’m really happy about how things are going. I’ve learned so much! Somehow I learned more by working in this internship than I did all of my first year of grad school. Really it’s true what they said, life experience is the true Teacher. I feel like there are so many career opportunities and paths I can take not only next year but also the year after when I graduate. I’m excited about that, because not having anything to tie me down  to Michigan I can go anywhere, and I’ll go to the west coast.

I’ve been working with Drupal since the beginning. I get the impression that when you know Drupal, you’re something of a rare gem in a web development organization or setting. I’ve talked to several clients and done some consulting, everyone wants my opinion about whether you can do this with drupal, whether you can do that. My career advisor said now that I know drupal I will always have job offers. I definitely see what she means now… Anyway, I finished the original project a long time ago. Well actually, I finished all of the base functionality. Dave and I haven’t been able to meet with the professor that is doing this project yet, so that’s been put on hold. She may want additional things in it. So they gave me another project, a beast of sorts: migration. That’s probably the most dreadful thing one can do in CMS, but I’m tring to make the best of it. I’m migrating a project they’ve been working on here “TheatreFinder” onto Drupal. I feel like I’m almost done, but that doesn’t at all mean I’m almost done. I have to break the process down into several substeps, and it’s hard to estimate how long some of the future substeps will take.

I’ve realized now that the term “migration” is not very appropriate. What it really is, is remaking the site with some additional content plugged in. So in a sense it’s far more work than just making a site from scratch, because not only do you have to make the “scratch” but you have the plug in tons of existing content to it.  All you migrate is the database, and that’s the easy part. But pretty much everything else needs to be created from scratch in drupal. This has taken me longer than the original project. The nice thing is, I’ll have 2 projects to add to my portfolio when I’m done at the end of the summer.  Which by the way, I am remaking my portfolio now (in drupal, of course), and when it’s ready I’ll post it here.

I’m learning a lot, not only about drupal specifically, but about problem solving and troubleshooting in general. I’m also getting more experience with database handling and using tools like phpMyAdmin. Amazing how much you learn from real work experience! =) And I thought I knew Drupal before! Haha.

A very important final update. I ordered Drupal stickers last week and they came in today. I distributed it around the office and made everyone happy. If you want some, let me know before they run out.

Posted by: Isabela Carvalho

Posted in Uncategorized | Tagged , | Leave a comment