Developing Issues in Licensing: Text Mining, MOOCs, and More

This report covers a program co-sponsored by the Collection Development and Electronic Resources Management Interest Groups of the Association of College and Research Libraries New England Chapter (ACRL/NEC), an independent chapter of ACRL. The workshop, titled “Developing Issues in Licensing: Text Mining, MOOCs, and More,” took place on April 25

Open education has been expressed in different ways over the years through many educational initiatives, Kumar explained.Just over ten years ago, open courseware was launched when MIT made the content of all its courses available online for free for educational purposes.With this initiative, the MIT community began having discussions about MIT's unique value proposition.
Faculty articulated that what defined the value of an MIT education was intensity: a high level of interaction between high quality students and high quality faculty as evidenced in part through project-based learning and hands-on experiences.The question became how to maintain and extend this value proposition when offering online distance education to a broader set of learners who are more diverse in their levels of preparation.
While there are all sorts of open education initiatives at all levels, for example online courses and online tutoring, the biggest event so far in open education has been the launch of MOOCs.
Kumar believes that one of the characteristics of MOOCs that makes them significantly different is that the community of self-learners enrolled in the MOOC plays a major role.MOOCs do not offer a one-to-many relationship so much as a peer-to-peer learning opportunity.For example, MIT has found that students in MOOCs have created tools and software to help each other learn; there is a whole ecosystem of production that is going on.Another characteristic of MOOCs that distinguish them from other online courses is that much of the process is automated, for example assessment.
Kumar mentioned that there are exciting new developments in online learning that are beginning to show up in MOOCs and have the potential to be dramatically transformative.These include tools that bring the practice of research to the process of learning, for example protein visualization, materials modeling, hydrology visualizations, and parallel programming opportunities.Through MOOCs students can be exposed to the discovery aspect of research and to the processes of doing research using interactive technology.The point is that MOOCs are not just about access to content like video clips and assignments posted online.MOOCs can enable end-to-end educational experiences, including hand-on experiences, at scale.Kumar presented some of what MIT has learned through its experience offering MOOCs.One thing is the value of real-time feedback and correction that is the result of students in the MOOC helping one another.Along these lines, new tools and technologies are being developed that allow, for example, programs created for an assignment in a computer science course to be chopped into chunks and sent to many reviewers for grading, allowing for faster feedback.
Another lesson learned from MOOCs is the value of online learning in enhancing face-to-face education on campus.Shifting information transfer (lectures) and assessments (tests) online allows for the "flipped classroom" in which scheduled class time is used for field experiences, labs, and other interactive experiences.
Kumar ended with a discussion of concept-based learning and modularity.Concept-based learning has as its goal to present students with a coherent sense of how the content and skills they are learning relate to specific concepts.These concepts can be linked to educational assets, for example labs and lectures, so that students seeking mastery of a given concept can chart their own path through the material required to master that concept.This in turn enables modularity, the ability to experience education in smaller chunks, which can make it easier to create opportunities for students, like internships or study abroad experiences, that do not interrupt the flow of education.Rethinking the entire curriculum based on concepts could play a large role in changing the ecology and economics of education.MOOCs and related technologies can offer an abundance of courses, content, and interaction opportunities.Access to courses can be blended with hands-on vocational opportunities that allow for a more customized and accessible education.The challenge will be in determining in this new environment what to discard and what to keep.

Why Humanists Need Data: New Uses for Electronic Archives
Speaking next was Ryan Cordell, assistant professor of English at Northeastern University and a core faculty member at the NULab for Texts, Maps, and Networks, Northeastern University's new center for Digital Humanities and Computational Social Science.He presented his research on nineteenth century U.S. newspapers.Cordell explained that he is interested in historical newspapers because he is interested in viral media.In the nineteenth century, before modern copyright had taken shape, newspapers in the United States were similar to today's blogs or aggregators.Newspaper editors combed through other newspapers to find material their readers might like and published it, sometimes with attribution, sometimes without.What Cordell is studying is how these shared texts moved around the country, changing as they did, and how they informed society at the time.
Nineteenth century newspapers included a wide variety of content, such as poems, short stories, and travel accounts.For example, one poem, "The Inquiry," was reprinted in newspaper after newspaper throughout the country, changing over time.The version that became the most popular, that "went viral," was one of the edited versions, not the original.The poem ultimately became so popular that it was parodied.Because nearly everyone in the country was experiencing texts such as this one, they can tell us a lot about the period.
Cordell's primary source for his research is the Library of Congress's site Chronicling America: Historic American Newspapers (http://chroniclingamerica.loc.gov/).The site contains the full text and page images of many American newspapers published between 1836 and 1922.Cordell explained that if he had been conducting his research two decades ago, he would have had to painstakingly read every newspaper he could, hoping to randomly encounter shared texts.Even conducting this research with simple full-text searching capabilities would be difficult, since searching relies on inputting known text.Cordell explained that what he needs for his research is the data itself, that is, the full text of the digitized newspapers generated using optical character recognition (OCR).Cordell also used network analysis to create a diagram on which circles represented individual nineteenth century newspapers and lines between the circles represented shared text.This type of data visualization shows which newspapers were the most influential, printing items that other newspapers chose to reprint.It also reveals which newspapers regularly shared stories, which was often the result of a shared religious or political affiliation or, in one case, a family relationship between the editors.Cordell pointed out that the diagram illustrates the prominence of some newspapers that we might not have suspected.For example the most prominent newspaper during the time period studied was the Nashville Union and American because at that time Nashville was the geographic center of the country.
Cordell startled the audience by revealing that his study includes no historical newspapers from Massachusetts.The Library of Congress Chronicle of America includes no Massachusetts newspapers because at this point in time only a commercial vendor, Readex, has digitized them, and their data is not available for text mining.Cordell stated that if he had access to Readex's America's Historical Newspapers and ProQuest's American Periodicals Series Online, he would have much more data to analyze.He and Smith have begun conversations with Readex and ProQuest about using their data, but the process has not been easy.They are trying to convince these vendors that if they were to allow the use of their text, Cordell and Smith could help them in return by providing corrections to the OCR identified through their research as well as increasing the visibility of these databases.Researchers who do text mining need the help of librarians to include data mining rights in license agreements.

Coursera Partnership
Jolee West, director of academic computing and digital library projects at Wesleyan University, spoke next about her experiences doing copyright research for Wesleyan's MOOCs.West explained that she is neither a lawyer nor a librarian; she has a PhD in anthropology and works as a technologist.
In the fall of 2012, Wesleyan partnered with Coursera, a for-profit educational technology company that works with universities to host MOOCs.Wesleyan's first MOOCs were offered through Coursera in February 2013.West described MOOCs as distance education combined with crowd-sourced learning.MOOCs have very large enrollments of students from around the world, many of whom are not native speakers of English.(At Wesleyan, the average Coursera enrollment is 30,000.)MOOCs are known for very high attrition rates (about 90%), but, as West explained, the students who remain are highly engaged.Students enrolled in MOOCs selforganize to help each other.Within hours of a MOOC opening, enrolled students form local study groups.Discussions do not need to be seeded.
West noted that faculty in the face-to-face classroom use a wide variety of copyrighted works, but this practice does not translate to MOOCs, which take place online and are open to anybody.
Because the application of the distance education safe harbor in copyright law is questionable in the case of MOOCs, Wesleyan relies on fair use when including third-party copyrighted material, and a great deal of discussion and debate takes place around every copyrighted item used.
One issue that arose at Wesleyan around relying on fair use for MOOC content was that Coursera is a for-profit company, and fair use law favors "nonprofit educational purposes."For this reason, some other schools also offering MOOCs through Coursera will not depend on fair use when incorporating third-party copyrighted content.However, West mentioned, there are a number of legal cases in which fair use has been upheld for commercial entities, and she has spent a good deal of time scouring the Web for information on these cases "to get her head in the right place."In then end, West believes that Wesleyan's fair use with regard to Coursera MOOCs is not that different from fair use claims made by institutions using the not-for-profit EdX system.
Relying on fair use is often necessary because Wesleyan's subscriptions to licensed electronic resources do not cover external students enrolled in MOOCs, and licensing rights for an additional 80,000-100,000 students would not be feasible.When Wesleyan first contacted the Copyright Clearance Center (CCC) about licensing an article for a MOOC, the CCC had no idea what a MOOC was and stated that the licensing fee for the article would be $3.00 per student, the same as for on-campus students.As a result, readings from journal articles and other copyrighted sources are off-limits for Wesleyan MOOCs unless they are available open access or unless the instructor wants to leave obtaining access up to the students themselves.When she is undecided about whether using a particular item would qualify for fair use, West confers with the university's counsel.The fact that students need to register for MOOCs -that the courses are not totally open -mitigates some of the risk involved in invoking fair use, despite the fact that anyone can register for a MOOC and that participants have access to content for as long as the instructor leaves the course open.
West provided a number of examples of decisions about using third-party copyrighted content in MOOCs at Wesleyan.A professor for a Wesleyan Coursera MOOC on "The Language of Hollywood" addressed copyright concerns by only using materials openly available on the Web, such as movie posters, still images, and publicity shots.Nothing was taken from print publications, and no movie clips were shown.Instead, the instructor posted a list of movies and suggested that students enrolled in the MOOC obtain the films from their library, Netflix, or a video store.He linked to the IMDB.compage for each movie to reduce the amount of searching required of students in the course.For another class, West found that even clips from silent movies directed by Buster Keaton posed a problem, because although the motion picture content is in the public domain, modern releases of these early silent films have used musical soundtracks that might be protected by copyright.
In another example of working with instructors regarding MOOC content, West received a request to use an excerpt from a recording of a speech by Martin Luther King, Jr.Because West was aware that the family foundation that owns the rights to this content is very aggressive about copyright, the faculty member was not permitted to use the excerpt.This example illustrates the principle of avoiding the use of famous or aggressively monitored content in MOOCs if possible, a strategy that was echoed later by Kyle Courtney of Harvard.
Many of the decisions West helps faculty members make about including content in MOOCs concern images.At Wesleyan they had long discussions about whether images of book covers would be allowed and decided to use them on the basis that the advertising benefit to the rights holder by including the image in a MOOC would outweigh any possible market harm.In one case, a faculty member wanted to use an image of the label on a vinyl LP record.Because they were not comfortable with a fair use rationale for using the label, they did not use it.For an Associated Press image used in a MOOC, Wesleyan decided not to rely on fair use but to license the right to use the image for five years at the cost of $150.When using fine art images, Wesleyan has found the terms of the Metropolitan Museum of Art to be fairly generous.The Met allows for the non-commercial and educational use of images as long as the images belong to the museum and are not subject to additional copyrights.The terms of use of images from the Museum of Modern Art, on the other hand, are more restrictive so West avoids using them.
Wesleyan also makes a good deal of use of Creative Commons licensed images from Wikimedia Commons; students do the work of obtaining the images and recording the required attributions.
West provided an example of how one Wesleyan professor obtained content for his MOOC without having to conduct a fair use analysis, rely on open content, or pay a licensing fee.The professor of the "Social Psychology" MOOC, in which over 90,000 students had enrolled, wanted to use the same textbook that he uses for his on-campus version of the course (a book that was dedicated to him).He approached McGraw-Hill, the publisher, which offered to make a cheaper version of the book available for $100.The professor believed this was still too expensive for many students, so he convinced the publisher to allow him to use only three chapters of the text at no cost.He also wanted to use a photo of the Blue Man Group and to turn the photo green for a demonstration, and so he approached the organization and received permission.In the end, he acquired materials from many rights holders by asking them directly.
In return, the main page for his MOOC (https://www.coursera.org/course/socialpsychology)thanks them and displays their corporate logos.
Cordell and his colleague David Smith of Northeastern University's College of Computer and Information Science have "scraped" the full text of all newspapers in Chronicling America published before 1860 in order to analyze it for matching passages.Smith's area of research is duplicate detection.They have created an algorithm that breaks the unstructured text into strings of five words (or n-grams) and then searches through the entire body of text for matching ngrams.If enough five-word sequences match between two or more pages, the algorithm identifies a possible matched text.Because the program is only looking for five-word matches, it is not affected by the frequently poor quality of the OCR.Each potential matched text is assigned an identification number and delivered to Cordell in a spreadsheet.So far, Cordell and Smith's research has identified 50,000 viral texts, though Cordell has focused thus far only on the top 5,000.The vast majority of these texts are items that literary scholars have never written about.Most are by anonymous authors or are minor pieces by major authors that we now realize were more influential than previously recognized.Cordell displayed a number of visualizations of his data, some involving mash ups with open data from other sources such as Railroads and the Making of Modern America from the University of Nebraska, Lincoln, the David Rumsey Map Collection, and the Atlas of Historical County Boundaries from the Newberry Library.By taking a historical map, overlaying data about the railroad network at that time, and then adding data on reprinted newspaper texts, Cordell illustrated that the reprinting of newspaper content lined up neatly with the railroad network.Cordell also combined data on historical county boundaries with data about the founding of newspapers to show that, as the population expanded westward, newspapers appeared first, and then shortly after the establishment of a newspaper, political boundaries were established.He has also mashed up his data with historical census data to map the characteristics of populations near where certain types of viral stories appeared, for example what the population looked like in places where religious stories were reprinted.