Digital Text

Disclaimer: The information on this page is provided as an overview. The course outline, readings, and assignments may be subject to change in the final syllabus as determined by the lecturer and/or preceptors.

This course is an introduction to developing and working with texts electronically, particularly literary and historical language texts. It lays a strong conceptual foundation as well as provides hands-on experience working with texts. Many introductions to text analysis or corpus building focus on texts such as newspapers, product reviews or tweets. This course instead focuses on texts of particular relevance to Signum students (for example, a Norse saga or 19th-century novel) and the fundamentals of preparing and analysing specific literary texts.

The course will provide both a conceptual and practical introduction to topics such as text preparation, copyright, metadata, character encoding, typography, markup, annotation, concordancing, and some basic quantitative analysis. No background in statistics or programming is required although we will touch on those two fields in the context of text analysis. There will be an opportunity for students to work on projects relating to a text of their choice (an out-of-copyright literary text in English or, if the student has an appropriate language background, a text in a historical language). The course will also provide a useful foundation for further study in digital philology, computational literary studies, natural language processing, and quantitative methods in the humanities.

Prerequisite: This course has no prerequisite; however, a student who wants to focus on a historic language text for their project (e.g. an Old Norse saga or Old English poem) should have some familiarity with that language already (either through a Signum course or a comparable course at another institution).

Note on M.A. Degree Requirements: This course does count toward a Master’s student’s language requirement. This course does not automatically count toward any concentration. However, a student may petition to have it count toward a concentration depending on the topic of their project(s) (e.g. a student who uses a historic language text for their projects might count it toward a Germanic Philology concentration). This is subject to approval by the course faculty and the Dean of Language and Literature. The student should contact their advisor to facilitate the petition for approval.

Weekly Schedule

This live course will include two 1-hour lectures and two 1-hour discussion sessions per week as assigned (4 hours total weekly). Remember to indicate your availability on the Goldberry registration system.

Course Schedule

Week 1

  • Introduction & History

Week 2

  • Text Sources
  • Copyright and Metadata

Week 3

  • Text Structure and Citation Systems
  • Typography and Typefaces

Week 4

  • Character Encoding and Unicode
  • Markup and Markup Languages

Week 5

  • Basic SGML / XML
  • HTML and CSS

Week 6

  • Version Control and GitHub
  • The Text Encoding Initiative (TEI) Part 1

Week 7

  • The Text Encoding Initiative (TEI) Part 2
  • Search and Information Retrieval

Week 8

  • Wildcards, Globs, and Regular Expressions
  • Tokens, Types, and Word Frequency

Week 9

  • Stemming, Lemmatization, and Part of Speech Tagging
  • Concordances and Corpus Linguistics

Week 10

  • Probability and Descriptive Statistics
  • Inferential Statistics

Week 11

  • Introduction to Python
  • Text Processing with Python

Week 12

  • Natural Language Processing
  • Visualization
  • Next Steps

Note on the Summer 2022 Schedule: The summer 2022 semester will include a one-week summer break which will fall on the week of June 20, 2022.

Required Texts

There are no required texts to purchase for this course. Required readings will be supplied by the instructor in the final syllabus.

Course History

SemesterPreceptor(s)
Summer 2022James Tauber
Digital Text

This course is an introduction to developing and working with texts electronically, particularly literary and historical language texts.

START: May 2, 2022

DURATION: 12 Weeks

ID: LNGA 5303

CREDIT: 3