Chas Emerick

PDFDATA.io

Programming data for display, the PDF Story

As lovers of research and academic papers, we are in intimate contact with PDF documents on a daily basis. Further, PDF documents are used extensively in other publishing contexts, and are in many industries and roles the primary means of data interchange between organizations. Despite its ubiquity and the concordant importance of its file format and specification, the heritage and history of the PDF format and specification are not widely known and the engineering challenges and design choices it and its predecessors faced and made are rarely contemplated.

In this talk, Chas will provide a narrative history of PDF and its (immediate) predecessors — PostScript and Interpress — and explore the problem space of page description languages in general. Since these technologies were largely developed and promulgated from within commercial organizations, there are no papers per se to love, but Chas' narrative will be grounded in internally-published whitepapers that motivated the work, as well as decades-old public newsgroup postings from people involved at the time. The session will conclude with a brief tour through some of the internals of a couple of sample PDF documents, tying their concrete manifestation to the previously-discussed engineering challenges and design choices of the PDF specification.

References

PostScript and Interpress: a comparison by Reid, Brian (Newsgroup posting, March 1985)
The Camelot Project by Warnock, John. 1981

Biography

Chas Emerick is the founder of PDFDATA.io, the API for structured data extraction, where he is continuing a now 15-year-long history of building tools to usefully recover data from PDF documents. His other technical interests include distributed systems and programming language theory suited for liberatory and intergalactic computing.