Parsing and Understanding in a Messy World
Date:
Every system interacts with the world, which means that nearly every system includes a parser. Parsers are the system’s immune system: they divide the (untrusted) environment from the (safe) interior. In fact, a typical system may include thousands of parsers for different types of data exchange. It’s unfortunate, therefore, that parsers are often terrible: unsafe, insecure, and ambiguous. Parser flaws are the root cause of many of the security vulnerabilities that plague modern software.
Why are parsers so problematic, and what can we do about it? I think there are two main issues, and we’ve built two tools to try to remedy them. First, it’s often unclear how data should be understood. Real data formats are messy, with many dialects generated by different conflicting implementations. To solve this problem we need a way to play detective with a large dataset of examples—this is our Format Analysis Workbench (the FAW). Second, it’s hard to build correct parsers, even once we understand what they should do. To solve this problem, we need a safe, simple description language that can deal with the complexities of modern data formats—this is our language Daedalus.
In this talk, I’ll explain how parsing goes wrong, and how the FAW and Daedalus might be able to help. I’ll illustrate all of this with war stories from a deeply weird format that has nonetheless become critical to all our lives: the Portable Document Format, PDF.