The world of document understanding has come a long way in recent years, with advancements in systems that can automatically process complex business documents. These systems have the potential to revolutionize business workflows by eliminating errors and manual work. Researchers have been developing models based on the powerful Transformer architecture, such as PaLM 2, which have shown impressive accuracy improvements. However, there’s a catch. The datasets used in academic literature don’t quite capture the challenges faced in real-world applications. So while these models perform well in academic benchmarks, they struggle when applied to complex real-world scenarios.
That’s where the Visually Rich Document Understanding (VRDU) dataset comes in. Presented at KDD 2023, this dataset aims to bridge the gap between academic benchmarks and real-world use cases. It meets five important requirements for a good document understanding benchmark. Firstly, it focuses on real-world use cases and compares model accuracy to academic benchmarks to truly reflect performance. Secondly, it includes complex and diverse document layouts, including tables, key-value pairs, and multi-column formats. This is important because real-world documents don’t adhere to simple sentence and paragraph structures.
Thirdly, the VRDU dataset features diverse templates to challenge models’ ability to generalize to new layouts. Fourthly, the dataset provides high-quality OCR results to ensure that the focus is solely on the document understanding task, without interference from OCR variations. And last but not least, the dataset includes token-level annotations, which allows for clean training data and prevents noisy examples.
The VRDU dataset is a combination of two publicly available datasets – Registration Forms and Ad-Buy forms. These datasets provide representative examples of real-world use cases, including political advertisement details and information about foreign agents registering with the US government. We gathered a random sample of documents and used Google Cloud’s OCR to convert images to text. An experienced team of annotators labeled the documents, ensuring accurate annotations.
To evaluate the VRDU dataset, we conducted three tasks: Single Template Learning, Mixed Template Learning, and Unseen Template Learning. These tasks allowed us to measure models’ performance and ability to deal with fixed templates, same-set templates, and unseen templates, respectively. The results showed that the VRDU dataset is challenging and leaves room for improvement for even state-of-the-art models. We also found that few-shot performance, even for the best models, is surprisingly low.
In conclusion, the release of the VRDU dataset marks an important step in improving document understanding models. It provides a benchmark that captures the complexity of real-world applications and allows researchers to track progress more effectively. With the five requirements met, the VRDU dataset sets a new standard for document understanding benchmarks. So get ready, because the future of document understanding is about to get a whole lot better!