As everything is getting digital the demand for machine-based document digitization is growing higher than ever. Every organization wants its documents to be digitized as digital documents are easy to search. Maintaining hard copies of documents is both an expensive and tedious process. Also, these documents become unreadable after some years due to the fact that their papers get too old. The hard copy documents could be easily destroyed by criminals, natural disasters, etc in certain situations. All these facts make document digitization an inevitable thing to do. But digitizing documents manually can be both a very expensive and time taking process.
So how can we digitize documents fast and cheaply?
That’s where the deep learning systems come into place. The document pages can be captured using cameras and those images can be fed into a deep learning system to recognize texts and extract information from the images. After extracting information, it can be used to fill a predefined template for a particular document type and can be stored in the local system or in the cloud.
Our first attempt at the problem was very basic. We used an OCR system to perform text recognition of the document images. Once the text was extracted we employed certain hand-crafted rules based on regex to extract the information out of the documents. We know that the text in documents can have certain patterns like dates could be in formats like dd-mm-yyyy or yyyy-mm-dd, etc. The addresses can belong to texts separated by commas, etc.
Having realized the limitations of the rule-based approach we decided to research a better approach. After doing some research we realized that the problem of information extraction can be represented in the form of graphs. The intuition around using graphs to solve the information extraction problem comes from the way in which we humans identify important information from documents. If you are given an identity card and you are told to identify the name of the person, DOB and address then you might look for a key-value pair in the card or you might be able to know about the information by the location of the text in the image. For example, the top most text in the card may be the organization for which the card is issued.
Exploring certain literature in the computer vision domain we found out GCN is a type of network that combines the visual and text information to create graphs. Then it performs classification on the graph nodes to identify the category of that text hence extracting the information from the document.
The above diagram gives up a rough idea of our system.
The whole work can be broken down into the below steps:
First, we perform OCR on the document image and extract the texts and corresponding bounding boxes from the image.
The textual information is passed to the transformer module which converts the textual information into feature vectors. The bounding boxes obtained in the previous step are used to crop the image regions containing those texts. These cropped images are passed through the Convolutional neural network and features are extracted.
Now we have textual features and visual features from transformer and CNN respectively. These features textual and visual are passed to the Graph Neural Network. The GNN component models the texts as nodes. The relationship between these nodes is established with the help of the visual feature obtained by the convolutional neural network. Once these nodes are established by the GNN. These nodes can be processed further and classified into labels.
The BiLSTM layer and CRF layer follows the GNN layer which takes the graph feature and classified the nodes into their label e.g. Name, Company, etc.
Say we want to extract the name of the organization from an ID Card then we would train our model to classify the text containing the name as the rest of the text.
Like any other challenges we also faced certain challenges to create this project.
One of the biggest problems with creating this type of project was getting a good amount of data to perform our experiments. We did an intensive amount of search but we couldn’t find any good dataset. In fact, even finding images of ID Cards in a good amount wasn’t possible. Hence we decided to create our own dataset with a mix of natural and synthetic images. We created certain tools to generate synthetic images of cards.
We also created a tool to perform automatic annotation of these images. Soon after a few days, we had a good amount of data for doing experimentation.
The model that we used for experimentation was based on Wenwen Yu et al. The model was too heavy to be deployed. Hence we needed to modify the architecture of the neural networks involved to make the model small and less computation demanding. We used the intuition that the textual features are less important than visual features for doing node classification and we modified the transformer and CNN blocks. We also modified the GNN model. After doing certain experimentations we were able to figure out the architecture that is both small and accurate.
We tested our model on around 100 images of the cards. The metric that we used for evaluation was MEF(Mean Entity F1 score). The MEF of our model on test data was approximately 99.17 which seems to be pretty good.
In this article, we learned how Graph Convolutional Networks can be used to extract information from document images and can help us in the digitization of documents. Proper implementation of the approach can yield a pretty robust and accurate system which can save a lot of time and money for an organization in the digitization of documents.