Donut ???? : OCR-free Document Understanding Transformer (Research Paper Walkthrough)
Hey hi everyone. Welcome back to the new video today. We'll be talking about this paper which is titled as OCR free document understanding Transformer. It's also called as donut and I found this paper to be circulated a lot on my LinkedIn. These days so thought of reading it out for you guys so this paper is from never upstage. TMax Google and lbox so before we move forward. If you're new to this channel make sure to hit that like button and subscribe to the channel. My name is prakar. Mishra and on this channel I majorly talk about research papers and general concepts related to machine learning with a particular focus in NLP so make sure to check out other videos as well and see if you enjoy the content cool so let's start with abstract so understanding document images such as invoices is a core but challenging task since it requires complex functions such as reading text and a holistic understanding of the document current visual document understanding vdu methods. Outsource the task of reading text to the offshelf OCR engines and focus on understanding the task with the OCR output. So yeah the use case that we're talking about is let's say parsing invoices so this is an invoice now. Depending on the vendor invoices can vary a lot in the layouts but the central elements such as items price total price branding of the company. All of those are likely to remain constant but their positioning will definitely differ so the task is can we understand these invoices and extract. Let's say item and the related prices out of it so this is one of the use case so till date what people have been doing is you have the image of your invoice you pass it to some OCR engine. Let's talk about pressure. Act what this will give you all the text. That's written in that image so that will be displayed but what it lacks is the structure. The spacings spacings as in the new lines. The paragraph change the tabular structure all. Those things are not preserved you will kind of get a flattened structure with some nuances here and there which you can use to write your post processing function and eventually try to create output that caters to your business use case or for that matter of your interest so this is a typical pipeline.
What people have been using now what this paper says is the step of. OCR what you apply in the between right this has its limitation one of the limitation is. It has a high computational cost then these OCR engines are also not flexible the type of documents and the language which means if you pass in. Let's say invoice if you pass out a research paper now if you pass in a resume all of those things are not essentially understood by your OCR engine. It just knows how to extract text. Whatever you write as a part of post processing function will be the heart of your code that kind of will know what to extract and how to parse that. OCR output and also OCR can lead to propagating errors in terms of it. Not recognizing the characters written properly in the scanned image and that could be because of many factors this could then lead to propagating error to the subsequent process. Which in our case is the post processing function so to address all of these issue they propose an OCR free video model which they call as donor which stands for document understanding Transformer with a certain pre-training objective that makes the entire system learn more about the document structure and the inherent property that every document comes with and they found this to be achieving state-of-the-art performance on various video tasks in terms of both speed and accuracy. Not only this. They also propose a technique for generating synthetic data which they use for pre-training the model. So yeah let's move forward and see the exact method what they propose so yeah. This figure exactly talks about the typical way of how people have been parsing documents. You started the document image you apply. OCR get the bounding box of all the text segments are there in that image which essentially will be in the form of a.
Json so you'll have all the words the bounding box and the actual text then again. The next text sequence and the bounding box for that so now to get to the D part of it you'll have to write a post processing function that kind of takes in and learns more meaningful sequences for that matter. Let's say if we can see right 3002. Kyoto this is splitted as three words over here with three bounding box but at the output we see them to be coming as a single line and not only that but also its Associated price and unit which was added somewhere. Here so yeah that is the post processing function that you'll have to write and this is what. I was talking about. Okay so this again talks about the same thing you do. OCR followed by some post processing function. That could be train your layout LM model or you are using OCR output to feed to the bird model for doing some kind of a classification for that matter so essentially the pipeline. Looks like you have an image you apply. OCR you have trainer Downstream model and you get the output whereas donut is an end-to-end model where you take an image and directly get the output and these are some of the benchmarks that they tested on some of the data sets where you're getting an accuracy of 87 percent for the typical pipeline whereas 94 for this end-to-end system what they propose and similarly also process each image faster with lesser memory. Okay so this is how the entire pipeline for the donut model works. You have an input image and the prompt now. The use of prompt here is to kind of make your model learn multiple tasks for example if you want to just classify this image if you want to do a visual question answering of knowing let's say what is the price of certain thing in that invoice or simply if you want to just pass this document for some Downstream task so you start off by taking input image and pass it to a Transformer encoder so they use swin transformer for this that works at learning the representation of an image at a patch level.
And then eventually aggregating it off. So once you get that embedding representation for the image that now captures the inherent structure the text and the style in which the entire thing is displayed in that image. You pass it through. Transformer decoder they use pre-trained Bart model to which the first sequence that they give is the prompt that kind of hints the decoder. To what kind of thing to decode from now on so for example. If you give this image you get let's say 768. Dimension representation now considering decoder to just understand. Let's say it's a lstm decoder right. So this is a thought. Vector which is nothing but the image representation now. The first thing that you give over here is the prompt now. This could be parsing. This could be vqa followed by equation. This could be classification. Once that is done model starts generating whatever output suits the current prompt and the output sequence looks something like this it looks like a XML kind of a structure wherein for class you say receipt and the task was classification for vqa. You say this is the answer because you already had an opening answer tag over here so you just close it and close the task as well and for parsing since you have opened it you parse this entire thing and then close it by saying the task is done so basically the starting token of the prompt that defines the kind of task that we're talking about has to be closed and that is usually treated as the end of the sequence token and since we are dealing with both Transformer encoder and decoder so this training can be done in an end-to-end fashion where the loss is cross entropy because it's a generative task at the end of the day. Cool so yeah. This is the entire model what they propose which their title as donut and this is exactly what they have written over here now talking about the other part which was synthetic document generator so they use this to generate 0.
5 million samples per language which is for Chinese Japanese Korean and English so apart from this they also use IIT cdip data set which has 11 million scanned English document images so this along with the synthetic document generator. Whatever data you generate from. There was used for free training so the way synthetic document generator works is like they. Define certain layouts for that matter. Let's say this isn't page. You'll have a background image. This will be a text sequence. This will be an image sequence. This would be a floating. Point number so like this. You kind of mimic how the variations in the invoice or all of these kind of different formatted data would look like and then we write a function that would fetch data from various sources to fill these required blocks so for that matter for background. They use imagenet to sample and get to. Let's say this image the text sequence or the floating mode numbers are randomly scraped from Wikipedia so entirely. If you see the invoice not all invoices would make sense. But the idea is to learn the structure and the model to extract. What's written it's not the task of the model to make sense of as in like whatever is written. Does it make sense or not. It just has to extract and post there. You can write other systems to know whatever you have extracted. If that's of correct format whether that makes sense or not okay. And the pre-training was done in order to minimize the cross entropy of next sequence prediction in a way that. Let's say if you get an image you do an OCR and you finally find the pattern of top left bottom right which is a typical way of reading and document so that becomes your output sequence and you train your model against this. So this is a pre-training post this they fine-tune it using the prompts cool so I think we are done with the paper now. So if you're new to this channel make sure to subscribe and like this video. If you enjoyed it also shared across your friends to whosoever is interested in such content.
I'll meet you in the next one. Bye bye and take care.
What people have been using now what this paper says is the step of. OCR what you apply in the between right this has its limitation one of the limitation is. It has a high computational cost then these OCR engines are also not flexible the type of documents and the language which means if you pass in. Let's say invoice if you pass out a research paper now if you pass in a resume all of those things are not essentially understood by your OCR engine. It just knows how to extract text. Whatever you write as a part of post processing function will be the heart of your code that kind of will know what to extract and how to parse that. OCR output and also OCR can lead to propagating errors in terms of it. Not recognizing the characters written properly in the scanned image and that could be because of many factors this could then lead to propagating error to the subsequent process. Which in our case is the post processing function so to address all of these issue they propose an OCR free video model which they call as donor which stands for document understanding Transformer with a certain pre-training objective that makes the entire system learn more about the document structure and the inherent property that every document comes with and they found this to be achieving state-of-the-art performance on various video tasks in terms of both speed and accuracy. Not only this. They also propose a technique for generating synthetic data which they use for pre-training the model. So yeah let's move forward and see the exact method what they propose so yeah. This figure exactly talks about the typical way of how people have been parsing documents. You started the document image you apply. OCR get the bounding box of all the text segments are there in that image which essentially will be in the form of a.
Json so you'll have all the words the bounding box and the actual text then again. The next text sequence and the bounding box for that so now to get to the D part of it you'll have to write a post processing function that kind of takes in and learns more meaningful sequences for that matter. Let's say if we can see right 3002. Kyoto this is splitted as three words over here with three bounding box but at the output we see them to be coming as a single line and not only that but also its Associated price and unit which was added somewhere. Here so yeah that is the post processing function that you'll have to write and this is what. I was talking about. Okay so this again talks about the same thing you do. OCR followed by some post processing function. That could be train your layout LM model or you are using OCR output to feed to the bird model for doing some kind of a classification for that matter so essentially the pipeline. Looks like you have an image you apply. OCR you have trainer Downstream model and you get the output whereas donut is an end-to-end model where you take an image and directly get the output and these are some of the benchmarks that they tested on some of the data sets where you're getting an accuracy of 87 percent for the typical pipeline whereas 94 for this end-to-end system what they propose and similarly also process each image faster with lesser memory. Okay so this is how the entire pipeline for the donut model works. You have an input image and the prompt now. The use of prompt here is to kind of make your model learn multiple tasks for example if you want to just classify this image if you want to do a visual question answering of knowing let's say what is the price of certain thing in that invoice or simply if you want to just pass this document for some Downstream task so you start off by taking input image and pass it to a Transformer encoder so they use swin transformer for this that works at learning the representation of an image at a patch level.
And then eventually aggregating it off. So once you get that embedding representation for the image that now captures the inherent structure the text and the style in which the entire thing is displayed in that image. You pass it through. Transformer decoder they use pre-trained Bart model to which the first sequence that they give is the prompt that kind of hints the decoder. To what kind of thing to decode from now on so for example. If you give this image you get let's say 768. Dimension representation now considering decoder to just understand. Let's say it's a lstm decoder right. So this is a thought. Vector which is nothing but the image representation now. The first thing that you give over here is the prompt now. This could be parsing. This could be vqa followed by equation. This could be classification. Once that is done model starts generating whatever output suits the current prompt and the output sequence looks something like this it looks like a XML kind of a structure wherein for class you say receipt and the task was classification for vqa. You say this is the answer because you already had an opening answer tag over here so you just close it and close the task as well and for parsing since you have opened it you parse this entire thing and then close it by saying the task is done so basically the starting token of the prompt that defines the kind of task that we're talking about has to be closed and that is usually treated as the end of the sequence token and since we are dealing with both Transformer encoder and decoder so this training can be done in an end-to-end fashion where the loss is cross entropy because it's a generative task at the end of the day. Cool so yeah. This is the entire model what they propose which their title as donut and this is exactly what they have written over here now talking about the other part which was synthetic document generator so they use this to generate 0.
5 million samples per language which is for Chinese Japanese Korean and English so apart from this they also use IIT cdip data set which has 11 million scanned English document images so this along with the synthetic document generator. Whatever data you generate from. There was used for free training so the way synthetic document generator works is like they. Define certain layouts for that matter. Let's say this isn't page. You'll have a background image. This will be a text sequence. This will be an image sequence. This would be a floating. Point number so like this. You kind of mimic how the variations in the invoice or all of these kind of different formatted data would look like and then we write a function that would fetch data from various sources to fill these required blocks so for that matter for background. They use imagenet to sample and get to. Let's say this image the text sequence or the floating mode numbers are randomly scraped from Wikipedia so entirely. If you see the invoice not all invoices would make sense. But the idea is to learn the structure and the model to extract. What's written it's not the task of the model to make sense of as in like whatever is written. Does it make sense or not. It just has to extract and post there. You can write other systems to know whatever you have extracted. If that's of correct format whether that makes sense or not okay. And the pre-training was done in order to minimize the cross entropy of next sequence prediction in a way that. Let's say if you get an image you do an OCR and you finally find the pattern of top left bottom right which is a typical way of reading and document so that becomes your output sequence and you train your model against this. So this is a pre-training post this they fine-tune it using the prompts cool so I think we are done with the paper now. So if you're new to this channel make sure to subscribe and like this video. If you enjoyed it also shared across your friends to whosoever is interested in such content.
I'll meet you in the next one. Bye bye and take care.