Process of Benchmarking
The first step in building a good benchmark is to pull enough samples that would be representative of the types of documents your system will be processing. A good rule of thumb is to have 100+ samples for each type of document you will be processing.
Answer Key (Golden Set)
Once you have your samples collected, the next step is to process these documents as you would during a production run. This means that the samples you process need to have (1) all the documents classified correctly and (2) the extracted data validated correctly. What this does is gives the system a baseline of where the documents should be and what the data should be for each document.
Creating a Golden Set
To create a golden set, you must process each document through the system to the end with all the information correct. For example, in Kofax, you would put your samples into a batch and scan them in. The process would separate, classify, and extract as best it could. The user would then correct any classification and extraction so that each document is 100% correct and is exactly what is needed to export to the backend system.
Now that you have the sample images and you have processed these images to create your golden set or answer key, you can now run your test runs. This process runs the sample images through the system automatically and tests the results against the golden set then reports out the benchmark results. In most benchmarking tools, you can save any one of the test runs as a baseline so that any test run after can provide a comparison showing whether the benchmarking is better or worst than the baseline benchmark.
Analyzing the Benchmark
There are four possible outcomes for each document that is processed in a benchmark:
- Correct and confident
- Correct but not confident
- Incorrect and not confident
- Incorrect but confident (False Positives)
When a document is processed, each aspect of the process is associated with a confidence level. The confidence level is assigned by the system as it evaluates the data. How close to the data match is the search criteria? For example, you’re searching for the invoice number off an invoice document. You have told it to look for a series of numbers that are close to a heading of “Invoice Number.” The system is going to pull all the series of numbers and evaluate them against any heading of “Invoice Number.” Depending on how well the system matches “Invoice Number,” it assigns a confidence level to that series of numbers. If the system finds “Inv0ice Numbor” because of an OCR issue, the confidence level it assigns the series of numbers by this would get a confidence level of about 85% since the header only matched at 12 out of 14 characters. This process is done for each alternative from the document. You can adjust the level at which the system says that what it has is a confident match or not. For example, you set the system to mark any extraction that has a confidence level of 80% as high confident and anything below this unconfident.
Correct and Confident
Document data is correct and confident when the benchmark run has indicated the data is the same as the data in the golden set document, and the system is confident that it has processed the document correctly. This means that the particular piece of information captured matches what the golden set document says should be there and the confidence level for that data is above the level set in the system to be confident.
Correct but Not Confident
Document data is correct but not confident when the benchmark run has indicated the data is the same as the data in the golden set document, but the system is not confident enough. This is typically a situation where the data is correct, but the system may have other data that it thinks could be correct as well. As a result, it is not confident that the data it selected is the correct data. In other words, the data is a match to what the user says it should be, but the confidence level is below the threshold set in the system for confident data or multiple pieces of data have the same confidence levels. This could happen if you are looking for “Invoice Number” with a number next to it and this piece of data is located in multiple spots on the form.
Incorrect and Not Confident
Document data is incorrect and not confident when the benchmark run has indicated that the data does not match the data in the golden set document and the confidence level falls below the threshold set in the system. Again using our example of looking for a series of numbers next to “Invoice Number,” only this time the system cannot find “Invoice Number” because an OCR error resulted in the word looking like “$#GW$.” As a result, the benchmark couldn’t extract the data and indicates the confidence level of 0% because it could not find any part of “Invoice Number.”
Incorrect but Confident (False Positives)
Document data is incorrect but confident when the benchmark run has indicated that the data does not match the golden set data, but the system is confident that it got the right data. In this instance, a user would see that the data is wrong, but the system says the data is correct. The system says the data extracted is above the confidence threshold set in the system, but is not the correct data according to what the user entered into the golden set document.
Precision and Recall
Once the benchmarking test run is completed, you can now analyze how well your project is performing.
For Precision and Recall, can you add in a diagram? Something that shows the breakdown of the 10 fields and how the precision and recall metrics are calculated against those outcomes? In the current format, I’m not understanding how the 10 fields relate to the numbers used in the example calculations.
The first measurement is the precision of the run. This is a check against the false positives that were returned. To calculate this, you take all the items that were confidently extracted that were correct and divide it by the total number of confident items extracted. For example, you have a test run of 100 documents with 10 fields each. Your run produced 5 confident and correct fields in each document and 2 fields that were confident but were actually incorrect. With this run your project would be 5/7 = .714 or 71% precise for each document. This means that 71% of the time the fields the system was confident about were correct.
The second measurement is the recall of the run. This is a check against the false negatives that were returned. This is calculated by taking all the items that were confident and correct and dividing it by all the items that are supposed to be correct. Using the previous example of 100 documents with 10 fields each, you see 5 confident and correct fields, 1 not confident but correct field, 2 fields that were not confident and not correct, and 2 fields that were confident but were actually incorrect. With this run your project would be 5/6 = .833 or 83% recall. This means that of all the fields that were correct 83% of the time the system was confident that the field it extracted was correct and confident.