Kofax has powerful tools embedded in its Transformation Modules that allow users to run benchmarks against their Transformation projects.
Extraction benchmarking is the process of testing the extraction of data from a sample set of documents against the same set that has been processed and saved as a Golden Set. The process takes each document and runs extraction on it pulling all the data the project says it should pull. It then compares what it extracted to the value defined by the golden set document and marks it as one of the four classifications.
The first step for extraction benchmarking is the creation of “Golden Files.” There are a couple of ways to obtain these files. The first is to copy and paste the image files and xdoc (see definition below) files from the images folder once a batch has completed the KTM validation queue (this does not work in the environment of KTA). This is more of a back door way of getting these. The second option is to put code into the script “Document_AfterProcess” to export the document to a folder location after it is processed in Validation.
Once you have all the “Golden Files” collected in a folder location. Open that folder location inside of the Project Builder as a “Test Set.” You can then right click on this and change it to a “Benchmark Set.”
From there you can click the “Process” item in the menu bar and click on “Extraction” drop down and select either to run the benchmark against the current class selected, the current class and its children, or all of the classes.
Classification benchmarking is the process of testing the classification results of each document in a sample set against the same set of documents that have been processed in a Golden set.
The first set of classification benchmarking is the creation of “Golden Files.” These can be the same files as the golden files you used for extraction. For this process, the files just need to tell the system what classification the document should be before the classification process.
The above graph shows the results of the Classification benchmark. The right pane displays the results in a number format. The center view shows the results in a graphical view.
The count is the number of documents processed through the benchmark.
The correct count displays how many of the documents processed were classified as the class the Golden Set said it should be with a confidence level high enough to select that class.
The incorrect count displays how many of the documents processed were not classified as the class the Golden Set said it should be.
The unclassified count displays how many of the documents processed were not classified as any class. This means that the confidence level for these documents was not above the confidence level set in the system to be classified as any of the available classes.
The exceptions count displays how many of the documents processed were affected by the Exceptions. Exceptions allow you to mark documents valid in the benchmark if the resulting classification should be valid. For example, if you have a Document1 class and a child of that class called Document1a. If your Golden Set says the classification is Document1, then the benchmark will mark all the documents invalid because they would be classified as Document1a. Building in exceptions allows you to tell the benchmark that documents of type Document1a are valid documents for Document1. You can add exceptions by clicking on the exceptions button in the menu bar.
Separation benchmarking becomes available when you turn on document separation in the project settings.
Once this is turned on and the project is trained to separate documents based on data on the pages, you can test your separation in the separation benchmark. Do this by taking a sample set of documents (these must be .tiff documents) and putting them into a folder.
The next step is to open the folder in project builder and recognize these documents. Set the “Test Set” to “Benchmark Set.”
Next, select the “Tree View.”
Assign each document to the correct class. At this point, you are ready to run your separation benchmark. Click the “Process” menu item and then the “Separation” item in the benchmark.
From here, the separation will take all the individual files and combine them into a single document with many pages and then attempt to separate. Below is the output of this test run.