|
|
Conversion:
KS performs conversions in two ways: through the SPS/SPKS
paper and wav conversion systems and with crawler file type
conversion plug-ins for documents, databases and email. Discussions
below address both areas with emphasis on OCR and voice
recorded conversion.
Databases and XML
CrawlScape contains a plug-in and filter for crawling databases in a multi-step process.
The process segments tables into individual records. An XML
file is created to recombine records across related tables
(polytuplet). This allows PS technology to recombine the
individual pieces into full inter-related records. Real time search on a dynamic database
is made possible if the database administrator writes XML records for
all database transactions. CrawlScape will monitor change and apply indexing in real time to mirror the database changes. And this is done without letting users manipulate or view database applications - ideal for privacy and security, while at the same time providing a research environment on data for the pattern profiling searcher.
Paper (OCR) and Voice conversion
(Speak to Text)
SPS is an automation infrastructure that requires a network scanner and the ScapeShape
system with SPS option. The user simply selects the send button on the scanner. The rest is automated. Paper is fed into the scanner and automatically sent to a conversion repository,
where documents are converted and saved in the index of the search system. All scanned documents are created in a triplet structure (polytuplet) so that during search a user may click the original scanned Tiff document, a PDF converted version
with page based segmentation PDF's. The system is designed to
be an easy to use, document storage and conversion system.
Interception of scanned documents for cleaning and
editing is not required as the SPS application converts,
stores and splits the documents automatically. Refer to sections
below for details.
A scanner and OCR client interface can be used for creative
formatting of the output PDF documents and zone-based scanning; otherwise SPS is an automated scan and
conversion system, eliminating need for time wasting
document manipulation.
OCR is performed by a Microsoft Windows application while
collation, mapping and sorting occur in UNIX systems. The
following performance specifications are normal:
- 5-20 documents per second conversion performance per
server
- 70-99.5% accuracy depending on print and voice quality
- error correction eliminating 80-100% of errors during
search
- collation and sorting using PS vector mapping content analyzer
- 1bit depth tiff conversion with 24 bit for images and
maps
Automated OCR conversion (post scan)
SPS monitors
a multiple-user folder system to watch for incoming scanned
documents and triggers
processes to transform new documents into searchable assets. PatternScape
pattern recognition technology is used for collating and
sorting documents based on content, thereby providing an
automatic folder sub system. This is done transparently. The system can operate across WAN/LAN and Internet networks for remote use such as home
office scanners, remote location scanners and fax feeds.
Error correction
Error correction is performed
in a synonym mapping process where error terms (words) are mapped to correctly spelled words. Consequently, documents are found during search even when a search term is stored incorrectly as it is
"synonym-mapped" to the correct equivalent.
Files types
SPS generates multiple segmented Tiff (1 bit depth and Tiff 24 bit depth-optional),
multiple segmented PDF's, and text documents as well as an XML file tuplet for association during search. If the user searches for a document, they are given all formats for use and review, thereby eliminating the need to track file locations and relationships.
Additionally, the original scanned Tiff is available as
persistent original record.
Index process control
SPS passes the results of a converted document-to-collation processing which results in a folder storage structure. The user can either locate the scanned documents based on folder navigation or through a pattern-based search. All files converted in SPS are renamed from meaningless scanner naming convention to content-based vector naming. For example:
Dr. Chan folder
Document vector name:
Sore-throat-Acute-Pharyngitis.pdf
Folder location: ../DrChan/Assessment/Patients/B Smith/Physical Exams/
Automated Speak-to-Text conversion
SPKS operates in an identical way as SPS with the exception that it
processes wav files in real-time or in batch (uploaded). The Voice conversion technology
uses a Microsoft API for conversion of speech to text. Error correction is provided through a synonym mapping search table as noted above.
Large wav file splitter
Wav files are split into snippets
(forming a polytuplet in search) to facilitate search and listening speed. The polytuplet presentation includes the original wav file, its snippet, and a
txt file. This way a user does not need to listen to long recordings to find results.
Rather one may listen to snippets for confirmation before using large converted
recorded files. Naturally, a transcriptionist saves time by
using the converted document in cases that reports must be
formatted.
Automated sort and collating folders
All wav documents are saved in folders based on content of the document. This is done using the PS DocMap process for mapping phrases and creating a folder system based on the mapped array. The administrator only need to create a table of reference for the system to
sort documents correctly. For example, in a clinic application, doctor names would be root folders while patients folders would be
sub-branches; assessments, billings, treatments, Rx, notes, and instruction folders would propagate below the patient folder-this is all done automatically based on document content compared to the mapping reference table. This provides for automated scan-and-convert archiving with collation and folder structuring. All this
is wrapped in an extensive search infrastructure. Regardless of the size of the archival project, the system
can
convert and process large repositories. And search vast amounts of
document thereby providing an alternative to scan and save to DVD (CD) solutions. |
|