|
|
Crawling:

CrawlScape
CS is a graphical user interface that contains a complex set of features and
Java servlets for crawling, indexing
and pattern clustering search. Options for the following advanced
features are included in the framework:
- Server Administration
- User Administration
- Crawl Engine Configuration
- URL Crawl Target Management
- Crawl and Index Session Control
and Configuration
- Search Deployment (copying and deploying
multiple instances)
- Information Resource Management (merging and moving and
managing information assets)
- Scheduling crawling, indexing and search deployment
sessions
- Views and Reports
CrawlScape features are useful for
managing intricate scheduled network crawls, indexing and search deployments.
The system can be used to crawl and index documents, databases, email
systems and documents. The crawler is controlled by an application
which uses MySQL for data management related to server
settings, user management, crawl control management, and scheduling.
When multiple records are related in search such as multiple
tables in a database or email system CS utilizes a method for
combining the related records. This is referred to as polytuplet
groupings.
CrawlScape simplifies the deployment and management of search technologies in
the enterprise. It is especially functional in complicated and hard to manage search
environments with complex crawl and index requirements. Screen shots
are given
in the products section of the site while the following table summarizes options and configurations for the framework.
| CrawlScape |
Server
Install |
Multiple
Sessions Manager |
Distributed
Host
Servers |
Multi-
Search
Deploy
-ment |
Scheduler |
User
Mgmt |
URL
Crawl |
Database
Crawl |
Email
Crawl |
Index
Merge |
Tuplets |
| CS-1 |
Single |
|
|
|
|
|
x |
|
|
|
|
| CS-2 |
Multiple |
|
|
|
|
x |
x |
|
|
|
|
| CS-3 |
Multiple |
|
|
|
x |
x |
x |
|
|
x |
|
| CS-4 |
Multiple |
x |
x |
x |
x |
x |
x |
|
|
x |
|
| CS-Enterprise |
Multiple |
x |
x |
x |
x |
x |
x |
x |
x |
x |
x |
Crawl technology
The crawl and index portion of CrawlScape is based on a Java API,
which is part of an excellent open source project. CrawlScape transforms
crawling and indexing management by using this source API
together with PS Java code to provide a commercial grade crawling,
indexing and search deployment framework. With CrawlScape, the API and search
deployment is controlled by a graphic user interface rather than
command line controls, which eliminates the need for Java
programming or Unix expertise. The framework is excellent for
managing search infrastructures and ideal for day-to-day
administration.
Moving and combining indexes are added
functions to enhanced search deployment flexibility. Distributed
crawling, indexing and distributed search are additional
powerful features of the system for very large-scale search
projects. The application is distributed in a load-sharing
scheme with server clusters configured and managed by CrawlScape
technology.
For data storage CrawlScape employs MySQL to manage information
about users, servers, configurations, crawls, schedules and search
deployments.
Network crawling (Intranet, Internet, FTP, HTTP, Shares)
CrawlScape can be configured to crawl any type of network. The parameters used to control the system for each network are clickable
(UI) setup parameters for each type of designated crawl.
Files types (Documents, Text, PDF, HTML, XML, JPEG, image files, databases, email)
CrawlScape can crawl any readable document or file type if a
plug-in exists for it. The default configuration contains
plug-ins for most document and file
types. Users who need a
plug-in altered for their particular file type,
may commission a KS consultant for development assistance. KS
consultants will develop the plug-in and test it across an
intranet for reliability and performance. It is normal to expect
a plug-in to take 24 to 72 hours to build depending on
complexity.
There are special case plug-ins
such as the KS JPEG plug-in. It actually indexes the text content
in the upper 1,000 bytes which allows embedded JPEG definitions. This way JPEG files can be indexed and made searchable. Tiff files, PSD, and camera EXIF types can also be indexed. The use is obvious in the case of pairing a tiff image of a
crime scene gun exhibit, with a text description
explaining the evidence. CS can also pair an evidence document with the same name image pair, in what we refer to as polytuplet associations.
CS can also pair an evidence document
with the same name image pair, in what we refer to as a
polytuplet. The use of CS for joining a tiff image (gun exhibit)
with its case record is another example of presenting
associations, which are ultimately used in pattern profiling
search and co-existence. Triplets, quadruplets and so on form
tuplets in CS crawling and indexing processes.
Polytuplets explained further
Polytuplets are also used for
combining records from different tables based on table
associations described in schema. These relationships
allow for proper search of databases by presenting compound
records that are constructed from distributed tables. For
example, a person's record might be constructed from three
different demographic tables. CS reconstructs the record,
thereby providing advanced PS functionality operating over a
complex database repository.
URL List management (target crawl list)
CrawlScape can be used to manage many crawl and indexing processes across multiple networks and independent servers; it provides facility for URL list management where the user can upload, copy and paste URL targets, which are ultimately assigned to crawl
sessions. The ability to control URL definitions and assignments makes CrawlScape an enterprise and service bureau-class (ASP, ISP) tool. Multiple URL lists for users, departments,
companies and groups can be independently managed within the framework. |
|