KnowledgeShape - Crawling

Main

Products

Solutions

Technology

Pricing

Partners

Support

Corporate Info

Feedback

Technology :

Technology Overview

Searching

Crawling

Indexing

Paper & Voice Notes Conversion

Information Landscape

Operating Systems, Applications, & Server Platforms

Intranet Site Deployment Framework

Packaging

Crawling:

CrawlScape
CS is a graphical user interface that contains a complex set of features and Java servlets for crawling, indexing and pattern clustering search. Options for the following advanced features are included in the framework:

Server Administration
User Administration
Crawl Engine Configuration
URL Crawl Target Management
Crawl and Index Session Control and Configuration
Search Deployment (copying and deploying multiple instances)
Information Resource Management (merging and moving and managing information assets)
Scheduling crawling, indexing and search deployment sessions
Views and Reports

CrawlScape features are useful for managing intricate scheduled network crawls, indexing and search deployments.

The system can be used to crawl and index documents, databases, email systems and documents. The crawler is controlled by an application which uses MySQL for data management related to server settings, user management, crawl control management, and scheduling. When multiple records are related in search such as multiple tables in a database or email system CS utilizes a method for combining the related records. This is referred to as polytuplet groupings.

CrawlScape simplifies the deployment and management of search technologies in the enterprise. It is especially functional in complicated and hard to manage search environments with complex crawl and index requirements. Screen shots are given in the products section of the site while the following table summarizes options and configurations for the framework.

CrawlScape	Server Install	Multiple Sessions Manager	Distributed Host Servers	Multi- Search Deploy -ment	Scheduler	User Mgmt	URL Crawl	Database Crawl	Email Crawl	Index Merge	Tuplets
CS-1	Single						x
CS-2	Multiple					x	x
CS-3	Multiple				x	x	x			x
CS-4	Multiple	x	x	x	x	x	x			x
CS-Enterprise	Multiple	x	x	x	x	x	x	x	x	x	x

Crawl technology
The crawl and index portion of CrawlScape is based on a Java API, which is part of an excellent open source project. CrawlScape transforms crawling and indexing management by using this source API together with PS Java code to provide a commercial grade crawling, indexing and search deployment framework. With CrawlScape, the API and search deployment is controlled by a graphic user interface rather than command line controls, which eliminates the need for Java programming or Unix expertise. The framework is excellent for managing search infrastructures and ideal for day-to-day administration.

Moving and combining indexes are added functions to enhanced search deployment flexibility. Distributed crawling, indexing and distributed search are additional powerful features of the system for very large-scale search projects. The application is distributed in a load-sharing scheme with server clusters configured and managed by CrawlScape technology.

For data storage CrawlScape employs MySQL to manage information about users, servers, configurations, crawls, schedules and search deployments.

Network crawling (Intranet, Internet, FTP, HTTP, Shares)
CrawlScape can be configured to crawl any type of network. The parameters used to control the system for each network are clickable (UI) setup parameters for each type of designated crawl.

Files types (Documents, Text, PDF, HTML, XML, JPEG, image files, databases, email)
CrawlScape can crawl any readable document or file type if a plug-in exists for it. The default configuration contains plug-ins for most document and file types. Users who need a plug-in altered for their particular file type, may commission a KS consultant for development assistance. KS consultants will develop the plug-in and test it across an intranet for reliability and performance. It is normal to expect a plug-in to take 24 to 72 hours to build depending on complexity.

There are special case plug-ins such as the KS JPEG plug-in. It actually indexes the text content in the upper 1,000 bytes which allows embedded JPEG definitions. This way JPEG files can be indexed and made searchable. Tiff files, PSD, and camera EXIF types can also be indexed. The use is obvious in the case of pairing a tiff image of a crime scene gun exhibit, with a text description explaining the evidence. CS can also pair an evidence document with the same name image pair, in what we refer to as polytuplet associations.

CS can also pair an evidence document with the same name image pair, in what we refer to as a polytuplet. The use of CS for joining a tiff image (gun exhibit) with its case record is another example of presenting associations, which are ultimately used in pattern profiling search and co-existence. Triplets, quadruplets and so on form tuplets in CS crawling and indexing processes.

Polytuplets explained further
Polytuplets are also used for combining records from different tables based on table associations described in schema. These relationships allow for proper search of databases by presenting compound records that are constructed from distributed tables. For example, a person's record might be constructed from three different demographic tables. CS reconstructs the record, thereby providing advanced PS functionality operating over a complex database repository.

URL List management (target crawl list)
CrawlScape can be used to manage many crawl and indexing processes across multiple networks and independent servers; it provides facility for URL list management where the user can upload, copy and paste URL targets, which are ultimately assigned to crawl sessions. The ability to control URL definitions and assignments makes CrawlScape an enterprise and service bureau-class (ASP, ISP) tool. Multiple URL lists for users, departments, companies and groups can be independently managed within the framework.