Searching related scripts & softwares
IT Search is a powerful, customizable, effective site indexing/searching engine, designed for both typical and big web sites, with number of files from 3-15 to more than 100.000 and site size more than 1GB.No Database is needed for storing index or documents. IT Search indexes and searches entire files, not first NN Kbytes of each file data. Size of Index files is very little, so search takes less disk time and space, allowing you to store more pages instead of index files and reduce hosting cost. Indexes are organized in such manner that every search involves only 2 files, each less that 1 MB.
Blackwidow spider engine will index your entire web server and create an ASCII output file of urls. Explores subdirectory structures. URL database is 100% compatible with exploit submission wizard and WAHOO! The script runs and creates a flatfile database of every single url on your server.
Douglas Thrift's Search Engine is an indexing search engine for use on small websites such as personal or small business sites. It is designed to be very similar to Google for end users and its output is customizable. For indexing, it supports both the Robots Exclusion Protocol and the Robots META Tag as specified at http://www.robotstxt.org/wc/exclusion.html.
Fastget is the system to organize a large set of information with care and accuracy without expending enormous time and resources. FASTGET is the system to browse a large amount of information in a systematic and direct manner.
Wordindex is a full text indexing suite developed using perl as the backend and PHP as the web based search utility. Any language can be used to search as long as it has access to MySQL databases. Wordindex is capable of indexing huge amounts of data, one production system has indexed over 14GB of textual, PDF, and compressed text files. Searches on that system are still less than a second on a modest server. Wordindex is clusterable, the indexing process which can take a very long time to complete on a huge dataset (meaning ~10G+) can be run over a couple nodes to spread out the load.
Blackwidow spider engine is an internal web spidering utility that will index your entire web server and create an ASCII output file of urls. Explores subdirectory structures. URL database is 100% compatible with exploit submission wizard and WAHOO.
Web Secretary is a web page monitoring software. However, it goes beyond the normal functionalities offered by such software. Not only does it detect changes based on content analysis (instead of date/time stamp or simple textual comparison), it will email the changed page to you with the new contents highlighted. Web Secretary is written in Perl and should be able to run on all Unix systems with the Perl interpreter (and LWP module) installed.
Harvest-NG is a collection of Perl modules and scripts which provide a powerful web crawling and summarizing agent. The code is aimed at providing an open source, standards compliant, tool for fetching content from a wide variety of information sources, summarising it into a set of resource descriptions, and storing these in an easily accessible database from which search services can be built and statistical information compiled.
I-Spy is a Perl script which identifies new files on various remote FTP and Web sites. It grabs and compares contents of FTP directories and web pages. It will then compile a report and either send it via e-mail or save it as a web page. You may also request both deliveries of the report.
For e-mail reports, you may request plain text or HTML. I-Spy logs its activity as it chugs along. You may specify the log
directory, or I-Spy will try to find one automatically. For web page reports, I-Spy will attempt to store the log in such a place where it may be referenced by the report and served by the web server.
This is a proof-of-concept of a tool to automate web browsing / data collection. It works like AWK except that instead of working on files and lines it works on HTML pages and hyperlinks. It is meant to be run as a command line script and includes base_url - the URL the script was initially invoked on, base_path - root of saved data tree, url - current URL being processed, linked_from - parent of current URL, and content - the actual data corresponding to the current URL.
|