File Coverage

blib/lib/Gungho.pm
Criterion Covered Total %
statement 12 12 100.0
branch n/a
condition n/a
subroutine 5 5 100.0
pod 1 1 100.0
total 18 18 100.0


line stmt bran cond sub pod time code
1             # $Id: /mirror/gungho/lib/Gungho.pm 67350 2008-07-28T10:37:01.975672Z lestrrat $
2             #
3             # Copyright (c) 2007 Daisuke Maki <daisuke@endeworks.jp>
4             # All rights reserved.
5              
6             package Gungho;
7 2     2   46595 use strict;
  2         2  
  2         43  
8 2     2   6 use warnings;
  2         1  
  2         38  
9 2     2   28 use 5.008;
  2         4  
10 2     2   6 use base qw(Class::C3::Componentised);
  2         1  
  2         451  
11             our $VERSION = '0.09008';
12              
13             __PACKAGE__->load_components('Setup');
14              
15 2     2 1 23 sub component_base_class { "Gungho::Component" }
16              
17             1;
18              
19             __END__
20              
21             =head1 NAME
22              
23             Gungho - Yet Another High Performance Web Crawler Framework
24              
25             =head1 SYNOPSIS
26              
27             use Gungho;
28             Gungho->run($config);
29              
30             =head1 DESCRIPTION
31              
32             Gungho provides a complete out-of-the-box web crawler framework with
33             high performance and great felxibility.
34              
35             Please note that Gungho is in beta. It has been stable for some time, but
36             its internals may still change, including the API.
37              
38             Gungho comes with many features that solve recurring problems when building
39             a spider:
40              
41             =over 4
42              
43             =item Event-Based, Asynchronous Engine
44              
45             Gungho uses event-based dispatch via POE, Danga::Socket, or IO::Async.
46             Choose the best engine that fits your needs.
47              
48             =item Asynchronous DNS lookups
49              
50             HTTP connections are handled asynchronously, why not DNS lookups?
51             Gungho doesn't block while hostnames are being resolved, so other jobs can
52             continue.
53              
54             =item Automatic robots.txt Handling
55              
56             Every crawler needs to respect robots.txt. Gungho offers automatic handling
57             of robots.txt. If you use it in conjunction with memcached, you can even
58             do this in a distributed environment, where farms of Gungho crawler hosts
59             are all fetching pages.
60              
61             =item Robots META Directives
62              
63             Robots META directives embedded in HTML text can also be parsed automatically.
64             You can then access this resulting structure to decide if you can process
65             the fetched URL.
66              
67             =item Throttling
68              
69             You don't want your crawl targets to go under just because you let loose a
70             crawler against it and did a million fetches per hour. With Gungho's
71             throttling component, you can throttle the amount of requests that are sent
72             against a domain.
73              
74             =item Private IP Blocking
75              
76             Malicious sites may embed hostnames that resolve to internal IP address ranges
77             such as 192.168.11.*, which may lead to a DoS attack to your private servers.
78             Gungho has an automatic option to block such IP addresses via BlockPrivateIP
79             component.
80              
81             =item Caching
82              
83             Whatever you want to cache, Gungho offers a generic cache interface a-la
84             Catalyst via Gungho::Component::Cache
85              
86             =item Web::Scraper Integration
87              
88             (Note: This is not quite production ready) Gungho has Web::Scraper integration
89             that allows you to easily call Web::Scraper sripts defined in your config files.
90              
91             =item Request Logging
92              
93             Requests can be automatically logged to a file, a database, to screen, via
94             Gungho::Plugin::RequestLog, which gives you the full power of Log::Dispatch
95             for your logging needs.
96              
97             =back
98              
99             =head1 HISTORY
100              
101             First there were a bunch of scripts that used scrape a bunch of RSS feeds.
102             Then I got tired of writing scripts, so I decided a framework is the way to
103             go, and Xango was born.
104              
105             Xango was my first attempt at trying to harness the full power of event-based
106             framework. It was fast. It wasn't fun to extend. It had a nightmare-ish
107             way to deal with robots.txt.
108              
109             Couple of more attempts later, more inspirations and lessons learned from
110             Catalyst, Plagger, DBIx::Class, Gungho was born.
111              
112             Since its inception, Gungho has been in successfully used as crawlers that
113             fetch hundreds of thousands of urls to a few million urls per day.
114              
115             =head1 PLEASE READ BEFORE USE
116              
117             Gungho is designed to so that it can handle massive amount of traffic.
118             If you're careful enough with your Provider and Handler implementation, you
119             can in fact hit millions of URL with this crawler.
120              
121             So PLEASE DO NOT LET IT LOOSE. DO NOT OVERLOAD your crawl targets.
122             You are STRONGLY advised to use Gungho::Component::Throttle to throttle your
123             fetches.
124              
125             Also PLEASE CHANGE THE USER AGENT NAME OF YOUR CRAWLER. If you hit your targets
126             hard with the default name (Gungho/VERSION X.XXXX), it will look as though a
127             service called Gungho is hitting their site, which really isn't the case.
128             Whatever it is, please specify at least a simple user agent in your config
129              
130             =head1 STRUCTURE
131              
132             Gungho is comprised of three parts. A Provider, which provides Gungho with
133             requests to process, a Handler, which handles the fetched page, and an
134             Engine, which controls the entire process.
135              
136             There are also "hooks". These hooks can be registered from anywhere by
137             invoking the register_hook() method. They are run at particular points,
138             which are specified when you call register_hook().
139              
140             All components (engine, provider, handler) are overridable and switcheable.
141             However, do note that if you plan on customizing stuff, you should be aware
142             that Gungho uses Class::C3 extensively, and hence you may see warnings about
143             the code you use.
144              
145             =head1 HOW *NOT* TO USE Gungho
146              
147             One note about Gungho - Don't use it if you are planning on accessing
148             a single url -- It's usually not worth it, so you might as well use
149             LWP::UserAgent or an equivalent module.
150              
151             Gungho's event driven engine works best when you are accessing hundreds,
152             if not thousands of urls. It may in fact be slower than using LWP::UserAgent
153             if you are accessing just a single url.
154              
155             Of course, you may wish to utilize features other than speed that Gungho
156             provides, so at that point, it's simply up to you.
157              
158             =head1 RUNNING IN DISTRIBUTED ENVIRONMENT
159              
160             Gungho has experimental support for running in distributed environments.
161              
162             Strictly speaking, each crawler needs to have its own strategy to enable
163             itself to to run in a distribued environment. What Gungho offers is a
164             "good enough" solution that I<may> work for your. If what Gungho offers
165             isn't enough, at least what comes with it might help to show you what
166             needs to be tweaked for your particular environment.
167              
168             Roughly speaking, there are three components you need to worry about in order
169             to make a well bahaved and distributed crawler. Check out the below list
170             and documentation for each component.
171              
172             =over 4
173              
174             =item Distributed Throttling
175              
176             As of version 0.08010, Throttle::Domain and Throttle::Simple can be configured
177             to use whatever Data::Throttler-based throttling object as its engine.
178              
179             Download Data::Throttler::Memcached, and specify it as the engine behind
180             your throttling for Gungho. Using Data::Throttler::Memcached will make
181             Gungho store throttling information in a shared Memcached server, which will
182             allow separate Gungho instances to share that information.
183              
184             =item Distributed robots.txt Handling
185              
186             As of version 0.08013, RobotRules can be configured to use a cache in the
187             backend. You can specify your choice of distributed cache (e.g. Memcached)
188             and use that as the storage for robots.txt data.
189              
190             Of course, this means that robots.txt data isn't persitent, but you should be
191             expiring robots.txt once in while to reflect new data, anyways.
192              
193             =item Distributed Provider
194              
195             This is actually the simplest aspect, as it's usually done by hooking the
196             provider with a database. However, if you prefer, you may use some sort of
197             Message Queue as your backend.
198              
199             =back
200              
201             =head1 GLOBAL CONFIGURATION OPTIONS
202              
203             =over 4
204              
205             =item debug
206              
207             ---
208             debug: 1
209              
210             Setting debug to a non-zero value will trigger debug messages to be displayed.
211              
212             =back
213              
214             =head1 COMPONENTS
215              
216             Components add new functionality to Gungho. Components are loaded at
217             startup time from the config file / hash given to Gungho constructor.
218              
219             Gungho->run({
220             components => [
221             'Throttle::Simple'
222             ],
223             throttle => {
224             max_interval => ...,
225             }
226             });
227              
228             Components modify Gungho's inheritance structure at run time to add
229             extra functionality to Gungho, and therefore should only be loaded
230             before starting the engine.
231              
232             Please refer to each component's document for details
233              
234             =over 4
235              
236             =item Gungho::Component::Authentication::Basic
237              
238             =item Gungho::Component::BlockPrivateIP
239              
240             =item Gungho::Component::Cache
241              
242             =item Gungho::Component::RobotRules
243              
244             =item Gungho::Component::RobotsMETA
245              
246             =item Gungho::Component::Scraper
247              
248             =item Gungho::Component::Throttle::Domain
249              
250             =item Gungho::Component::Throttle::Simple
251              
252             =back
253              
254             =head1 INLINE
255              
256             If you're looking into simple crawlers, you may want to look at Gungho::Inline,
257              
258             Gungho::Inline->run({
259             provider => sub { ... },
260             handler => sub { ... }
261             });
262              
263             See the manual for Gungho::Inline for details.
264              
265             =head1 PLUGINS
266              
267             Plugins are different from components in that, whereas components require the
268             developer to explicitly call the methods, plugins are loaded and are not
269             touched afterwards.
270              
271             Please refer to the documentation of each plugin for details.
272              
273             =over 4
274              
275             =item RequestLog
276              
277             =item Statistics
278              
279             =back
280              
281             =head1 HOOKS
282              
283             Currently available hooks are:
284              
285             =head2 engine.send_request
286              
287             =head2 engine.handle_response
288              
289             =head1 METHODS
290              
291             =head2 component_base_class
292              
293             Used for Class::C3::Componentised
294              
295             =head1 CODE
296              
297             You can obtain the current code base from
298              
299             http://gungho-crawler.googlecode.com/svn/trunk
300              
301             =head1 AUTHOR
302              
303             Copyright (c) 2007 Daisuke Maki E<lt>daisuke@endeworks.jpE<gt>
304              
305             =head1 CONTRIBUTORS
306              
307             =over 4
308              
309             =item Jeff Kim
310              
311             =item Kazuho Oku
312              
313             =item Keiichi Okabe
314              
315             =back
316              
317             =head1 LICENSE
318              
319             This program is free software; you can redistribute it and/or modify it
320             under the same terms as Perl itself.
321              
322             See http://www.perl.com/perl/misc/Artistic.html
323              
324             =cut