3 // PHP Version => 5.2.0-8+etch7 (ihrisko.org)
9 ini_set('default_socket_timeout',1); //How long wait for a webserver? (seconds)
10 set_time_limit(0); //How long run? (seconds)
11 //$url='http://ad.doubleclick.net/click';
12 $url='http://w.moreover.com/';
16 file_get_contents($url, false, null, 0, $max_size);
17 //preg_match_all('(http:\/\/[_a-zA-Z0-9\.\-]+\.[a-zA-Z]{2,4}\/{1}[-_~&=\ ?\.a-z0-9\/]*)',htmlspecialchars_decode(@file_get_contents($url, false, null, 0, $max_size)), $new_urls);
19 $new_urls = $new_urls[0];
21 print_r($new_urls); echo($i);
25 $stdin = fopen('php://stdin','r');
26 while(!feof($stdin)) {
28 preg_match_all("/[-a-z0-9\._]+@[-a-z0-9\._]+\.[a-z]{2,4}/", @file_get_contents
($url), $emails); $emails = $emails[0];
29 foreach($emails as $email) {
39 I am trying to write WebCrawler
/Spider (as school project
, and -of course
- I am trying to be more wealthy than google
;o
)
41 So
... I have big
/small problem
:
42 I am using
file_get_contents() (i
've tryed fopen() too...).
43 Crawler works 100% great, but sometimes it freezing. I have tryed to trace what function freezes, and i found it, it's
file_get_contents()...
45 So
, i googled
and found default_socket_timeout setting
, i set it to
1, but sometimes its freezes
and never get up again
.
47 I
've done this example, so you can see, that it freezes after few iterations. I have supplyed URL, that causes freeze of my crawler (im not sure why...):
52 ini_set('default_socket_timeout
',1);
54 //$url='http
://ad.doubleclick.net/click';
55 $url='http://w.moreover.com/';
57 @file_get_contents
($url, false, null, 0, 10000);
63 Of course
, if somebody want be better than google
, he have to have very good crawler
. So I need very solid code
, that can run
and crawl a days without
crash (like this one
). Yeah
, it
's true, that this worked 1 or 2 hours before it crashes, or i stoped it, but the file_get_contents() doesn't work like i need
.
65 If you are interested to crawling
, you can write me to YouSeekMe
: 283//782//978 ;D
67 And there are few statistics from my last session
:
68 +
5431 URLs
; 19292 Downloaded
; UpTime
: 21.5 mins
; Buffered
: 30 URLs
; History
: 1000 URLs
; Speed
: 4.22 URLs
/s
, 14.98 Downloads
/s