php/crawler/old/emails.php

   1 #!/usr/bin/php
   2 <?php
   3 // PHP Version => 5.2.0-8+etch7 (ihrisko.org)
   4
   5
   6
   7 //phpinfo(); die;
   8
   9 ini_set('default_socket_timeout',1); //How long wait for a webserver? (seconds)
  10 set_time_limit(0); //How long run? (seconds)
  11 //$url='http://ad.doubleclick.net/click';
  12 $url='http://w.moreover.com/';
  13 $max_size = 10000;
  14 $i = 0;
  15 while(1) {
  16 file_get_contents($url, false, null, 0, $max_size);
  17 //preg_match_all('(http:\/\/[_a-zA-Z0-9\.\-]+\.[a-zA-Z]{2,4}\/{1}[-_~&=\ ?\.a-z0-9\/]*)',htmlspecialchars_decode(@file_get_contents($url, false, null, 0, $max_size)), $new_urls);
  18 echo "#";
  19 $new_urls = $new_urls[0];
  20 $i++;
  21 print_r($new_urls); echo($i);
  22 }
  23 die;
  24
  25 $stdin = fopen('php://stdin','r');
  26 while(!feof($stdin)) {
  27         $url = fgets($stdin);
  28         preg_match_all("/[-a-z0-9\._]+@[-a-z0-9\._]+\.[a-z]{2,4}/", @file_get_contents($url), $emails); $emails = $emails[0];
  29         foreach($emails as $email) {
  30                 echo($email."\n");
  31         }
  32 }
  33
  34
  35 die;
  36
  37 ?>
  38 Hi everybody!
  39 I am trying to write WebCrawler/Spider (as school project, and -of course- I am trying to be more wealthy than google ;o)
  40
  41 So... I have big/small problem:
  42 I am using file_get_contents() (i've tryed fopen() too...).
  43 Crawler works 100% great, but sometimes it freezing. I have tryed to trace what function freezes, and i found it, it's file_get_contents()...
  44
  45 So, i googled and found default_socket_timeout setting, i set it to 1, but sometimes its freezes and never get up again.
  46
  47 I've done this example, so you can see, that it freezes after few iterations. I have supplyed URL, that causes freeze of my crawler (im not sure why...):
  48
  49 #!/usr/bin/php
  50 < ?php
  51
  52 ini_set('default_socket_timeout',1);
  53 set_time_limit(0);
  54 //$url='http://ad.doubleclick.net/click';
  55 $url='http://w.moreover.com/';
  56 while(1) {
  57     @file_get_contents($url, false, null, 0, 10000);
  58     echo "#";
  59 }
  60
  61 ?>
  62
  63 Of course, if somebody want be better than google, he have to have very good crawler. So I need very solid code, that can run and crawl a days without crash (like this one). Yeah, it's true, that this worked 1 or 2 hours before it crashes, or i stoped it, but the file_get_contents() doesn't work like i need.
  64
  65 If you are interested to crawling, you can write me to YouSeekMe: 283//782//978 ;D
  66
  67 And there are few statistics from my last session:
  68 +5431 URLs; 19292 Downloaded; UpTime: 21.5 mins; Buffered: 30 URLs; History: 1000 URLs; Speed: 4.22 URLs/s, 14.98 Downloads/s
  69
  70 THX4AnyHelp ;o)