| 1 | #!/usr/bin/php |
| 2 | <?php |
| 3 | // PHP Version => 5.2.0-8+etch7 (ihrisko.org) |
| 4 | |
| 5 | |
| 6 | |
| 7 | //phpinfo(); die; |
| 8 | |
| 9 | ini_set('default_socket_timeout',1); //How long wait for a webserver? (seconds) |
| 10 | set_time_limit(0); //How long run? (seconds) |
| 11 | //$url='http://ad.doubleclick.net/click'; |
| 12 | $url='http://w.moreover.com/'; |
| 13 | $max_size = 10000; |
| 14 | $i = 0; |
| 15 | while(1) { |
| 16 | file_get_contents($url, false, null, 0, $max_size); |
| 17 | //preg_match_all('(http:\/\/[_a-zA-Z0-9\.\-]+\.[a-zA-Z]{2,4}\/{1}[-_~&=\ ?\.a-z0-9\/]*)',htmlspecialchars_decode(@file_get_contents($url, false, null, 0, $max_size)), $new_urls); |
| 18 | echo "#"; |
| 19 | $new_urls = $new_urls[0]; |
| 20 | $i++; |
| 21 | print_r($new_urls); echo($i); |
| 22 | } |
| 23 | die; |
| 24 | |
| 25 | $stdin = fopen('php://stdin','r'); |
| 26 | while(!feof($stdin)) { |
| 27 | $url = fgets($stdin); |
| 28 | preg_match_all("/[-a-z0-9\._]+@[-a-z0-9\._]+\.[a-z]{2,4}/", @file_get_contents($url), $emails); $emails = $emails[0]; |
| 29 | foreach($emails as $email) { |
| 30 | echo($email."\n"); |
| 31 | } |
| 32 | } |
| 33 | |
| 34 | |
| 35 | die; |
| 36 | |
| 37 | ?> |
| 38 | Hi everybody! |
| 39 | I am trying to write WebCrawler/Spider (as school project, and -of course- I am trying to be more wealthy than google ;o) |
| 40 | |
| 41 | So... I have big/small problem: |
| 42 | I am using file_get_contents() (i've tryed fopen() too...). |
| 43 | Crawler works 100% great, but sometimes it freezing. I have tryed to trace what function freezes, and i found it, it's file_get_contents()... |
| 44 | |
| 45 | So, i googled and found default_socket_timeout setting, i set it to 1, but sometimes its freezes and never get up again. |
| 46 | |
| 47 | I've done this example, so you can see, that it freezes after few iterations. I have supplyed URL, that causes freeze of my crawler (im not sure why...): |
| 48 | |
| 49 | #!/usr/bin/php |
| 50 | < ?php |
| 51 | |
| 52 | ini_set('default_socket_timeout',1); |
| 53 | set_time_limit(0); |
| 54 | //$url='http://ad.doubleclick.net/click'; |
| 55 | $url='http://w.moreover.com/'; |
| 56 | while(1) { |
| 57 | @file_get_contents($url, false, null, 0, 10000); |
| 58 | echo "#"; |
| 59 | } |
| 60 | |
| 61 | ?> |
| 62 | |
| 63 | Of course, if somebody want be better than google, he have to have very good crawler. So I need very solid code, that can run and crawl a days without crash (like this one). Yeah, it's true, that this worked 1 or 2 hours before it crashes, or i stoped it, but the file_get_contents() doesn't work like i need. |
| 64 | |
| 65 | If you are interested to crawling, you can write me to YouSeekMe: 283//782//978 ;D |
| 66 | |
| 67 | And there are few statistics from my last session: |
| 68 | +5431 URLs; 19292 Downloaded; UpTime: 21.5 mins; Buffered: 30 URLs; History: 1000 URLs; Speed: 4.22 URLs/s, 14.98 Downloads/s |
| 69 | |
| 70 | THX4AnyHelp ;o) |