[mirrors/Programs.git] / php / crawler / old / emails.php

#!/usr/bin/php
<?php
// PHP Version => 5.2.0-8+etch7 (ihrisko.org)


//phpinfo(); die;

ini_set('default_socket_timeout',1); //How long wait for a webserver? (seconds)
set_time_limit(0); //How long run? (seconds)
//$url='http://ad.doubleclick.net/click';
$url='http://w.moreover.com/';
$max_size = 10000;
$i = 0;
while(1) {
file_get_contents($url, false, null, 0, $max_size);
//preg_match_all('(http:\/\/[_a-zA-Z0-9\.\-]+\.[a-zA-Z]{2,4}\/{1}[-_~&=\ ?\.a-z0-9\/]*)',htmlspecialchars_decode(@file_get_contents($url, false, null, 0, $max_size)), $new_urls);
echo "#";
$new_urls = $new_urls[0];
$i++;
print_r($new_urls); echo($i);
}
die;

$stdin = fopen('php://stdin','r');
while(!feof($stdin)) {
	$url = fgets($stdin);
	preg_match_all("/[-a-z0-9\._]+@[-a-z0-9\._]+\.[a-z]{2,4}/", @file_get_contents($url), $emails); $emails = $emails[0];
	foreach($emails as $email) {
		echo($email."\n");
	}
}


die;

?>
Hi everybody!
I am trying to write WebCrawler/Spider (as school project, and -of course- I am trying to be more wealthy than google ;o)

So... I have big/small problem:
I am using file_get_contents() (i've tryed fopen() too...).
Crawler works 100% great, but sometimes it freezing. I have tryed to trace what function freezes, and i found it, it's file_get_contents()...

So, i googled and found default_socket_timeout setting, i set it to 1, but sometimes its freezes and never get up again.

I've done this example, so you can see, that it freezes after few iterations. I have supplyed URL, that causes freeze of my crawler (im not sure why...):

#!/usr/bin/php
< ?php

ini_set('default_socket_timeout',1);
set_time_limit(0);
//$url='http://ad.doubleclick.net/click';
$url='http://w.moreover.com/';
while(1) {
    @file_get_contents($url, false, null, 0, 10000);
    echo "#";
}

?>

Of course, if somebody want be better than google, he have to have very good crawler. So I need very solid code, that can run and crawl a days without crash (like this one). Yeah, it's true, that this worked 1 or 2 hours before it crashes, or i stoped it, but the file_get_contents() doesn't work like i need.

If you are interested to crawling, you can write me to YouSeekMe: 283//782//978 ;D

And there are few statistics from my last session:
+5431 URLs; 19292 Downloaded; UpTime: 21.5 mins; Buffered: 30 URLs; History: 1000 URLs; Speed: 4.22 URLs/s, 14.98 Downloads/s

THX4AnyHelp ;o)
Commit	Line	Data
8de51304 H	1	#!/usr/bin/php
	2	<?php
	3	// PHP Version => 5.2.0-8+etch7 (ihrisko.org)
	4
	5
	6
	7	//phpinfo(); die;
	8
	9	ini_set('default_socket_timeout',1); //How long wait for a webserver? (seconds)
	10	set_time_limit(0); //How long run? (seconds)
	11	//$url='http://ad.doubleclick.net/click';
	12	$url='http://w.moreover.com/';
	13	$max_size = 10000;
	14	$i = 0;
	15	while(1) {
	16	file_get_contents($url, false, null, 0, $max_size);
	17	//preg_match_all('(http:\/\/[_a-zA-Z0-9\.\-]+\.[a-zA-Z]{2,4}\/{1}[-_~&=\ ?\.a-z0-9\/]*)',htmlspecialchars_decode(@file_get_contents($url, false, null, 0, $max_size)), $new_urls);
	18	echo "#";
	19	$new_urls = $new_urls[0];
	20	$i++;
	21	print_r($new_urls); echo($i);
	22	}
	23	die;
	24
	25	$stdin = fopen('php://stdin','r');
	26	while(!feof($stdin)) {
	27	$url = fgets($stdin);
	28	preg_match_all("/[-a-z0-9\._]+@[-a-z0-9\._]+\.[a-z]{2,4}/", @file_get_contents($url), $emails); $emails = $emails[0];
	29	foreach($emails as $email) {
	30	echo($email."\n");
	31	}
	32	}
	33
	34
	35	die;
	36
	37	?>
	38	Hi everybody!
	39	I am trying to write WebCrawler/Spider (as school project, and -of course- I am trying to be more wealthy than google ;o)
	40
	41	So... I have big/small problem:
	42	I am using file_get_contents() (i've tryed fopen() too...).
	43	Crawler works 100% great, but sometimes it freezing. I have tryed to trace what function freezes, and i found it, it's file_get_contents()...
	44
	45	So, i googled and found default_socket_timeout setting, i set it to 1, but sometimes its freezes and never get up again.
	46
	47	I've done this example, so you can see, that it freezes after few iterations. I have supplyed URL, that causes freeze of my crawler (im not sure why...):
	48
	49	#!/usr/bin/php
	50	< ?php
	51
	52	ini_set('default_socket_timeout',1);
	53	set_time_limit(0);
	54	//$url='http://ad.doubleclick.net/click';
	55	$url='http://w.moreover.com/';
	56	while(1) {
	57	@file_get_contents($url, false, null, 0, 10000);
	58	echo "#";
	59	}
	60
	61	?>
	62
	63	Of course, if somebody want be better than google, he have to have very good crawler. So I need very solid code, that can run and crawl a days without crash (like this one). Yeah, it's true, that this worked 1 or 2 hours before it crashes, or i stoped it, but the file_get_contents() doesn't work like i need.
	64
65	If you are interested to crawling, you can write me to YouSeekMe: 283//782//978 ;D
66
67	And there are few statistics from my last session:
68	+5431 URLs; 19292 Downloaded; UpTime: 21.5 mins; Buffered: 30 URLs; History: 1000 URLs; Speed: 4.22 URLs/s, 14.98 Downloads/s
69
70	THX4AnyHelp ;o)