Please help extract email from a sites

Ask questions here, report about problems.
Post Reply
petelius
Posts: 3
Joined: Wed Apr 03, 2019 5:01 pm

Please help extract email from a sites

Post by petelius » Wed Sep 25, 2019 10:25 pm

Hi
I need to extract email addresses from a webpages. Results can be saved into a CSV file.
How can I do it with the Human Emulator?

User avatar
support
Site Admin
Posts: 210
Joined: Fri Feb 22, 2019 3:42 pm

Re: Please help extract email from a sites

Post by support » Thu Sep 26, 2019 12:04 am

It depends on a programming language which you wish using in your script and a HTML code a webpage where you want extract email addresses.

petelius
Posts: 3
Joined: Wed Apr 03, 2019 5:01 pm

Re: Please help extract email from a sites

Post by petelius » Thu Sep 26, 2019 12:16 am

I using php

User avatar
support
Site Admin
Posts: 210
Joined: Fri Feb 22, 2019 3:42 pm

Re: Please help extract email from a sites

Post by support » Thu Sep 26, 2019 6:45 pm

Example of script - Scraping Email Address . The logic of the script:

1. Get keywords from a file.
2. Insert the keywords into the search engine Google.
3. Grab websites from a Google search results.
4. Go to websites and search page Contacts or About us.
5. Extract emails to a TXT file.

Code: Select all

<?php
 
$xhe_host ="127.0.0.1:7010";
 
// The following code is required to properly run XWeb Human Emulator
require("../../Templates/xweb_human_emulator.php");
 
// //////////////////////// settings /////////////////////////
// data file for the script
$keys = file("data/keys.txt");
// the results file
$file_res="res/email.txt";
 
// depth of passage in search results
$cnt_pages = 10;
// current page
$crnt_page =1; 
 
// the script runs in debug mode
$dbg = true;
 
// //////////////////////// additional functions///////////////
 
require_once("functions.php");
 
// /////////////////////// script ///////////////////////////////////////////
 
debug_mess(date("\[ d.m.y H:i:s\] ")." start script");
 
// count
for($ii=0;$ii<count($keys);$ii++)
{
	// get search query
	$key = trim($keys[$ii]);
 
   // navigate to google
   $browser->navigate("google.com");
 
   // set the word to search
   $input->set_value_by_name("q",$key);
   $input->click_by_name("q");
   // press the space bar to disable the google tooltip
   $keyboard->send_key(32,true);
 
   // press enter
   $keyboard->send_key(13,true);
   
	// wait 
	sleep(3);
 
      // reset to zero before the next pass
	$crnt_page=1;
	
   while(true)
   {
		 // get all links to sites enclosed in tags <cite>
		 $sites=$webpage->get_body_inter_prefix_all("<cite>","</cite>");
		 $sites=explode("<br>",$sites);
	        // let's go through all the links received
		 for($i=0;$i<count($sites);$i++)
		 {        
			// go to the website
			$site=str_replace("<b>","",trim($sites[$i]));
			$site=str_replace("</b>","",$site);
			if($site=="")
			  continue;
			// output to debug panel
			debug_mess("link : ".$site); 
	      
			// open and make a new browser active
			$browser->set_count(2);
			$browser->set_active_browser(1,true);
         
         // go to the website
         $browser->navigate($site);
         sleep(1);
         // go to contacts
         $anchor->click_by_inner_text("contacts");
         $anchor->click_by_inner_text("Contacts");
         $anchor->click_by_inner_text("About us");
         $anchor->click_by_inner_text("about us");
			sleep(2);
         // looking for all email on the page
			preg_match_all('#[\w\d.-_]+@([\w\d.-_]+\.)+[a-zA-Z]{2,6}#i', $webpage->get_source(), $matches);
       
			// let's go through the results
			foreach ($matches[0] as $key=>$value)
         {
				
            // remove the excess
            $str_mail=str_replace(">","",$value);
            $str_mail=str_replace("<","",$str_mail);  
            $str_mail=str_replace("mailto:","",$str_mail);   
            $str_mail=str_replace("/","",$str_mail); 
            $str_mail=str_replace("mail:","",$str_mail);  
       
            // write to file
            $textfile->add_string_to_file($file_res,trim($str_mail)."\n",60) ;
         }
		
         // close and go back
			$browser->set_active_browser(0,true);
			$browser->close_all_tabs();
	      
         // remove duplicates from file
         dedupe($file_res);
		 }
 
		 // did not go to the next page
		if(!next_page($crnt_page)) 
		  break;
  }
 
}
debug_mess(date("\[ d.m.y H:i:s\] ")."the script is finished<br>");
 
// Quit
$app->quit();
?>

Download script in russian: http://www.x-scripts.com/scripts/downlo ... ?script=23

petelius
Posts: 3
Joined: Wed Apr 03, 2019 5:01 pm

Re: Please help extract email from a sites

Post by petelius » Thu Sep 26, 2019 9:02 pm

Excellent, thanks.

Post Reply