Saturday 15 March 2014

Extract images and links from website using php


Extract website data using php Web Crawling. The process by which we extract images, hyperlinks, metadata and several other things from a website is called web crawling. So in this tutorial you will learn about how to crawl images and hyperlinks from a website using php.And also I have used here jquery ajax for displaying the results without page refreshing, and its not mandatory.


So now take a look at the coding part which I did, the code is little bit longer, just for better understanding purpose. You can short this code as you need.

index.php
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<title>Web crawling using php</title>
<link rel="stylesheet" type="text/css" href="style.css" />
<script type="text/javascript" src="jquery.min.js"></script>
<script type="text/javascript">
$(document).ready(function()
{
  $("#submit").click(function()
  {
    var url = $("#url").val();
 if(url.length > 0)
 {
   //A loading or waiting gif image will display in the demo_output div until the extract data will appearing
   $("#demo_output").html('&nbsp;&nbsp;<img src="loading.gif">');
   $.ajax
   ({
      type: "POST",
   url: "get_content.php",
   data: "url="+url,
   success: function(option)
   {
     $("#demo_output").html(option);
   }
   });
 }
  });
});
</script>
</head>
<body>
 <div align="center"><b>Extract image links and hyperlinks from website using php</b>
<br />
<br />
<div class="demo_wrapper">
  <div id="demo_input">
    &nbsp;Enter Url :&nbsp;&nbsp;&nbsp;<input type="text" name="url" id="url" value="" />&nbsp;<input type="submit" id="submit" value="Go" />
  </div>
  
  <div id="demo_output">
  </div>
</div>
</div>
</body>
</html>
Here we just enter the url address and send it to the get_content.php and display the extract data, that’s all.
get_content.php
<?php
set_time_limit(120);

class Crawler
{

   protected $markup = "";
   public function __construct($url)
    {
      $this->markup = $this->getMarkup($url);
    }

   public function getMarkup($url)
    {
      return file_get_contents($url);
    }

   public function get($type)
    {
      $method = "_get_{$type}";
      if (method_exists($this, $method))
   {
  return call_user_func(array($this, $method));
      }
    }

   protected function _get_images() 
    {
      if (!empty($this->markup))
   {
  preg_match_all('/<img [^>]*src="?([^ ">]+)"?/i', $this->markup, $images);
        return !empty($images[1]) ? $images[1] : FALSE;
      }
    }

   protected function _get_links() 
    {
      if (!empty($this->markup))
   {
  preg_match_all('/<a [^>]*href="?([^ ">]+)"?/i', $this->markup, $links);
        return !empty($links[1]) ? $links[1] : FALSE;
      }
    }
}  // End of Crawler class


if(isset($_POST['url']) && $_POST['url'] != '')
{
$url = $_POST['url'];
//We must enter http:// or https:// before the url, if it does not, then we check here 
//and write http if needed.
if(substr($url, 0, 4) != 'http') $url = 'http://'.$url;
//Create an object of class Crawler.
$crawl = new Crawler($url);
//Call the function get() with argument "images"
$images = $crawl->get('images');
//Call the function get() with argument "links"
$links  = $crawl->get('links');
$i = 0;
echo "<table cellpadding='5'>";
echo "<tr><td id='title'>IMAGE LINKS</td></tr>";
//Here we chech if array $images is empty or not. If it empty then we just pass the control to 
//the else condition.
if(!empty($images))
{
    //Here we print the image links
 foreach($images as $img)
 {
   if($i%2 == 0) $style = "style='background-color:#cccccc;'";
   else $style="style='background-color:#eeeeee;'";
   if($img[0] == "'") $img = substr($img,1,-1);
   echo "<tr><td ".$style.">".$img."</td></tr>";
   $i++;
 }
}
else echo "<tr><td>No Image!</td></tr>";
echo "</table>";


$j = 0;
echo "<table cellpadding='5'>";
echo "<tr><td id='title'>HYPERLINKS</td></tr>";
//Here we chech if array $links is empty or not. If it empty then we just pass the control to 
//the else condition.
if(!empty($links))
{
    //Here we print the hyperlinks
 foreach($links as $link)
 {
   if($j%2 == 0) $style = "style='background-color:#cccccc;'";
   else $style="style='background-color:#eeeeee;'";
   if($link[0] == "'") $link = substr($link,1,-1);
   if($link[0] == "/") $link = $_POST['url'].$link;
   echo "<tr><td ".$style.">".$link."</td></tr>";
   $j++;
 }
}
else echo "<tr><td>No Hyperlink!</td></tr>";
echo "</table>";
}
?>

Don’t be afraid to seeing the above code, it just contains some extra codes for properly display in the demo_output div. You just need the crawler class and need to create a object of that class and call the functions with this object. file_get_contents() reads entire file into a string and it returns the read data or FALSE on failure. Function _get_images() extract all image links from the webpage using Regular Expression. preg_match_all() performs a global regular expression match. Search the webpage for all matches to the regular expression given in pattern and put them in $images array. Function _get_links() does exactly same as _get_images(). These two function returns arrays with extract image links and hyperlinks in the $images and $links arrays respectively. Now we just need to print them.

No comments:

Post a Comment