Welcome to collectivesolver - Programming & Software Q&A with code examples. A website with trusted programming answers. All programs are tested and work.

Contact: aviboots(AT)netvision.net.il

Buy a domain name - Register cheap domain names from $0.99 - Namecheap

Scalable Hosting That Grows With You

Secure & Reliable Web Hosting, Free Domain, Free SSL, 1-Click WordPress Install, Expert 24/7 Support

Semrush - keyword research tool

Boost your online presence with premium web hosting and servers

Disclosure: My content contains affiliate links.

39,948 questions

51,890 answers

573 users

How to parsing robots.txt and check if crawl a specific web page is allowed in PHP

1 Answer

0 votes
function get_robots_txt($url)
{
    $parsed_url = parse_url($url);

    $robotstxt = @file("http://{$parsed_url['host']}/robots.txt");
    
    return $robotstxt;
}

function robots_allowed_crawl($url, $robotstxt)
{
    $parsed_url = parse_url($url);

    if (empty($robotstxt)) return true;

    $rules = array();
    foreach($robotstxt as $line) 
    {
        if(!$line = trim($line)) continue;

        if (preg_match('/^\s*Disallow:(.*)/i', $line, $regs)) 
        {
            if (!$regs[1]) return true;
            $rules[] = preg_quote(trim($regs[1]), '/');
        }
    }
    foreach($rules as $rule) 
        if(preg_match("/^$rule/", $parsed_url['path'])) return false;

    return true;
}
  
$url = "https://www.website.com/order.php";  
$robotstxt = get_robots_txt($url);
if (robots_allowed_crawl($url, $robotstxt))
    $html = file_get_contents($url);



/*
run: 


*/

 



answered Jul 4, 2016 by avibootz

Related questions

...