How to Scrape data from indeed.com with php

indeed scraping with php

Indeed is number 1 job posting site in the world with millions of unique visitors every month. That giant job portal has jobs being posted by many recuiters from all around the world. When we want to create a job portal, it is very easy to get the jobs fetched from indeed.com for initial listings. Today i am going to show you how to scrape data from indeed.com with php. Recently i coded a XenForo joblistings addon . Basically it was a mini job portal where people can post the jobs in specific categories and regions. To fill up the database i had to scrape data from indeed.com with php. I am sharing with you the complete script that did the work successfully.

Scrape data from indeed.com with php complete code

Here is this complete php script that i have coded to scrape data from indeed.com, It is being tested and used at different forums without any issue. You are free to implement it and use it for your needs. Web scraping with php is very easy but you have to understand the document structure and the important data that you have to fetch. I would recommend you to read the code explanation that i shall provide here.

Scrape data from indeed.com with php

class IndeedFetch {

    public static function fetchJobs($brands,$locations) {



//Loop through each brand 
        foreach ($brands as $brand) {

            //prepare the brand parameters
            $brandTitle = $brand['brand_title'];
            $brandId = $brand['brand_id'];

//we need to use this brand title format for first page at indeeed.com
            $brandTitle1 = str_replace(' ', '-', $brandTitle);

            //we need to use this brand title format for other pages 

            $brandTitle = str_replace(' ', '+', $brandTitle);

            //loop through each locations
            foreach ($locations as $location) {

                //prepare the location parameters to save in the database
                $locationTitle = $location['name'];
                $locationId = $location['location_id'];
                $locationTitle1 = str_replace(' ', '-', $locationTitle);
                $locationTitle = str_replace(' ', '+', $locationTitle);


                //Get jobs for the first page
                $url = "https://www.indeed.com.pk/q-" . $brandTitle1 . "-l-" . $locationTitle1 . "-jobs.html";
                $jobLinks = self::getJobLinks($url);

                //Url for other pages 


for ($i = 10; $i <= 50; $i = $i + 10) 
{ $url = "https://www.indeed.com.pk/jobs?q=" . $brandTitle . "&l=" . $locationTitle . "&start=" . $i; 
if (is_array($jobLinks)) 
{ 
$jobLinks = array_merge($jobLinks, self::getJobLinks($url));
 } 
} //end for loop 
$jobs = array(); 
foreach ($jobLinks as $jobLink) 
{ 
$job = self::getJobDetails($jobLink); 
if ($job)
 {
 $jobs[] = array_merge($jobs, $job); 
} //end this if condition 
} //end this foreach loop for jobliniks 
$jobArray[$location]=$jobs; 
} //end locations loop }//End brands loop 
return $jobArray;
 } //end function 
public static function getJobLinks($url) 
{ 
$html = file_get_contents($url); 
$jobLinks=array(); 
$job_doc = new DOMDocument(); 
libxml_use_internal_errors(TRUE); //disable libxml errors 
if (!empty($html)) { //if any html is actually returned 
$job_doc->loadHTML($html);
  libxml_clear_errors(); //remove errors for yucky html

 $job_xpath = new DOMXPath($job_doc);

 //get all elements in this class

 $anchors = $job_xpath->query('//h2[@class="jobtitle"]/a');



            if ($anchors->length > 0) {

                //loop through all the pokemons
                foreach ($anchors as $a) {

              $jobLinks[] = "https://www.indeed.com" . $a->getAttribute("href") . "
";
                }
            }
        }


        return $jobLinks;
    }

    public static function getJobDetails($url) {

//disable the redirection
        $context = stream_context_create(
                array(
                    'http' => array(
                        'follow_location' => false
                    )
                )
        );

//Get the required parameter from url to reform it 
        $parts = parse_url($url);
        parse_str($parts['query'], $query);
        $last = isset($query['jk']) ? $query['jk'] : 0;

        //form the url correctly because original url is giving errors
        if ($last) {

            $url = "https://www.indeed.com/viewjob?jk=" . $last;
        } else {
            return Null;
        }

        $html = file_get_contents($url, false, $context);


        $job_doc = new DOMDocument();

        libxml_use_internal_errors(TRUE); //disable libxml errors

        if (!empty($html)) { //if any html is actually returned
            $job_doc->loadHTML($html);
            libxml_clear_errors(); //remove errors for yucky html

            $job_xpath = new DOMXPath($job_doc);

            $title = $job_xpath->query('//b[@class="jobtitle"]', $job_doc)->item(0)->nodeValue;
            $description = $job_xpath->query('//span[@id="job_summary"]', $job_doc)->item(0)->nodeValue;
            $link = $job_xpath->query('//div[@class="job-footer-button-row"]/a');


            //If we have option selected which needs to fetch jobs for this day only
          

                $date = $job_xpath->query('//span[@class="date"]', $job_doc)->item(0)->nodeValue;
                $postDate = strtotime($date);
                $lastDay = strtotime('-1 day');

                if (!$postDate) {
                    return Null;
                }

                if ($postDate < $lastDay) { return Null; } } if ($link->length > 0) {

                $applyLink = "https://www.indeed.com" . $link->item(0)->getAttribute("href");
            




            $job = array(
                'title' => $title,
                'description' => $description,
                'link' => $applyLink
            );
        }


        return $job;
    }

Scrape data from indeed.com with php code explanation

I know i have written above that it is very easy to scrape data from indeed.com. But after looking at the code you would want to kill me :). No believe me it is very easy. Take my hand and lets walk through the code.

Overview of the php code:

Basically we have coded a function fetchJobs inside IndeedFetch Class. That function takes two arguments. The brands(job title ) and locations. Here i am using location id and brand id also you can remove that if you are not going to save that data in database. Once we have location and and job description we have to go through all the pages that have jobs related to our location and brand. For each page we have to get individual job title and description that we can fetch and save in our own database. The code automatically scrapes data from indeed.com with php code for each page for all the locations and brands provided.

Step by step explanation:

public static function fetchJobs($brands,$locations)

{

//Loop through each brand

foreach ($brands as $brand)

{

//prepare the brand parameters
$brandTitle = $brand[‘brand_title’];
$brandId = $brand[‘brand_id’];

//we need to use this brand title format for first page at indeeed.com
$brandTitle1 = str_replace(' ', '-', $brandTitle);

//we need to use this brand title format for other pages

$brandTitle = str_replace(' ', '+', $brandTitle);

Ok so here we are getting brands and locations, PLEASE note i am passing brands and locations array with brand and location id’s. You can remove that if you do not want to save data back in the db.
brandTitles should be prepared for the URL, In short we are converting “Software Engineer” to ‘software-engineer” because that is how indeed url is formed. brandtitle1 is for first page indeed url and brandTitle variable is used for all other pages.

foreach ($locations as $location)

{

//prepare the location parameters to save in the database
$locationTitle = $location[‘name’];
$

locationId = $location['location_id'];

$locationTitle1 = str_replace(' ', '-', $locationTitle);

$locationTitle = str_replace(' ', '+', $locationTitle);

//Get jobs for the first page

$url = "https://www.indeed.com.pk/q-" . $brandTitle1 . "-l-" . $locationTitle1 . "-jobs.html";

$jobLinks = self::getJobLinks($url);

Ok here we are executing another loop that attaches every location to a brand and prepares the URL in exact format. For example, Software Engineer in New york. software Engineer in Indiana, Software Engineer in New Jersey. After this loop finishes, we are going to get a new brand and run the loop again. Again the locationTitle1 and locationTitle are two different parameters for first page and other pages respectively. The last line is used to call another function that should return us an array of job links. That would be used to get the job title and description.

public static function getJobLinks($url)

This complete function ha some typical XPATH code that you can easily understand if you search for XPATH and how it works, The format of indeed job is like this . Title is contained in an H2 tag with class “jobtitle” so we have to extract URL from that.  Once the URL’s are obtained we are going to return them at parent function. This is exact process, You can copy paste this function in your application as it is and i am sure it would work.

for ($i = 10; $i <= 50; $i = $i + 10)

{

$url = "https://www.indeed.com.pk/jobs?q=" . $brandTitle . "&l=" . $locationTitle . "&start=" . $i;

if (is_array($jobLinks))

{

$jobLinks = array_merge($jobLinks, self::getJobLinks($url));

}

} //end for loop

$jobs = array();

foreach ($jobLinks as $jobLink)

{
$job = self::getJobDetails($jobLink);

if ($job)
{

$jobs[] = array_merge($jobs, $job);

} //end this if condition

} //end this foreach loop for jobliniks

$jobArray[$location]=$jobs;

} //end locations loop

}//End brands loop


return $jobArray;

} //end function

Ok notice the for loop it runs for 10,20,30,40 and 50 because we are going to fetch last 50 jobs only for a location and brand. You can change this as per your needs. To scrape data from indeed.com with php you can change this as per your requirements. After that we are calling another function i.e get job detail. That is an easy and self explanatory function. We are simply getting detail of each job i.e Description, title and post date. I can bet if you have some experience with php programming you can easily grasp the logic.

Closing words:

I personally think that was enough to help you get started with your next indeed job scraping project. Simple php concepts are used in the code that would not take much time if you are coding in php for 1 or 2 years. If you have any question feel free to comment here or contact me i would love to help you.

I am a freelance php programmer and web developer, If you are looking for someone to scrape data from indeed.com with php for your web project you can hire me.Also i am a freelance xenforo developer and expert with custom addon development.

Leave a Reply