How to Scrape data from indeed.com with php -- Complete code

How to Scrape data from indeed.com with php

Indeed is the number one job posting site in the world, with millions of unique visitors every month. This giant job portal features listings posted by recruiters from all around the globe. When creating a job portal, it’s quite easy to populate your site with initial listings by scraping jobs from Indeed.com.

Today, I’m going to show you how to scrape data from Indeed.com using PHP.

Recently, I developed a XenForo Job Listings addon essentially a mini job portal where users can post jobs by category and region. To populate the database, I scraped job listings from Indeed.com using PHP. I’m now sharing the complete script that successfully did the job.

Scrape data from indeed.com with php code

Here is the complete PHP script I coded to scrape data from Indeed.com. It has been tested and is being used on several forums without any issues. You’re free to implement and use it for your own needs.

Web scraping with PHP is quite easy, but it’s important to understand the structure of the webpage and identify the key data you need to extract. I recommend reading the code explanation provided below to better understand how it works.


class IndeedFetch {

    public static function fetchJobs($brands,$locations) {



//Loop through each brand 
        foreach ($brands as $brand) {

            //prepare the brand parameters
            $brandTitle = $brand['brand_title'];
            $brandId = $brand['brand_id'];

//we need to use this brand title format for first page at indeeed.com
            $brandTitle1 = str_replace(' ', '-', $brandTitle);

            //we need to use this brand title format for other pages 

            $brandTitle = str_replace(' ', '+', $brandTitle);

            //loop through each locations
            foreach ($locations as $location) {

                //prepare the location parameters to save in the database
                $locationTitle = $location['name'];
                $locationId = $location['location_id'];
                $locationTitle1 = str_replace(' ', '-', $locationTitle);
                $locationTitle = str_replace(' ', '+', $locationTitle);


                //Get jobs for the first page
                $url = "https://www.indeed.com.pk/q-" . $brandTitle1 . "-l-" . $locationTitle1 . "-jobs.html";
                $jobLinks = self::getJobLinks($url);

                //Url for other pages 


for ($i = 10; $i <= 50; $i = $i + 10) 
{ $url = "https://www.indeed.com.pk/jobs?q=" . $brandTitle . "&l=" . $locationTitle . "&start=" . $i; 
if (is_array($jobLinks)) 
{ 
$jobLinks = array_merge($jobLinks, self::getJobLinks($url));
 } 
} //end for loop 
$jobs = array(); 
foreach ($jobLinks as $jobLink) 
{ 
$job = self::getJobDetails($jobLink); 
if ($job)
 {
 $jobs[] = array_merge($jobs, $job); 
} //end this if condition 
} //end this foreach loop for jobliniks 
$jobArray[$location]=$jobs; 
} //end locations loop }//End brands loop 
return $jobArray;
 } //end function 
public static function getJobLinks($url) 
{ 
$html = file_get_contents($url); 
$jobLinks=array(); 
$job_doc = new DOMDocument(); 
libxml_use_internal_errors(TRUE); //disable libxml errors 
if (!empty($html)) { //if any html is actually returned 
$job_doc->loadHTML($html);
  libxml_clear_errors(); //remove errors for yucky html

 $job_xpath = new DOMXPath($job_doc);

 //get all elements in this class

 $anchors = $job_xpath->query('//h2[@class="jobtitle"]/a');



            if ($anchors->length > 0) {

                //loop through all the pokemons
                foreach ($anchors as $a) {

              $jobLinks[] = "https://www.indeed.com" . $a->getAttribute("href") . "
";
                }
            }
        }


        return $jobLinks;
    }

    public static function getJobDetails($url) {

//disable the redirection
        $context = stream_context_create(
                array(
                    'http' => array(
                        'follow_location' => false
                    )
                )
        );

//Get the required parameter from url to reform it 
        $parts = parse_url($url);
        parse_str($parts['query'], $query);
        $last = isset($query['jk']) ? $query['jk'] : 0;

        //form the url correctly because original url is giving errors
        if ($last) {

            $url = "https://www.indeed.com/viewjob?jk=" . $last;
        } else {
            return Null;
        }

        $html = file_get_contents($url, false, $context);


        $job_doc = new DOMDocument();

        libxml_use_internal_errors(TRUE); //disable libxml errors

        if (!empty($html)) { //if any html is actually returned
            $job_doc->loadHTML($html);
            libxml_clear_errors(); //remove errors for yucky html

            $job_xpath = new DOMXPath($job_doc);

            $title = $job_xpath->query('//b[@class="jobtitle"]', $job_doc)->item(0)->nodeValue;
            $description = $job_xpath->query('//span[@id="job_summary"]', $job_doc)->item(0)->nodeValue;
            $link = $job_xpath->query('//div[@class="job-footer-button-row"]/a');


            //If we have option selected which needs to fetch jobs for this day only
          

                $date = $job_xpath->query('//span[@class="date"]', $job_doc)->item(0)->nodeValue;
                $postDate = strtotime($date);
                $lastDay = strtotime('-1 day');

                if (!$postDate) {
                    return Null;
                }

                if ($postDate < $lastDay) { return Null; } } if ($link->length > 0) {

                $applyLink = "https://www.indeed.com" . $link->item(0)->getAttribute("href");
            




            $job = array(
                'title' => $title,
                'description' => $description,
                'link' => $applyLink
            );
        }


        return $job;
    }

Scrape data from indeed.com with php code explanation

I know I said earlier that scraping data from Indeed.com is very easy—but after looking at the code, you might want to kill me! 😅
But seriously, believe me—it is easy once you understand it. So take my hand, and let’s walk through the code together.

Overview of the php code:

We’ve created a function called fetchJobs inside the IndeedFetch class. This function takes two arguments: the brand (job title) and the location. In my implementation, I’m also using a location ID and brand ID, which you can remove if you’re not planning to store that data in a database.

Once we have the location and job title, we need to loop through all the pages that contain job listings related to the given location and brand. For each page, the script extracts individual job titles and descriptions, which can then be fetched and saved into your own database.

The script automatically scrapes data from Indeed.com using PHP for every page based on the provided locations and job titles.

Step by step explanation:

public static function fetchJobs($brands,$locations)

{

//Loop through each brand

foreach ($brands as $brand)

{

//prepare the brand parameters
$brandTitle = $brand[‘brand_title’];
$brandId = $brand[‘brand_id’];

//we need to use this brand title format for first page at indeeed.com $brandTitle1 = str_replace(' ', '-', $brandTitle);

//we need to use this brand title format for other pages

$brandTitle = str_replace(' ', '+', $brandTitle);

Ok so here we are getting brands and locations, PLEASE note i am passing brands and locations array with brand and location id’s. You can remove that if you do not want to save data back in the db.
brandTitles should be prepared for the URL, In short we are converting “Software Engineer” to ‘software-engineer” because that is how indeed url is formed. brandtitle1 is for first page indeed url and brandTitle variable is used for all other pages.

foreach ($locations as $location)

{

//prepare the location parameters to save in the database
$locationTitle = $location[‘name’];
$

locationId = $location['location_id'];

$locationTitle1 = str_replace(' ', '-', $locationTitle);

$locationTitle = str_replace(' ', '+', $locationTitle);

//Get jobs for the first page

$url = "https://www.indeed.com.pk/q-" . $brandTitle1 . "-l-" . $locationTitle1 . "-jobs.html";

$jobLinks = self::getJobLinks($url);

Here, we’re executing another loop that combines each location with a brand to prepare the URL in the correct format. For example: Software Engineer in New York, Software Engineer in Indiana, Software Engineer in New Jersey, and so on.

Once this loop finishes processing all locations for the current brand, we move on to the next brand and repeat the process.

Note that locationTitle1 and locationTitle are two different parameters—used for the first page and the subsequent pages, respectively.

The last line in this block calls another function that returns an array of job links. These links will then be used to fetch the job title and description for each listing.

public static function getJobLinks($url)

This function includes some typical XPath code, which is easy to understand if you take a moment to look up how XPath works. The structure of an Indeed job listing is fairly straightforward—the job title is contained within an

tag with the class “jobtitle”, so we extract the job URL from there. Once the URLs are obtained, they are returned to the parent function. That’s the exact process. You can copy and paste this function directly into your application—I’m confident it will work as expected.

for ($i = 10; $i <= 50; $i = $i + 10)

{

$url = "https://www.indeed.com.pk/jobs?q=" . $brandTitle . "&l=" . $locationTitle . "&start=" . $i;

if (is_array($jobLinks))

{

$jobLinks = array_merge($jobLinks, self::getJobLinks($url));

}

} //end for loop

$jobs = array();

foreach ($jobLinks as $jobLink)

{ $job = self::getJobDetails($jobLink);

if ($job) {

$jobs[] = array_merge($jobs, $job);

} //end this if condition

} //end this foreach loop for jobliniks

$jobArray[$location]=$jobs;

} //end locations loop

}//End brands loop

return $jobArray;

} //end function

Notice the for loop—it runs for 10, 20, 30, 40, and 50 because we’re fetching only the latest 50 jobs for each location and brand. You can adjust this range based on your needs.

To scrape data from Indeed.com with PHP, simply modify the loop or parameters to match your specific requirements.

After that, we call another function—getJobDetail. It’s a simple and self-explanatory function that fetches the details of each job, including the description, title, and post date.

If you have some experience with PHP programming, I’m confident you’ll easily understand the logic behind it.

Closing words:

I personally think this should be enough to help you get started with your next Indeed job scraping project. The code uses simple PHP concepts that shouldn’t take much time to understand, especially if you’ve been coding in PHP for a year or two.

If you have any questions, feel free to leave a comment or contact me—I’d be happy to help!

I am a freelance software consultant and Web programmer, Need help scraping data from Indeed.com with PHP for your project? I’d be happy to assist—feel free to reach out!

About Me

I am a freelance web development consultant. I help people select the right technology for their web applications or roll up my sleeves and build custom solutions from scratch with cutting-edge technology.

Get in Touch

Need Consultation?

You can either do this yourself or hire an expert software consultant who can help you solve the following pain points: