Time to read: 10 min read

Indeed Job Scraper Using Python

So for a fun little project I decided to build a job scraper for the popular job listings website Indeed. I've used the UK website (Indeed.co.uk) but it should work for any country variant (just change the source url from .co.uk).

I've also built in a simple command line input for ease of use.

TIP

For this script to work you will need the libraries BeautifulSoup (for scraping) and Pandas installed (if an output excel/csv file is required).

If you'd like to find out information about them (including how to install) please visit the following pages:


Start!

So before we begin, let's examine the way Indeed.co.uk loads it's search terms into the url. For example if we type 'data scientist' into the search bar, and location as 'london' we get the following url in the address bar: https://www.indeed.co.uk/jobs?q=data+scientist&l=london

Notice that the url has 'q=' (for query) followed by the job title, where any spaces are separated by a plus symbol '+' and also '&l=' (for location).

We can therefore use this information to input our desired job title and location into the url, and create a simple input response in Python for this.

In addition, if we click to the next page of results, we get the following added to the url: &start=10. Which means it will display the first 10 results for that search query. With this we can create a for loop which will search multiple pages for us automatically.

So, with that in mind - let's begin!

  1. First the required modules need to be loaded:
# Import required modules
from bs4 import BeautifulSoup # for parsing the html data
import urllib.request # for requesting the webpage
import time # to provide a brief delay in requesting data from Indeed
import pandas as pd # for writing the results to csv file (omit if not required)
  1. Next we create an empty job_postings list to append to:
job_postings = []
  1. Then we create the input categories for the scraper, including the search term that we will append to the url. We also need to sanitise the input by removing any spaces and adding it to the end of the url.
# Search term
search_term = input('Please enter a job title to search and press enter: ')
# Replaces spaces with + so url works correctly
search_term_url = search_term.replace(" ", "+")
# Location term
location_term = input('Enter a location to search and press enter: ')
# Replaces spaces with + so url works correctly (again)
location_term_url = location_term.replace(" ", "+")
# Requests number of inputs
max_results = input(
    'Please enter the max number of results to search (10, 20, 30, etc...) ' + \
    'and press enter: ')

# Checks max_results input to ensure it is an integer
try:
    max_results_int = int(max_results)
except ValueError:
    print('Please enter a number and try again.')
    exit(1) # Exits program so user can restart
  1. Next we need to create a for loop which loops through the range of values from 0 to the max_results_int variable from the input we set above. Indeed is also searched in increments of 10. So the for loop will begin:
for start in range(0, max_results_int, 10):
  1. Now we will specify the source url (from indeed) and add the search term and location term variables which the user will have input:
source = 'https://www.indeed.co.uk/jobs?q=' + \
        str(search_term_url) + '&l=' + \
        str(location_term_url) + "&start=" + str(start)
  1. Next we will use BeautifulSoup to open the source url, and parse the data using the 'html.parser' attribute.

Note: I have added a time.sleep(1) here to avoid requesting from the server too frequently (no more than once a second).

 soup = BeautifulSoup(urllib.request.urlopen(source).read(), 'html.parser')
time.sleep(1)
  1. Now we will define the results variable which will be used to perform another for loop in the next step.
results = soup.find_all('div', attrs={'data-tn-component': 'organicJob'})

So far your code should resemble the following:

# Import required modules
from bs4 import BeautifulSoup
import urllib.request
import time
import pandas as pd

job_postings = []  # Create empty job_postings list to append to

# Search term
search_term = input('Please enter a job title to search and press enter: ')
# Replaces spaces with + so url works correctly
search_term_url = search_term.replace(" ", "+")
# Location term
location_term = input('Please enter a location to search and press enter: ')
# As above
location_term_url = location_term.replace(" ", "+")
# Requests number of inputs
max_results = input(
    'Please enter the max number of results to search (10, 20, 30, etc...)' + \
    'and press enter: ')

# Checks to ensure max_results input is an integer
try:
    max_results_int = int(max_results)
except ValueError:
    print('Please enter a number and try again.')
    exit(1)

for start in range(0, max_results_int, 10):

    source = 'https://www.indeed.co.uk/jobs?q=' + \
        str(search_term_url) + '&l=' + \
        str(location_term_url) + "&start=" + str(start)

    soup = BeautifulSoup(urllib.request.urlopen(source).read(), 'html.parser')
    time.sleep(1)

    results = soup.find_all('div', attrs={'data-tn-component': 'organicJob'})

If we were to add print(results) to the end of the for start loop we should get an output of a mass of html text - which would confirm that BeautifulSoup is working correctly!

  1. So now let's add the types of data we'd like to pull from the webpage (date posted, company name, job title, salary and summary). To do this we declare the variable (e.g job) and use the .find() function, inputting the html element we wish to select (e.g. 'div') and the attributes of that element e.g data-tn-element': "jobTitle".

TIP

You can inspect elements on a webpage by opening up the browser dev tools built and using the 'Inspector' feature or equivalent (F12 key on Chrome and Firefox)

We need to next place the elements we want to select within another for loop, this time for x in results:. So the start of your inner scraping loop (just below the results =) should look like this:

    for x in results:

        job = x.find('a', attrs={'data-tn-element': "jobTitle"})
        # The attributes (or attrs) can be obtained
        # by looking through the elements on a webpage in the browser dev tools.

To strip the data into a readable format we use an if loop and the text.strip() function in Python. This can then be called via a print to display the output, for example:

        if job:
            job_str = job.text.strip() # strip the text and assign to a variable
            print('Job Title:', job_str) # display the job title to screen

We can keep doing this for a variety of different html elements, including span and div e.g.:

       company = x.find('span', attrs={'class': 'company'})
        if company:
            company_str = company.text.strip()
            print('Company:', company_str)

Make sure to account for there being no data within a field (i.e. for salary):

        salary = x.find('span', attrs={'class': "salaryText"})
        if salary:
            salary_str = salary.text.strip()
            print('Salary:', salary_str)

        else: # For postings that have no salaryText data
            print('Salary: Not listed')
            salary_str = 'Not listed'

I also included the summary field in mine which has a brief job description from the posting:

        summary = x.find('div', attrs={'class': 'summary'})
        if summary:
            summary_str = summary.text.strip()
            print('Summary:', summary_str)

At the end of this loop searching for elements, make sure to put some sort of append_postings variable if you'd like to write the results to a file rather than just print on screen i.e:

            append_postings = date_str, company_str, job_str, salary_str, summary_str
            job_postings.append(append_postings)

So far, you should have a scraping for loop which resembles the following (feel free to add/remove elements as you see fit):

for start in range(0, max_results_int, 10):

    source = 'https://www.indeed.co.uk/jobs?q=' + \
        str(search_term_url) + '&l=' + \
        str(location_term_url) + "&start=" + str(start)

    soup = BeautifulSoup(urllib.request.urlopen(source).read(), 'html.parser')
    time.sleep(1)

    results = soup.find_all('div', attrs={'data-tn-component': 'organicJob'})

    for x in results:

        date = x.find('span', attrs={'class': 'date'})
        if date:
            date_str = date.text.strip()
            print('Posted:', date_str)

        company = x.find('span', attrs={'class': 'company'})
        if company:
            company_str = company.text.strip()
            print('Company:', company_str)

        job = x.find('a', attrs={'data-tn-element': "jobTitle"})
        if job:
            job_str = job.text.strip()
            print('Job Title:', job_str)

        salary = x.find('span', attrs={'class': "salaryText"})
        if salary:
            salary_str = salary.text.strip()
            print('Salary:', salary_str)

        else:
            print('Salary: Not listed')
            salary_str = 'Not listed'

        summary = x.find('div', attrs={'class': 'summary'})
        if summary:
            summary_str = summary.text.strip()
            print('Summary:', summary_str)

            append_postings = date_str, company_str, job_str, salary_str, summary_str
            job_postings.append(append_postings)

        print('----------') # to separate the elements on screen
  1. We're nearly there - well done for making it this far! If you'd like to add the extra functionality of writing to csv file, I created a simple yes/no response function which writes the data to a jobs_scraper.csv file. Please see the code below:
def to_csv_file(response):
    yes = {'yes', 'y', 'ye', ''}
    no = {'no', 'n'}

    if response in yes:
        columns = ['Posted', 'Company', 'JobTitle',
                   'Salary', 'Summary']  # Specify dataframe columns
        df = pd.DataFrame(job_postings, columns=columns) # Write to pandas dataframe
        df.to_csv('job_scraper.csv', encoding='utf-8-sig', index=False) # Write to csv file
        print(
            f'Done, wrote {max_results_int} job postings for {search_term} in {location_term} to jobs_scraper.csv.')
        exit() # Print finished statement and exit
    elif response in no:
        input('\nPlease press any key to exit ')
        exit()
    else:
        print("Please respond with 'yes' or 'no' and try your search again.")
        exit()

Don't forget to call this function!

# Ask user if they'd like results written to file.
choice = input('\nWould you like the results written to a file? ').lower()
# Call function for to_csv_file()
to_csv_file(choice)
  1. That's it! All that's left to do is run the program via an IDE or terminal which takes an input (i.e. bash or cmd.exe).

Many thanks for reading!

This code can definitely be improved (this is the first scraper I have built!), so if you have any suggestions please get in touch via my contact page.

Full code below:


# Import required modules
from bs4 import BeautifulSoup
import urllib.request
import time
import pandas as pd

job_postings = []  # Create empty job_postings list to append to

# Search term
search_term = input('Please enter a job title to search and press enter: ')
# Replaces spaces with + so url works correctly
search_term_url = search_term.replace(" ", "+")
# Location term
location_term = input('Please enter a location to search and press enter: ')
# As above
location_term_url = location_term.replace(" ", "+")
# Requests number of inputs
max_results = input(
    'Please enter the max number of results to search (10, 20, 30, etc...) ' + \
    'and press enter: ')

# Checks to ensure max_results input is an integer
try:
    max_results_int = int(max_results)
except ValueError:
    print('Please enter a number and try again.')
    exit(1)

for start in range(0, max_results_int, 10):

    source = 'https://www.indeed.co.uk/jobs?q=' + \
        str(search_term_url) + '&l=' + \
        str(location_term_url) + "&start=" + str(start)

    soup = BeautifulSoup(urllib.request.urlopen(source).read(), 'html.parser')
    time.sleep(1)

    results = soup.find_all('div', attrs={'data-tn-component': 'organicJob'})

    for x in results:

        date = x.find('span', attrs={'class': 'date'})
        if date:
            date_str = date.text.strip()
            print('Posted:', date_str)

        company = x.find('span', attrs={'class': 'company'})
        if company:
            company_str = company.text.strip()
            print('Company:', company_str)

        job = x.find('a', attrs={'data-tn-element': "jobTitle"})
        if job:
            job_str = job.text.strip()
            print('Job Title:', job_str)

        salary = x.find('span', attrs={'class': "salaryText"})
        if salary:
            salary_str = salary.text.strip()
            print('Salary:', salary_str)

        else:
            print('Salary: Not listed')
            salary_str = 'Not listed'

        summary = x.find('div', attrs={'class': 'summary'})
        if summary:
            summary_str = summary.text.strip()
            print('Summary:', summary_str)

            append_postings = date_str, company_str, job_str, salary_str, summary_str
            job_postings.append(append_postings)

        print('----------')


def to_csv_file(response):
    yes = {'yes', 'y', 'ye', ''}
    no = {'no', 'n'}

    if response in yes:
        # Write to dataframe
        columns = ['Posted', 'Company', 'JobTitle',
                   'Salary', 'Summary']  # Specify dataframe columns
        df = pd.DataFrame(job_postings, columns=columns)
        # Write to csv file
        df.to_csv('job_scraper.csv', encoding='utf-8-sig', index=False)
        # Print finished statement - will build in errors in future
        print(
            f'Done, wrote {max_results_int} job postings for {search_term} in {location_term} to jobs_scraper.csv.')
        exit()
    elif response in no:
        input('\nPlease press any key to exit ')
        exit()
    else:
        print("Please respond with 'yes' or 'no' and try your search again.")
        exit()


# Ask user if they'd like results written to file.
choice = input('\nWould you like the results written to a file? ').lower()
# Call function for to_csv_file()
to_csv_file(choice)