Time to read: 10 min read
So for a fun little project I decided to build a job scraper for the popular job listings website Indeed. I've used the UK website (Indeed.co.uk) but it should work for any country variant (just change the source url from .co.uk).
I've also built in a simple command line input for ease of use.
TIP
For this script to work you will need the libraries BeautifulSoup (for scraping) and Pandas installed (if an output excel/csv file is required).
If you'd like to find out information about them (including how to install) please visit the following pages:
So before we begin, let's examine the way Indeed.co.uk loads it's search terms into the url. For example if we type 'data scientist' into the search bar, and location as 'london' we get the following url in the address bar:
https://www.indeed.co.uk/jobs?q=data+scientist&l=london
Notice that the url has 'q='
(for query) followed by the job title, where any spaces are separated by a plus symbol '+'
and also '&l='
(for location).
We can therefore use this information to input our desired job title and location into the url, and create a simple input response in Python for this.
In addition, if we click to the next page of results, we get the following added to the url: &start=10
. Which means it will display the first 10 results for that search query. With this we can create a for loop which will search multiple pages for us automatically.
So, with that in mind - let's begin!
# Import required modules
from bs4 import BeautifulSoup # for parsing the html data
import urllib.request # for requesting the webpage
import time # to provide a brief delay in requesting data from Indeed
import pandas as pd # for writing the results to csv file (omit if not required)
job_postings = []
# Search term
search_term = input('Please enter a job title to search and press enter: ')
# Replaces spaces with + so url works correctly
search_term_url = search_term.replace(" ", "+")
# Location term
location_term = input('Enter a location to search and press enter: ')
# Replaces spaces with + so url works correctly (again)
location_term_url = location_term.replace(" ", "+")
# Requests number of inputs
max_results = input(
'Please enter the max number of results to search (10, 20, 30, etc...) ' + \
'and press enter: ')
# Checks max_results input to ensure it is an integer
try:
max_results_int = int(max_results)
except ValueError:
print('Please enter a number and try again.')
exit(1) # Exits program so user can restart
for start in range(0, max_results_int, 10):
source = 'https://www.indeed.co.uk/jobs?q=' + \
str(search_term_url) + '&l=' + \
str(location_term_url) + "&start=" + str(start)
BeautifulSoup
to open the source
url, and parse the data using the 'html.parser'
attribute.Note: I have added a time.sleep(1)
here to avoid requesting from the server too frequently (no more than once a second).
soup = BeautifulSoup(urllib.request.urlopen(source).read(), 'html.parser')
time.sleep(1)
results = soup.find_all('div', attrs={'data-tn-component': 'organicJob'})
So far your code should resemble the following:
# Import required modules
from bs4 import BeautifulSoup
import urllib.request
import time
import pandas as pd
job_postings = [] # Create empty job_postings list to append to
# Search term
search_term = input('Please enter a job title to search and press enter: ')
# Replaces spaces with + so url works correctly
search_term_url = search_term.replace(" ", "+")
# Location term
location_term = input('Please enter a location to search and press enter: ')
# As above
location_term_url = location_term.replace(" ", "+")
# Requests number of inputs
max_results = input(
'Please enter the max number of results to search (10, 20, 30, etc...)' + \
'and press enter: ')
# Checks to ensure max_results input is an integer
try:
max_results_int = int(max_results)
except ValueError:
print('Please enter a number and try again.')
exit(1)
for start in range(0, max_results_int, 10):
source = 'https://www.indeed.co.uk/jobs?q=' + \
str(search_term_url) + '&l=' + \
str(location_term_url) + "&start=" + str(start)
soup = BeautifulSoup(urllib.request.urlopen(source).read(), 'html.parser')
time.sleep(1)
results = soup.find_all('div', attrs={'data-tn-component': 'organicJob'})
If we were to add print(results)
to the end of the for start
loop we should get an output of a mass of html text - which would confirm that BeautifulSoup is working correctly!
.find()
function, inputting the html element we wish to select (e.g. 'div'
) and the attributes of that element e.g data-tn-element': "jobTitle"
.TIP
You can inspect elements on a webpage by opening up the browser dev tools built and using the 'Inspector' feature or equivalent (F12
key on Chrome and Firefox)
We need to next place the elements we want to select within another for loop, this time for x in results:
. So the start of your inner scraping loop (just below the results =
) should look like this:
for x in results:
job = x.find('a', attrs={'data-tn-element': "jobTitle"})
# The attributes (or attrs) can be obtained
# by looking through the elements on a webpage in the browser dev tools.
To strip the data into a readable format we use an if
loop and the text.strip()
function in Python. This can then be called via a print
to display the output, for example:
if job:
job_str = job.text.strip() # strip the text and assign to a variable
print('Job Title:', job_str) # display the job title to screen
We can keep doing this for a variety of different html elements, including span
and div
e.g.:
company = x.find('span', attrs={'class': 'company'})
if company:
company_str = company.text.strip()
print('Company:', company_str)
Make sure to account for there being no data within a field (i.e. for salary):
salary = x.find('span', attrs={'class': "salaryText"})
if salary:
salary_str = salary.text.strip()
print('Salary:', salary_str)
else: # For postings that have no salaryText data
print('Salary: Not listed')
salary_str = 'Not listed'
I also included the summary field in mine which has a brief job description from the posting:
summary = x.find('div', attrs={'class': 'summary'})
if summary:
summary_str = summary.text.strip()
print('Summary:', summary_str)
At the end of this loop searching for elements, make sure to put some sort of append_postings variable if you'd like to write the results to a file rather than just print on screen i.e:
append_postings = date_str, company_str, job_str, salary_str, summary_str
job_postings.append(append_postings)
So far, you should have a scraping for loop which resembles the following (feel free to add/remove elements as you see fit):
for start in range(0, max_results_int, 10):
source = 'https://www.indeed.co.uk/jobs?q=' + \
str(search_term_url) + '&l=' + \
str(location_term_url) + "&start=" + str(start)
soup = BeautifulSoup(urllib.request.urlopen(source).read(), 'html.parser')
time.sleep(1)
results = soup.find_all('div', attrs={'data-tn-component': 'organicJob'})
for x in results:
date = x.find('span', attrs={'class': 'date'})
if date:
date_str = date.text.strip()
print('Posted:', date_str)
company = x.find('span', attrs={'class': 'company'})
if company:
company_str = company.text.strip()
print('Company:', company_str)
job = x.find('a', attrs={'data-tn-element': "jobTitle"})
if job:
job_str = job.text.strip()
print('Job Title:', job_str)
salary = x.find('span', attrs={'class': "salaryText"})
if salary:
salary_str = salary.text.strip()
print('Salary:', salary_str)
else:
print('Salary: Not listed')
salary_str = 'Not listed'
summary = x.find('div', attrs={'class': 'summary'})
if summary:
summary_str = summary.text.strip()
print('Summary:', summary_str)
append_postings = date_str, company_str, job_str, salary_str, summary_str
job_postings.append(append_postings)
print('----------') # to separate the elements on screen
def to_csv_file(response):
yes = {'yes', 'y', 'ye', ''}
no = {'no', 'n'}
if response in yes:
columns = ['Posted', 'Company', 'JobTitle',
'Salary', 'Summary'] # Specify dataframe columns
df = pd.DataFrame(job_postings, columns=columns) # Write to pandas dataframe
df.to_csv('job_scraper.csv', encoding='utf-8-sig', index=False) # Write to csv file
print(
f'Done, wrote {max_results_int} job postings for {search_term} in {location_term} to jobs_scraper.csv.')
exit() # Print finished statement and exit
elif response in no:
input('\nPlease press any key to exit ')
exit()
else:
print("Please respond with 'yes' or 'no' and try your search again.")
exit()
Don't forget to call this function!
# Ask user if they'd like results written to file.
choice = input('\nWould you like the results written to a file? ').lower()
# Call function for to_csv_file()
to_csv_file(choice)
This code can definitely be improved (this is the first scraper I have built!), so if you have any suggestions please get in touch via my contact page.
Full code below:
# Import required modules
from bs4 import BeautifulSoup
import urllib.request
import time
import pandas as pd
job_postings = [] # Create empty job_postings list to append to
# Search term
search_term = input('Please enter a job title to search and press enter: ')
# Replaces spaces with + so url works correctly
search_term_url = search_term.replace(" ", "+")
# Location term
location_term = input('Please enter a location to search and press enter: ')
# As above
location_term_url = location_term.replace(" ", "+")
# Requests number of inputs
max_results = input(
'Please enter the max number of results to search (10, 20, 30, etc...) ' + \
'and press enter: ')
# Checks to ensure max_results input is an integer
try:
max_results_int = int(max_results)
except ValueError:
print('Please enter a number and try again.')
exit(1)
for start in range(0, max_results_int, 10):
source = 'https://www.indeed.co.uk/jobs?q=' + \
str(search_term_url) + '&l=' + \
str(location_term_url) + "&start=" + str(start)
soup = BeautifulSoup(urllib.request.urlopen(source).read(), 'html.parser')
time.sleep(1)
results = soup.find_all('div', attrs={'data-tn-component': 'organicJob'})
for x in results:
date = x.find('span', attrs={'class': 'date'})
if date:
date_str = date.text.strip()
print('Posted:', date_str)
company = x.find('span', attrs={'class': 'company'})
if company:
company_str = company.text.strip()
print('Company:', company_str)
job = x.find('a', attrs={'data-tn-element': "jobTitle"})
if job:
job_str = job.text.strip()
print('Job Title:', job_str)
salary = x.find('span', attrs={'class': "salaryText"})
if salary:
salary_str = salary.text.strip()
print('Salary:', salary_str)
else:
print('Salary: Not listed')
salary_str = 'Not listed'
summary = x.find('div', attrs={'class': 'summary'})
if summary:
summary_str = summary.text.strip()
print('Summary:', summary_str)
append_postings = date_str, company_str, job_str, salary_str, summary_str
job_postings.append(append_postings)
print('----------')
def to_csv_file(response):
yes = {'yes', 'y', 'ye', ''}
no = {'no', 'n'}
if response in yes:
# Write to dataframe
columns = ['Posted', 'Company', 'JobTitle',
'Salary', 'Summary'] # Specify dataframe columns
df = pd.DataFrame(job_postings, columns=columns)
# Write to csv file
df.to_csv('job_scraper.csv', encoding='utf-8-sig', index=False)
# Print finished statement - will build in errors in future
print(
f'Done, wrote {max_results_int} job postings for {search_term} in {location_term} to jobs_scraper.csv.')
exit()
elif response in no:
input('\nPlease press any key to exit ')
exit()
else:
print("Please respond with 'yes' or 'no' and try your search again.")
exit()
# Ask user if they'd like results written to file.
choice = input('\nWould you like the results written to a file? ').lower()
# Call function for to_csv_file()
to_csv_file(choice)