Learn how to build a resume parser using Python
A resume parser is like a smart assistant that reads through resumes and picks out key details for you. It scans resumes to identify and pull out important information like names, contact details, education, work experience, skills, and more. It does all the heavy lifting of sorting and organizing this information so you can easily review it.
In simple terms, a resume parser reads resumes and converts the unstructured text into structured data. This structured data can be easily stored in a database and used for further analysis or processing. The main purpose of a resume parser is to save time and effort for recruiters by automating the data extraction process. Instead of manually reading through hundreds of resumes, a parser can quickly and accurately extract the necessary details.
Resume parsers are incredibly important in HR and recruitment. They help streamline the hiring process by:
Before we build our own resume parser, let’s watch how our resume scraper smoothly extracts all the details from a sample resume. Here’s the video
In this Article, we will discuss how to build a resume parser using Python. We will cover everything from the basics of resume parsing to the detailed steps of coding a parser. The post will include practical examples and a complete Python resume parser tutorial. You will learn how to extract resume data, automate resume parsing, and use libraries and tools to create an efficient resume parser project. By the end of the tutorial, you’ll be able to implement a resume parser script and apply it to real-world scenarios in HR tech.
We will cover the following topics:
Building a resume parser offers many benefits for both HR professionals and recruitment processes. Here are some of the key advantages:
A resume parser significantly speeds up the hiring process. By automating resume parsing with Python, you can quickly scan through hundreds or thousands of resumes. This makes it easier to identify the most qualified candidates without spending countless hours on manual review. Using a Python resume parser tutorial, you can learn to create efficient tools that save time and effort.
Entering data by hand can cause mistakes, like typos or missing details. A resume parser makes sure the data is taken out accurately and consistently. This accuracy helps make better hiring decisions because the information is reliable. Using a machine learning resume parser, you can use advanced methods to improve how well the data is extracted.
By building a resume parser, you can improve the candidate matching process. The parser can extract information such as skills, experience, and education, which can be used to match candidates with job requirements more accurately. This leads to better hiring outcomes and a more efficient recruitment process. The use of NLP resume parser techniques ensures that the extracted information is relevant and useful.
Integrating a resume parser into your HR technology stack enhances overall efficiency. It allows for smooth data transfer between different HR systems and databases. A resume parsing software can integrate with applicant tracking systems (ATS) and other HR tools to provide a comprehensive solution. Building a resume parser with Python enables customization and scalability, ensuring that the tool meets your specific HR needs.
To build a resume parser using Python, you’ll need some basic knowledge and tools. Here’s what you should know:
Required Skills and Tools
You should have a basic understanding of Python. This includes knowing how to write Python code, use functions, and work with libraries. If you’re new to Python, you can start with a beginner’s tutorial to get up to speed. Knowing Python is essential because you’ll be writing scripts and using Python libraries to build the resume parser.
import pandas as pd
# Create a DataFrame to store resume data
data = {
'Name': ['Alice Johnson', 'Bob Smith'],
'Email': ['alice@example.com', 'bob@example.com'],
'Phone': ['123-456-7890', '987-654-3210']
}
df = pd.DataFrame(data)
# Display the DataFrame
print(df)
import re
# Sample text from a resume
text = "Contact me at alice@example.com or call 123-456-7890."
# Find email addresses
email = re.search(r'\S+@\S+', text)
print(f"Email: {email.group()}")
# Find phone numbers
phone = re.search(r'\d{3}-\d{3}-\d{4}', text)
print(f"Phone: {phone.group()}")
import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
# Sample text from a resume
text = "Alice Johnson has 5 years of experience in software engineering."
# Tokenize the text
tokens = word_tokenize(text)
print(f"Tokens: {tokens}")
# Tag parts of speech
tagged = pos_tag(tokens)
print(f"Tagged: {tagged}")
spacy: Spacy is another NLP library that is known for its speed and efficiency. It can be used for advanced text processing tasks, such as extracting entities, understanding sentence structure, and more. Spacy complements NLTK and can be particularly useful for building a strong resume parser.
import spacy
# Load the English model
nlp = spacy.load('en_core_web_sm')
# Sample text from a resume
text = "Alice Johnson worked at TechCorp from 2018 to 2022."
# Process the text
doc = nlp(text)
# Extract named entities
for ent in doc.ents:
print(f"{ent.text} ({ent.label_})")
By combining these libraries, you can build a powerful and efficient resume parser. For example, you can use pandas for data handling, re for extracting specific patterns, and nltk and spacy for advanced text processing.
A virtual environment is a tool that helps you manage dependencies and keep your projects isolated from each other. It ensures that each project has its own set of libraries and versions, avoiding conflicts between them.
Here’s how you can set up a virtual environment:
Open your command line or terminal and navigate to your project directory. Run the following command to create a virtual environment:
python -m venv myenv
Here, myenv is the name of your virtual environment. You can choose any name you like.
Once the virtual environment is created, you need to activate it:
myenv\Scripts\activate
source myenv/bin/activate
With the virtual environment active, install the required libraries again using pip:
pip install pandas nltk spacy
This ensures that these libraries are installed only in your virtual environment and not globally on your system.
By setting up a virtual environment, you keep your project’s dependencies separate and organized. This makes it easier to manage and avoid issues with different projects.
Here’s a step-by-step guide to building a simple resume parser using Python. This example will cover extracting basic information like names, contact details, and skills from a resume. We will use libraries such as pandas, re (regular expressions), and nltk for natural language processing.
We’ve already set up the environment for our resume parser tool, so the next step is to import the required libraries.
resume_parser.py.pandas helps with organizing and managing data.re is used for finding patterns in text, like extracting email addresses.nltk and spacy are for processing and understanding the text from resumes.punkt for breaking text into words and sentences.wordnet for working with words and their meanings.Here’s what it looks like in code:
import pandas as pd
import re
import nltk
import spacy
# Download NLTK data
nltk.download('punkt')
nltk.download('wordnet')
# Load SpaCy model
nlp = spacy.load('en_core_web_sm')
To extract useful information from resumes, you’ll create several functions to pull out different types of data like names, contact details, and skills. Here’s how you can set this up:
This function finds names in the text using SpaCy’s named entity recognition (NER) tool:
def extract_name(text):
# Process the text with SpaCy
doc = nlp(text)
# Look for entities labeled as 'PERSON' which typically are names
names = [ent.text for ent in doc.ents if ent.label_ == 'PERSON']
# Return the first name found, or None if no names were found
return names[0] if names else None
None if no names are detected.Extract Contact Details
This function finds phone numbers and email addresses using regular expressions:
def extract_contact_details(text):
# Patterns to find phone numbers and emails
phone_pattern = re.compile(r'\+?\d[\d -]{8,12}\d')
email_pattern = re.compile(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}')
# Find matches in the text
phone_numbers = phone_pattern.findall(text)
emails = email_pattern.findall(text)
return {
'phone_numbers': phone_numbers,
'emails': emails
}
This function identifies specific skills mentioned in the text:
def extract_skills(text):
# List of common skills to look for
skills = ['Python', 'Java', 'SQL', 'Machine Learning', 'Data Analysis']
skills_found = [skill for skill in skills if skill.lower() in text.lower()]
return skills_found
These functions help you automatically extract important information from resumes, such as names, contact details, and skills, making the resume processing more efficient and accurate.
Create a function to parse the entire resume and extract the information using the above functions. This function processes the entire text of a resume to extract key information like the name, contact details, and skills.
def parse_resume(resume_text):
# Extract the name from the resume text
name = extract_name(resume_text)
# Extract contact details (phone numbers and emails) from the resume text
contact_details = extract_contact_details(resume_text)
# Extract skills mentioned in the resume text
skills = extract_skills(resume_text)
# Return the extracted information in a structured format
return {
'Name': name,
'Contact Details': contact_details,
'Skills': skills
}
def parse_resume(resume_text):: This line defines a function named parse_resume that takes resume_text as an argument. This text is the content of the resume you want to analyze.name = extract_name(resume_text): Calls the extract_name function with the resume text. This function looks for names in the text using SpaCy and returns the detected name.contact_details = extract_contact_details(resume_text): Calls the extract_contact_details function with the resume text. This function uses regular expressions to find phone numbers and email addresses, then returns them in a dictionary.skills = extract_skills(resume_text): Calls the extract_skills function with the resume text. This function checks if any of the predefined skills are mentioned in the text and returns a list of those skills.return {'Name': name, 'Contact Details': contact_details, 'Skills': skills}: Returns a dictionary containing all the extracted information: The code snippet provided is for testing your resume parser function to make sure it extracts information as expected.
if __name__ == "__main__":
# Sample resume text to test the parser
sample_resume = """
John Doe
Phone: +123 456 7890
Email: john.doe@example.com
Skills: Python, Java, Data Analysis
"""
# Parse the sample resume text
parsed_data = parse_resume(sample_resume)
# Print the parsed data as a DataFrame
print(pd.DataFrame([parsed_data]))
Check if the Script is Run Directly:
if __name__ == "__main__": is a standard Python construct. It checks if this script is being run directly (not imported as a module in another script). If it is, the code inside this block will execute.Sample Resume Text:
sample_resume = """ ... """: This variable holds a multi-line string representing a sample resume. It includes: Parse the Sample Resume:
parsed_data = parse_resume(sample_resume): Calls the parse_resume function with the sample_resume text. This function processes the resume and extracts the name, contact details, and skills. The result is stored in the parsed_data variable.Print the Results:
print(pd.DataFrame([parsed_data])): Converts the parsed_data (a dictionary) into a pandas DataFrame and prints it. pandas library. It makes the extracted information easy to read and analyze in a structured format.Here is the complete source code for your resume parser:
import pandas as pd
import re
import nltk
import spacy
# Download NLTK data
nltk.download('punkt')
nltk.download('wordnet')
# Load SpaCy model
nlp = spacy.load('en_core_web_sm')
def extract_name(text):
doc = nlp(text)
names = [ent.text for ent in doc.ents if ent.label_ == 'PERSON']
return names[0] if names else None
def extract_contact_details(text):
phone_pattern = re.compile(r'\+?\d[\d -]{8,12}\d')
email_pattern = re.compile(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}')
phone_numbers = phone_pattern.findall(text)
emails = email_pattern.findall(text)
return {
'phone_numbers': phone_numbers,
'emails': emails
}
def extract_skills(text):
skills = ['Python', 'Java', 'SQL', 'Machine Learning', 'Data Analysis']
skills_found = [skill for skill in skills if skill.lower() in text.lower()]
return skills_found
def parse_resume(resume_text):
name = extract_name(resume_text)
contact_details = extract_contact_details(resume_text)
skills = extract_skills(resume_text)
return {
'Name': name,
'Contact Details': contact_details,
'Skills': skills
}
if __name__ == "__main__":
sample_resume = """
John Doe
Phone: +123 456 7890
Email: john.doe@example.com
Skills: Python, Java, Data Analysis
"""
parsed_data = parse_resume(sample_resume)
print(pd.DataFrame([parsed_data]))
To enhance the resume parser and make it more sophisticated, you can integrate advanced NLP techniques, handle various resume formats, and improve accuracy. Here’s how you can expand on the basic resume parser:
Let’s go through the code for training a custom Named Entity Recognition (NER) model with SpaCy. This code enables you to build a model specifically designed to identify particular entities in resumes, such as names and job titles.
This code trains a customized Named Entity Recognition (NER) model to identify specific entities, such as names and job titles, in resumes. It begins with a pre-existing base model, introduces custom entity labels, prepares the training data, conducts the training process, and then saves the trained model to a file. This specialized model will more effectively detect relevant entities in resumes compared to a generic model.
# Import necessary libraries from SpaCy
import spacy
from spacy.training import Example
from spacy.training import Corpus
from spacy.training import Config
from spacy.training import Trainer
# Load a base model
nlp = spacy.load('en_core_web_sm')
# Define your training data (this is a simple example, ideally use a larger dataset)
TRAIN_DATA = [
("John Doe is a Data Scientist", {"entities": [(0, 8, "PERSON")]}),
("Jane Smith worked as a Software Engineer", {"entities": [(0, 10, "PERSON"), (18, 37, "JOB_TITLE")]})
]
# Add the NER component to the pipeline
ner = nlp.get_pipe('ner')
# Add new labels to the NER
ner.add_label("JOB_TITLE")
# Prepare training data
training_data = []
for text, annot in TRAIN_DATA:
doc = nlp.make_doc(text)
example = Example.from_dict(doc, annot)
training_data.append(example)
# Train the model
nlp.begin_training()
for epoch in range(10):
losses = {}
nlp.update(training_data, drop=0.5, losses=losses)
print(f"Epoch {epoch}, Losses: {losses}")
# Save the model
nlp.to_disk('custom_ner_model')
import spacy: Imports SpaCy, a library for natural language processing.from spacy.training import Example: Provides tools to create training examples.from spacy.training import Corpus, Config, Trainer: Imports components needed for training and configuring the model.nlp = spacy.load('en_core_web_sm'): Loads a pre-trained SpaCy model (en_core_web_sm) as a starting point for training. This model includes basic NLP capabilities.TRAIN_DATA: A list of tuples where each tuple contains: {"entities": [(start, end, label)]}, where start and end are the character positions of the entity, and label is the type of entity (e.g., "PERSON" or "JOB_TITLE").ner = nlp.get_pipe('ner'): Retrieves the Named Entity Recognition (NER) component from the pipeline of the base model.ner.add_label("JOB_TITLE"): Adds a new label ("JOB_TITLE") to the NER component. This label will help recognize job titles in the text.for text, annot in TRAIN_DATA:: Processes over the training data.doc = nlp.make_doc(text): Converts the text into a SpaCy document.example = Example.from_dict(doc, annot): Creates a training example using the document and annotations.training_data.append(example): Adds the example to the list of training data.nlp.begin_training(): Prepares the model for training.for epoch in range(10):: Runs the training process for 10 epochs (iterations). losses = {}: Initializes a dictionary to keep track of losses.nlp.update(training_data, drop=0.5, losses=losses): Updates the model with the training data. drop=0.5 helps with regularization by randomly dropping some data during training to avoid overfitting.print(f"Epoch {epoch}, Losses: {losses}"): Prints the loss values after each epoch to monitor training progress.nlp.to_disk('custom_ner_model'): Saves the trained model to disk in the directory custom_ner_model. This allows you to use the trained model later without retraining it.This code utilizes the transformers library to use a pre-trained BERT model for Named Entity Recognition. By loading a specialized model fine-tuned for NER, you can effectively identify and classify entities such as names, job titles, and other key information in a text. The pipeline function simplifies applying the model, and the extract_entities function formats the results for easy use.
pip install transformers
This command installs the transformers library from Hugging Face, which provides pre-trained models and tools for natural language processing tasks.
from transformers import pipeline
Import: The pipeline function from the transformers library makes it easy to use pre-trained models for different tasks, such as Named Entity Recognition (NER).
# Load a pre-trained model for NER
nlp = pipeline("ner", model="dbmdz/bert-large-cased-finetuned-conll03-english")
Load Model: This line creates an NER pipeline using a pre-trained BERT model fine-tuned on the CoNLL-03 dataset, which is designed for entity recognition tasks.
"dbmdz/bert-large-cased-finetuned-conll03-english" is a specific BERT variant fine-tuned for English NER. It’s trained to understand context and identify entities in text more effectively.def extract_entities(text):
entities = nlp(text)
return [(ent['word'], ent['entity']) for ent in entities]
Function: extract_entities takes a string of text and processes it to identify named entities.
nlp(text): This applies the NER pipeline to the provided text, returning a list of detected entities.[(ent['word'], ent['entity']) for ent in entities]: This line extracts and formats the entities into a list of tuples where: ent['word']: The actual word or phrase identified as an entity.ent['entity']: The type of entity (e.g., PERSON, ORG).# Example usage
text = "John Doe, an experienced Data Scientist with skills in Python and Machine Learning."
entities = extract_entities(text)
print(entities)
"John Doe, an experienced Data Scientist with skills in Python and Machine Learning." is a sample string containing names, job titles, and skills.extract_entities(text): Calls the function to extract entities from the sample text.print(entities): Displays the extracted entities and their types.When dealing with resumes, you might encounter them in various formats such as PDF, DOCX (Microsoft Word), and plain text. The provided code shows how to use Python libraries to extract text from these formats.
pip install python-docx pdfplumber
python-docx and pdfplumber libraries. python-docx: Handles DOCX files (Microsoft Word).pdfplumber: Handles PDF files and allows for detailed extraction of text from PDF pages.import pdfplumber
from docx import Document
pdfplumber: Library for extracting text and information from PDF files.Document from docx: Class from python-docx to handle DOCX files.def extract_text_from_pdf(pdf_path):
with pdfplumber.open(pdf_path) as pdf:
text = ''.join(page.extract_text() for page in pdf.pages)
return text
extract_text_from_pdf takes the path to a PDF file and extracts its text. pdfplumber.open(pdf_path): Opens the PDF file.pdf.pages: Retrieves all pages from the PDF.page.extract_text(): Extracts text from each page.''.join(...): Combines text from all pages into a single string.return text: Returns the extracted text.def extract_text_from_docx(docx_path):
doc = Document(docx_path)
text = '\n'.join(paragraph.text for paragraph in doc.paragraphs)
return text
extract_text_from_docx takes the path to a DOCX file and extracts its text. Document(docx_path): Opens the DOCX file.doc.paragraphs: Retrieves all paragraphs from the document.paragraph.text: Gets the text of each paragraph.'\n'.join(...): Joins all paragraph texts with newline characters to maintain formatting.return text: Returns the combined text.def extract_text_from_file(file_path):
if file_path.endswith('.pdf'):
return extract_text_from_pdf(file_path)
elif file_path.endswith('.docx'):
return extract_text_from_docx(file_path)
else:
with open(file_path, 'r') as file:
return file.read()
Function: extract_text_from_file handles different file formats based on the file extension.
file_path.endswith('.pdf'): Checks if the file is a PDF. return extract_text_from_pdf(file_path): Uses the PDF extraction function.file_path.endswith('.docx'): Checks if the file is a DOCX. return extract_text_from_docx(file_path): Uses the DOCX extraction function.else: Assumes the file is plain text. with open(file_path, 'r') as file:: Opens the file in read mode.return file.read(): Reads and returns the text.Resumes often contain several distinct sections. To effectively process these sections, we can use regular expressions to locate and extract text from each section. Here’s how the code achieves this:
def extract_sections(text):
sections = {}
section_titles = ['Education', 'Experience', 'Skills', 'Certifications']
for title in section_titles:
pattern = re.compile(rf'{title}\n(.*?)(?=\n[A-Z])', re.DOTALL)
match = pattern.search(text)
if match:
sections[title] = match.group(1).strip()
else:
sections[title] = 'Not found'
return sections
extract_sections takes the entire resume text and extracts information for specific sections. sections = {}: Initializes an empty dictionary to store the extracted sections.section_titles: A list of section titles we want to extract from the resume (e.g., ‘Education,’ ‘Experience’).for title in section_titles:
pattern = re.compile(rf'{title}\n(.*?)(?=\n[A-Z])', re.DOTALL)
match = pattern.search(text)
if match:
sections[title] = match.group(1).strip()
else:
sections[title] = 'Not found'
Loop: Goes through each section title to find and get the related text.
pattern = re.compile(rf'{title}\n(.*?)(?=\n[A-Z])', re.DOTALL): Defines a pattern to match the section title followed by its content. rf'{title}\n(.*?)(?=\n[A-Z])': This pattern looks for the section title and captures the text after it until the next section title.{title}: Inserts the section title into the pattern.\n(.*?): Captures the text after the title, up to the next section.(?=\n[A-Z]): Ensures the capture stops before the next section title that starts with a capital letter.re.DOTALL: Makes the . in the pattern match newline characters, so it captures content across multiple lines.match = pattern.search(text): Searches the resume text for the section matching the pattern.if match: Checks if a match was found. sections[title] = match.group(1).strip(): If a match is found, it extracts the section content (removing extra spaces) and stores it in the dictionary.else: If no match is found for a section. sections[title] = 'Not found': Stores ‘Not found’ in the dictionary for that section title.text = """
Education
Bachelor of Science in Computer Science
Experience
Data Scientist at XYZ Corp
Skills
Python, SQL, Machine Learning
Certifications
Certified Data Scientist
"""
sections = extract_sections(text)
print(sections)
text: Sample resume text with different sections.sections = extract_sections(text): Calls the extract_sections function to process the sample text.print(sections): Outputs the extracted sections to the console.This code improves the resume parser by:
This method breaks down resumes into organized sections, making it easier to process and analyze the information.
Preprocessing and normalization are crucial steps in preparing text for further analysis. The provided code helps clean and standardize text data, making it more suitable for processing. Here’s a detailed explanation of each part of the code:
Objective: Prepare the text by removing unnecessary elements and standardizing it to improve the accuracy and efficiency of text analysis.
def preprocess_text(text):
text = text.lower() # Convert to lowercase
text = re.sub(r'\s+', ' ', text) # Replace multiple spaces with a single space
text = re.sub(r'\W+', ' ', text) # Remove non-word characters
return text
Function: preprocess_text cleans and prepares the input text for easier analysis later on.
text = text.lower() text = re.sub(r'\s+', ' ', text) \s+ matches one or more whitespace characters (like spaces or tabs).text = re.sub(r'\W+', ' ', text) \W+ matches one or more non-word characters (anything not a letter, digit, or underscore).return text clean_text = preprocess_text(sample_resume)
clean_text: Variable to store the result of the preprocessing.preprocess_text(sample_resume): Calls the preprocess_text function on the variable sample_resume, which should contain the raw resume text.clean_text: Will now contain the cleaned and normalized version of sample_resume.This preprocessing step helps in making the text consistent and easier to analyze, improving the accuracy and efficiency of subsequent text processing tasks.
Objective: Use advanced methods to get more meaningful and accurate features from text data.
Word Embeddings: These represent words as vectors, capturing their meanings based on context. Examples include Word2Vec, GloVe, and FastText.
Topic Modeling: Techniques like Latent Dirichlet Allocation (LDA) uncover the main topics in the text, giving insights into its content and structure.
Example: Combining these methods can improve feature extraction from resumes by understanding the nuances of language and context better.
Objective: Make sure the data extracted from resumes is accurate and follows known patterns or formats.
def validate_phone_numbers(phone_numbers):
valid_phone_pattern = re.compile(r'\+?\d[\d -]{8,12}\d')
return [num for num in phone_numbers if valid_phone_pattern.match(num)]
validated_phones = validate_phone_numbers(contact_details['phone_numbers'])
print(validated_phones)
validate_phone_numbers def validate_phone_numbers(phone_numbers):
2. Regular Expression Pattern: valid_phone_pattern
valid_phone_pattern = re.compile(r'\+?\d[\d -]{8,12}\d'):
re.compile(r'\+?\d[\d -]{8,12}\d'): Creates a pattern for matching phone numbers.\+?: Matches an optional plus sign at the start.\d: Matches a digit.[\d -]{8,12}: Matches a sequence of digits, spaces, or dashes, between 8 and 12 characters long.\d: Matches the final digit in the phone number.Purpose: Defines a pattern to recognize valid phone numbers, allowing for different formats and separators.
3. Validation Logic: List Comprehension
return [num for num in phone_numbers if valid_phone_pattern.match(num)]:
[num for num in phone_numbers]: Loops through each phone number in the list.if valid_phone_pattern.match(num): Checks if the phone number fits the defined pattern.return: Gathers and returns only the phone numbers that match the pattern.4. Applying Validation: validated_phones
validated_phones = validate_phone_numbers(contact_details['phone_numbers']): contact_details['phone_numbers']: Gets the list of phone numbers from the resume.validate_phone_numbers(...): Uses the validation function to filter out any invalid phone numbers.5. Output: print(validated_phones)
print(validated_phones): Prints the list of validated phone numbers to the console.This method will make sure that your extracted data is accurate and follows the right formats. This improves the quality and usefulness of the information.
Here’s how the full code might look with these enhancements:
import pandas as pd
import re
import nltk
import spacy
from transformers import pipeline
import pdfplumber
from docx import Document # Ensure you have python-docx installed
# Download NLTK data
nltk.download('punkt')
nltk.download('wordnet')
# Load SpaCy model
nlp = spacy.load('en_core_web_sm')
def extract_name(text):
doc = nlp(text)
names = [ent.text for ent in doc.ents if ent.label_ == 'PERSON']
return names[0] if names else None
def extract_contact_details(text):
phone_pattern = re.compile(r'\+?\d[\d -]{8,12}\d')
email_pattern = re.compile(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}')
phone_numbers = phone_pattern.findall(text)
emails = email_pattern.findall(text)
return {
'phone_numbers': phone_numbers,
'emails': emails
}
def extract_skills(text):
skills = ['Python', 'Java', 'SQL', 'Machine Learning', 'Data Analysis']
skills_found = [skill for skill in skills if skill.lower() in text.lower()]
return skills_found
def preprocess_text(text):
text = text.lower()
text = re.sub(r'\s+', ' ', text)
text = re.sub(r'\W+', ' ', text)
return text
def extract_sections(text):
sections = {}
section_titles = ['Education', 'Experience', 'Skills', 'Certifications']
for title in section_titles:
pattern = re.compile(rf'(?i){title}\s*([\s\S]*?)(?=\n\s*\w|$)', re.IGNORECASE)
match = pattern.search(text)
if match:
sections[title] = match.group(1).strip()
else:
sections[title] = 'Not found'
return sections
def extract_text_from_pdf(pdf_path):
with pdfplumber.open(pdf_path) as pdf:
text = ''.join(page.extract_text() for page in pdf.pages)
return text
def extract_text_from_docx(docx_path):
doc = Document(docx_path)
text = '\n'.join(paragraph.text for paragraph in doc.paragraphs)
return text
def extract_text_from_file(file_path):
if file_path.endswith('.pdf'):
return extract_text_from_pdf(file_path)
elif file_path.endswith('.docx'):
return extract_text_from_docx(file_path)
else:
with open(file_path, 'r') as file:
return file.read()
def parse_resume(resume_text):
resume_text = preprocess_text(resume_text)
name = extract_name(resume_text)
contact_details = extract_contact_details(resume_text)
skills = extract_skills(resume_text)
sections = extract_sections(resume_text)
return {
'Name': name,
'Contact Details': contact_details,
'Skills': skills,
'Sections': sections
}
if __name__ == "__main__":
sample_resume = """
John Doe
Phone: +123 456 7890
Email: john.doe@example.com
Skills: Python, Java, Data Analysis
Education
Bachelor of Science in Computer Science
Experience
Data Scientist at XYZ Corp
Certifications
Certified Data Scientist
"""
parsed_data = parse_resume(sample_resume)
print(pd.DataFrame([parsed_data]))
Name ... Sections
0 john ... {'Education': 'bachelor of science in computer...
[1 rows x 4 columns]
Process finished with exit code 0
You can customize this code to extract additional relevant information. Alternatively, you can upload a URL or a PDF or DOC resume database for parsing. Here is the most relevant output:
{
"Contact Details": {
"emails": [
"careerservices@bellevue.edu",
"imasample1@xxx.com",
"imasample2@xxx.com",
"imasample3@xxx.com",
"imasample4@xxx.com",
"imasample5@xxx.com",
"imasample6@xxx.com",
"imasample7@xxx.com",
"imasample8@xxx.com",
"imasample9@xxx.com",
"imasample10@xxxx.net"
],
"phone_numbers": [
"(308) 308-3083",
"(402) 291-5432",
"(402) 291-5678",
"(402) 292-2345",
"(402) 489-3421",
"(402) 493-1234",
"(402) 555-9876",
"(402) 557-7423",
"(800) 756-7920",
"(402) 543-1234"
]
},
"Name": "A. Sample",
"Sections": {
"Certifications": "Not found",
"Education": "Bachelor of Science, Bellevue University, Bellevue, NE (in progress) Major: Accounting Minor: Computer Information Systems Expected graduation date: January, 20xx GPA to date: 3.95/4.00",
"Work History": [
{
"Position": "Student Intern, Financial Accounting Development Program",
"Company": "Mutual of Omaha, Omaha, NE",
"Dates": "Summer 20xx"
},
{
"Position": "Accounting Coordinator",
"Company": "Nebraska Special Olympics, Omaha, NE",
"Dates": "20xx-20xx"
},
{
"Position": "Bookkeeper",
"Company": "SMC, Inc., Omaha, NE",
"Dates": "20xx – 20xx"
},
{
"Position": "Bookkeeper",
"Company": "First United Methodist Church, Altus, OK",
"Dates": "20xx – 20xx"
}
],
"Professional Affiliations": [
"Member, IMA, Bellevue University Student Chapter"
],
"Computer Skills": [
"Proficient in MS Office (Word, Excel, PowerPoint, Outlook), QuickBooks",
"Basic knowledge of MS Access, SQL, Visual Basic, C++"
],
"Additional Sections": {
"Objective": "Internship or part-time position in marketing, public relations or related field utilizing strong academic background and excellent communication skills.",
"Education": "BS in Business Administration with Marketing Emphasis, Bellevue University, Bellevue, NE Expected graduation date: June, 20xx GPA to date: 3.56/4.00",
"Relevant Coursework": [
"Principles of Marketing",
"Business Communication",
"Internet Marketing",
"Consumer Behavior",
"Public Relations",
"Business Policy & Strategy"
],
"Work History": [
{
"Position": "Academic Tutor",
"Company": "Bellevue University, Bellevue, NE",
"Dates": "20xx to Present"
},
{
"Position": "Senior Accounts Receivable Clerk",
"Company": "Lincoln Financial Group, Omaha, NE",
"Dates": "20xx-20xx"
}
],
"Community Service": [
"Advertising Coordinator, The Vue (20xx to Present)",
"Volunteer, Publicity Committee (20xx, 20xx), Brushup Nebraska Paint-a-thon"
],
"Added Value": {
"Language Skills": "Bilingual (English/Spanish)",
"Computer Skills": [
"MS Office (Word, Excel, PowerPoint)",
"Photoshop"
]
},
"References": "Available upon request"
}
}
}
Testing and evaluating your resume parser is crucial to ensure that it accurately extracts and processes information. Here’s how to effectively test and evaluate your Python resume parser:
To validate the accuracy of your resume parser, follow these methods:
extract_name(), extract_contact_details(), and extract_skills() functions with different inputs to ensure they return the expected results.A test suite is a collection of tests designed to evaluate the functionality of your parser. Here’s how you can create one:
unittest or pytest to automate the testing process. Write test scripts that feed sample resumes into the parser and check if the output matches the expected results.Example of a Simple Test Suite Using unittest:
import unittest
from resume_parser import parse_resume # Assuming your parser functions are in resume_parser.py
class TestResumeParser(unittest.TestCase):
def setUp(self):
self.sample_resume1 = """
John Doe
Phone: +123 456 7890
Email: john.doe@example.com
Skills: Python, Java, Data Analysis
"""
self.expected_output1 = {
'Name': 'John Doe',
'Contact Details': {
'phone_numbers': ['+123 456 7890'],
'emails': ['john.doe@example.com']
},
'Skills': ['Python', 'Java', 'Data Analysis'],
'Sections': {
'Education': 'Not found',
'Experience': 'Not found',
'Skills': 'Python, Java, Data Analysis',
'Certifications': 'Not found'
}
}
def test_parse_resume(self):
parsed_data = parse_resume(self.sample_resume1)
self.assertEqual(parsed_data, self.expected_output1)
if __name__ == '__main__':
unittest.main()
In this test suite, setUp() prepares the sample data and expected results. test_parse_resume() checks if the parser’s output matches the expected results.
To evaluate the performance of your resume parser, use the following metrics:
Example Calculation
def calculate_precision(true_positive, false_positive):
return true_positive / (true_positive + false_positive) * 100
def calculate_recall(true_positive, false_negative):
return true_positive / (true_positive + false_negative) * 100
To improve parsing accuracy:
Building a resume parser with Python is a project that involves both simple and advanced techniques. Here’s a summary of the journey from start to finish:
We started by setting up the basics. Using Python libraries like pandas, re, and nltk, we built a simple parser. This parser could extract names, contact details, and basic skills from resumes. This first step helped us understand the key parts of resume parsing, like getting text from files, matching patterns, and organizing data.
Next, we made the parser smarter by adding advanced natural language processing (NLP) techniques. We used SpaCy’s named entity recognition (NER) and pre-trained models from the transformers library. This made the parser better at finding and categorizing complex information, such as job titles and certifications. These improvements made the parser more accurate and reliable.
We made the parser able to handle different resume formats, like PDFs and DOCX files. Using libraries like pdfplumber and python-docx, we ensured that our parser could extract text from various file types. This step made the parser more versatile and useful in real-world situations where resumes come in different formats.
To make sure our parser worked well, we focused on making it more accurate and efficient. We used techniques like text normalization and pattern validation to deal with inconsistencies in resumes. We also evaluated the parser’s performance using metrics like precision and recall, which helped us fine-tune and improve its accuracy.
Testing was a crucial part of our process. We created a thorough test suite and did manual checks to make sure the parser could handle different scenarios and give reliable results. Regular testing and evaluation helped us find and fix potential problems, ensuring the parser was robust and effective.
Looking ahead, there are many ways to improve the resume parser. We can use machine learning models to better understand resume content, expand the training data for greater accuracy, and refine the extraction algorithms. Also, integrating user feedback and continuous testing will help maintain and improve the parser’s performance over time.
Building a resume parser using Python involves combining basic and advanced techniques to create a tool that is both functional and accurate. From the initial setup to sophisticated NLP enhancements, each step contributes to a more powerful and efficient resume parsing solution. With ongoing improvements and adaptations, your Python-based resume parser can become a valuable tool for streamlining the hiring process and extracting important information from resumes.
To build a Strong resume parser using Python, the following libraries and tools are helpful:
A resume parser is a tool that extracts and organizes information from resumes, making it easier to analyze and use for hiring decisions.
A resume parser works by using text extraction techniques to pull out key details like names, contact information, skills, and work experience from a resume.
Our resume parser can handle multiple file formats, including PDF and DOCX files.
We use Python libraries like pandas for data handling, re for text pattern matching, SpaCy for advanced NLP tasks, and pdfplumber and python-docx for handling different file formats.
The accuracy of the resume parser depends on the quality of the resume format and the robustness of the text extraction and NLP techniques used. Regular updates and testing help improve its accuracy.
Yes, with the help of advanced NLP models like those from the transformers library, our resume parser can handle and process resumes in multiple languages.
After debugging production systems that process millions of records daily and optimizing research pipelines that…
The landscape of Business Intelligence (BI) is undergoing a fundamental transformation, moving beyond its historical…
The convergence of artificial intelligence and robotics marks a turning point in human history. Machines…
The journey from simple perceptrons to systems that generate images and write code took 70…
In 1973, the British government asked physicist James Lighthill to review progress in artificial intelligence…
Expert systems came before neural networks. They worked by storing knowledge from human experts as…
This website uses cookies.