This tutorial shows code snipets that enable you to get all the dates from a text file. The dates can be in various formats without being skipped. As can be seen the main library used is datefinder, a not so common library perfect for this type of job. Lets first import the libraries.
Github for code: LINK
# import the libraries needed for filtering out dates:
import datefinder
import pandas as pd
def find_dates_in_text(text: str, sentence_store = True) -> str:
"""Args:
- text: one text file (CV) that we want to get the dates out of
- sentence_store: true or false statement, decides if we store the sentence with the date or only the date.
Output:
- pandas dataframe with the CV dates and if chosen the relevant sentences
required libraries:
- datefinder, pandas as pd"""
characters_to_replace = ["?", "\n", ","]
for item in characters_to_replace:
text = text.replace(item,".")
sentences = text.split(".")
date_store = []
for sentence in sentences:
for match in datefinder.find_dates(sentence):
print_date = pd.to_datetime(pd.Series(match)).dt.strftime('%d/%m/%Y')
if sentence_store == True:
date_store.append([print_date, sentence])
else:
date_store.append(print_date)
return pd.DataFrame(date_store)
Below is shown a quick example of how to import an Excel file usable for getting dates out of.
# example of loading a excel file into python.
# depending on the data in the file you want to adjust the row and column values.
import openpyxl
path = "CV_sample_test.xlsx"
wb_obj = openpyxl.load_workbook(path)
sheet_obj = wb_obj.active
cell_obj = sheet_obj.cell(row = 2, column = 2)
# running the function
find_dates_in_text(cell_obj.value, True)
# function output
0 1
0 0 21/12/2013 dtype: object 2013 – 2013 Minor Embedded Vision Design aan
1 0 21/12/2014 dtype: object 2011 – 2014 HBO Embedded Systems Engineering
2 0 21/12/2011 dtype: object 2008 – 2011 MBO ICT-Beheerder Niv
3 0 04/12/2022 dtype: object 4 aan de ROC Aventus te Apeldoorn
4 0 06/12/2022 dtype: object 2022 Qt 6 Core Advanced with C++
5 0 06/12/2022 dtype: object 2022 Qt 6 Core Intermediate with C++
6 0 06/12/2022 dtype: object 2022 Qt 6 Core Beginners with C++
7 0 21/12/2021 dtype: object 2021 Complete Python Developer 2021: Zero to
8 0 07/12/2022 dtype: object 2015 Oracle Certified Associate (OCA): Java
9 0 21/01/2015 dtype: object jan 2015 – heden CIMSOLUTIONS B
10 0 21/01/2014 dtype: object jan 2014 – jul 2014 EAL Apeldoorn B
11 0 21/07/2014 dtype: object jan 2014 – jul 2014 EAL Apeldoorn B
12 0 21/08/2013 dtype: object aug 2013 – dec 2014 DNVN Software Engineer
13 0 21/12/2014 dtype: object aug 2013 – dec 2014 DNVN Software Engineer
14 0 21/08/2010 dtype: object aug 2010 – feb 2011 AVOO Software Engineer
15 0 21/02/2011 dtype: object aug 2010 – feb 2011 AVOO Software Engineer
16 0 21/08/2009 dtype: object aug 2009 – feb 2010 Paradigit Computers B
17 0 21/02/2010 dtype: object aug 2009 – feb 2010 Paradigit Computers B
18 0 21/04/2022 dtype: object BRANCHE: Medisch PERIODE: april 2022 – nu
19 0 21/01/2022 dtype: object BRANCHE: Industrie PERIODE: jan 2022 –
20 0 21/04/2022 dtype: object BRANCHE: Industrie PERIODE: jan 2022 –
21 0 21/11/2021 dtype: object BRANCHE: Cosmetica/Consument PERIODE: nov
22 0 21/02/2022 dtype: object BRANCHE: Cosmetica/Consument PERIODE: nov
23 0 21/08/2021 dtype: object BRANCHE: Cosmetica/Consument PERIODE: aug
24 0 21/02/2022 dtype: object BRANCHE: Cosmetica/Consument PERIODE: aug
25 0 21/02/2021 dtype: object BRANCHE: Medisch PERIODE: feb 2021 – aug
26 0 21/08/2021 dtype: object BRANCHE: Medisch PERIODE: feb 2021 – aug
27 0 19/12/2022 dtype: object en wegens het COVID-19 virus is er besloten
28 0 19/12/2022 dtype: object en wegens het COVID-19 virus is er besloten
29 0 19/12/2022 dtype: object Implementeren algoritme COVID-19 in
30 0 21/07/2020 dtype: object BRANCHE: Medisch PERIODE: jul 2020 – mrt
31 0 21/01/2018 dtype: object BRANCHE: Medisch PERIODE: jan 2018 – jun
32 0 21/06/2020 dtype: object BRANCHE: Medisch PERIODE: jan 2018 – jun
Column 1 shows the found and formated date. Column 2 shows the original text found that contained a date. Not all text is shown, only the snipets that contained date information.