This tutorial shows code snipets that enable you to get all the dates from a text file. The dates can be in various formats without being skipped. As can be seen the main library used is datefinder, a not so common library perfect for this type of job. Lets first import the libraries.

Github for code: LINK

# import the libraries needed for filtering out dates:
import datefinder
import pandas as pd
def find_dates_in_text(text: str, sentence_store = True) -> str:
    """Args:
    - text: one text file (CV) that we want to get the dates out of
    - sentence_store: true or false statement, decides if we store the sentence with the date or only the date.
    Output:
    - pandas dataframe with the CV dates and if chosen the relevant sentences
    required libraries:
    - datefinder, pandas as pd"""
    
    characters_to_replace = ["?", "\n", ","]
    for item in characters_to_replace:
        text = text.replace(item,".")
    sentences = text.split(".")

    date_store = []
    for sentence in sentences:
        for match in datefinder.find_dates(sentence):
            print_date = pd.to_datetime(pd.Series(match)).dt.strftime('%d/%m/%Y')
            if sentence_store == True:
                date_store.append([print_date, sentence])
            else:
                date_store.append(print_date)
    return pd.DataFrame(date_store)

Below is shown a quick example of how to import an Excel file usable for getting dates out of.

# example of loading a excel file into python.
# depending on the data in the file you want to adjust the row and column values.
import openpyxl
path = "CV_sample_test.xlsx"
wb_obj = openpyxl.load_workbook(path) 

sheet_obj = wb_obj.active 
cell_obj = sheet_obj.cell(row = 2, column = 2) 
# running the function
find_dates_in_text(cell_obj.value, True)
# function output
0	1
0	0 21/12/2013 dtype: object	2013 – 2013 Minor Embedded Vision Design aan 
1	0 21/12/2014 dtype: object	2011 – 2014 HBO Embedded Systems Engineering 
2	0 21/12/2011 dtype: object	2008 – 2011 MBO ICT-Beheerder Niv
3	0 04/12/2022 dtype: object	4 aan de ROC Aventus te Apeldoorn
4	0 06/12/2022 dtype: object	2022 Qt 6 Core Advanced with C++
5	0 06/12/2022 dtype: object	2022 Qt 6 Core Intermediate with C++
6	0 06/12/2022 dtype: object	2022 Qt 6 Core Beginners with C++
7	0 21/12/2021 dtype: object	2021 Complete Python Developer 2021: Zero to 
8	0 07/12/2022 dtype: object	2015 Oracle Certified Associate (OCA): Java 
9	0 21/01/2015 dtype: object	jan 2015 – heden CIMSOLUTIONS B
10	0 21/01/2014 dtype: object	jan 2014 – jul 2014 EAL Apeldoorn B
11	0 21/07/2014 dtype: object	jan 2014 – jul 2014 EAL Apeldoorn B
12	0 21/08/2013 dtype: object	aug 2013 – dec 2014 DNVN Software Engineer
13	0 21/12/2014 dtype: object	aug 2013 – dec 2014 DNVN Software Engineer
14	0 21/08/2010 dtype: object	aug 2010 – feb 2011 AVOO Software Engineer
15	0 21/02/2011 dtype: object	aug 2010 – feb 2011 AVOO Software Engineer
16	0 21/08/2009 dtype: object	aug 2009 – feb 2010 Paradigit Computers B
17	0 21/02/2010 dtype: object	aug 2009 – feb 2010 Paradigit Computers B
18	0 21/04/2022 dtype: object	BRANCHE: Medisch PERIODE: april 2022 – nu
19	0 21/01/2022 dtype: object	BRANCHE: Industrie PERIODE: jan 2022 – 
20	0 21/04/2022 dtype: object	BRANCHE: Industrie PERIODE: jan 2022 – 
21	0 21/11/2021 dtype: object	BRANCHE: Cosmetica/Consument PERIODE: nov 
22	0 21/02/2022 dtype: object	BRANCHE: Cosmetica/Consument PERIODE: nov 
23	0 21/08/2021 dtype: object	BRANCHE: Cosmetica/Consument PERIODE: aug 
24	0 21/02/2022 dtype: object	BRANCHE: Cosmetica/Consument PERIODE: aug 
25	0 21/02/2021 dtype: object	BRANCHE: Medisch PERIODE: feb 2021 – aug 
26	0 21/08/2021 dtype: object	BRANCHE: Medisch PERIODE: feb 2021 – aug 
27	0 19/12/2022 dtype: object	en wegens het COVID-19 virus is er besloten 
28	0 19/12/2022 dtype: object	en wegens het COVID-19 virus is er besloten 
29	0 19/12/2022 dtype: object	Implementeren algoritme COVID-19 in 
30	0 21/07/2020 dtype: object	BRANCHE: Medisch PERIODE: jul 2020 – mrt 
31	0 21/01/2018 dtype: object	BRANCHE: Medisch PERIODE: jan 2018 – jun 
32	0 21/06/2020 dtype: object	BRANCHE: Medisch PERIODE: jan 2018 – jun 

Column 1 shows the found and formated date. Column 2 shows the original text found that contained a date. Not all text is shown, only the snipets that contained date information.

Leave a Reply

Your email address will not be published. Required fields are marked *