Skip to content

AI tools

In this page, we briefly describe how we use AI to extract relevant information from the PDF.
The program works in 3 steps:
Fetch the PDF
Extract information
Update database

We use a serverless environment to perform all the actions below. More specifically, we leverage to execute the code on our behalf.
Fetch pdf
Each submission has a PDF with the proposal, which are stored in Coda. We built a robot that goes through the database on a regular basis to detect new submissions. As soon as a new submission is detected, the robot download the PDF.
Extract information

The second and most important step aims to extract the text from the PDF and analyse it. We use two API provided by AWS to:
Extract text
Extract language and keywords
Extract text

For instance, we send the following PDF to the API
Screenshot 2021-04-04 at 09.32.28.png
and we get the following results
Screenshot 2021-04-04 at 09.33.20.png
When it turns to analysing the text, there are few steps to take into account:
Remove special characters
Remove accents, and other characteristics specific to the French language
Remove the stop words

Remove special characters
From the second image, we can see that some characters are irrelevant for analysis such as “(”, “.”, “;” etc. So we need to remove them. See the image below, the parenthesis are gone.
Screenshot 2021-04-04 at 09.36.59.png
⚠️ note that, we use the text above to count the number of words.
Remove the stop words
The stop words are a list of unique words which provides no relevant information. The typical words are pronoun, linking words, etc.
Example
'i',
'me',
'my',
'myself',
'we',
'our',
'ours',
'ourselves',
'you',

After we perform a rough cleaning, we get the following text, which we can send to the second API
Screenshot 2021-04-04 at 09.42.20.png
Extract language and keywords

We need to fetch two informations:
The language used
The most relevant keywords

The language used
First of all, we send the extracted text to AWS Comprehend service using detect_dominant_language API. The API returns a score and the language
{'LanguageCode': 'en', 'Score': 0.9502699971199036}
For the PDF analysed, the machine is sure at 95% the language is English.
The most relevant keywords
We use the previous information to indicate the machine which language he needs to use to get the most relevant keywords. For that, we turn to AWS API function detect_key_phrases . The API returns a list of relevant sentences used in the PDF along with a score. For the sake of our analysis, we only filter the most relevant keyword (i.e. above 90%)
[{'Score': 0.9918580651283264,
'Text': 'spatio temporal social dimensions',
'BeginOffset': 0,
'EndOffset': 33},
{'Score': 0.9929715394973755,
'Text': 'human wildlife interactions',
'BeginOffset': 34,
'EndOffset': 61},
{'Score': 0.9508159756660461,
'Text': 'urban environment background human wildlife co',
'BeginOffset': 62,
'EndOffset': 108},
...}]
Update database
We have in hand the following information:
Language
Extracted text
Extracted clean text
list of most relevant keywords
number of words

We can update the database with the new information.
Conclusion
All these steps helps us to get real-time information about the application, and a key summary before we can dive in.
Appendix
Here is the code used (without the deployment)
import requests
import tempfile
import os
import boto3
from nltk.corpus import stopwords
import nltk
import re
import unicodedata
final_stopwords_list = stopwords.words('english') + stopwords.words('french')

## Load creds
token = 'XXXXX'
headers_coda = {'Authorization': 'Bearer {}'.format(token)}
s3_client = boto3.resource(
's3',
region_name='eu-west-3')
client = boto3.client('textract')
client_language = boto3.client('comprehend', region_name='eu-west-2')

## fetch data
URI_APPLICATION = 'https://coda.io/apis/v1/docs/XXX/tables/XXX/rows'
params = {
'query': 'STATUS_PDF:"to_add"',
}
request_application = requests.get(URI_APPLICATION, headers=headers_coda, params = params)
request_application.json()
len(request_application.json()['items'])
### SAVE PDF
list_payload = []
if len(request_application.json()['items']) > 0:
for i, val in enumerate(request_application.json()['items']):
if len(val['values']['XXX']) >0:
print(val['values']["XXX"])
ID_ROW = val['id']
URL = val['values']["XXX"]
download_file = requests.get(URL, allow_redirects=False)
FILENAME_NEW_PDF = "{}.pdf".format(val['values']["XXX"].upper())
LOCAL_PATH_FILE = os.path.join(
tempfile.gettempdir(), FILENAME_NEW_PDF)
PATH_S3_KEY = os.path.join(
'DATA/SUSTAINABLE_DEVELOPMENT/APPLICATION_CALL',
FILENAME_NEW_PDF
)

with open(LOCAL_PATH_FILE, "wb") as file:
file.write(download_file.content)

### Save S3
s3_client.Bucket('datalake-datascience').upload_file(LOCAL_PATH_FILE,
PATH_S3_KEY)

### analyse PDF

response = client.start_document_analysis(
DocumentLocation={
'S3Object': {
'Bucket': 'datalake-datascience',
'Name': PATH_S3_KEY
}
},
FeatureTypes=['TABLES','FORMS']
)

response_pdf_analysis = client.get_document_analysis(
JobId=response['JobId'],
)
status = response_pdf_analysis['JobStatus']
while status == "IN_PROGRESS":
response_pdf_analysis = client.get_document_analysis(
JobId=response['JobId'],
)
status = response_pdf_analysis['JobStatus']

list_pages = [response_pdf_analysis]
while 'NextToken'in response_pdf_analysis:
response_pdf_analysis = client.get_document_analysis(
JobId=response['JobId'],
NextToken = response_pdf_analysis['NextToken']
)
status = response_pdf_analysis['JobStatus']
#if 'NextToken' in response_pdf_analysis:
list_pages.append(response_pdf_analysis)

### Fetch text
final_text = []
final_clean_text = []
for page in list_pages:
for i, text in enumerate(page['Blocks']):
if text['BlockType'] == 'LINE':
final_text.append(text['Text'])

final_text_join = ' '.join(final_text)

### Clean text
final_text_join = ''.join((c for c in unicodedata.normalize('NFD', final_text_join)
if unicodedata.category(c) != 'Mn'))
final_text_join = re.sub('[^A-Za-z0-9]+', ' ', final_text_join).lower()

to_check = ' '.join([word for word in final_text_join.split(' ') if word not in final_stopwords_list])
print(len(to_check))
COUNT_WORD = len(final_text_join.split(' '))
### Fetch language
response = client_language.detect_dominant_language(
Text=to_check[:5000],
)

LANGUAGE = response['Languages'][0]['LanguageCode']
### Fetch keywords
response = client_language.detect_key_phrases(
Text=to_check[:5000],
LanguageCode=LANGUAGE
)
KEYWORDS = [i['Text'] for i in response['KeyPhrases'] if i['Score']> .9]

### Update database
uri = 'https://coda.io/apis/v1/docs/XX/tables/XX/rows/{}'.format(ID_ROW)
payload = {
'row': {
'cells': [
{'column': 'LANGUAGE','value': LANGUAGE},
{'column': 'TEXT','value': final_text_join},
{'column': 'TEXT_CLEAN', 'value': to_check},
{'column': 'KEYWORDS', 'value': KEYWORDS},
{'column': 'NB_WORDS', 'value': COUNT_WORD},
{'column': 'STATUS_PDF', 'value': 'updated'},
],
},
}
req = requests.put(uri, headers=headers_coda, json=payload)
res = req.json()
list_payload.append({'payload':payload, 'response':res})
Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.