SDD Official Website

Useful information

Program

Students

THEMES

AI tools

Survey

Explore

AI tools

In this page, we briefly describe how we use AI to extract relevant information from the PDF.

The program works in 3 steps:

Fetch the PDF

Extract information

Update database

We use a serverless environment to perform all the actions below. More specifically, we leverage

AWS lambda function⁠

to execute the code on our behalf.

Fetch pdf

Each submission has a PDF with the proposal, which are stored in Coda. We built a robot that goes through the database on a regular basis to detect new submissions. As soon as a new submission is detected, the robot download the PDF.

Extract information

The second and most important step aims to extract the text from the PDF and analyse it. We use two API provided by AWS to:

Extract text

Extract language and keywords

Extract text

For instance, we send the following PDF to the API

⁠

⁠

and we get the following results

⁠

⁠

When it turns to analysing the text, there are few steps to take into account:

Remove special characters

Remove accents, and other characteristics specific to the French language

Remove the stop words

Remove special characters

From the second image, we can see that some characters are irrelevant for analysis such as “(”, “.”, “;” etc. So we need to remove them. See the image below, the parenthesis are gone.

⁠

⁠

⚠️ note that, we use the text above to count the number of words.

Remove the stop words

The stop words are a list of unique words which provides no relevant information. The typical words are pronoun, linking words, etc.

Example

'i',

'me',

'my',

'myself',

'we',

'our',

'ours',

'ourselves',

'you',

After we perform a rough cleaning, we get the following text, which we can send to the second API

⁠

⁠

Extract language and keywords

We need to fetch two informations:

The language used

The most relevant keywords

The language used

First of all, we send the extracted text to AWS Comprehend service using detect_dominant_language API. The API returns a score and the language

{'LanguageCode': 'en', 'Score': 0.9502699971199036}

For the PDF analysed, the machine is sure at 95% the language is English.

The most relevant keywords

We use the previous information to indicate the machine which language he needs to use to get the most relevant keywords. For that, we turn to AWS API function detect_key_phrases . The API returns a list of relevant sentences used in the PDF along with a score. For the sake of our analysis, we only filter the most relevant keyword (i.e. above 90%)

[{'Score': 0.9918580651283264,

'Text': 'spatio temporal social dimensions',

'BeginOffset': 0,

'EndOffset': 33},

{'Score': 0.9929715394973755,

'Text': 'human wildlife interactions',

'BeginOffset': 34,

'EndOffset': 61},

{'Score': 0.9508159756660461,

'Text': 'urban environment background human wildlife co',

'BeginOffset': 62,

'EndOffset': 108},

...}]

Update database

We have in hand the following information:

Language

Extracted text

Extracted clean text

list of most relevant keywords

number of words

We can update the database with the new information.

Conclusion

All these steps helps us to get real-time information about the application, and a key summary before we can dive in.

Appendix

Here is the code used (without the deployment)

import requests

import tempfile

import os

import boto3

from nltk.corpus import stopwords

import nltk

import re

import unicodedata

final_stopwords_list = stopwords.words('english') + stopwords.words('french')

## Load creds

token = 'XXXXX'

headers_coda = {'Authorization': 'Bearer {}'.format(token)}

s3_client = boto3.resource(

's3',

region_name='eu-west-3')

client = boto3.client('textract')

client_language = boto3.client('comprehend', region_name='eu-west-2')

Want to print your doc?
This is not the way.

Try clicking the ⋯ next to your doc name or using a keyboard shortcut (

CtrlP

) instead.