Each submission has a PDF with the proposal, which are stored in Coda. We built a robot that goes through the database on a regular basis to detect new submissions. As soon as a new submission is detected, the robot download the PDF.
Extract information
The second and most important step aims to extract the text from the PDF and analyse it. We use two API provided by AWS to:
Extract text
Extract language and keywords
Extract text
For instance, we send the following PDF to the API
and we get the following results
When it turns to analysing the text, there are few steps to take into account:
Remove special characters
Remove accents, and other characteristics specific to the French language
Remove the stop words
Remove special characters
From the second image, we can see that some characters are irrelevant for analysis such as “(”, “.”, “;” etc. So we need to remove them. See the image below, the parenthesis are gone.
⚠️ note that, we use the text above to count the number of words.
Remove the stop words
The stop words are a list of unique words which provides no relevant information. The typical words are pronoun, linking words, etc.
Example
'i',
'me',
'my',
'myself',
'we',
'our',
'ours',
'ourselves',
'you',
After we perform a rough cleaning, we get the following text, which we can send to the second API
Extract language and keywords
We need to fetch two informations:
The language used
The most relevant keywords
The language used
First of all, we send the extracted text to AWS Comprehend service using detect_dominant_language API. The API returns a score and the language
For the PDF analysed, the machine is sure at 95% the language is English.
The most relevant keywords
We use the previous information to indicate the machine which language he needs to use to get the most relevant keywords. For that, we turn to AWS API function detect_key_phrases . The API returns a list of relevant sentences used in the PDF along with a score. For the sake of our analysis, we only filter the most relevant keyword (i.e. above 90%)
[{'Score': 0.9918580651283264,
'Text': 'spatio temporal social dimensions',
'BeginOffset': 0,
'EndOffset': 33},
{'Score': 0.9929715394973755,
'Text': 'human wildlife interactions',
'BeginOffset': 34,
'EndOffset': 61},
{'Score': 0.9508159756660461,
'Text': 'urban environment background human wildlife co',
'BeginOffset': 62,
'EndOffset': 108},
...}]
Update database
We have in hand the following information:
Language
Extracted text
Extracted clean text
list of most relevant keywords
number of words
We can update the database with the new information.
Conclusion
All these steps helps us to get real-time information about the application, and a key summary before we can dive in.