Data Science: Analysis - Over 985.000 Italian Words
This project is a fork of the user: https://github.com/napolux
- Source of the Data Set.
- For codes, check my GitHub’s account.
- The Python and Anaconda’s JupyterLab tools will be used to this analysis.
For all my posts, please click here.
Why Italian words?
Since I started to get myself involved with the italian citizenship, I have been learning Italian. The Italian language really fascinates me and makes me feel closer to my roots.
“Quindi” the post that follows is a work when a Data Scientist begins to involves himself with a new language.
First of all, the Italian Alphabet is quiet different of the others.
[‘a’, ‘b’, ‘c’, ‘d’, ‘e’,
‘f’, ‘g’, ‘h’, ‘i’, ‘j’,
‘k’, ‘l’, ‘m’, ’n’, ‘o’,
‘p’, ‘q’, ‘r’, ‘s’, ‘t’,
‘u’, ‘v’, ‘w’, ‘x’, ‘y’, ‘z’]
[‘a’, ‘b’, ‘c’, ‘d’, ‘e’,
‘f’, ‘g’, ‘h’, ‘i’,
‘l’, ‘m’, ’n’, ‘o’,
‘p’, ‘q’, ‘r’, ‘s’, ‘t’,
‘u’, ‘v’, ‘z’]
The Italian alphabet does not contains the letters “J, K, W, X and Y”, so, instead of 26 letters, in italian we have only 21, all the words in italian which iniciates with J, K, W, X and/or Y are borrowed from other languages.
Of course, there are other words that are borrowed from other languages, but for this project we will only consider “borrowed words” the ones which stats with J, K, W, X and Y.
The name of the file is “parole_unique.txt” or “unique_words” in english. There are no repetitions is this data set, all the words are unique. This dataset is simple, there is a index for each word and that’s all.
So, all the analysis will be over the words itself.
Let’s start with it, hope you like it!
- Latin alphabet: All mentioned earlier, all the words ranging from A to Z;
- Italian alphabet: Same of the Latin alphabet except for: J, K, W, X and Y letters;
- Borrowed words: Words which starts the letters J, K, W, X and/or Y.
In this dataset contains 986698 words, “borrowed words” or not. 941407 are “Italian Words” or “words not borrowed”, There are 45291 words borrowed from other languages, it represents 4.59% of the total.
WARNING: THERE MUST BE WORDS FROM OTHER LANGUAGES IN THIS DATASET THAT WASN’T CORRECTLY FILTERED. BECAUSE MY METHOD OF FILTERING DATA WAS USING THE WORD’S INICIAL LETTER, I DIDN’T CROSSED ANY OTHER DATASET OF ANY OTHER LANGUAGE.
The next graph shows us the total of the sum of the words (in the left y-axis), and the percentage (in the right y-axis) which starts with each one of the letters of the Italian alphabet.
The letters with the highest occurrence are: S, R and A, with respectively 14.59%, 10.47% and 9.17%. Approximately 34,23% of the Italian words starts with S, R or A. Those 3 inicial letters represents 337717 words of the 986698.
Counting the Letters in Each Word
The following graph, number 3, shows us the distribution of the number of letters in each word of the data set.
For this purpose, I counted the length of each word in the data set and made their distribution according to the number of letters in each word.
As we can see, almost every word in the dataset is smaller than 20 letters. There are: 986323 words with equal or less than 20 letters and 375 word with more than 20 letters.
The average number of letters in this dataset is 9.98 and the median is 10.
- 126424 words with 11 letters;
- 135526 word with 9 letters;
- 140472 word with 10 letters.
When analysing data I’ve come to a different, but curious question. What’s the most occurrence of a letter in the entire dataset?
The next graph, number 4, shows us the maximum number of time that a letter from Italian Alphabet appeared in a SINGLE WORD.
As we can see, the C, E, S, I and O letters appear 7 times in a word. Follow by A, N, T, R, M and D, with 6 times each. Then we have B, L, G, F and Z with 5 times each one. Followed by H, P, U and V with 4 times and finally Q with 3 times in a singles word.
S - spossessasse — (dispossess)
I - giurisdizionalistici — (jurisdictionalistic)
R - prerefrigererebbero — (they would pre-cool)
Z - nazionalizzazione — (nationalization)
Summing All The Letters Within the Words
There are 986698 words in this dataset, but… what is the sum of each letter in all the 986698 words in the dataset?
There are 9.783.876 of letters in this dataset, which makes sense, because there are 986686 words, and 9.783.876 divided by 986686 is = 9.91 letters by words, and our analysis showed us that the average number of letters in a word is 9.98.
The next graph show the total number of occurrences of each letter in the Italian Alphabet.
Top 5 Letters Ocurrences in the Dataset.
I — 1209959
A — 1124997
E — 995039
R — 833290
O — 805063.0
These 5 letters are appeared 4.968.348 times in all the words in the dataset. So we can infere that 50.7% of all letters occurrences in the data set, comes from I, A, E, R and O.
There are 20 letter in italian alphabet, so, I, A, E, R and O represents 25% of the alphabet, but half of the dataset letters.
Thus, 49.3% of the letters in the dataset are all words except for those Top 5 letters.