The first step is to type a special command at the Python prompt which tells the interpreter to load some texts for us to explore: from nltk. This says "from NLTK's book module, load all items. After printing a welcome message, it loads the text of several books this will take a few seconds. Here's the command again, together with the output that you will see.
1. Language Processing and Python
Any time we want to find out about these texts, we just have to enter their names at the Python prompt:. Now that we can use the Python interpreter, and have some data to work with, we're ready to get started. There are many ways to examine the context of a text apart from simply reading it. A concordance view shows us every occurrence of a given word, together with some context.
Here we look up the word monstrous in Moby Dick by entering text1 followed by a period, then the term concordance , and then placing "monstrous" in parentheses:. The first time you use a concordance on a particular text, it takes a few extra seconds to build an index so that subsequent searches are fast.
3 Reasons to Learn How to Write a Book
Your Turn: Try searching for other words; to save re-typing, you might be able to use up-arrow, Ctrl-up-arrow or Alt-p to access the previous command and modify the word being searched. You can also try searches on some of the other texts we have included. For example, search Sense and Sensibility for the word affection , using text2.
Search the book of Genesis to find out how long some people lived, using text3. You could look at text4 , the Inaugural Address Corpus , to see examples of English going back to , and search for words like nation , terror , god to see how these words have been used differently over time. We've also included text5 , the NPS Chat Corpus : search this for unconventional words like im , ur , lol. Note that this corpus is uncensored! Once you've spent a little while examining these texts, we hope you have a new sense of the richness and diversity of language.
In the next chapter you will learn how to access a broader range of text, including text in languages other than English. A concordance permits us to see words in context. What other words appear in a similar range of contexts? We can find out by appending the term similar to the name of the text in question, then inserting the relevant word in parentheses:.
- Programming for Everybody (Getting Started with Python).
- Chapter 1: What is cash flow?!
- Fishing for Snakes and Baking Apple Pies;
Observe that we get different results for different texts. Austen uses this word quite differently from Melville; for her, monstrous has positive connotations, and sometimes functions as an intensifier like the word very. We have to enclose these words by square brackets as well as parentheses, and separate them with a comma:.
It is one thing to automatically detect that a particular word occurs in a text, and to display some words that appear in the same context. However, we can also determine the location of a word in the text: how many words from the beginning it appears. This positional information can be displayed using a dispersion plot.
Each stripe represents an instance of a word, and each row represents the entire text. You can produce this plot as shown below. You might like to try more words e. Can you predict the dispersion of a word before you view it? As before, take care to get the quotes, commas, brackets and parentheses exactly right. Presidential Inaugural Addresses: This can be used to investigate changes in language use over time. Important: You need to have Python's NumPy and Matplotlib packages installed in order to produce the graphical plots used in this book. Now, just for fun, let's try generating some random text in the various styles we have just seen.
To do this, we type the name of the text followed by the term generate. We need to include the parentheses, but there's nothing that goes between them. The generate method is not available in NLTK 3. The most obvious fact about texts that emerges from the preceding examples is that they differ in the vocabulary they use. In this section we will see how to use the computer to count the words in a text in a variety of useful ways. As before, you will jump right in and experiment with the Python interpreter, even though you may not have studied Python systematically yet.
Test your understanding by modifying the examples, and trying the exercises at the end of the chapter. Let's begin by finding out the length of a text from start to finish, in terms of the words and punctuation symbols that appear. We use the term len to get the length of something, which we'll apply here to the book of Genesis:.
So Genesis has 44, words and punctuation symbols, or "tokens. When we count the number of tokens in a text, say, the phrase to be or not to be , we are counting occurrences of these sequences.
Thus, in our example phrase there are two occurrences of to , two of be , and one each of or and not. But there are only four distinct vocabulary items in this phrase. How many distinct words does the book of Genesis contain? To work this out in Python, we have to pose the question slightly differently.
The vocabulary of a text is just the set of tokens that it uses, since in a set, all duplicates are collapsed together. In Python we can obtain the vocabulary items of text3 with the command: set text3. When you do this, many screens of words will fly past.
- Selected Letters (Classics)?
- How to Write the First Chapter of Your Book - Dorrance Publishing Company?
- 5 Practices from Deep Work by Cal Newport That’ll Change Your Life.
- Cias Christmas Collection.
- The complete guide to cash flow for small businesses?
- Carols Choice?
- 1. Language Processing and Python!
Now try the following:. By wrapping sorted around the Python expression set text3 , we obtain a sorted list of vocabulary items, beginning with various punctuation symbols and continuing with words starting with A. All capitalized words precede lowercase words.
We discover the size of the vocabulary indirectly, by asking for the number of items in the set, and again we can use len to obtain this number. Although it has 44, tokens, this book has only 2, distinct words, or "word types. Our count of 2, items will include punctuation symbols, so we will generally call these unique items types instead of word types.
Now, let's calculate a measure of the lexical richness of the text. Next, let's focus on particular words. We can count how often a word occurs in a text, and compute what percentage of the text is taken up by a specific word:. Your Turn: How many times does the word lol appear in text5? How much is this as a percentage of the total number of words in this text? You may want to repeat such calculations on several texts, but it is tedious to keep retyping the formula. Now you only have to type a short name instead of one or more complete lines of Python code, and you can re-use it as often as you like.
The block of code that does a task for us is called a function , and we define a short name for our function with the keyword def. It is up to you to do the indentation, by typing four spaces or hitting the tab key.
The 4-Hour Workweek Tools
To finish the indented block just enter a blank line. This parameter is a "placeholder" for the actual text whose lexical diversity we want to compute, and reoccurs in the block of code that will run when the function is used. Similarly, percentage is defined to take two parameters, named count and total. The data value that we place in the parentheses when we call a function is an argument to the function. You have already encountered several functions in this chapter, such as len , set , and sorted.
By convention, we will always add an empty pair of parentheses after a function name, as in len , just to make clear that what we are talking about is a function rather than some other kind of Python expression. Functions are an important concept in programming, and we only mention them at the outset to give newcomers a sense of the power and creativity of programming. Don't worry if you find it a bit confusing right now.