Chapter 17

Word Frequencies

👋 Anyone can read and edit this exercise. Sign up to save your progress.

Time to extend our running word-count example. Back in chapter 5 you built word_count, char_count, and longest_word with simple for loops. Chapter 16 then showed how iterators collapse those into one-liners. This chapter goes one level deeper: instead of asking how many words a text has, we'll ask which words appear and how often each one shows up.

There's no big new concept here either — it puts iterators, hashmaps, and Option to work together. Along the way you'll meet two new iterator tricks (max_by_key and HashMap::into_iter); the rest is just applying what's already in your toolbox.

A few patterns you'll likely use:

Splitting text into words. Both split_whitespace and split return iterators of &str. The first handles any kind of whitespace and skips empties, which is usually what you want for natural text:

for word in "hello  world\nrust".split_whitespace() {
    println!("{word}"); // hello, world, rust
}

Counting things into a HashMap. Reach for entry(...).or_insert(0):

let mut counts: HashMap<String, usize> = HashMap::new();
for word in text.split_whitespace() {
    *counts.entry(word.to_lowercase()).or_insert(0) += 1;
}

Finding the maximum by some property. max_by_key is the right tool for "give me the entry with the largest count":

let top = counts.iter().max_by_key(|(_, count)| *count);
// top: Option<(&String, &usize)>

Filtering and collecting. Same iterator chain you saw in the previous chapter:

let frequent: Vec<String> = counts
    .iter()
    .filter(|(_, &n)| n >= min)
    .map(|(word, _)| word.clone())
    .collect();

Computing an average. Sum the lengths, divide by the count, watch out for the integer-division trap:

let total_chars: usize = words.iter().map(|w| w.len()).sum();
let avg = total_chars as f64 / words.len() as f64;

Counting words

The foundation for everything else in this chapter: take a string of text and produce a HashMap<String, usize> that maps each word to how many times it appears. Words are separated by whitespace and the count should be case-insensitive: "Hello" and "hello" are the same word.

The classic recipe is: split on whitespace, lowercase each piece, then walk the resulting iterator and bump a counter in the map. The entry API on HashMap is the idiomatic way to do that last step: *map.entry(key).or_insert(0) += 1.

Useful from the standard library

Exercise 1 of 4
Open in Web Editor

Results

    Compiler / runtime output
    
                

    The most common word

    Now that you can count, finding the maximum is a one-liner, almost. The borrow checker has an opinion about returning data out of a HashMap, and that's the real lesson of this step.

    count_words is duplicated below as a todo!() stub so this step compiles in isolation; you don't need to fill it in again. Focus on most_common_word. Once you have it, the test will drive both through unwrap().

    Useful from the standard library

    • HashMap::into_iter consumes the map and yields owned (K, V) pairs. That's how you get an owned String out without cloning.
    • Iterator::max_by_key returns the entry with the largest derived key as an Option. max_by_key(|(_, count)| *count) does the trick here.
    • An empty input naturally produces None: count_words returns an empty map, into_iter().max_by_key(...) returns None, and the function signature already says Option<(String, usize)>. No special case needed.
    Exercise 2 of 4
    Open in Web Editor

    Results

      Compiler / runtime output
      
                  

      Filtering frequent words

      A different type of result: instead of one winning word, return every word whose count meets some threshold. The natural pipeline is count_words(text).into_iter().filter(...).map(...).collect(), and the collect infers Vec<String> from the return type.

      As before, count_words is stubbed with todo!() so this step compiles standalone. Your work is in frequent_words.

      Useful from the standard library

      • HashMap::into_iter hands out owned (String, usize) pairs, so the resulting Vec doesn't have to clone anything.
      • Iterator::filter keeps the pairs whose count is high enough. Destructure the tuple with |(_, count)| to ignore the word and look only at the count.
      • Iterator::map drops the count and keeps just the word, so collect can build a Vec<String>.
      • HashMap iteration order is unspecified; if your test ever relies on a particular order, sort the result first.
      Exercise 3 of 4
      Open in Web Editor

      Results

        Compiler / runtime output
        
                    

        Text statistics

        The orchestrator step. text_stats returns three numbers about a piece of text: total word count, number of unique words, and the average word length as an f64. You can compute all three from a single pass over count_words's result, or split the work; either is fine.

        count_words is stubbed with todo!() again so this file compiles on its own. Wire text_stats up however you like. The test only cares about the returned tuple.

        Useful from the standard library

        • The total word count is the sum of every value in the map: counts.values().sum::<usize>().
        • The unique-word count is counts.len().
        • For the average length, sum key.chars().count() * count across the map (or sum word.len() straight from a fresh text.split_whitespace() pass) and divide by the total. Watch the integer-division trap: cast both operands to f64 before the divide.
        • HashMap::values and HashMap::iter are the two iterator entry points you'll likely use here.
        Exercise 4 of 4
        Open in Web Editor

        Results

          Compiler / runtime output
          
                      

          Wrapping up the word counter

          You glued together the chapters so far: a HashMap keyed by lowercased words, an into_iter() to escape the borrow checker, a max_by_key to pick a winner, a filter/map/collect to slice the map, and a few aggregations to compute summary stats.

          What we learned

          • split_whitespace() is the right default for word-splitting in natural text. It collapses runs of whitespace and skips empties.
          • Lowercasing keys (or any other normalization step) belongs to the same pipeline that builds the map, not to the consumer side.
          • into_iter on a HashMap is the standard escape hatch when you need to return owned data out of it. iter only hands out borrows.
          • max_by_key returns an Option, so empty input naturally collapses to None without a special-case branch.
          • "Filter, then map, then collect" composes the same way over a HashMap as it does over a Vec. The collection on either end is just where the items live.
          • Watch the integer-division trap when computing averages: divide after casting to f64, not before. f64 comparisons need a tolerance ((a - b).abs() < eps); never ==.
          • Tuples like (usize, usize, f64) work for tiny ad-hoc returns, but a named struct (TextStats { total, unique, avg_len }) reads better at the call site as soon as a function takes off in scope.
          Next chapter 18A Creative Break