Chapter 17

Word Frequencies

Time to extend our running word-count example. Back in chapter 5 you built word_count, char_count, and longest_word with simple for loops. Chapter 16 then showed how iterators collapse those into one-liners. This chapter goes one level deeper: instead of asking how many words a text has, we'll ask which words appear and how often each one shows up.

There's no big new concept here either — it puts iterators, hashmaps, and Option to work together. Along the way you'll meet two new iterator tricks (max_by_key and HashMap::into_iter); the rest is just applying what's already in your toolbox.

A few patterns you'll likely use:

Splitting text into words. Both split_whitespace and split return iterators of &str. The first handles any kind of whitespace and skips empties, which is usually what you want for natural text:

for word in "hello  world\nrust".split_whitespace() {
    println!("{word}"); // hello, world, rust
}

Counting things into a HashMap. Reach for entry(...).or_insert(0):

let mut counts: HashMap<String, usize> = HashMap::new();
for word in text.split_whitespace() {
    *counts.entry(word.to_lowercase()).or_insert(0) += 1;
}

Finding the maximum by some property. max_by_key is the right tool for "give me the entry with the largest count":

let top = counts.iter().max_by_key(|(_, count)| *count);
// top: Option<(&String, &usize)>

Filtering and collecting. Same iterator chain you saw in the previous chapter:

let frequent: Vec<String> = counts
    .iter()
    .filter(|(_, &n)| n >= min)
    .map(|(word, _)| word.clone())
    .collect();

Computing an average. Sum the lengths, divide by the count, watch out for the integer-division trap:

let total_chars: usize = words.iter().map(|w| w.len()).sum();
let avg = total_chars as f64 / words.len() as f64;

Counting words

The foundation for everything else in this chapter: take a string of text and produce a HashMap<String, usize> that maps each word to how many times it appears. Words are separated by whitespace and the count should be case-insensitive: "Hello" and "hello" are the same word.

The classic recipe is: split on whitespace, lowercase each piece, then walk the resulting iterator and bump a counter in the map. The entry API on HashMap is the idiomatic way to do that last step: *map.entry(key).or_insert(0) += 1.

Useful from the standard library

str::split_whitespace splits on any whitespace and skips empty pieces. Almost always what you want for word splitting.

str::to_lowercase returns a fresh String. Use it as the map key so Hello and hello collapse together.

HashMap::entry

Entry::or_insert is the "look up; insert default; mutate" pattern from chapter 8.

Exercise 1 of 4

Open in Web Editor

The most common word

Now that you can count, finding the maximum is a one-liner, almost. The borrow checker has an opinion about returning data out of a HashMap, and that's the real lesson of this step.

count_words is duplicated below as a todo!() stub so this step compiles in isolation; you don't need to fill it in again. Focus on most_common_word. Once you have it, the test will drive both through unwrap().

Useful from the standard library

HashMap::into_iter consumes the map and yields owned (K, V) pairs. That's how you get an owned String out without cloning.

Iterator::max_by_key returns the entry with the largest derived key as an Option. max_by_key(|(_, count)| *count) does the trick here.

An empty input naturally produces None: count_words returns an empty map, into_iter().max_by_key(...) returns None, and the function signature already says Option<(String, usize)>. No special case needed.

Exercise 2 of 4

Open in Web Editor

use std::collections::HashMap;

/// Counts how many times each word appears in the text.
/// Words are separated by spaces and should be case-insensitive.
fn count_words(text: &str) -> HashMap<String, usize> {
    todo!()
}

/// Finds the most common word in the text.
/// Returns the word and its count, or None if text is empty.
///
/// Tip: this is the function where the borrow checker pushes back. To
/// return `(String, usize)` you need to own the key, but `iter()` on
/// a `HashMap` only hands out borrows. The trick is
/// [`into_iter`](https://doc.rust-lang.org/std/collections/struct.HashMap.html#method.into_iter):
/// it consumes the map and yields `(K, V)` pairs by value, so combining
/// it with `max_by_key` gives you back an owned `(String, usize)`.
fn most_common_word(text: &str) -> Option<(String, usize)> {
    // Use count_words() then find the max by count
    todo!()
}

#[test]
fn test_most_common_word() {
    let text = "apple banana apple cherry apple";
    let (word, count) = most_common_word(text).unwrap();
    assert_eq!(word, "apple");
    assert_eq!(count, 3);
}

Filtering frequent words

A different type of result: instead of one winning word, return every word whose count meets some threshold. The natural pipeline is count_words(text).into_iter().filter(...).map(...).collect(), and the collect infers Vec<String> from the return type.

As before, count_words is stubbed with todo!() so this step compiles standalone. Your work is in frequent_words.

Useful from the standard library

HashMap::into_iter hands out owned (String, usize) pairs, so the resulting Vec doesn't have to clone anything.

Iterator::filter keeps the pairs whose count is high enough. Destructure the tuple with |(_, count)| to ignore the word and look only at the count.

Iterator::map drops the count and keeps just the word, so collect can build a Vec<String>.

HashMap iteration order is unspecified; if your test ever relies on a particular order, sort the result first.

Exercise 3 of 4

Open in Web Editor

Text statistics

The orchestrator step. text_stats returns three numbers about a piece of text: total word count, number of unique words, and the average word length as an f64. You can compute all three from a single pass over count_words's result, or split the work; either is fine.

count_words is stubbed with todo!() again so this file compiles on its own. Wire text_stats up however you like. The test only cares about the returned tuple.

Useful from the standard library

The total word count is the sum of every value in the map: counts.values().sum::<usize>().

The unique-word count is counts.len().

For the average length, sum key.chars().count() * count across the map (or sum word.len() straight from a fresh text.split_whitespace() pass) and divide by the total. Watch the integer-division trap: cast both operands to f64 before the divide.

HashMap::values and HashMap::iter are the two iterator entry points you'll likely use here.

Exercise 4 of 4

Open in Web Editor

use std::collections::HashMap;

/// Counts how many times each word appears in the text.
/// Words are separated by spaces and should be case-insensitive.
fn count_words(text: &str) -> HashMap<String, usize> {
    todo!()
}

/// Calculates basic text statistics.
/// Returns (`total_words`, `unique_words`, `average_word_length`).
///
/// In real code you'd reach for a `struct TextStats { total: usize,
/// unique: usize, avg_len: f64 }` here; a 3-tuple is hard to read at
/// the call site. We're sticking with a tuple to keep the focus on the
/// iterator chain in the body.
fn text_stats(text: &str) -> (usize, usize, f64) {
    todo!()
}

#[test]
fn test_text_stats() {
    let text = "hello world rust";
    let (total, unique, avg_len) = text_stats(text);
    assert_eq!(total, 3);
    assert_eq!(unique, 3);
    assert!((avg_len - 4.33).abs() < 0.1); // Average length ≈ 4.33
    // Side note: floats don't compare exactly (the value here is
    // really 13/3 = 4.333...), so we check that we're close enough
    // by taking the absolute difference and comparing to a tolerance.
    // Direct `==` on `f64` is almost always the wrong thing.
}

Wrapping up the word counter

You glued together the chapters so far: a HashMap keyed by lowercased words, an into_iter() to escape the borrow checker, a max_by_key to pick a winner, a filter/map/collect to slice the map, and a few aggregations to compute summary stats.

What we learned

split_whitespace() is the right default for word-splitting in natural text. It collapses runs of whitespace and skips empties.

Lowercasing keys (or any other normalization step) belongs to the same pipeline that builds the map, not to the consumer side.

into_iter on a HashMap is the standard escape hatch when you need to return owned data out of it. iter only hands out borrows.

max_by_key returns an Option, so empty input naturally collapses to None without a special-case branch.

"Filter, then map, then collect" composes the same way over a HashMap as it does over a Vec. The collection on either end is just where the items live.

Watch the integer-division trap when computing averages: divide after casting to f64, not before. f64 comparisons need a tolerance ((a - b).abs() < eps); never ==.

Tuples like (usize, usize, f64) work for tiny ad-hoc returns, but a named struct (TextStats { total, unique, avg_len }) reads better at the call site as soon as a function takes off in scope.

Next chapter 18A Creative Break

Word Frequencies

Counting words

Useful from the standard library

Results

The most common word

Useful from the standard library

Results

Filtering frequent words

Useful from the standard library

Results

Text statistics

Useful from the standard library

Results

Wrapping up the word counter

What we learned