Time to extend our running word-count example. Back in chapter 5 you
built word_count, char_count, and longest_word with simple for
loops. Chapter 16 then showed how iterators collapse those into
one-liners. This chapter goes one level deeper: instead of asking how
many words a text has, we'll ask which words appear and how often
each one shows up.
There's no big new concept here either — it puts iterators, hashmaps,
and Option to work together. Along the way you'll meet two new
iterator tricks (max_by_key and HashMap::into_iter); the rest is
just applying what's already in your toolbox.
A few patterns you'll likely use:
Splitting text into words. Both split_whitespace and split return
iterators of &str. The first handles any kind of whitespace and skips
empties, which is usually what you want for natural text:
for word in "hello world\nrust".split_whitespace() {
println!("{word}"); // hello, world, rust
}
Counting things into a HashMap. Reach for entry(...).or_insert(0):
let mut counts: HashMap<String, usize> = HashMap::new();
for word in text.split_whitespace() {
*counts.entry(word.to_lowercase()).or_insert(0) += 1;
}
Finding the maximum by some property. max_by_key is the right tool
for "give me the entry with the largest count":
let top = counts.iter().max_by_key(|(_, count)| *count);
// top: Option<(&String, &usize)>
Filtering and collecting. Same iterator chain you saw in the previous chapter:
let frequent: Vec<String> = counts
.iter()
.filter(|(_, &n)| n >= min)
.map(|(word, _)| word.clone())
.collect();
Computing an average. Sum the lengths, divide by the count, watch out for the integer-division trap:
let total_chars: usize = words.iter().map(|w| w.len()).sum();
let avg = total_chars as f64 / words.len() as f64;
The foundation for everything else in this chapter: take a string of
text and produce a HashMap<String, usize> that maps each word to
how many times it appears. Words are separated by whitespace and the
count should be case-insensitive: "Hello" and "hello" are the
same word.
The classic recipe is: split on whitespace, lowercase each piece,
then walk the resulting iterator and bump a counter in the map. The
entry API on HashMap is the idiomatic way to do that last step:
*map.entry(key).or_insert(0) += 1.
Useful from the standard library
str::split_whitespacesplits on any whitespace and skips empty pieces. Almost always what you want for word splitting.str::to_lowercasereturns a freshString. Use it as the map key soHelloandhellocollapse together.HashMap::entry
Entry::or_insertis the "look up; insert default; mutate" pattern from chapter 8.
Now that you can count, finding the maximum is a one-liner, almost.
The borrow checker has an opinion about returning data out of a
HashMap, and that's the real lesson of this step.
count_words is duplicated below as a todo!() stub so this step
compiles in isolation; you don't need to fill it in again. Focus on
most_common_word. Once you have it, the test will drive both
through unwrap().
Useful from the standard library
HashMap::into_iterconsumes the map and yields owned(K, V)pairs. That's how you get an ownedStringout without cloning.Iterator::max_by_keyreturns the entry with the largest derived key as anOption.max_by_key(|(_, count)| *count)does the trick here.- An empty input naturally produces
None:count_wordsreturns an empty map,into_iter().max_by_key(...)returnsNone, and the function signature already saysOption<(String, usize)>. No special case needed.
A different type of result: instead of one winning word, return
every word whose count meets some threshold. The natural pipeline is
count_words(text).into_iter().filter(...).map(...).collect(), and
the collect infers Vec<String> from the return type.
As before, count_words is stubbed with todo!() so this step
compiles standalone. Your work is in frequent_words.
Useful from the standard library
HashMap::into_iterhands out owned(String, usize)pairs, so the resultingVecdoesn't have to clone anything.Iterator::filterkeeps the pairs whose count is high enough. Destructure the tuple with|(_, count)|to ignore the word and look only at the count.Iterator::mapdrops the count and keeps just the word, socollectcan build aVec<String>.- HashMap iteration order is unspecified; if your test ever relies on a particular order, sort the result first.
The orchestrator step. text_stats returns three numbers about a
piece of text: total word count, number of unique words, and the
average word length as an f64. You can compute all three from a
single pass over count_words's result, or split the work; either
is fine.
count_words is stubbed with todo!() again so this file compiles
on its own. Wire text_stats up however you like. The test only
cares about the returned tuple.
Useful from the standard library
- The total word count is the sum of every value in the map:
counts.values().sum::<usize>().- The unique-word count is
counts.len().- For the average length, sum
key.chars().count() * countacross the map (or sumword.len()straight from a freshtext.split_whitespace()pass) and divide by the total. Watch the integer-division trap: cast both operands tof64before the divide.HashMap::valuesandHashMap::iterare the two iterator entry points you'll likely use here.
You glued together the chapters so far: a HashMap keyed by
lowercased words, an into_iter() to escape the borrow checker, a
max_by_key to pick a winner, a filter/map/collect to slice
the map, and a few aggregations to compute summary stats.
What we learned
split_whitespace()is the right default for word-splitting in natural text. It collapses runs of whitespace and skips empties.- Lowercasing keys (or any other normalization step) belongs to the same pipeline that builds the map, not to the consumer side.
into_iteron aHashMapis the standard escape hatch when you need to return owned data out of it.iteronly hands out borrows.max_by_keyreturns anOption, so empty input naturally collapses toNonewithout a special-case branch.- "Filter, then map, then collect" composes the same way over a
HashMapas it does over aVec. The collection on either end is just where the items live.- Watch the integer-division trap when computing averages: divide after casting to
f64, not before.f64comparisons need a tolerance ((a - b).abs() < eps); never==.- Tuples like
(usize, usize, f64)work for tiny ad-hoc returns, but a named struct (TextStats { total, unique, avg_len }) reads better at the call site as soon as a function takes off in scope.