CSV looks easy until you hit quoted fields with commas inside, or escaped quotes inside quoted fields. The trick is to walk the input character-by-character while tracking a small amount of state: "am I currently inside a quoted field?"
This kind of "for each character, update some state, occasionally emit a result" pattern is called a state machine. It comes up in any non-trivial parsing task: JSON, command-line arguments, terminal escape sequences, markup languages.
fn parse(line: &str) -> Vec<String> {
let mut fields = Vec::new();
let mut current = String::new();
let mut in_quotes = false;
let mut chars = line.chars().peekable();
while let Some(c) = chars.next() {
match (c, in_quotes) {
('"', false) => in_quotes = true,
('"', true) if chars.peek() == Some(&'"') => {
// Escaped quote inside a quoted field.
current.push('"');
chars.next();
}
('"', true) => in_quotes = false,
(',', false) => {
fields.push(std::mem::take(&mut current));
}
(c, _) => current.push(c),
}
}
fields.push(current);
fields
}
Two things worth pointing out:
peekable() lets you look at the next character without consuming it.
Essential when one character's meaning depends on the one after it (the
"" -> " rule here).match on a tuple (c, in_quotes) lets you express each transition as
one arm. Easier to read than nested if/else.std::mem::take
gives you the current string and replaces it with an empty one in a
single move. No clone, no temporary.while letwhile let Some(c) = chars.next() { ... } is the loop counterpart of
if let from chapter 10. It keeps running as long as the pattern
matches, and stops as soon as it doesn't. Iterators return None at
the end, so while let Some(...) is a natural fit when you need more
control than a for loop gives you (here we want to call
chars.next() again inside the loop body to consume the second ").
Tuple matching like match (c, in_quotes) { ... } is the same idea as
chapter 9's let (a, b) = pair, just used as a match scrutinee. The
arms then pattern-match both elements at once, and the guards (chapter
11's if chars.peek() == Some(&'"')) do the rest.
The tests in this chapter also lean heavily on raw strings
(r#"..."#, introduced in chapter 21) so the CSV examples can contain
literal commas and quotes without an escape forest.
When stateful parsing gets hairy, write the simple version first
(split_once, split(',')) and let the easy tests pass. Then upgrade
to the state-machine version for the harder cases. Failing tests give you
concrete examples to think against, instead of trying to imagine every
edge case up front.
For real CSV in production code, reach for the
csv crate; it handles all the corners that this
exercise glosses over. But knowing how to write a state machine yourself
is a transferable skill.
Before tackling the messy realities of CSV (quotes, escapes, embedded commas), let's handle the trivial case: a line that's nothing but plain values separated by commas, possibly with surrounding whitespace. This is what most "I'll just split on commas" CSV parsers do, and it's also why so many of them break.
Use str::split
and str::trim.
Collect into a Vec<String>.
Useful from the standard library
str::splitwith a','argument yields each comma-separated piece as a&str.str::trimdrops leading/trailing whitespace from each piece.str::to_stringin amapstep turns the borrowed pieces into the ownedStrings the return type wants.Iterator::collectfinishes the chain. The body fits on one line:line.split(',').map(|s| s.trim().to_string()).collect().
&str that splits on a delimiter and gives you
an iterator. Combine it with trim and collect.line.split(',').map(|s| s.trim().to_string()).collect()
Real CSV is a state machine in disguise. A field can be wrapped in
double quotes, in which case any commas inside the quotes are part
of the field, not separators. And a literal " inside a quoted
field is encoded as "" (two quotes).
Suggested order of attack:
a,b,c and simply quoted "a","b","c" (the basic test)."a,b",c."a""b",c -> [a"b, c].Walk the string character by character with a peekable iterator and
keep a small in_quotes: bool flag. When you see " while already
inside quotes, peek the next char: if it's another ", push a
literal " and consume both; otherwise close the field.
Useful from the standard library
str::charsis the entry point for character-level iteration.Iterator::peekablewraps the iterator so you can look ahead one character. Essential for the""->"rule.Peekable::peekreturnsOption<&Item>without advancing.std::mem::takeswaps the currentStringwith a fresh empty one in a single move. Cleaner thancurrent.clone()followed bycurrent.clear().- A
match (c, in_quotes)on the tuple lets you express each state transition as a single arm. Add a guard (if chars.peek() == Some(&'"')) for the escape rule.
bool (in_quotes) is enough state. Walk the input with
line.chars().peekable() so you can look one character ahead.(c, in_quotes). There are only five interesting
cases:
('"', false) → enter quoted mode.('"', true) and the next char is also " → push a literal ",
consume the second one with chars.next().('"', true) → exit quoted mode.(',', false) → finish the current field, start a new one.std::mem::take(&mut current)
to harvest a field without cloning.With a working line parser, the file-level parser is mostly plumbing: split on newlines, treat the first line as headers, and parse the rest as data rows.
Use str::lines
to split: it handles trailing newlines gracefully, so "a,b\n"
gives one line, not two.
This step composes on top of parse_csv_line from the previous
step. To keep each step independently runnable, the signature is
re-declared here as a stub with todo!(). Replace it with your
solution from step 4 (or just call into it).
Useful from the standard library
str::linesyields each line as a&str, stripping\nand\r\n. A trailing newline does not create an empty trailing line.Iterator::nexton the iterator pulls off the header line; an empty file should return empty headers and rows.Iterator::map
parse_csv_lineover the remaining lines builds the rows.
Iterator::collect
to materialize both the headers and the rows into Vecs.content.lines() gives you an iterator over &str lines.next() on the
iterator pulls the first one off; the rest you can map(parse_csv_line).collect().fn parse_csv_file(content: &str) -> (Vec<String>, Vec<Vec<String>>) {
let mut lines = content.lines();
let headers = lines.next().map(parse_csv_line).unwrap_or_default();
let rows = lines.map(parse_csv_line).collect();
(headers, rows)
}
Parallel Vecs of headers and row values are awkward to consume.
Most code wants to ask "what's the name for this row?" A job for
a HashMap<String, String> per row.
Pair headers with each row using
Iterator::zip
and collect into a HashMap. You'll need cloned() on both
iterators because the map wants owned Strings but iteration
yields &String.
Useful from the standard library
Iterator::zippairs items from two iterators. Stops at the shorter of the two, which silently drops trailing fields when a row has the wrong arity.Iterator::clonedturns&Stringitems into ownedStrings. Apply on both sides of thezipso theHashMapends up owning its keys and values.Iterator::collecton a(K, V)iterator builds aHashMapstraight from the type annotation. The outermapthen collects per-row maps into the finalVec.HashMap::getis what callers will use afterwards:record.get("name").
zip the headers with the values to get
(header, value) pairs, then collect::<HashMap<_, _>>().Strings into the map. That's expected here, the
function takes shared slices.use std::collections::HashMap;
fn csv_to_records(
headers: &[String],
rows: &[Vec<String>],
) -> Vec<HashMap<String, String>> {
rows.iter()
.map(|row| {
headers.iter().cloned()
.zip(row.iter().cloned())
.collect()
})
.collect()
}
You wrote the easy version of CSV with split and trim, then
upgraded to a real state machine that handles quoted fields and
escaped quotes, glued lines into headers + rows, and converted
those rows into HashMap records.
What we learned
- Stateful parsing comes up everywhere CSV doesn't (JSON, command lines, terminal escape sequences). The routine is always: walk the input character by character, keep a small flag (or enum) of current state, occasionally emit a result.
- A peekable iterator is the standard tool for "what comes next?" decisions like the
""->"escape rule.Iterator::peekablecosts nothing in practice.match (token, state) { ... }over a tuple expresses each state transition in one line. Match guards (if cond) handle the cases where the transition depends on the lookahead.std::mem::take(&mut s)gives you the current value and replaces it withDefaultin one move. Cleaner than clone-then-clear when you're harvesting an accumulator.- The simple
split/trimversion is worth writing first. It passes the easy tests and gives you a baseline; the state-machine upgrade then has concrete failing cases to react to.iter.zip(other).collect::<HashMap<_, _>>()is the standard "two parallel sequences -> a map" move. Addcloned()on each side when the map needs owned data.- For real CSV in production, reach for the
csvcrate: it handles all the corner cases (BOMs, custom delimiters, escaped newlines inside fields) this exercise glosses over. Writing the parser by hand once is still worth doing for the transferable state-machine technique.