Chapter 22

State Machines and Stateful Parsing

👋 Anyone can read and edit this exercise. Sign up to save your progress.

CSV looks easy until you hit quoted fields with commas inside, or escaped quotes inside quoted fields. The trick is to walk the input character-by-character while tracking a small amount of state: "am I currently inside a quoted field?"

This kind of "for each character, update some state, occasionally emit a result" pattern is called a state machine. It comes up in any non-trivial parsing task: JSON, command-line arguments, terminal escape sequences, markup languages.

A skeleton

fn parse(line: &str) -> Vec<String> {
    let mut fields = Vec::new();
    let mut current = String::new();
    let mut in_quotes = false;
    let mut chars = line.chars().peekable();

    while let Some(c) = chars.next() {
        match (c, in_quotes) {
            ('"', false) => in_quotes = true,
            ('"', true) if chars.peek() == Some(&'"') => {
                // Escaped quote inside a quoted field.
                current.push('"');
                chars.next();
            }
            ('"', true) => in_quotes = false,
            (',', false) => {
                fields.push(std::mem::take(&mut current));
            }
            (c, _) => current.push(c),
        }
    }
    fields.push(current);
    fields
}

Two things worth pointing out:

A note on while let

while let Some(c) = chars.next() { ... } is the loop counterpart of if let from chapter 10. It keeps running as long as the pattern matches, and stops as soon as it doesn't. Iterators return None at the end, so while let Some(...) is a natural fit when you need more control than a for loop gives you (here we want to call chars.next() again inside the loop body to consume the second ").

Tuple matching like match (c, in_quotes) { ... } is the same idea as chapter 9's let (a, b) = pair, just used as a match scrutinee. The arms then pattern-match both elements at once, and the guards (chapter 11's if chars.peek() == Some(&'"')) do the rest.

The tests in this chapter also lean heavily on raw strings (r#"..."#, introduced in chapter 21) so the CSV examples can contain literal commas and quotes without an escape forest.

A useful tactic

When stateful parsing gets hairy, write the simple version first (split_once, split(',')) and let the easy tests pass. Then upgrade to the state-machine version for the harder cases. Failing tests give you concrete examples to think against, instead of trying to imagine every edge case up front.

For real CSV in production code, reach for the csv crate; it handles all the corners that this exercise glosses over. But knowing how to write a state machine yourself is a transferable skill.

A first pass: comma-splitting

Before tackling the messy realities of CSV (quotes, escapes, embedded commas), let's handle the trivial case: a line that's nothing but plain values separated by commas, possibly with surrounding whitespace. This is what most "I'll just split on commas" CSV parsers do, and it's also why so many of them break.

Use str::split and str::trim. Collect into a Vec<String>.

Useful from the standard library

  • str::split with a ',' argument yields each comma-separated piece as a &str.
  • str::trim drops leading/trailing whitespace from each piece.
  • str::to_string in a map step turns the borrowed pieces into the owned Strings the return type wants.
  • Iterator::collect finishes the chain. The body fits on one line: line.split(',').map(|s| s.trim().to_string()).collect().
Exercise 1 of 4
Open in Web Editor

Results

    Compiler / runtime output
    
                
    Stuck? Show a hint No spoilers, just a nudge
    1. There's a method on &str that splits on a delimiter and gives you an iterator. Combine it with trim and collect.
    2. line.split(',').map(|s| s.trim().to_string()).collect()
      

    Quotes, embedded commas, and escapes

    Real CSV is a state machine in disguise. A field can be wrapped in double quotes, in which case any commas inside the quotes are part of the field, not separators. And a literal " inside a quoted field is encoded as "" (two quotes).

    Suggested order of attack:

    1. Plain a,b,c and simply quoted "a","b","c" (the basic test).
    2. Commas inside quoted fields: "a,b",c.
    3. Escaped quotes: "a""b",c -> [a"b, c].

    Walk the string character by character with a peekable iterator and keep a small in_quotes: bool flag. When you see " while already inside quotes, peek the next char: if it's another ", push a literal " and consume both; otherwise close the field.

    Useful from the standard library

    • str::chars is the entry point for character-level iteration.
    • Iterator::peekable wraps the iterator so you can look ahead one character. Essential for the "" -> " rule.
    • Peekable::peek returns Option<&Item> without advancing.
    • std::mem::take swaps the current String with a fresh empty one in a single move. Cleaner than current.clone() followed by current.clear().
    • A match (c, in_quotes) on the tuple lets you express each state transition as a single arm. Add a guard (if chars.peek() == Some(&'"')) for the escape rule.
    Exercise 2 of 4
    Open in Web Editor

    Results

      Compiler / runtime output
      
                  
      Stuck? Show a hint No spoilers, just a nudge
      1. A single bool (in_quotes) is enough state. Walk the input with line.chars().peekable() so you can look one character ahead.
      2. Match on the tuple (c, in_quotes). There are only five interesting cases:
        • ('"', false) → enter quoted mode.
        • ('"', true) and the next char is also " → push a literal ", consume the second one with chars.next().
        • ('"', true) → exit quoted mode.
        • (',', false) → finish the current field, start a new one.
        • anything else → push the character into the current field.
      3. After the loop, push the final field. Use std::mem::take(&mut current) to harvest a field without cloning.
      4. The full skeleton is in the chapter intro; if you've read it and are still stuck, copy the skeleton verbatim and run the tests. The compiler errors will tell you what's left to wire up.

      Parsing a whole file

      With a working line parser, the file-level parser is mostly plumbing: split on newlines, treat the first line as headers, and parse the rest as data rows.

      Use str::lines to split: it handles trailing newlines gracefully, so "a,b\n" gives one line, not two.

      This step composes on top of parse_csv_line from the previous step. To keep each step independently runnable, the signature is re-declared here as a stub with todo!(). Replace it with your solution from step 4 (or just call into it).

      Useful from the standard library

      • str::lines yields each line as a &str, stripping \n and \r\n. A trailing newline does not create an empty trailing line.
      • Iterator::next on the iterator pulls off the header line; an empty file should return empty headers and rows.
      • Iterator::map
        • parse_csv_line over the remaining lines builds the rows.
    • Iterator::collect to materialize both the headers and the rows into Vecs.
    • Exercise 3 of 4
      Open in Web Editor

      Results

        Compiler / runtime output
        
                    
        Stuck? Show a hint No spoilers, just a nudge
        1. content.lines() gives you an iterator over &str lines.
        2. The first line is headers; the rest are rows. next() on the iterator pulls the first one off; the rest you can map(parse_csv_line).collect().
        3. fn parse_csv_file(content: &str) -> (Vec<String>, Vec<Vec<String>>) {
              let mut lines = content.lines();
              let headers = lines.next().map(parse_csv_line).unwrap_or_default();
              let rows = lines.map(parse_csv_line).collect();
              (headers, rows)
          }
          

        From rows to records

        Parallel Vecs of headers and row values are awkward to consume. Most code wants to ask "what's the name for this row?" A job for a HashMap<String, String> per row.

        Pair headers with each row using Iterator::zip and collect into a HashMap. You'll need cloned() on both iterators because the map wants owned Strings but iteration yields &String.

        Useful from the standard library

        • Iterator::zip pairs items from two iterators. Stops at the shorter of the two, which silently drops trailing fields when a row has the wrong arity.
        • Iterator::cloned turns &String items into owned Strings. Apply on both sides of the zip so the HashMap ends up owning its keys and values.
        • Iterator::collect on a (K, V) iterator builds a HashMap straight from the type annotation. The outer map then collects per-row maps into the final Vec.
        • HashMap::get is what callers will use afterwards: record.get("name").
        Exercise 4 of 4
        Open in Web Editor

        Results

          Compiler / runtime output
          
                      
          Stuck? Show a hint No spoilers, just a nudge
          1. For each row, zip the headers with the values to get (header, value) pairs, then collect::<HashMap<_, _>>().
          2. You'll be cloning Strings into the map. That's expected here, the function takes shared slices.
          3. use std::collections::HashMap;
            
            fn csv_to_records(
                headers: &[String],
                rows: &[Vec<String>],
            ) -> Vec<HashMap<String, String>> {
                rows.iter()
                    .map(|row| {
                        headers.iter().cloned()
                            .zip(row.iter().cloned())
                            .collect()
                    })
                    .collect()
            }
            

          Wrapping up the CSV parser

          You wrote the easy version of CSV with split and trim, then upgraded to a real state machine that handles quoted fields and escaped quotes, glued lines into headers + rows, and converted those rows into HashMap records.

          What we learned

          • Stateful parsing comes up everywhere CSV doesn't (JSON, command lines, terminal escape sequences). The routine is always: walk the input character by character, keep a small flag (or enum) of current state, occasionally emit a result.
          • A peekable iterator is the standard tool for "what comes next?" decisions like the "" -> " escape rule. Iterator::peekable costs nothing in practice.
          • match (token, state) { ... } over a tuple expresses each state transition in one line. Match guards (if cond) handle the cases where the transition depends on the lookahead.
          • std::mem::take(&mut s) gives you the current value and replaces it with Default in one move. Cleaner than clone-then-clear when you're harvesting an accumulator.
          • The simple split/trim version is worth writing first. It passes the easy tests and gives you a baseline; the state-machine upgrade then has concrete failing cases to react to.
          • iter.zip(other).collect::<HashMap<_, _>>() is the standard "two parallel sequences -> a map" move. Add cloned() on each side when the map needs owned data.
          • For real CSV in production, reach for the csv crate: it handles all the corner cases (BOMs, custom delimiters, escaped newlines inside fields) this exercise glosses over. Writing the parser by hand once is still worth doing for the transferable state-machine technique.
          Next chapter 23Rust Fundamentals Quiz