Christopher Sardegna's Blog

Thoughts on technology, design, data analysis, and data visualization.


Uncovering a Blocking Syscall

Uncovering a Blocking Syscall

Logria is a program that uses a terminal emulator as its front end. As with any front end, responding quickly to user input is important to making software enjoyable to use.

The Problem

When rendering to a shell, Logria looks like this:


    WARNING:root:I am a first log! 57                                 
    WARNING:root:I am a first log! 78                                 
    WARNING:root:I am a first log! 3                                  
    WARNING:root:I am a first log! 96                                 
    WARNING:root:I am a first log! 27                                 
    WARNING:root:I am a first log! 93                                 
    WARNING:root:I am a first log! 69                                 
    WARNING:root:I am a first log! 3                                  
    WARNING:root:I am a first log! 6                                  
    WARNING:root:I am a first log! 18                                 
    WARNING:root:I am a first log! 78                                 
    WARNING:root:I am a first log! 51                                 
    WARNING:root:I am a first log! 99                                 
    WARNING:root:I am a first log! 84                                 
    ┌─────────────────────────────────────────────────────────────────┐
    │Regex with pattern /WARNING/                                     │
    └─────────────────────────────────────────────────────────────────┘
    

For the app to get to this state, the user must create a session that reads a log stream, then enter a regex to filter the stream.

Naturally, all of those actions require user input. Reacting slowly—or not at all—is unacceptable. Logria is designed to prioritize user input, but, despite that design, some keystrokes are ignored.

Specific Behavior

When a session contained a command that used ssh, some keystrokes would not appear when entered. For example, typing / to enter regex mode and then typing WARNING might drop some characters, resulting in the expected WARNING search yielding WRNG. Similarly, the /, Enter, or Esc keystrokes may not be received.

The missing keystrokes do not arrive later; they are simply ignored.

Logria Internals

To determine why keystrokes get dropped, we must first understand what Logria is doing during its render process. After some initial setup steps that are unaffected by the input bug, Logria's main thread enters a render loop that looks something like this1:


    loop {
        let num_new_messages = self.receive_streams();
    
        if poll(Duration::from_millis(self.config.poll_rate))? {
            match read()? {
                Event::Key(input) => {
                    // Input is handled here
                }
                Event::Mouse(_) => {...}
                Event::Resize(_, _) => {...}
            }
        }
    
        if num_new_messages > 0 {
            self.render_text_in_output()?;
        }
    }
    

The poll() and read() functions come from crossterm; the rest of the code is internal to Logria.

Broadly, Logria checks the streams it is attached to for messages and reads those messages into its internal buffer, keeping track of how many messages it received. It then waits up to a defined period of time for a user to enter something, then triggers a render.

Blocking Input

Blocking input is problematic and can be solved by separating the blocking work into separate threads2. To determine where best to offload work, we need to first understand where Logria is blocked3.

There are two spots in this code that block the input poll() call: receive_streams() and render_text_in_output(). To determine which of these may be blocking the loop, let's add profiling code to emit timing information:


    let mut output = File::create("logria.log").unwrap();
    loop {
        let t_0 = Instant::now();
        let num_new_messages = self.receive_streams();
        writeln!(output, "{} in {:?}", num_new_messages, t_0.elapsed())?;
    
        // ... poll was here ... 
    
        let t_0 = Instant::now();
        if num_new_messages > 0 {
            self.render_text_in_output()?;
            writeln!(output, "Render in {:?}", t_0.elapsed())?;
        }
    }
    

However, the emitted data reveals that these calls happen so fast that they could not possibly last long enough to block the input loop:


    2 in 66µs
    Render in 476.875µs
    1 in 19.541µs
    Render in 597.875µs
    0 in 47.833µs
    0 in 10.5µs
    1 in 24.666µs
    Render in 126.458µs
    1 in 9.166µs
    Render in 275.333µs
    4 in 21.541µs
    Render in 3.375µs
    3 in 26.166µs
    Render in 116.541µs
    1 in 12.25µs
    

These calls take microseconds to complete4.

Logria creates separate threads to read from the processes it monitors. It then uses channels to send the output of those processes to the render thread. Because monitoring the processes is offloaded to a separate thread, this call does not last long enough to block the loop.

Rendering to the screen, while heavier than reading into the message buffer, is still quite fast, averaging approximately 260 µs. This is because of several optimizations discovered in the Python prototype. For example, when possible, renders are skipped or escaped early.

Diving Deeper

With the newly-added logs5, we can see that when the character does not get rendered, the output logs stop, indicating that the loop is blocked by something. The only remaining part of this process is the input handler, so let's add logs there:


    let mut output = File::create("logria.log").unwrap();
    loop {
        // self.receive_streams()
    
        writeln!(output, "Before input poll")?;
        if poll(Duration::from_millis(self.config.poll_rate))? {
            writeln!(output, "After input poll")?;
            match read()? {
                Event::Key(input) => {
                    writeln!(output, "Got key: {:?}", input)?;
                    // Input is handled here
                }
                Event::Mouse(_) => {...}
                Event::Resize(_, _) => {...}
            }
        } else {
            writeln!(output, "No input")?;
        }
    
        // self.render_text_in_output()
    }
    

After replicating the problem, we see that execution stops inside of the poll() call:


    4 in 21.541µs
    Before input poll
    No input
    Render in 3.375µs
    3 in 26.166µs
    Before input poll
    No input
    Render in 116.541µs
    1 in 12.25µs
    Before input poll
    

Stepping into the source for poll(), we can see crossterm invokes an internal poll function:


    if let Some(reader) = try_lock_internal_event_reader_for(timeout) { ... }
    

That function attempts to acquire a temporary lock on a shared event reader:


    Some(MutexGuard::map(
        INTERNAL_EVENT_READER.try_lock_for(duration)?,
        |reader| reader.get_or_insert_with(InternalEventReader::default),
    

This succeeds and invokes an internal poll() Waker. The first thing that Waker does is trigger a read from a Unix Event:


    let maybe_event = match event_source.try_read(poll_timeout.leftover()) { ... }
    

This event sends a buffer to fill with data from a predefined file descriptor.


    match self.tty_fd.read(&mut self.tty_buffer, TTY_BUFFER_SIZE) { ... }
    

In the self.tty_fd.read() call, crossterm invokes the read() syscall on that file descriptor in an unsafe 6 block:


    let result = unsafe {
        libc::read(
            self.fd,
            buffer.as_mut_ptr() as *mut libc::c_void,
            size as size_t,
        ) as isize
    };
    

This is meant to fill self.tty_buffer with bytes read from stdin.

According to the read() specification:

If some process has the pipe open for writing and O_NONBLOCK is clear, read() shall block the calling thread until some data is written or the pipe is closed by all processes that had the pipe open for writing.

Thus, libc::read() is likely blocking because the file descriptor for stdin is opened by one of Logria's child processes. Even though the function invoking this has a timeout, since libc::read() never finishes, control is never handed back for the timeout to execute.

Reproducible Example

Logria is too complex to test this libc::read() theory, so we need a small example to prove that this is the case. The code only needs to have two components:

The example project's main.rs ends up looking like this:


    use std::process::{Command, Stdio};
    use std::thread;
    use std::time::Duration;
    
    use crossterm::{
        event::{poll, read, Event, KeyCode},
        terminal::{disable_raw_mode, enable_raw_mode},
        Result,
    };
    
    fn main() -> Result<()> {
        enable_raw_mode()?;
        
        // Create a thread that wants to consume `stdin`
        // Mimic the Logria Command creation 1:1
        let thread = thread::spawn(|| {
            Command::new("cat")
                .arg("-")
                .stdout(Stdio::piped())
                .stderr(Stdio::piped())
                .spawn()
                .unwrap();
        });
    
        
        loop {
            eprintln!("Before poll");
            if poll(Duration::from_millis(1000))? {
                eprintln!("Poll passed");
                match read()? {
                    Event::Key(k) => {
                        println!("{:?}", k);
                    }
                    Event::Mouse(_) => todo!(),
                    Event::Resize(_, _) => todo!(),
                }
            } else {
                eprintln!("No input");
            }
        }
    
        thread.join().unwrap();
        Ok(())
    }
    

To further confirm, let's modify the code around the libc::read() syscall to log when it runs:


    let result = unsafe {
        eprintln!("Attempting to read {} bytes from {}\n", size, self.fd);
        let res = libc::read(
            self.fd,
            buffer.as_mut_ptr() as *mut libc::c_void,
            size as size_t,
        ) as isize;
        eprintln!("Finished read!\n");
        res
    };
    

Running this sample project and pressing any key replicates the same freezing behavior Logria exhibits:


    Before poll
    No input
    Before poll
    No input
    Before poll
    No input
    Before poll
    Attempting to read 1204 bytes from 0
    

0 is the file descriptor for stdin.

Fixing the Example

Now that we know the bug lies with something taking control over stdin, we can (sort of) see the bug:


    let thread = thread::spawn(|| {
        Command::new("cat")
            .arg("-")
            .stdout(Stdio::piped())
            .stderr(Stdio::piped())
            .spawn()
            .unwrap();
    });
    

Technically, we cannot see it, because something is missing! According to the Command::new() docs, the child process inherits stdin/stdout/stderr from the parent process. Since Logria (and this example) only set .stdout() and .stderr() in the builder, stdin is left to inherit from the parent process. After setting .stdin(Stdio::null()), the issue is resolved.

Fixing Logria

The same patch is simple to apply to Logria:


    let mut proc_read = match Command::new(command_to_run[0])
        .args(&command_to_run[1..])
        .current_dir(current_dir().unwrap())
        .stdout(Stdio::piped())
        .stderr(Stdio::piped())
        .stdin(Stdio::null()) // Do not capture stdin
        .spawn() { ... }
    

When running previously broken sessions, Logria now reacts to input as expected.

Conclusion

The session that broke Logria's input contained a ssh connection running via aws ssm. This ticks the stdin box, because ssh wants to pass stdin to the connected machine.

While one could technically add -n to the ssh invocation, this patch prevents input from ever getting sent to Logria's child processes.

This solution also maintains the simplicity of a single render thread, unlike shared memory solutions that would significantly increase the complexity of this tool.


  1. The real one is located here. ↩︎

  2. The Rust community has contributed many good resources for learning about shared-state concurrency. ↩︎

  3. Spoiler: I jumped the gun here and decided to write a multithreaded implementation of this render logic only to realize it did not solve the problem. ↩︎

  4. While this may technically block receiving input, the delay is so small that it does not affect the application. We can prove this by checking how fast we can press a key here. On my MacBook Pro, I was unable to get under 20ms, which is 33 times slower than the longest render process above. My normal typing speed leaves keys activated for around 100ms, or 167 times slower than the worst case. ↩︎

  5. I used logging instead of a debugger here because the process causing this was behind an authentication wall that VSCode did not play well with. ↩︎

  6. Even though Logria uses #![forbid(unsafe_code)]! ↩︎