Intro to Rust
Author’s Note
I originally wrote this document as part of a presentation I gave for one of our weekly ’lunch and learn’ meetings at my AWS org. There are tons of great learning materials for the Rust programming language, most maintained by experts with far more experience in the language than I. But I wanted to provide a brief, technical overview to some of my peers at Amazon, one that wasn’t just describing language basics.
Though Rust adoption within Amazon has been quite strong, it’s by no means unilateral; there are still so many people that don’t have visibility into the language and the problems that it tries to solve. And for good reason, too; as exciting as the language is, no one can magically rewrite every legacy codebase into Rust (nor should they). So, the following is a brief overview of the language, and some of the reasons why I think experienced developers should get excited about the language.
Intro
Rust is a programming language that aims to give the ergonomics and flexibility of a high-level language, while delivering the performance and low-level control of a lower-level language like C++. In fact, to really understand the problems that Rust solves with its design, we can look at classic problems that arise when using C++.
Here’s a classic example. This code will compile perfectly in C++:
std::vector<std::string> vec = { "hello", "world" };
auto& first = vec[0];
vec.push_back("from c++"); // might reallocate
std::cout << first << std::endl; // undefined behavior at runtime
This is also true if we were to have used an iterator, or even a raw pointer instead of a reference. In both cases, a reallocation of the vector causes the underlying reference to become silently invalidated. No warnings or complaints from the compiler; from the compiler’s perspective, this is perfectly valid C++.
If we try the same code in Rust, our experience is much nicer:
// explicitly mutable
let mut vec = vec![String::from("hello"), String::from("world")];
let first = &vec[0];
vec.push(String::from("from rust"));
println!("{}", first);
The compiler doesn’t just stop us, it gives us a very nice explanation of what’s going wrong:
error[E0502]: cannot borrow `vec` as mutable because it is also borrowed as immutable
--> test.rs:4:5
|
3 | let first = &vec[0];
| --- immutable borrow occurs here
4 | vec.push("from rust");
| ^^^^^^^^^^^^^^^^^^^^^ mutable borrow occurs here
5 | println!("{}", first);
| ----- immutable borrow later used here
error: aborting due to 1 previous error
For more information about this error, try `rustc --explain E0502`.
It’s not only enforcing the reference’s validity, it’s also giving us a very clear picture of what’s going wrong, and even linking us to more information on that type of error!
More than just nice compiler errors, what Rust is actually doing for us here is enforcing a set of rules at compile time using their ‘borrow checker’. The borrow checker is what enforces the concept of ‘ownership’, which is one of the central features of Rust. With ownership, Rust is able to eliminate entire classes of bugs statically, at compile time. These include, but are not limited to:
- data races
- double free
- use-after-free
- null pointer dereference
- dangling pointers
The tradeoff is that the language is a lot more opinionated about how you manage your memory. Compile times are longer, and you might often find yourself structuring code in a much different way than you would have (in a language like C++) to satisfy the borrow checker. But, once the code compiles, you get much stronger guarantees about safety than you do with a C++ binary.
Ownership
In C/C++, an allocation is very easy to get wrong. Tools like valgrind continue to remain relevant because it’s very easy to simply forget to free something that you’ve malloc’d. This is because in these languages, the memory will exist until you explicitly call free(). There is no external garbage collector that will automatically clean up memory for you, and paying the cost of reference-counted pointers for every heap allocation will erode performance.
In Rust, the mental model for memory management is different. Every value has an owner, and that owner is directly responsible for cleaning up that memory. When the owner goes out of scope, the value is automatically dropped, or freed. There is no manual delete keyword or free() library call.
fn main() {
let data = vec![ 1, 2, 3, 4, 5 ]; // heap-allocated vector
println!("{:?}", data);
} // data goes out of scope here, automatically dropped.
In this sense, since data automatically expires at the end of a scope, rather than remembering when to free your memory, it’s your job to explicitly extend the memory’s lifetime for as long as it’s needed. The other important property of ownership is that it uses move semantics by default. In C++:
CustomObject data = new CustomObject();
CustomObject other = data; // implicit copy
// Make sure you #include <memory>
std::unique_ptr<CustomObject> data = std::make_unique<CustomObject>();
auto other = std::move(data); // data is now null!
In Rust:
let data = CustomObject::new();
let other = data; // implicit move, data no longer valid
println!("{}", other); // valid!
println!("{}", data); // compiler error: use of moved value
While this is great in principle, it quickly becomes a problem when data has a single owner. To solve this, Rust has references, but they’re not quite the same as the references in C++.
fn takes_and_returns_ownership(msg: String) -> String {
msg.push_str("' OR 1=1 --");
msg // last expression of a scope is implicitly returned
} // since msg is returned, value is not dropped
fn takes_reference(msg: &mut String) {
msg.push_str("'; DROP TABLE pls_dont_delete; --");
// msg not dropped here
}
In the first case, the caller passes ownership of the msg into the function. However, since it’s returned, ownership passes back to the caller (and is then subject to the lifetime of the caller).
In the second case, we pass a reference to the function, in this case a mutable reference. The reference is guaranteed to be valid for the scope of the function, and at the end of the function, it’s no longer valid. The returning caller retains ownership of the value of msg throughout this whole exchange, even though msg is mutated.
How does Rust guarantee that the reference is valid? How is a Rust reference different from a C++ reference? The difference lies in the following rule: You can have multiple immutable borrows, or a single mutable borrow, but never both (concurrently). This effectively eliminates data races and invalid iterators at compile time.
fn main() {
let mut data = vec![1, 2, 3];
let r1 = &data; // this is fine
let r2 = &data; // also fine, second immutable borrow
// let r3 = &mut data; // compiler error. Mutable borrow
// if we want to copy data, we need to be explicit
let r3 = data.clone();
let r4 = &mut r3; // this is fine!
}
Let’s revisit this C++ example from earlier:
std::vector<std::string> vec = { "hello", "world" };
auto& first = vec[0];
vec.push_back("from c++"); // might reallocate
std::cout << first << std::endl; // undefined behavior at runtime
Another related example in Rust:
let mut vec = vec![1, 2, 3, 4, 5];
for item in &vec { // immutable borrow iterator loop
if *item == 3 {
vec.remove(3); // compiler error;
// we would need a mutable borrow here
// but we already have immutable borrows
}
}
error[E0502]: cannot borrow `vec` as mutable because it is also borrowed as immutable
--> test.rs:5:13
|
3 | for item in &vec {
| ----
| |
| immutable borrow occurs here
| immutable borrow later used here
4 | if *item == 3 {
5 | vec.remove(3);
| ^^^^^^^^^^^^^ mutable borrow occurs here
error: aborting due to 1 previous error
For more information about this error, try `rustc --explain E0502`.
This isn’t a perfect setup. Managing references in Rust can be tricky, as references also have their own concept of lifetimes which can become confusing to manage. While the compiler does its best to try to handle reference lifetimes, there is still a steep learning curve to the ownership model. The ownership model also forces Rust programmers to use different patterns than they would normally use; heavily self-referential data structures like graphs are much more challenging in Rust, and will likely use different patterns (like arena allocation) than in C++.
With ownership, the tradeoff is that you get really good compile time safety guarantees, excellent feedback about bugs from the compiler, and no hidden side effects (implicit copies, dangling references, etc.) from state mutations. Rust forces you to be very explicit with how you define and manage your state, but once you’re done satisfying the compiler, you don’t have to worry about these seeing these types of issues at runtime.
Rust Ergonomics - Enums
One of my favorite features of Rust is its implementation of enums. In C++, enums are essentially named integers. Java has some advantage over C++ in this regard, because we can attach multiple members to an enum, and have some methods, but they still aren’t very powerful for anything other than representing a named collection of values.
One of the most common uses for enums comes from managing different states, and the limits of these enum implementations are quickly reached when it becomes necessary to associate different fields with each state. Consider the following example:
enum class TaskState {
PENDING,
IN_PROGRESS,
SUCCESS,
FAILED
};
struct Task {
TaskState state;
std::optional<int> worker_id; // only valid when IN_PROGRESS
std::optional<time_t> start_time; // only valid when IN_PROGRESS
std::optional<int> result; // only valid when SUCCESS
std::optional<std::string> errMsg; // only valid when FAILED
// ...
void process() {
switch(state) {
case TaskState::PENDING:
... // if we add new states in the future we need to remember
// to update this
}
}
};
This is problematic, because invalid states can easily be represented if we mismanage our values. Additionally, we have to remember to be exhaustive about all our states, or insert a ‘default’ statement which might cause problems down the line.
One way you might choose to solve this is with some inheritance hierarchy, where each state is inherited from a base Task state, and then can define its own fields. But this is still problematic, because now you have to deal with the dynamic dispatch cost, as well as the potential runtime errors when downcasting to different states. You also have a fragmented view of various states, each with their own class definitions. Overall, it’s incredibly verbose, and hard to maintain.
Rust solves this very elegantly with its enum implementation:
enum Task {
PENDING,
IN_PROGRESS { worker_id: u32, start_time: u64 },
SUCCESS { result: i32, complete_time: u64 },
FAILED { error: String },
}
This approach solves all the problems that we face above. Invalid states can’t be represented, and there’s no possible ambiguity with shared state. Each state only holds the fields that it needs. Our definition is concise, compact, and can be easily extended with new states without any extra refactors.
Then, when we go to check these states:
impl Task {
fn process(&self) {
match self {
Self::PENDING => {
println!("pending task this will RUIN our SLA :(");
},
Self::IN_PROGRESS { worker_id, start_time } => {
println!("Task reached in progress on worker {} at time {}", worker_id, start_time);
},
// IMPORTANT:
// If we forget to put a state in here,
// we get a compiler error. Every state must be explicitly
// represented!
// we still have the option of a default, if needed:
_ => ()
}
}
}
Rust enums are very powerful, and they’re used thoroughly throughout the language. The two most common examples are the built-in Result and Option enums. These concepts replace exceptions and null values, respectively.
In Rust, the Result enum is defined as follows:
enum Result<V, E> {
Ok(V),
Err(E)
}
In C++ and Java, an error during an operation results in an exception being thrown. Each caller has to carefully manage exceptions within its own scope, including handling all of the exceptions that may be thrown by functions within its own scope. We’re all familiar with the issues thrown exceptions create. There is almost always an issue with an uncaught or improperly handled exception in a Java codebase.
In Rust, the Result type forces explicit error handling. The caller must directly deal with an error state from a function call that returns a Result, or explicitly return it to another caller as another Result. This means that our control flow is more predictable; error handling states are more clearly defined sequences, rather than a random jump from one line into a catch block. Rust also makes this cleaner with the ‘?’ operator:
fn use_number(number: String) -> Result<(), ParseIntError> {
let num = parse_number(number)?; // automatically propagates error if present
println!("parsed number: {}", num);
Ok(())
}
The Option enum is Rust’s solution to null values. In Rust, Option is defined as:
enum Option<T> {
Some(T),
None
}
This again forces us to be explicit about how we handle null values. In Java/C++, if we try to use a null value, we’ll get the exception at runtime. In Rust, there is no value to attempt with.
HttpResponse response = fetchFromServer();
System.out.println(response.getHeader().toString());
If we get a null value in response, we’ll get a null pointer exception when we call getHeader(). Or, we could get a non-null response, but with a null header field, which means toString() will give us a null pointer exception. To combat this, we have to constantly keep checking:
HttpResponse response = fetchFromServer();
if (response != null) {
HttpHeader header = response.getHeader();
if (header != null) {
System.out.println(response.getHeader().toString());
} else {
// error case 2
}
} else {
// error case 1
}
In Rust, this is much cleaner:
if let Some(response) = fetch_from_server() {
if let Some(header) = response.get_header() {
println!(header.to_string());
} else {
// error case 2
}
} else {
// error case 1
}
// there is no alternative!
let response = fetch_from_server();
println!("{}", response.get_header().to_string());
// compile error: calling get_header() on an Option<> is not defined
Even though the latter Java example isn’t too far from our Rust example, the most important detail here is that it’s entirely up to the programmer to make sure that that validation happens. The first example is still valid Java!
Options and Results are great examples of how Rust enforces safety throughout the program. Rust forces you to be explicit and precise with how you handle states, but as a reward, it gives you amazing ergonomics and incredibly accurate feedback whenever you break one of its rules.
Honorable mention: Procedural Macros
Procedural macros and the Rust macro system deserve their own deep dive, but I think they’re especially relevant to touch on here because of our extensive use of Lombok at Amazon. Java is a very verbose langauge, and its strongly object-oriented nature means that there’s tons of great patterns that we’re constantly incorporating into our codebases. The unfortunate combination of these two things means that most objects end up having tons of boilerplate for some of the most widely used patterns (getters, setters, builders, etc.).
To combat this, we use Lombok extensively in many of our Java projects. Lombok is a great tool that cuts down on all this verbosity and makes code much more readable; all we have to do is add some annotations, and Lombok modifies the AST at compile-time to insert the desired generated fields.
Rust solves this problem natively with procedural macros. A procedural macro in Rust is a piece of Rust code that runs before your main code is compiled, which reads in the AST of your code and generates additions to it. For example:
#[derive(Debug)]
pub struct User {
name: String
}
// user-provided implementation block
impl User {
fn say_hello(&self) {
println!("hello, {}!", self.name);
}
}
// at compile time, macro generates:
// impl std::fmt::Debug for User {
// fn fmt(&self, f: &mut std::fmt::Formatter) -> std::fmt::Result {
// f.debug_struct("User")
// .field("name", &self.name)
// .finish()
// }
// }
The std::fmt::Debug here is a Trait, which is similar to an interface in Java. The implementation block here is fully separate from the user’s implementation; unlike Lombok, we aren’t modifying our struct’s implementation in-place, but generating a separate, related implementation block.
Procedural macros can also be used much like a Lombok annotation:
// this example uses the derive_builder crate:
use derive_builder::Builder;
#[derive(Builder)]
struct Lorem {
ipsum: u32,
// ..
}
// at compile-time, we get a builder definition:
// #[derive(Clone, Default)]
// struct LoremBuilder {
// ipsum: Option<u32>,
// }
//
// #[allow(dead_code)]
// impl LoremBuilder {
// pub fn ipsum(&mut self, value: u32) -> &mut Self {
// let mut new = self;
// new.ipsum = Some(value);
// new
// }
//
// fn build(&self) -> Result<Lorem, LoremBuilderError> {
// Ok(Lorem {
// ipsum: Clone::clone(self.ipsum
// .as_ref()
// .ok_or(LoremBuilderError::from(UninitializedFieldError::new("ipsum")))?),
// })
// }
// }
// then we can use it:
fn build() -> {
let built = match LoremBuilder::default().ipsum(42).build() {
Ok(lorem) => lorem,
Err(e) => {
println!("oh no! my lorem! {:?}", e);
return;
}
};
// ...
}
But more than just generating boilerplate, we can use Rust’s procedural macro system to do things like reflection. One of the most popular Rust libraries of all time is the serde crate, which generates all-purpose bindings for serialization/deserialization.
In Java, if you want to serialize an object to JSON using jackson, you have the following:
ObjectMapper mapper = new ObjectMapper();
String json = mapper.writeValueAsString(user);
At runtime, the mapper will use reflection to look into the fields of myObject, then build the JSON structure dynamically. This incurs all the runtime costs of reflection, and is only facilitated by the heavy JVM runtime.
Rust doesn’t have refelection, but it’s able to achieve a similar result using procedural macros in the serde crate:
#[derive(Serialize)]
pub struct User {
name: String,
age: u32,
}
// At compile time, we get something like this from serde:
// impl Serialize for User {
// fn serialize<S>(&self, serializer: S) -> Result<S::Ok, S::Error>
// where S: Serializer {
// let mut state = serializer.serialize_struct("User", 2);
// state.serialize_field("name", &self.name)?;
// state.serialize_field("age", &self.age)?;
// state.end()
// }
// }
Then, we use this general purpose binding with some other crate (or our own code) that can call this function:
let json = serde_json::to_string(&user)?;
This works despite zero runtime reflection. All of our field introspection is done directly at compile time, and happens automatically through procedural macros!
Procedural macros are way more powerful than even just these small examples. They essentially allow you to write Rust code that runs before the rest of your code compiles, with all the power of the language. And rather than importing an annotation processor or shipping a runtime to provide reflection, procedural macros are a native, first-class feature of the langauge.
Conclusion
Rust is a really exciting language. It forces you to be explicit, and it puts a lot of pressure on you early on to iron out your memory model. It has a steep learning curve, and a very different mental model from most C-like languages, but it rewards you with great expressivity and excellent performance. It pushes an incredible amount of work down into the compilation stage, and gives you excellent feedback before your code ever runs, where other languages might just start cutting you Sev2s.