Day Three of the 9th annual QCon New York conference was held on June 15th, 2023, at the New York Marriott at the Brooklyn Bridge in Brooklyn, New York. This three-day event is organized by C4Media, a software media company focused on unbiased content and information in the enterprise development community and creators of InfoQ and QCon. It included a keynote address by Suhail Patel and presentations from these four tracks:
- Doing Platform Engineering Well
- Hosted by Sarah Wells, independent consultant and author.
- We will hear about the kinds of developer platforms being built, and what makes them successful.
- Next in Cloud Native Development
- Hosted by Christie Warwick, software engineer at Google.
- Explores emerging approaches and practices that build for the cloud from day one. Walk away knowing what's next.
- ML in Practice
- Hosted by Sid Anand, chief architect and head of engineering at Datazoom.
- We'll look at the practical application of machine learning in experiences that you have come to rely on.
- Resilience Engineering - Culture as a System Requirement.
- Hosted by Vanessa Huerta Granda, solutions engineer at Jeli.io.
- Learn how organizations remain resilient across changing socio-technical systems. Come hear about how SREs and Ops engineers make change happen and how they respond to outages and learn from incidents.
Morgan Casey, program manager at C4Media, and Danny Latimer, content product manager at C4Media, kicked off the day three activities by welcoming the attendees. They introduced the Program Committee, namely: Aysylu Greenberg, Frank Greco, Sarah Wells, Hien Luu, Michelle Brush, Ian Thomas and Werner Schuster; and acknowledged the QCon New York staff and volunteers. The aforementioned track leads for day three introduced themselves and described the presentations in their respective tracks.
Keynote Address: The Joy of Building Large-Scale Systems
Suhail Patel, staff engineer at Monzo, presented a keynote entitled, The Joy of Building Large Scale Systems. On his opening slide, which Patel stated was also his conclusion, he asked why the following is true:
Many of the systems (databases, caches, queues, etc.) that we rely on are grounded on quite poor assumptions for the hardware of today.
He characterized his keynote as a retrospective of where we have been in the industry. As the title suggests, Patel stated that developers have "the joy of building large-scale systems, but the pain of operating them."
After showing a behind-the-scenes view of the required microservices for an application in which a Monzo customer uses their debit card, he introduced: binary trees, a tree data structure where each node has at most two children; and a comparison of latency numbers, as assembled by Jonas Bonér, founder and CTO of Lightbend, that every developer should know. Examples of latency data included: disk seek, main memory reference and L1/L2 cache references. Patel then described how to search a binary tree, insert nodes and rebalance a binary tree as necessary.
After a discussion of traditional hard drives, defragmentation and comparisons of random and sequential I/O as outlined in the blog post by Adam Jacobs, chief scientist at 1010data, he provided analytical data of how disks, CPUs and networks have been evolving and getting faster. "Faster hardware == more throughput," Patel maintained. However, despite the advances in CPUs and networks, "The free lunch is over," he said, referring to a March 2005 technical article by Herb Sutter, software architect at Microsoft and chair of the ISO C++ Standards Committee, that discussed the slowing down of Moore's Law and how the drastic increases in CPU clock speed were coming to an end. Sutter maintained:
No matter how fast processors get, software consistently finds new ways to eat up the extra speed. Make a CPU ten times as fast, and software will usually find ten times as much to do (or, in some cases, will feel at liberty to do it ten times less efficiently).
Since 2005, there has been a revolution in the era of cloud computing. As Patel explained:
We have become accustomed to the world of really infinite compute and we have taken advantage of it by writing scalable and dist software, but often focused on ever scaling upwards and outwards without a ton of regard for perf per unit of compute that we are utilizing.
Sutter predicted back then that the next frontier would be in software optimization with concurrency. Patel discussed the impact of the thread per core architecture and the synchronization challenge where he compared the shared everything architecture, in which multiple CPU cores access the same data in memory, versus the shared-nothing architecture, in which the multiple CPU cores access their own dedicated memory space. A 2019 white paper by Pekka Enberg, founder and CTO at ChiselStrike, Ashwin Rao, a researcher at the University of Helsinki, and Sasu Tarkoma, Campus Dean at the University of Helsinki, found a 71% reduction in application tail latency using the shared-nothing architecture.
Patel then introduced solutions to help developers in this area. These include: Seastar, an open-source C++ framework for high-performance server applications on modern hardware; io_uring, an asynchronous interface to the Linux kernel that can potentially benefit networking; and the emergence of programming languages, such as Rust and Zig; a faster CPython with the recent release of version 3.11; and eBPF, a toolkit for creating efficient kernel tracing and manipulation programs.
As an analogy of human and machine coming together, Patel used as an example Sir Jackie Stewart, who coined the term mechanical sympathy as caring and deeply understanding the machine to extract the best possible performance.
He maintained there had been a cultural shift in writing software to take advantage of the improved hardware. Developers can start with profilers to locate bottlenecks. Patel is particularly fond of Generational ZGC, a Java garbage collector that will be included in the upcoming GA release of JDK 21.
Patel returned to his opening statement and, as an addendum, added:
Software can keep pace, but there's some work we need do to yield huge results, power new kinds of systems and reduce compute costs.
Optimizations are staring at us in the face, and Patel "longs for the day that we never have to look at the spinner."
Highlighted Presentations: Living on the Edge, Developing Above the Cloud, Local-First Technologies
Living on the Edge by Erica Pisani, software engineer at Netlify. Availability zones are defined as one or more data centers located in dedicated geographic regions provided by organizations such as AWS, Google Cloud or Microsoft Azure. Pisani further defined: the edge as data centers that live outside of an availability zone; an edge function as a function that is executed in one of these data centers; and data on the edge as data that is cached/stored/accessed at one of these data centers. This provides improved performance, especially if a user is the farthest away from a particular availability zone.
After showing global maps of AWS availability zones and edge locations, she then provided an overview of the communication between a user, edge location and origin server. For example, when a user makes a request via a browser or application, the request first arrives at the nearest edge location. In the best case, the edge location responds to the request. However, if the cache at the edge location is outdated or somehow invalidated, the edge location must communicate with the origin server to obtain the latest cache information before responding to the user. While there is an overhead cost for this scenario, subsequent users will benefit.
Pisani discussed various problems and corresponding solutions for web application functionality on the edge using edge functions. These were related to: high-traffic pages that need to serve localized content; user session validation taking too much time in the request; and routing a third-party integration request to the correct region. She provided an extreme example of communication between a faraway user relative to two origin servers for authentication. Installing an edge server close to the remote user eliminated the initial latency.
There is an overall assumption that there is reliable Internet access. However, that isn't always true. Pisani then introduced the AWS Snowball Edge Device, a physical device that provides cloud computing available for places with unreliable and/or non-existent Internet access or as a way of migrating data to the cloud. She wrapped up her presentation by enumerating some of the limitations of edge computing: lower available CPU time; advantages may be lost when network request is made; limited integration with other cloud services; and smaller caches.
Developing Above the Cloud by Paul Biggar, founder and CEO at Darklang. Biggar kicked off his presentation with an enumeration of how computing has evolved over the years in which a simple program has become complex when persistence, Internet, reliability, continuous delivery, and scalability are all added to the original simple program. He said that "programming used to be fun." He then discussed other complexities that are inherent in Docker, front ends and the growing number of specialized engineers.
With regards to complexity, "simple, understandable tools that interact well with our existing tools" are the way developers should build software along with the UNIX philosophy of "do one thing, and do it well." However, Biggar claims that building simple tools that interact well is just a fallacy and is the problem because doing so leads to the complexity that is in software development today with "one simple, understandable tool at a time."
He discussed incentives for companies in which engineers don't want to build new greenfield projects that solve all the complexity. Instead, they are incentivized to add small new things to existing projects that solve problems. Therefore, Biggar maintained that "do one thing, and do it well" is also the problem. This is why the "batteries included" approach, provided by languages such as Python and Rust, deliver all the tools in one package. "We should be building holistic tools," Biggar said, leading up to the main theme of his presentation on developing above the cloud.
Three types of complexity: infra[structure] complexity; deployment complexity; and tooling complexity, should be removed for an improved developer experience. Infra complexity includes the use of tools such as Kubernetes, ORM, connection pools, health checks, provisioning, cold starts, logging, containers and artifact registries.
Biggar characterized deployment complexity with a quote from Jorge Ortiz in which the "speed of developer iteration is the single most important factor in how quickly a technology company can move." There is no reason that deployment should take a significant amount of time. Tooling complexity was explained by demos of the Darklang IDE in which creating things like REST endpoints or persistence, for example, can be quickly moved to production by simply adding data in a dialog box. There was no need to worry about things such as server configuration, pushing to production or a CI/CD pipeline. Application creation is reduced "down to the abstraction".
At this time, there is no automated testing in this environment and the adoption of Darklang is currently in "hundreds of active users."
Offline and Thriving: Building Resilient Applications With Local-First Techniques by Carl Sverre, entrepreneur in residence at Amplify Partners. Sverre kicked off his presentation with a demonstration of the infamous "loading spinner" as a necessary evil to inform the user that something was happening in the background. This kind of latency doesn't need to exist as he defined offline-first as:
(of an application or system) designed and prioritized to function fully and effectively without an internet connection, with the capability to sync and update data once a connection is established.
Most users don't realize their phone apps, such as WhatsApp, email apps and calendar apps, are just some examples of offline apps and how they have improved over the years.
Sverre explained his reasons for developing offline-first (or local-first) applications. Latency can be solved by optimistic mutations and local storage techniques. Because the Internet can be unreliable with issues such as dropped packets, latency spikes and routing errors, Reliability is crucial to applications. Adding features for Collaboration leverage offline-first techniques and data models for this purpose. He said that developers "gain the reliability of offline-first without sacrificing the usability of real time." Development velocity can be accomplished by removing the complexity of software development.
Case studies included: WhatsApp, the cross-platform, centralized instant messaging and voice-over-IP service, which uses techniques such as end-to-end encryption, on-device messages and media, message drafts and background synchronization; Figma, a collaborative interface design tool, which uses techniques such as real-time collaborative editing, a Conflict-Free Replicated Data Type (CRDT) based data model, and offline editing; and Linear, an alternative to JIRA, uses techniques such as faster development velocity, offline editing and real-time synchronization.
Sverre then demonstrated the stages for converting a normal application to an offline-first application. However, trade-offs to consider for offline-first application development include conflict resolution, eventual consistency, device storage, access control and application upgrades. He provided solutions to these issues and maintained that, despite these tradeoffs, this is better than an application displaying the "loading spinner."