InfoQ Homepage Presentations LSEG Cloud Lessons Learned: after Nearly a Decade of Being Cloud-First, What Have We Learned?

Architecture & Design

LSEG Cloud Lessons Learned: after Nearly a Decade of Being Cloud-First, What Have We Learned?

View Presentation

Speed:

38:43

Summary

Oli Bage shares LSEG’s organizational, economic and technical tips about the journey to cloud. He talks about the CDMC standard, and where analytics might head in the future.

Bio

Oli Bage is Head of Architecture for the London Stock Exchange Group (LSEG) Data & Analytics business, one of the world’s largest financial data and analytics companies. LSEG infrastructure powers the world’s financial markets, including running the planet’s second largest network and buying a lot of fintech start-ups!

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Bage: After nearly a decade of being cloud-first, what has LSEG learned that we can share with others on the journey? We'll cover some of our lessons on economic, organizational, and technical tips. We're going to cover data on cloud with the CDMC. We're going to talk about what's coming next with CDMC+. First, I want to tell you a story. Sensitive data on cloud is a challenge. For some organizations, it is the challenge, the eye of the storm. In 2019, I was the distinguished engineer in the Center of Excellence for Data at Morgan Stanley, a large and systemically important investment bank. We have mature data controls on our on-prem data. For projects moving workloads to the cloud, there was a checklist of controls for sensitive data that had to be met before you could go. We had approached the big cloud service providers that we were working with at the time, and we asked them to automate these controls on the cloud for us just like we have them automated on-prem. They said, we're an important customer, but we're not that big. Could they drop everything and add 14 new features to every cloud data service? Yes, but it will be slow. We said, what if we could get the whole industry to ask with one voice for the same 14 controls that we were asking for? They said that would definitely help. We approached the EDM Council, the Enterprise Data Management Council to help us assemble a group and create a group of 100 organizations, including half of the world's systemically important banks, and all of the world's largest cloud service providers. We also invited smaller emerging tech companies with deep expertise, and we added professional services companies with practical experience of moving to the cloud. We worked together for 45,000 hours of SME time, over 18 months to publish the Cloud Data Management Capabilities standard in September 2021. I'll tell you more about CDMC, and where it's going this year. I'll also share some of the other cloud lessons we've learned here at LSEG to help you avoid the mistakes that we made, reduce some of the hard yards, and help you to move faster.

Background

My name is Oli Bage. I am head of architecture for the data and analytics division of LSEG, the London Stock Exchange Group. I joined LSEG three years ago as Chief Data Architect for Refinitiv, the market data company that was merging with LSEG at the time. I'm chief architect for the data division. I'm also founder and co-chair of the Cloud Data Management Capabilities initiative. Before I joined LSEG, I spent 21 years at Morgan Stanley, where most recently I was distinguished engineer. In the past, I've been a C++ developer, I've been a data architect, an information architect, a technical program manager on transformation projects. I helped set up the Center of Excellence for Data, before becoming a distinguished engineer. In my spare time, I help organize the London Enterprise Tech meetup, which we host monthly at the London Stock Exchange Group headquarters in Paternoster Square. I'm also an angel investor in emerging tech companies. I have a computer science degree from Cambridge University.

LSEG

A little bit about LSEG. We are a large financial services technology company, so the ultimate FinTech. Some surprising stats about us you may not know. We're the largest wealth management financial advisors' software provider in the world. We have the second largest network in the world after Google. Compared to a bank, we have similar technology in our landscape, from research, investment management, through investment banking, sales and trading, execution management systems, and clearing and settlement, that kind of thing, risk management systems, quantitative analytic platforms. That's all supported by the world's broadest set of content and analytics on the largest data platform in finance. We're an engineering company, and we cover a large number of engineering businesses, including our enterprise data feeds business. We have a wealth and investment management workflow and tools business. We provide indexes, the FTSE Russell. You may have heard of the FTSE 100 or the Russell 2000. We provide trading software including startups like TORA, which are part of LSEG, and banking software with our workspace desktop. We run the London Stock Exchange and we also write the software that runs a number of other stock exchanges around the world. We run the largest and most important clearing house in the world, that's the London Clearing House, which clears trillions in derivatives exposure every day. We have a large and fast-growing customer risk business where you might have heard of brands like World-Check, which we build and run.

Organizational Advice

Let me share some of the lessons that we learned on our journey to cloud. We started in around 2015, 2016. What we learned about our organization is what I want to share first. The pace of change when you're building on cloud is quite different to when you're building on your on-prem environment. Cloud technology is moving much faster than traditional on-premise technology. Large scale systems need an architectural approach that allows a modular swap-out over the medium term. There is additional cost in that flexible architecture, and it's harder to optimize performance. You get the advantage that you can move to better services, more efficient services or faster services as they become available over time. This is particularly important for customer facing applications or for analytics platforms where the pace of change is very high. We also found that you need to react to the pace of change with your people. Developer agility in financial services today doesn't currently match the technical platform's ability and agility. We find that financial data engineers, for example, have deep specialisms today. They need a skill set change to be successful in cloud. Let's talk about patterns. Patterns are important. There are lots of exciting services available on the cloud. The variety and ease of access can mean that you end up with technology in the wrong use case. Great tools, the main data warehouses, some of the cloud platforms, for example, at Redshift, or Synapse, or BigQuery, where you might be using some of this technology, like lambdas, and that kind of thing. You might end up using them for the wrong problem. Our advice is to create strong central pattern sharing. That would have helped our engineers to fail faster and spread their learning across other teams more quickly. We do now have a patterns-based approach, and that is working much better for us.

Cloud Economics

Let's talk a bit about cloud economics. When you're looking at migrating existing applications or your existing estate to the cloud, you've got a choice to make between a faster rehost or replatforming versus a much more efficient rearchitecting of your applications that takes advantage of cloud native features, like scalability. There's an inherent conflicting outcome for end users, technology as a group, and for the cloud service providers. You want to move fast, but also you want it to have the best possible results. The lesson we learned here was to use data driven architecture. When you're planning what to move and when, link your product roadmaps to your application cloud migration strategy, and your infrastructure cost and obsolescence data so that you know what opportunities you've got to get out of on-prem infrastructure. You know how you're going to make the savings that you expect to make by migrating an application. You'll need to understand this dependency anyway if you're a large-scale financial services or cloud company, for operational resiliency regulations that are coming soon. The transparency of that linkage between products, applications, and their infrastructure simplifies the decision-making process.

Let's talk a bit about technical debt. There is the risk of building a lot of capability on the cloud, but not replacing your on-prem capabilities. You need integration between the on-prem and the cloud. Without care, you can generate a lot of technical debt as you do your cloud migration journey. Our lesson was that we didn't think soon enough about how the definition of done should mean the switching off of old stuff. Not just getting the new capability live and having new exciting products available, but switching off the old infrastructure and not having to pay that on-premise cost and the integration cost too. The last economic lesson that we learned was around provisioning for performance. Performance testing is as expensive as running your production system, and you're going to do a lot of it to optimize. Make sure you include the performance testing estimates in your calculations. Then when you're comfortable that the system has met your performance expectations, you can scale it back with confidence.

Technical Design

I'm going to touch on some of the technical lessons that we learned, starting with landing zones. We started too broad with a large blast radius. Then we went very granular. We went from having a whole division in one cloud account, and when things go wrong, they impact everybody else. To counteract that, we went very granular and created too many accounts, which gave us a massive account management overhead problem. At the time, the tools were just maturing to be able to manage very large numbers of accounts. What we've gotten to, in our third generation, is the Goldilocks size: not too big, not too small. We would recommend encapsulating the change cycle that you made for a set of applications so that you have control over the dependencies. You can do an impact dependency map of your ecosystem, and decide the landing zone granularity on that basis.

Let's talk about IP ranges. Our initial migrations to cloud were constrained by our on-premise network design principles. We run a very large, sophisticated and mature physical network infrastructure. When we took that to the cloud, we ended up assigning massive IP ranges that couldn't be split, with applications ending up consuming a lot of IP addresses. The key lesson here is lots of services don't need those internally addressable IP addresses, with the ones that are unique for a whole organization's network. You can use secondary, classless inter-domain routing, or CIDR ranges for things like container-based applications that are IP hungry. These secondary ranges are unique within a VPC. They really allow you to get organized with your network design, and not chew through your entire range. I want to talk about the importance of observability. Obviously, you all engineers understand the need to be able to support our code. In the early days, when we moved to cloud, we didn't build enough observability into our applications, so we didn't have real-time feedback when they were experiencing problems. The logging environment wasn't mature as we rolled out, so we didn't provide a central logging service and then enforce the adoption of that. The lesson here is to make that observability call from the start as you roll out your applications.

A couple more areas of technical focus that are very important for financial technology firms. The first one is around disaster recovery and resilience. Our disaster recovery standard was hard to agree on across our organization, because we had to get a coordinated agreement across teams to understand the complex set of dependencies that we have as a set of applications. You need to put your highest tier of applications into multi-region as well as multi-availability zones. Teams have to work closely together to understand those complex application dependencies and work out just what is in that highest tier. There is some new operational resiliency regulations that are coming for the financial services industry, and they will require you to understand these dependencies between critical important functions or important business services and the infrastructure that they depend on. That could be a very complex set of application and data feed dependencies.

I also want to talk about the challenge we faced with real-time data on cloud. LSEG's core technical asset within our data business is a very large scale, low latency real-time network, where we pump market data from more than 500 exchange venues to tens of thousands of clients, in fact, 40,000 clients that we have. A very large portion of them take some kind of ticking market data from us. Providing that real-time capability on cloud is very expensive from an OpEx point of view. Also, for predictable, ultra-low latency market data feeds we haven't yet seen the services on the cloud providers that meet our needs. I have no doubt that they will come at some point in the future, but they're not there today. The kinds of things that we care about in our network are being close to the data sources and to our customers, so having the ability to colocate in specific physical locations. We also demand very high levels of resilience, so having ultra-high levels of redundancy, including dual power, dual comms, dual cooling. We like to be able to do individual placement within a rack to get very predictable distances between our service. That fine control extends into the software there as well. We want to allow control for our developers on the NIC, in the BIOS, and kernel, and understand exactly what's going on in there. Finally, we need support for multicast. That's very important for our predictable low latency distribution. These improvements will come over time.

Cloud Data Management Capabilities

Those are most of the technical, economic, and organizational tips that we've got. There's one big area of cloud migration that's been a real challenge, and that's data. Let me talk about the Cloud Data Management Capabilities. This is a huge opportunity to improve the level of automation in data management. As you move to cloud, the idea is to get to these laboratory conditions or white room conditions where everything is very clean, and fully automate the control framework around the outside, that means you can depend on those controls, rather than depending on individuals to put those controls in place. CDMC is designed to create 14 of those key controls that most large systemically important banks needed. When we went to other industries, we found lots of non-systemically important use cases need exactly the same set of controls. They form a basic capability for organizations that need to respond to privacy regulations, or you want to get better at data management in their own right, not just those who have financial reporting obligations on their data. We're trying to get here policy as code. The CDMC controls themselves are the result of a long, detailed review of the control work that we were doing at Morgan Stanley, in 2019 and 2020.

CDMC: Industry Engagement

Let me show you who was involved in creating them. That working group that I mentioned, included half the world's systemically important banks, who reviewed that list of what good looks like for data, and really tested whether this was going over the top or whether it was the bare minimum. We only wanted to standardize best practices that were in wide use within the industry. We held ourselves to a pretty high bar on only including established best practices in the CDMC framework. We also had great input on how these could be automated from deep specialists on the engineering side. We had in the same room at the same time launch teams from each of the cloud service providers, and more detailed specialists in areas like privacy, security, data cataloging, from emerging tech companies that have spent their entire time focusing on just these problems. We were also able to use the connections at the Enterprise Data Management Council to brief all but one of the world's financial regulators on what we were doing. As we went through the CDMC design process, we were able to give regular briefings to financial regulators around the world, let them know what was going on. They're interested in, not just from a control point of view, they're interested in what the industry is doing to control data. They're also interested in using public cloud themselves more often. That's actually quite a challenge for them, they have very sensitive data.

The EDM Council provided great support for the initiative. We produced during the course of those 18 months, a set of training courses that you can take, of varying lengths. You're going to get a ported version, a very short version. You can take an assessment of your platform or your company, or you can become certified as a company that provides these sorts of services or as an individual who understands the framework. There's an authorized partner program that allows you to promote your use of CDMC if you're a software company, or provide training for your customers if you're a large service provider. We also have a collaboration with FINOS, the Financial Open Source community, to automate in an open source way, a lot of these controls. Where the full control isn't built into every service, being able to create a framework that sits over the top and allows you to show that you're fully compliant. We tested some of these controls with other industries holding training sessions with pharma, telco, manufacturing, energy, insurance, and government stakeholders, quite senior stakeholders, chief data officers from companies in those sectors. The feedback was very positive. They said that the control framework was something that they felt that they could use. They might have a different weighting of the importance that they would give to each of the 14 different areas though. Later on, I'll tell you about CDMC+, which is what we're doing with CDMC this year, to take it into analytics, data marketplaces, data sharing, and then onwards later this year into master data management, federated data governance, and then the buzziest buzzword of all in data at the moment, data mesh, and put some practical, helpful automated controls around each of those areas.

CDMC - 14 Key Controls and Automations

Let's talk about what CDMC actually is like. There are 14 key controls that make up the framework. They provide a definition of what good looks like for data on cloud. Control number 1 is actually a metric that one of our regulators, when they were involved in the review, suggested to us, and it is the control that says whether all of the other controls are switched on. If you're adding the CDMC framework to a data platform, or to a data project or a subset of your data in cloud, or maybe all of your data in cloud and a subset of your data on-prem, you can check control number 1, and see if all of the other controls are operating correctly. It's the control that says I've got evidence that everything else is working. Control number 2 is around defining ownership. That's a very important principle here for data governance and accountability for data, that all the data assets that are covered by the framework have an owner. In control number 3, that owner has assigned authoritative sources and authorized distributors, where you can get the highest quality version of that data from. That the owner of that data says this is the place you come and get our data from. Control number 4 is being accountable for that data as it moves between jurisdictions. Looking at the data sovereignty requirements from various different jurisdictions, and understanding what workflows you need for cross-border movement controls. In cloud, it's much easier to move data around from one region to another or one jurisdiction to another accidentally. In your data catalog that we'll talk about, you'll be able to describe that this dataset can either only live in one country or one jurisdiction, or should never go into a certain set of jurisdictions. You can be sure that the cloud framework and the cloud platform you're on won't allow it to be moved into those areas.

I want to talk about cataloging and classification next. Cataloging is probably the most important control in the framework. It's the one that says that all of the data that you've got inside your framework is listed, even if you don't know what it is yet, because it hasn't gone through a classifier. As the data is created, a record appears in your catalog that says there's a new bit of storage being used in this service. I can see that Oli created it, or the process that I've been running created it. Until it's classified and I know how sensitive it is, I can't do anything with it. Control number 6 says, I'm going to run a classifier over all of the data that I create, and try and work out if there's sensitive data in there. The vast majority of the data in financial services companies is sensitive in some way. It may be information about their customers or their customers transactions, or it could be material nonpublic information from a research group, or trading area from a sensitive transaction. There's a lot of sensitive data. Once you've got a classification that says that this data is sensitive, whether it's personal data for a privacy regulation, or sensitive transaction data that's covered by financial services regulations, that's what turns on the other controls. Controls from 7 onwards around get switched on, if you find you've got sensitive data in the cloud. Control number 7, having data-oriented entitlements and access control switched on for all sensitive data. By default, you would default back to the creator of the data until they specify who can see and who can share that data.

Control number 8 is that data consumption purposes attract. You've got data sharing agreements in place with all users of your sensitive data, and that you can detect when those consumption purposes change. GDPR, for example, will allow you to do certain things with your subjects' data, but not other types of processing without getting their permission. You want to be able to detect when the purposes change. Control number 9 is having appropriate security controls based on the level of sensitivity of the data. Encrypting at rest, encrypting in use, or encrypting on the wire, and maybe even encrypting in use using confidential compute techniques. Control number 10 is data privacy impact assessments. These must be automatically triggered for data that has personal data or PII in it. Linking that into your overall control framework will give your privacy team the overview they need of what data is in the organization. Control Number 11 is around data quality management. What we're really asking for here is that when there are data quality metrics that the consumer has, because the quality of data is in the eye of the consumer, whether it's fit for their purpose, those metrics make it back to the data owner that we defined back in control number 2, so that they can do something about it in the authoritative source. If they're just passed around internally in your own organization, it wastes a lot of time, as it usually passes through a set of teams that don't have the resources to do the DQ corrections. Or if they do have resources, they tend to make adjustments on the way through, which is bad practice.

Control number 12 is around retention, archiving, and purging. Once you know from your classifier in control 6, what type of data you've got, there are various regulations that apply that tell you how long you have to keep that around. If you're doing transactions on the New York Stock Exchange, for example, then you want to keep that data around for 7 to 10 years depending on the type of transaction. Control 13 is data lineage control. This is allowing us to understand where a particular field came from in a downstream consumer use case. In particular set of financial and risk reporting use cases in banks, you need to be able to show that the data came from an authoritative source. That's very hard to do if you're trying to work out the lineage after the fact. However, if you design your data platform to report lineage of datasets as they move around between different services, you can just look in your catalog and see, where did this dataset start? Where did it flow through? Where was it transformed? Where was derived data created from it? Where was it finally consumed? Finally, control 14 is around cost metrics. The cloud companies have this cost data at hand because they're going to create a bill and send it to you, is making sure that those kinds of FinOps records go back into the data catalog as well, so that the data owner can see what their dataset is costing to operate. These 14 controls together make up a very basic definition of what good looks like for sensitive data in cloud.

In the CDMC specification document itself, which is about 160 pages, we go into much more detail. You can see they're organized into a set of 14 capabilities, each one of which is broken down into a sub-capability, which is something you can score yourself against if you're an organization, or if you're writing software, you can create an automated implementation of that particular control.

CDMC for Data Today, and CDMC+, Building for the Future

That's CDMC. It's a basic best practice framework for what good looks like for data on cloud. It's been adopted by a number of different cloud service providers who are in the process of getting themselves certified at the moment. Some have announced already. Others have it in the works. We are taking those defensive controls and building new capabilities on top of them this year, and we're calling that CDMC+. CDMC is a certifiable framework that you can assess yourself against. CDMC+ is a new set of capabilities that sit on top of it that allow you to turn your data governance and data management capabilities into a strategic business enabler, but being able to create more value from them, use them for more interesting use cases. Let me show you what I mean by more interesting use cases. This is an overview of the 2023 plans that we have for CDMC+. You can see on the left-hand side, these are the CDMC foundations published in 2021, and now being relatively widely adopted. We have 4000 downloads of the CDMC spec in our recent exec advisory board meeting that we were looking at. There are more than 600 people that are trained on the CDMC framework now across all of our authorized partners.

Right now, though, we have a number of active CDMC+ working groups. We have just kicked off the design work in the analytics space to see how we can implement the same 14 CDMC controls but for analytical assets, for example, models, or visualizations, or analytical APIs. We're also looking at how we can create secure and confidential compute environments for running analytics in, on very sensitive data. We're also looking at what needs to be extended in the privacy and ethics capabilities within CDMC, when you're looking at creating outputs in analytical models, rather than just managing data content. We have two very active working groups around data marketplaces, and data sharing, with subgroups working on designing and managing data products, and making sure that datasets are discoverable. Data products are a core concept used by data mesh. There'll be important work done here that will allow a concrete implementation of data mesh to work across a number of different cloud providers in the future.

We're also looking at improving the automation around governing data sharing, including automating terms of use, licensing, and digital rights management. We're extending a W3C standard called ODRL so that it can deal with data contracts between data vendors and data consumers, so that you can have machine readable licenses for data that will tell you what you're allowed to do with that dataset when it arrives inside your data platform. We're also designing effective data sharing mechanisms. You've seen some of these in the more modern data platforms recently, so the data sharing or data sharehouse concept inside Snowflake, for example, which is industry leading. There are data sharing implementations in many of the cloud service provider detailed data platforms available at the moment. We're working to make sure that these fit nicely into the control framework, so you know what type of sharing you're allowed to do, and you can automatically provision that kind of shared channel inside your own organization or across organizations.

CDMC Teams - Where Could You Get Involved?

If you want to get involved in the CDMC working groups, there are multiple levels that you can get involved with. We recognize that everybody is quite senior and busy if they're interested in this topic. There needs to be a low level of time commitment option where you can keep up to date, but still give your opinion. There are some areas where we need deep SME knowledge to generate the knowledge, so we might need a couple of weeks of more intense effort from certain people. We've designed a structure during the initial CDMC work that seems to work quite well. We are continuing it for CDMC+. The CDMC+ structure you can see here, at the very top is the minimum time commitment. That's an hour once a quarter. Usually, if you're a most senior executive, so a chief data officer, or a chief data architect, or somebody who's responsible for the implementation of the data standards on cloud, they will review the material that's been created in CDMC+, up until that point. They will also set the priorities for which sections we would do next. We try to do analytics, marketplace, and sharing simultaneously. We're looking, when is the right time to start data mastering, federated data governance, and data mesh? It will be the exec advisory board that makes that decision based on recommendations from the next level down which is the Steer Comm. The CDMC steering committee meets every two weeks, on a Tuesday afternoon, 3:30 to 4:30, London, 10:30 to 11:30, New York time. That steering committee is guiding the work of the working group in a two-week sprint cycle, making sure that the right priorities, the right resource is lined up to work on individual pieces of knowledge generation, framework generation. We're also setting up an Asia steering committee to make sure that the specific extra use cases that we have in the Asia region are represented in CDMC adequately.

One level lower than the steering committee is the working group area. We meet a large group of us, more than 100 are invited, and around 50 are actively participating in any given week. We meet for two hours on a Thursday to review any work that's been generated, any material that needs to be shared and reviewed by the wider group. Then in between those meetings, we have smaller, more active focus groups with SMEs involved. On cloud-hosted analytics, I chair that group with John Allen from LSEG. We have around 25 people who are actively interested in contributing to the analytical best practices. For cloud data marketplace and data sharing, there are similar working groups going on actively, week-to-week, and bringing their findings back on the first date for review in a larger group. Whether you've got just one hour a quarter, or an hour every two weeks, or two hours a week, or somewhere in between, there's a place where you can get involved. We've found that if you might be a deep SME on one area of data management or analytics management, you'll benefit the program by sharing your knowledge in that deep area. Also, you might learn a fair amount from the other SMEs that are on the program for the other areas, and that we've built quite a good network across the initiative over the last few years. If you are interested in getting involved, I would encourage you to do two things. One is, download the CDMC specification, which you can get from the edmcouncil.org/cdmc page. There's a sign-up link there to join the CDMC+ group. Jim Halcomb who organizes the program, @JimHalcomb, the EDM Council, is the organizer for all the CDMC activities.

Conclusion

We covered a few tips on how to successfully migrate to the cloud. Some of the lessons that we've learned. We talked about some of the lessons that other organizations have shared as part of CDMC. We've talked about how you can get involved in guiding the future and defining this best practice framework that is making it much easier to migrate sensitive data and sensitive analytics to cloud.

See more presentations with transcripts

Recorded at:

Feb 07, 2024

Oli Bage

InfoQ Software Architects' Newsletter