GDPR for Software Engineers
This eMag examines what software engineers, data engineers, and operations teams need to know about GDPR, along with the implications it has on data collection, storage and use for any organization dealing with customer data in the EU. Download Now.
At PayPal, data engineers, analysts and data scientists work with a variety of datasources, compute engines, languages and execution models (stream, batch, interactive). This results in engineers spending lot of time managing these different data sources, impacting the time-to-market of their products.
The PayPal data team developed a new analytics platform called Gimel, which provides access to any data store using a single data API and SQL, and a centralized data catalog.
Romit Mehta and Deepak Chandramouli from PayPal spoke at the recent QCon.ai Conference about the platform and how it can be used to commoditize data access. They talked about the components of Gimel - Compute Platform, Data API, PCatalog, GSQL and Notebooks. They also announced the open source version of the framework.
InfoQ spoke with Mehta and Chandramouli about Gimel data platform and its support in the areas of security, data versioning and future roadmap.
InfoQ: Are there any differences in managing the data catalog (PCatalog) for transactional and analytics use cases?
Mehta & Chandramouli: Gimel API and SQL implementation today are focused towards Analytics platforms. Regardless of whether the storage type is Kafka, NoSQL, Relational or Document based - the data API remains the same & SQL provides the language abstraction. Within PayPal, we’re seeing requests coming from online / live systems to have a similar layer of abstraction. We are currently in the thought process for bringing a similar layer for online systems that require sub-second level responses.
InfoQ: How did you address the security and access control requirements for the data access on the Gimel platform?
Mehta & Chandramouli: Since all queries are submitted to the underlying systems as the user who has logged in, and because all queries are eventually executed by those underlying systems, all existing security policies and controls are maintained.
In addition, through the logging framework, Gimel keeps a log of every query executed including the query itself, whether there was data downloaded to local, and in the future, will also tag if any classified data was accessed.
Gimel at PayPal also honors the Ranger policies and works tightly with the Kerberized clusters.
InfoQ: How do you manage the data store versioning?
Mehta & Chandramouli: We partner with storage admins in PayPal to ensure our APIs have full support for the versions of storage supported by the infra team. In addition, if the storage teams have needs such as new instrumentation, we wire the same in our API so that all the clients inherit the implementation transparently. With this being said whenever version upgrade happens, in most cases clients need not change their code.
InfoQ: Can you talk about GSQL query language and how it differs from other similar frameworks like Spark SQL or Neo4j's Cypher?
Mehta & Chandramouli: GSQL today is a light weight implementation that intercepts user SQL, generates the corresponding data API code behind the scenes for Gimel Datasets, and then passes on the same to Spark SQL interpretor. On longer term, we are working on adding push down optimizations for SQLs that blend/join data from multiple storage types - say join Kafka, Hive, HBase, and write results to Elastic.
From a roadmap perspective, besides working on incremental features and updates, the team is also planning the following for Gimel:
- Query optimization
- Open source PCatalog (includes metadata services, discovery services, catalog UI)
- Add support for Python; currently they support Scala
- Release to open source the features added to Jupyter & Livy
If you would like to learn more about the Gimel platform or have any questions about its features, checkout the documentation, Slack Channel, User Forum and Developer Forum. You can also try Gimel first-hand by following these instructions.