Kaggle has published a report on the State of Machine Learning and Data Science for 2020. The report is based on survey responses from over two thousand users currently employed as data scientists. The report notes that the "vast majority" of data scientists are under 35 years of age, two-thirds have a graduate degree, and most have less than 10 years coding experience. Around 55% have less than three years of experience with machine learning.
The report and underlying survey were described on Kaggle's website. Kaggle opened the 35-question survey for 3.5 weeks in October 2020 and collected over 20 thousand responses. The Enterprise Executive Summary Report focuses on the 13% of respondents who identified their job title as "data scientist." The report identifies several key results about data scientist demographics as well as popular data science and machine learning technologies. As with the three previous annual surveys, Kaggle has also released the anonymized response data. According to Kaggle,
In our fourth year running this survey, we were once again awed by the global, diverse, and dynamic nature of the data science and machine learning industry.
The report contains graphs and analysis of several attributes of the survey respondents, including: respondent profile, education, and experience; employment and work environment; and technologies and platforms. The survey contained several questions about technology choices; these questions allowed multiple answers, the result being that the percentages for a given question can total over 100%.
The most popular IDE for data scientists was Jupyter, used by 74% of respondents; second place was Visual Studio, used by 43%, up from 30% last year. Both PyCharm and RStudio were used by about 30% of respondents. In response to questions about frameworks and libraries, over 80% reported using scikit-learn, and around 50% used Google's deep-learning framework TensorFlow. PyTorch, another popular deep-learning framework developed by Facebook, was used by 31%, up from 26% in 2019.
The most popular machine learning algorithm was linear regression, used by over 80% of data scientists, with decision tree and gradient-boost algorithms a close second- and third- place, respectively. Various neural network architectures were reported separately, with 43% using a convolutional neural network (CNN), 30% a recurrent neural network (RNN), and 15% a Transformer neural network.
Most data scientists reported using a public cloud provider, led by Amazon Web Services (AWS) at nearly 50%. About one-third reported using Google Cloud Platform (GCP), and 29% used Microsoft Azure. Basic compute infrastructure was the most common service used, with Amazon EC2 used by 40%. Function-as-a-service was also a popular choice, with 21% using AWS Lambda, with the GCP and Azure FaaS solutions at 12% and 9% respectively. Container services had slightly less adoption, with AWS again being the leader at 14%. Just over 17% were not using any cloud platform, down from 25% a year ago. One twitter user noted that:
[This] is most likely to indicate the entire market of cloud computing applications is not saturated yet.
Besides Kaggle, several other data science organizations have recently published survey results from 2020. Anaconda, makers of a Python distribution popular among data scientists, recently published a State of Data Science report based on "2,360 responses from over 100 countries," featuring questions about bias and privacy as well as operational metrics. AI software vendor Algorithmia published their report on the State of Enterprise Machine Learning, which surveys "thousands of companies in various stages of machine learning maturity," highlighting challenges related to machine-learning operations.
Kaggle's raw data of survey responses are available for download from their site, along with the results from previous years' surveys.