Data is the starting point for all the projects and products in data science. The first lines of a script are dedicated to reading the data. This is a fact and does not change based on the project. What changes is the source of data. There are a variety of places where we store the data such as databases, S3 bucket, BigQuery, external files, spreadsheets, and so on.
Google sheets are quite common for storing small to medium size data. One of the nice things about Google sheets is that you can directly connect to them from a Jupyter notebook. You don’t have to download the data in your local directory and read it from there. Another advantage of connecting from within a notebook is that you can update the sheet directly.
Consider you have a data manipulating task to do. You write your script that connects to Google sheets, read the data, do the manipulation, and write the updated file back to the same sheet. You can schedule this script so that the data in the sheet is always up to date.
In this article, we will learn how to access Google sheets from a Jupyter notebook.
Create a project
The first step is to create a project in the Google Cloud Console, which can be done by clicking on create a project in the console. You will have a quota of 12 free projects with your personal account.
Enable Google Drive API and Google Sheets API
The next step is to enable the Google Drive API and Google Sheets API. In the Google Cloud Console menu, select APIs & Services and then Enabled APIs & Services as shown below:
If you have multiple projects, you will need to select the one you want to enable the API for. If you only have one project, it will automatically be selected.
Click on “ENABLE APIS and SERVICES” and then search for “Google Drive API” and then “Google Sheets API”. In the search results, click on the related icons as shown below. Then click on ENABLE API in the page that opens up.
Create credentials
Now that we have the Google Drive API enabled, we need to create credentials. At the top right corner of the page that opened up after enabling the API, you will see the CREATE CREDENTIALS icon and then select service account credentials. The page shown below opens up.
Give it a name and click CREATE AND CONTINUE. Then the grant pages open up, just hit CONTINUE and DONE. You will then see the following screen. Click on the link in the email part.
It will take you to the service account details page. Copy the email address shown here. This email will be used for connecting to your account.
We also need to generate a key. Go back to the service accounts page. Click on the three dots under the actions.
Click on MANAGE KEYS and then ADD a KEY. It will ask you to choose the format, select JSON and hit CREATE. The json file will automatically be downloaded.
Connecting to the Google Sheet
You, of course, need a Google sheet to connect to. I created one that contains some sample sales data. You will need to share the Google sheet with the email address copied in the previous step. Click on the share button at the top right corner of the Google sheet, paste this address and hit SEND.
We are now ready to connect to this sheet. Open up a Jupyter notebook. We will use a Python library called gspread, which can be installed with pip.
!pip install gspread
The next step is to import the library.
import gspread
We need to give the service account details to gspread, which can be done using the service_account method.
sa = gspread.service_account(filename="project-1-357814-ba841f7c3630.json")
The path to the json file that contains the service credentials is passed to the filename parameter. If the file is in the same working directory as the notebook, you can just write the name of the file.
The sa
is a gspread client, which can be used for connecting to the sheets by using the open method and the sheet name.
sheet = sa.open("sample_sales")
A Google sheet document might have multiple pages (i.e. worksheets) so we also need to specify the page name before getting the data. Ours has one page that is called “Sheet1”.
work_sheet = sheet.worksheet("Sheet1")
We have the data in a gspread worksheet object. We can extract the data using the get_all_records
method and create a Pandas DataFrame as follows:
import pandas as pd
df = pd.DataFrame(work_sheet.get_all_records())
Let's take a look at the first 5 rows of the data using the head method.
df.head()
We have successfully connected to a Google sheet and retrieved the data it contains. Most of the steps we have completed need to be done only for once. After that, it is just a matter of writing the sheet name, which is definitely more practical than downloading the data as a CSV file and then reading it.
Organizing your data analysis code can become a challenging task, check out our open-source framework which allows you to build modular data analysis pipelines so you can extract those insights from the spreadsheets!