S2 E4: How Climate Engine autoscales with BigQuery to process billions of records

Channel:
Subscribers:
209,000
Published on ● Video Link: https://www.youtube.com/watch?v=gckyujAuqCI



Duration: 8:14
322 views
19


Climate Engine was founded in 2014 to help organizations reduce climate risks using satellite-derived data and analytics. With support from Google Earth, Climate Engine has unprecedented access to raw data from Earth observation platforms, allowing them to understand the Earth’s changing climate systems and find solutions to sustainably manage natural resources, grow food, conserve water, and protect infrastructure.
It’s not uncommon for Climate Engine and their flagship software SpatiaFi to run jobs that process hundreds of millions of locations, listing risks for a company’s physical assets against numerous climate factors like floods, fires, and wind. When working with a large bank or insurance company, this might mean up to 200 million properties across their vertical real estate. “There’ve been a couple of heavy days where we process over one billion records per day,” says Climate Engine’s CTO, Bennett Kanuka. “Depending on a couple different factors, we generally run around 200-500 locations per second per worker.”
Kanuka’s background is in data science and statistics, and his biggest takeaway coming up in the industry was that data science was moving faster than companies could adopt it. The data being collected and the algorithms being developed far outpaced companies’ abilities to integrate them. “The way that I see I can make the most impact is if I make the data and insights easy to use, easy to understand, and easy to integrate into every day decision making.”
For Kanuka, the Google Cloud Products have been integral to that process. While many GCP products come into play, GKE Autopilot, Earth Engine, and BigQuery are Climate Engine’s core products—particularly when it comes to autoscaling. When it comes to spikey workloads, “autoscaling is vital,” says Kanuka. “It’s really not a surprise the BigQuery team put so much effort into this, especially after the success of GKE Autopilot. I couldn’t imagine this not being hugely successful and a gamechanger for many GCP users.” While the benefits of autoscaling are obvious (only pay what you need, when you need it), the risks of autoscaling are cost overruns and your dependent systems not scaling. BigQuery and GCP as a whole are solving those issues with cost controls and caps on the number of slots, while introducing autoscaling to their earlier GCP products.
Editions has simplified BigQuery pricing, allowing a level of clarity around which features to use and what users pay for, while aligning with a company’s data journey. “They can start out with the starter edition with minimal cost and risk and grow into the wider BigQuery ecosystem of tools as they see fit,” Kanuka adds. “We have the ability for all systems to scale in harmony, automatically.
Whether assessing climate risks or not, Kanuka also has some tips and tricks for other data scientists out there: “Set your min and max slots and workers appropriately! Use a queue.” And of course, “don’t aggressively use cold storage.”
To learn more about Climate Engine’s data endeavors, listen to the full episode of Google Data Journeys with Bruno Aziza.