72 points by isp 128 days ago
I'd really start with Google BigQuery Analytics by Jordan Tigani and Siddartha Naidu. Jordan's one of the core developers and shows up to BigQuery and GCP related conferences.
Jordan has since become the engineering lead on bigquery and Dremel!
The book is still very relevant but would be nice to get a refresh - many things have changed. Perhaps the biggest one is the execution engine. Jordan gave a nice talk on this in March:
Excellent suggestion - though remember when reading that BigQuery has since added "Standard SQL" - which is much more flexible and close to the standard (as the name suggests!).
A collection of condensed, easy-to-read posts that I found helpful when learning about BigQuery (BQ).
Great sources! Thanks.
I would love to know why BigQuery has a limit of 1,000 loads per day, per project .
I'm a founder of an early stage company that helps companies ETL their data into various Data warehouses, and I'm frequently met with companies that have evaluated BigQuery but ended up with Redshift or Snowflake because the maximum inserts/update/deletes are too low for their ETL process. Until these limits are increased, BigQuery is not an alternative to Redshift or Snowflake for a vast number of analytics teams that query thousands of tables that are updated hourly.
I work at Google and authored some of those posts.
The limitation is per table and for "batch loads" only. If you have per row inserts, you can always use bigquery's amazing streaming API, you can also do materialization and federated query loads.
Batch limits are there to help folks from creating too many small files, which affects performance (I believe Jordan's book explains this in greater detail).
Would love to know more about your use cases since this type of thing shouldn't be the deciding factor in choosing an analytics data warehouse and should be easy to architect for. As someone pointed out, most quotas are negotiable. If anything, bigquery ingest process is pretty amazing and unique in the industry .
Apologies for brevity, I'm out of office on paternity leave.
Thank you for the posts that you authored. These were very helpful for me at $DAY_JOB. BigQuery itself is a very impressive piece of tech.
Thanks for reading and using Google cloud :)
All use cases were a firehose of raw data being pulled from a Kafka topic every 15 seconds - 1 minute, written to storage, and inserted into 1,000+ tables.
Yeah definitely streaming API. This use case is MADE for BigQuery :)
Check out, as well as the blog posts on the required reading list on the topic of inserts. There's also a pretty good Kafka-bq connector.
It sounds like "ingest" was the big factor in your choice of a data warehouse. Regardless, I think bigquery has the best story in the industry here (but I'm biased).
I'd help more hands on but sadly not online for a while :)
Also, if the quota can be negotiated, I would reccomended verbiage closer to that of the GA reporting API limits:
50,000 requests per project per day, which can be increased.
Is it 1000 updates across all tables per day, or 1000 updates per table per day? Asking because the page says "Daily update limit: 1,000 updates per table per day," which sounds like you could update thousands of tables hourly.
The method of interest here is loading data into BigQuery:
Which has a limit of 50,000 loads per project, per day.
edit: corrected limit.
It's actually per table:
Daily limit: 1,000 load jobs per table per day (including failures), 50,000 load jobs per project per day (including failures)
I think the messages are crossing paths here.
Your initial concern of 1000 loads per project is actually 50000 and can be raised.
The limit of 1000 loads per TABLE per day is there, can't be raised, and a couple posts above explain how to work with this limit. This limit is also higher for partitioned tables.
Apologies for brevity.
To clarify, the quota is 1,000 loads per table per day, and 50,000 loads per project per day for batch loads.
In practice, I've never hit either of these quotas, because our data architecture is such that we'd have to have some kind of major operational screw up to encounter them (e.g., a run-away failed import job that somehow manages to fail 50,000 times in a single day). Regardless, you can work around the per-table limit by using Wildcard Tables.
In my case, the source of my data is strictly batch-oriented, so importing data into distinct tables with a common prefix solves this problem for me. Where I do use partition tables, the likelihood that I'm going to update a single partition 1000 times in a day is highly unlikely (I'd be curious about any use case for that).
The DML features of standard SQL are probably best viewed as a "helpful backup" rather than a general solution. Yes, you can run UPDATE and INSERT statements against your non-partitioned tables, but you probably want to limit that to situations where you are trying to correct something specific, or on an ad-hoc basis, and not as a general solution.
In any case, there are a number of tools you can use to not be impacted by this particular limitation. For example, you could use streaming. And while streaming has it's own quota (100,000 inserts per project per second), it's straightforward to ask for an increased quota.
The trick is to think of BQ like you do Hive or other "non-typical" SQL databases: yes it has a SQL query interface, but if you treat it like PGSQL or Oracle, you aren't going to get the benefits out of it.
If you are going to try and perform OLTP operations against BQ you are going to fail. It's a data warehouse, not a transactional system. Working with log-structured data, or append-only data, is where it really shines.
We evaluated streaming in each of the cases and it wasn't sufficient. We ended up doing, essentially, the same thing with Redshift and it worked like a charm.
I’d be curious to know more about the workloads that failed. What kind of data were you streaming?
I'm pretty sure this limitation could be negotiated as most other quotas on Google cloud
Perhaps, but this verbiage on the BQ quotas page isn't very welcoming:
The limit of 1,000 load jobs per table per day cannot be raised.
Thank you for this great reading list!
Currently I'm using Firebase for a medium-sized BI web app, but I'm really interested in BigQuery for bigger datasets.