Learnings from streaming 25 billion events to google bigquery

![rw-book-cover](https://miro.medium.com/v2/resize:fit:770/1*MCL3tX_Xa5ENck3oWir3FA.jpeg) ## Metadata - Author: [[aride-chettali|Aride Chettali]] - Full Title:: Learnings From Streaming 25 Billion Events to Google BigQuery - Category:: #🗞️Articles - URL:: https://aride.medium.com/learnings-from-streaming-25-billion-events-to-google-bigquery-57ce81fa9898 - Finished date:: [[2023-04-07]] ## Highlights > But spark did not have any connector to BigQuery that writes the data into BigQuery using streaming APIs. All connectors that I evaluated were writing into the GCS bucket and then performing a batch load to BigQuery. Hence I decided to write a BigQuery streaming sink for spark and use it for my PoC. ([View Highlight](https://read.readwise.io/read/01gxdmx2mfj9rdprkffyxzddjm)) > The default quota for streaming maximum bytes per second is 1GB per GCP project. Any ingestion above this limit would result in **BigQueryException** with ***quotaExceeded*** Error ([View Highlight](https://read.readwise.io/read/01gxdmyaqd4f90ehqnnbm10h0b)) > When I enabled “dedupe” only one record is getting duplicated for every 5 million records ingested. ***Enabling deduplication does not guarantee 100% duplication removal rather it is only the best effort to remove duplicates*** ([View Highlight](https://read.readwise.io/read/01gxdmz9f43mzr6fhenmf10r9n)) > streaming table you end up reading data from write optimized streaming buffer and that's the exact reason for higher latency ([View Highlight](https://read.readwise.io/read/01gxdn0bpjnncxsc4mn04z543f))