
## Metadata
- Author: [[aride-chettali|Aride Chettali]]
- Full Title:: Learnings From Streaming 25 Billion Events to Google BigQuery
- Category:: #🗞️Articles
- URL:: https://aride.medium.com/learnings-from-streaming-25-billion-events-to-google-bigquery-57ce81fa9898
- Finished date:: [[2023-04-07]]
## Highlights
> But spark did not have any connector to BigQuery that writes the data into BigQuery using streaming APIs. All connectors that I evaluated were writing into the GCS bucket and then performing a batch load to BigQuery. Hence I decided to write a BigQuery streaming sink for spark and use it for my PoC. ([View Highlight](https://read.readwise.io/read/01gxdmx2mfj9rdprkffyxzddjm))
> The default quota for streaming maximum bytes per second is 1GB per GCP project. Any ingestion above this limit would result in **BigQueryException** with ***quotaExceeded*** Error ([View Highlight](https://read.readwise.io/read/01gxdmyaqd4f90ehqnnbm10h0b))
> When I enabled “dedupe” only one record is getting duplicated for every 5 million records ingested. ***Enabling deduplication does not guarantee 100% duplication removal rather it is only the best effort to remove duplicates*** ([View Highlight](https://read.readwise.io/read/01gxdmz9f43mzr6fhenmf10r9n))
> streaming table you end up reading data from write optimized streaming buffer and that's the exact reason for higher latency ([View Highlight](https://read.readwise.io/read/01gxdn0bpjnncxsc4mn04z543f))