The Lounge is rated Safe For Work. If you're about to post something inappropriate for a shared office environment, then don't post it. No ads, no abuse, and no programming questions. Trolling, (political, climate, religious or whatever) will result in your account being removed.
They are not really comparable though are they? CERN decided to discard 99.99% of their data. OP may not choose to discard any of theirs.
Plus, we cannot assume both sets of software are as efficient as they could be. OP software could be using really bad compression (or maybe none at all).
I just don't really understand why everyone is trying to argue the quantity of data. It's not even close to being an impossible amount (given current technology). Also, maybe the numbers are estimates for 5 years from now. You wouldn't want a system that only works for a week would you...
It is just possible to handle this amount of data with a dedicated 10 Gbps connection (the actual data rate is 6.6 Gbps), but once you take into account framing, collisions, etc., it looks very iffy.
[Probably have multiple systems receiving the data]
The interfaces (NVMe, etc.) can handle this data rate, but building a storage system that can handle this sort of sustained write rate is non-trivial.
[Probably use multiple disks running in parallel]
Once you have the data stored locally, you must read it off the storage at the same rate (otherwise you will eventually run out of space), process it, and store it somewhere else.
[The initial processing of this much data would presumably require a massively parallel system, with all the communication and synchronization issues that this entails. Have at least one primary processing node for each receiving system]
How will secondary, tertiary, etc. processing be done?
[Whether you have one secondary processor for one or more primary processors or vice versa depends on the amount of data and the processing required. Again, we have synchronization and communication issues]
Presentation of the results?
[Presumably requires that the results of the processing be sent to a single node. Synchronization, communication issues...]
Freedom is the freedom to say that two plus two make four. If that is granted, all else follows.
-- 6079 Smith W.
There are so many questions yet to be answered before jumping in to what computer I should buy. Questions like,
Does all the data from all around the World end up in single data center?
Does this data center do all the processing?
Is this really needed or can processing be distributed around the World?
Sure at some point of time you may need all your data in one location for some kind of analysis. But does this have to be real time? Do you need "raw" data or processed data from remote servers can work fine?
I can think of more if I spend some more time on it.
You could look at Dell EMC Isilon for the storage. I worked on a system for an automotive company a couple of years ago where they were collecting and analysing 2PB per week of video and telemetry for self driving car development.
The Isilon storage is NAS and modular so that you can add to clusters as the requirements grow. It is quite an interesting challenge because at 2PB per week you have a constant data input stream of, on average, 3.6 GB/s that has to be stored, next to that backup has to be made, and of course users must be able to access the system for data analysis runs. That's a lot of parallel data movement.
Networking is also a challenge, the initial system for 13PB had over one hundred storage nodes each with 40 Gb/s front end networking ports to connect to the server farm. The system also has its own private network that supports striping data across nodes for availability and protection from failures.
I was the solution architect for the system. It was one of my last projects before I retired from EMC in 2018.