I think every day. Thinking is very important. It is a process of digestion and continuous deepening. As the following sentence says:
So, have we ever thought about big data itself? What exactly does big data do? Why have I been working on big data for so many years but still cannot finish it? The essence of big data is:
The essence of machine learning is:
Where does big data consume the most workload?Currently, eighty percent of the workload is on data collection, cleaning and verification. The work itself is not difficult, but it is really tedious and laborious. We sigh every day:
What makes us frustrated is that when a new demand comes, the existing data format seems unable to meet the demand, and we have to go through the process of data collection, cleaning, and verification again in the existing data pile. It seemed like a curse, like poor Sisyphus, who was sentenced to push a boulder up a steep mountain. Every time he exerted all his strength and the boulder was about to reach the top, it would slip from his hands and he would have to push it back again, doing endless labor. What is the biggest technical difficulty currently encountered in big data?It is an ad-hoc query of massive data. When Hadoop first emerged, we could use it to manipulate the increasingly cheap prices of PC servers, and a kind of violence permeated the entire ecosystem:
But as the query efficiency requirements become higher and higher, we have to be forced to make changes. Remember that our previous logs were all simple Raw text? Now various storage formats are slowly blossoming:
In short, we don’t seem to have found a magical technology to solve the query problem, and we can only make some compromises: In order to speed up the query, data storage has gradually changed from early raw text to a columnar storage structure that is vectorized, indexed, and supports specific encoding and compression. Of course, this method of adjusting the storage structure will inevitably consume time and resources when data is entered. That is, we made a compromise between storage and query. How to make the hard labor work lessAs we mentioned earlier, perhaps 80% of our work is spent on data collection, cleaning, and verification. But how do we compress this part of the work? The answer is:
Letting all the calculations flow makes it easy to:
And we hope that the implementation of streaming computing combines streaming and batch semantics. Why? Looking at Huawei's StreamCQL on Storm, we can see that real-time streaming is very limited in many cases, because in the future we will be able to do a lot more with streaming:
This requires a certain degree of flexibility, because only on the data set can there be Ad-Hoc queries, efficient storage, and adaptation to some machine learning algorithms. In many cases, a single piece of data does not have much meaning. I have always been a supporter of Spark Streaming. So why do we need a streaming computing superstructure? Let's review the problem. The data ETL process is a hard job that consumes a lot of programmers ' working time. In order to reduce this time, we have two ways:
Stream computing builds the entire foundation, and the framework on it makes the above two points possible.
|
>>: How do mini programs make money? What are the money-making models of mini programs?
Friends who make videos must understand that if t...
In 2016, VR swept the domestic technology market ...
“ Retention analysis is an important method and a...
March 21st marks the fourth of the twenty-four so...
A global technology event, a glimpse into the coo...
Recently, I was impressed by Tmall’s “coolness” a...
The heating of mobile phones is definitely a heada...
Beibei Emotional Chinese Video Game, a mobile phon...
Will 5G spread the new coronavirus? This seemingl...
The biggest difference between a brand and a prod...
There is no fancy opening remarks, I will simply ...
Mobei Class 6th Foreign Trade SEO Optimization Pr...
Previously, Changan achieved good results in the ...
I have loved animals since I was a child. I have ...
This course is from Baiyang's course worth 2,...