2.3.1.1 Big

Large datasets ne a wajen wani karshen. ba su da wani karshen a kansu.

Na farko daga cikin uku kyau halaye na babban data ne mafi tattauna: wadannan su ne babban data. Wadannan bayanai kafofin iya zama babban a uku hanyoyi daban-daban: mutane da yawa, kuri'a na bayanai da mutum, ko da yawa lura a kan lokaci. Samun babban dataset sa wasu takamaiman iri bincike-aunawa heterogeneity, nazarin rare events, ganowa kananan bambance-bambance, da kuma yin causal kimomi daga observational data. Yana kuma da alama kai ga wani irin sloppiness.

Abu na farko da abin da size ne musamman da amfani ne motsi bayan Averages yi kimomi ga takamaiman subgroups. Alal misali, Gary King, Jennifer Pan, kuma Molly Roberts (2013) auna yiwuwar cewa kafofin watsa labarun posts a kasar Sin za a tace da gwamnati. By kanta wannan talakawan Yiwuwar shafewa ba da taimako sosai ga fahimtar dalilin da ya sa gwamnatin dakatar da wasu posts amma ba wasu. Amma, saboda dataset hada miliyan 11 posts, King da kuma abokan aiki kuma samar kimomi ga yiwuwar yin katsalandan ga posts on 85 raba Categories (misali, batsa, Tibet, da kuma Traffic a Beijing). By gwada yiwuwar yin katsalandan ga posts a daban-daban Categories, sun kasance iya fahimta game da yadda kuma me ya sa gwamnatin dakatar wani iri posts. Tare da 11 da dubu posts (maimakon miliyan 11 posts), dã ba mu kasance a iya samar da wadannan category-takamaiman kimomi.

Na biyu, size ne musamman da amfani ga yin nazarin na rare events. Alal misali, abokan aiki Goel kuma (2015) ya so ya yi nazarin hanyoyi daban-daban cewa tweets iya zuwa kwayar. Saboda manyan cascades na sake tweets ne musamman rare-game da daya a cikin wani 3,000-da suke bukata don nazarin fiye da biliyan tweets domin ya sami isasshen manyan cascades su analysis.

Na uku, babban datasets taimaka masu bincike gano kananan bambance-bambance. A gaskiya ma, da yawa daga cikin mayar da hankali a kan babban bayanai a masana'antu ne game da wadannan kananan bambance-bambance: dogara ganowa bambanci tsakanin 1% da kuma 1.1% click-ta rates a kan ad iya fassara a cikin miliyoyin daloli a karin kudaden shiga. A wasu kimiyya saituna, irin kananan bambance-bambance ba su musamman muhimmanci (ko da sun kasance ilimin kididdiga gagarumin). Amma, a wasu manufofin saituna, irin kananan bambance-bambance zai iya zama da muhimmanci a lõkacin da kyan gani, a tara. Alal misali, idan akwai biyu kiwon lafiya shisshigi kuma daya ne dan kadan more m fiye da sauran, to, ya sauya sheka zuwa mafi inganci baki zai iya kawo karshen up ceton dubban ƙarin rayuwarsu.

A karshe, babban data sets ƙwarai ƙara mu ikon yi causal kimomi daga observational data. Ko da yake manyan datasets ba fundamentally canja matsaloli tare da yin causal hasashe daga observational data, matching da na halitta gwaje-gwajen da biyu dabarun da masu bincike sun ɓullo domin yin causal da'awar daga observational data-biyu ƙwarai amfana daga manyan datasets. Zan bayyana da kuma kwatanta wannan da'awar a zurfafe daga baya a wannan babi, lokacin da na bayyana bincike dabarun.

Ko da yake bigness ne kullum mai kyau dukiya a lokacin da amfani da daidai, Na lura cewa bigness fiye take kaiwa zuwa wani na ra'ayi kuskure. Domin wasu dalilai, bigness alama kai masu bincike ya yi watsi da yadda m data aka generated. Duk da yake bigness aikata rage bukatar damu game da bazuwar ɓata, shi a zahiri qara bukatar damu din kurakurai, da irin kurakurai da cewa zan bayyana a cikin mafi kasa wanda tashi daga biases a yadda data aka halitta da kuma tattara. A cikin karamin dataset, duka bazuwar ɓata da din kuskure na iya zama da muhimmanci, amma a cikin wani babban dataset bazuwar kuskure ne za a iya kaddarance bãya, kuma din kuskure mamaye. Masu bincike ne da ba su yi tunani game da din kuskure zai ƙarasa yin amfani da su manyan datasets don samun daidai kimanta da ba daidai ba abu. za su zama daidai m (McFarland and McFarland 2015) .