2.3.2.6 Dirty

Big data kafofin za a iya ɗora Kwatancen da takarce kuma spam.

Wasu masu bincike yi imani da cewa babban data kafofin, musamman ma wadanda daga online kafofin, su ne ruwan tekun Atlantika, domin suna tattara ta atomatik. A gaskiya ma, mutanen da suka yi aiki tare da babban data kafofin sani cewa su masu akai-akai datti. Wancan ne, suna yawaita hada bayanai da ba gani real aiki na riba dõmin ya bincike. Mutane da yawa zamantakewa masana kimiyya ne riga ya saba da tsari na tsaftacewa manyan-sikelin zamantakewa binciken data, amma tsaftacewa babban data kafofin da aka mafi wuya ga dalilai biyu: 1) sun kasance bã su halitta da masu bincike domin bincike da kuma 2) masu bincike kullum da kasa fahimtar yadda da aka halicce su.

The hatsarori da datti digital alama data an kwatanta da baya da kuma abokan aiki ' (2010) nazarin tunanin mayar da martani ga hare-haren na Satumba 11, 2001. Masu bincike yawanci nazarin mayar da martani ga ban tausayi events amfani retrospective data tattara a kan watanni ko ma shekaru. Amma, Back da kuma abokan aiki sami wani ko da yaushe-on tushen digital burbushi-da timestamped, ta atomatik rubuta saƙonni daga 85,000 American pagers-da wannan sa da masu bincike ya yi nazarin wani tunanin mayar da martani a kan mai yawa finer timescale. Back da kuma abokan aiki halitta a minti-by-minti wani tunanin lokacin na Satumba 11th da coding da wani tunanin abun ciki na Na'urar Faja saƙonnin da yawan kalmomi alaka (1) bakin ciki (misali, kuka, da baƙin ciki), (2) juyayi (misali, m, tsõro), da kuma (3) fushi (misali, kiyayya, m). Suka gano cewa bakin ciki da tashin hankali fluctuated a ko'ina cikin yini, ba tare da wani karfi juna, amma, akwai wani mai daukan hankali karuwa cikin fushi a ko'ina cikin yini. Wannan bincike ya zama alama mai ban mamaki hoto na ikon ko da yaushe-on data kafofin: ta yin amfani da daidaitattun hanyoyin da shi ba zai yiwu ba a yi irin wannan high-ƙuduri jerin lokaci na nan da nan mayar da martani ga wani m taron.

Just shekara guda daga baya, duk da haka, Cynthia Pury (2011) duba a data more a hankali. Ta gano cewa babban yawan zato fushi saƙonnin da aka generated by guda Na'urar Faja kuma sun kasance duk m. Ga abin da wadanda zato fushi saƙonni ce:

"Sake NT na'ura [suna] in hukuma [suna] a [location]: m: [kwanan wata da lokaci]"

Wadannan sakonni da aka labeled fushi saboda sun hada da kalmar "m", wanda zai iya kullum nuna fushi amma ba a cikin wannan harka. Cire saƙonni generated da wannan aure sarrafa kansa Na'urar Faja gaba daya gusar da bayyana karuwa cikin fushi a kan hanya na yini (Figure 2.2). A takaice, babban sakamakon a Back, Küfner, and Egloff (2010) ya da mutum ke sanya daya Na'urar Faja. Kamar yadda wannan misali ya nuna, mun gwada m bincike na gwada hadaddun da m data yana da m je tsanani ba daidai ba.

Adadi 2.2: Kiyasta trends cikin fushi a kan hanya na Satumba 11, 2001 bisa 85,000 American pagers (Back, Küfner, kuma Egloff 2010. Pury 2011; Back, Küfner, kuma Egloff 2011). Asalinsu, Back, Küfner, kuma Egloff (2010) ya ruwaito a juna da kara fushi a ko'ina cikin yini. Duk da haka, mafi yawan wadannan fili fushi saƙonnin da aka generated by guda Na'urar Faja cewa akai-akai ya aika da wadannan sako: Sake NT na'ura [suna] in hukuma [suna] a [location]: m: [kwanan wata da lokaci. Da wannan sakon cire, da bayyana karuwa a fushi vuya (Pury 2011; Back, Küfner, kuma Egloff 2011). Wannan adadi ne mai haifuwa na siffa 1B a Pury (2011).

Adadi 2.2: Kiyasta trends cikin fushi a kan hanya na Satumba 11, 2001 bisa 85,000 American pagers (Back, Küfner, and Egloff 2010; Pury 2011; Back, Küfner, and Egloff 2011) . Asalinsu, Back, Küfner, and Egloff (2010) ya ruwaito a juna da kara fushi a ko'ina cikin yini. Duk da haka, mafi yawan wadannan fili fushi saƙonnin da aka generated by guda Na'urar Faja cewa akai-akai ya aika da wadannan sako: "Sake NT na'ura [suna] in hukuma [suna] a [location]: m: [kwanan wata da lokaci]". Da wannan sakon cire, da bayyana karuwa a fushi vuya (Pury 2011; Back, Küfner, and Egloff 2011) . Wannan adadi ne mai haifuwa na siffa 1B a Pury (2011) .

Duk da yake m data da aka halitta niyya-kamar daga m Na'urar Faja-za a iya gano a basira da hankali bincike, akwai kuma wasu online tsarin da jawo hankalin m spammers. Wadannan spammers rayayye samar karya data, da-sau da yawa m da riba-aiki sosai wuya a ci gaba da spamming boye. Alal misali, siyasa aiki a kan Twitter alama sun hada da a kalla wasu basira sophisticated spam, inda wasu siyasa haddasawa an ganganci yi duba more m daga gare su ainihin ne (Ratkiewicz et al. 2011) . Masu bincike da yin aiki tare da bayanai da za su iya dauke m spam fuskantar kalubale na shawo su masu sauraro da suka gano da kuma cire dacewa spam.

A karshe, abin da aka dauke m data iya dogara a dabara hanyoyi a kan bincike tambayoyi. Alal misali, mutane da yawa gyararrakin ga Wikipedia aka halitta da mai sarrafa kansa Bots (Geiger 2014) . Idan kana sha'awar a cikin qasa na Wikipedia, to, wadannan suna da muhimmanci Bots. Amma, idan kun kasance sha'awar yadda mutane da taimako zuwa Wikipedia, wadannan gyararrakin sanya by wadannan Bots ya kamata a cire.

Mafi hanyoyin da za a kauce wa ana fooled by m data su fahimci yadda your data aka halicce su yi sauki exploratory analysis, kamar yin sauki watsa mãkirci.