Bugu da ari, sharhin

Wannan sashe da aka tsara za a yi amfani a matsayin tunani, maimakon a karanta a matsayin labari.

  • Gabatarwa (Sashe 2.1)

Daya irin Kiyaye cewa ba a kunshe a cikin wannan sura ne ethnography. Don ƙarin on ethnography a digital sarari ga Boellstorff et al. (2012) , da kuma more on ethnography a gauraye digital da kuma ta jiki sarari ga Lane (2016) .

  • Big data (Sashe 2.2)

A lokacin da kake repurposing data, akwai biyu shafi tunanin mutum dabaru da za su iya taimaka maka ka fahimci yiwu matsaloli waɗanda zaka iya fuskanta. Na farko, za ka iya kokarin su yi tunanin manufa dataset for your matsalar da kwatanta cewa ga dataset da kake amfani. Ta yaya ne irin wannan da kuma yadda su daban-daban? Idan ka ba su tattara your data kanka, akwai wata ila ya zama bambanci tsakanin abin da kake so da kuma abin da kuke yi. Amma, dole ka yanke shawara idan wadannan bambance-bambance ne qananan ko manyan.

Na biyu, ka tuna cewa wani halitta da kuma tattara your data saboda wasu dalilai. Ya kamata ka yi kokarin fahimtar tattaunawa. Wannan irin baya-injiniya zai taimake ka gane yiwu matsaloli da kuma biases a repurposed data.

Babu wani guda yarjejeniya definition of "babban data", amma mutane da yawa ma'anar ze mayar da hankali a kan 3 Vs: girma, iri-iri, da kuma gudu (misali, Japec et al. (2015) ). Maimakon mayar da hankali a kan halaye na data, ta mayar da hankali definition more on dalilin da ya sa data an halitta.

My hada da gwamnatin administrative data cikin category na babban data ne mai bit unusually. Wasu da suka yi wannan yanayin, sun hada da Legewie (2015) , Connelly et al. (2016) , da kuma Einav and Levin (2014) . Domin ƙarin bayani game da darajar gwamnatin administrative data domin gudanar da bincike, ganin Card et al. (2010) , Taskforce (2012) , da kuma Grusky, Smeeding, and Snipp (2015) .

Ga wani view of administrative bincike daga cikin gwamnati ilimin kididdiga tsarin, musamman Amurka Census Ofishin, gani Jarmin and O'Hara (2016) . Ga wani littafi tsawon jiyya na administrative records bincike a Statistics Sweden, ga Wallgren and Wallgren (2007) .

A cikin babi na, ina takaice idan aka kwatanta da na gargajiya binciken kamar Gaba Social Survey (GSS) zuwa kafofin watsa labarun data source kamar Twitter. Ga wani sosai, kuma m kwatanta tsakanin gargajiya safiyo da kuma kafofin watsa labarun data, ga Schober et al. (2016) .

  • Common halaye na babban data (Sashe 2.3)

Wadannan halaye 10 na babban data an bayyana a cikin da dama hanyoyi daban-daban da dama daban-daban marubuta. Writing cewa rinjayi ta tunani a kan wadannan al'amurra sun hada da: Lazer et al. (2009) , Groves (2011) , Howison, Wiggins, and Crowston (2011) , boyd and Crawford (2012) , Taylor (2013) , Mayer-Schönberger and Cukier (2013) , Golder and Macy (2014) , Ruths and Pfeffer (2014) , Tufekci (2014) , Sampson and Small (2015) , Lewis (2015) , Lazer (2015) , Horton and Tambe (2015) , Japec et al. (2015) , da kuma Goldstone and Lupyan (2016) .

Cikin dukanin wannan babi, Na yi amfani da kalmar digital burbushi, wanda ina ganin ne in mun gwada tsaka tsaki. Wani rare lokaci domin digital burbushi ne digital sawun (Golder and Macy 2014) , amma kamar yadda Hal Abelson, Ken Ledeen, kuma Harry Lewis (2008) nuna, a mafi m lokaci ne mai yiwuwa digital yatsa. Idan ka ƙirƙiri gurãbunsu, kana sane da abin da ke faruwa da kuma sawun ba zai iya kullum a iya gano a gareka. Haka ba gaskiya ba ne for your digital burbushi. A gaskiya, kana barin burbushi a duk tsawon lokacin da ba ka da ilmi fãce kaɗan. Kuma, ko da yake wadannan burbushi ba su da your name a kansu, sai su sau da yawa za nasaba a mayar da ku. A takaice, su ne kamar yatsansa: ganuwa da kaina gano.

Big

Don ƙarin a kan dalilin da ya sa babban datasets, sa ilimin kididdiga gwaje-gwaje matsala, ganin Lin, Lucas, and Shmueli (2013) da kuma McFarland and McFarland (2015) . Wadannan al'amurran da suka shafi kamata kai masu bincike da hankali a kan m muhimmanci maimakon ilimin kididdiga muhimmanci.

Koyaushe-on

A lokacin da la'akari ko da yaushe-on data, yana da muhimmanci a yi la'akari da ko kana kwatanta ainihin wannan mutane a kan lokaci ko ko kana gwada wasu canza rukuni na mutane. ga misali, Diaz et al. (2016) .

Non-amsawa

A classic littafi a kan wadanda ba mai amsawa matakan ne Webb et al. (1966) . The misalai a littafin pre-date dijital shekaru, amma har yanzu suna haskakãwa. Domin misalai na mutane da canza su hali saboda gaban taro lura, gani Penney (2016) da kuma Brayne (2014) .

bai cika

Don ƙarin a kan rikodin hada huldodi, gani Dunn (1946) da kuma Fellegi and Sunter (1969) (tarihi) da kuma Larsen and Winkler (2014) (zamani). Similar kusata sun kuma an ci gaba a kimiyyar kwamfuta karkashin sunaye kamar data deduplication, misali ganewa, suna iri daya, kwafi ganewa, da kwafi rikodin ganewa (Elmagarmid, Ipeirotis, and Verykios 2007) . Akwai kuma bayanin tsare tsare hanyoyin rikodin hada huldodi da ba su bukatar da watsa daga kaina gano bayanai (Schnell 2013) . Facebook ma ya ɓullo da wani ci gaba da danganta su records to zabe hali. wannan da aka yi kimanta wani gwaji da zan gaya muku game da a Babi na 4 (Bond et al. 2012; Jones et al. 2013) .

Domin more on gina tushe, gani Shadish, Cook, and Campbell (2001) , Babi na 3.

m

Don ƙarin a kan AOL search log debacle, gani Ohm (2010) . Na bayar da shawara game da hulda tare da kamfanoni da gwamnatoci a Babi na 4 a lokacin da na bayyana gwaje-gwajen. A yawan marubuta sun bayyana damuwa game da bincike cewa ya dõgara a kan m data, ga Huberman (2012) da kuma boyd and Crawford (2012) .

Daya kyau hanya domin jami'a masu bincike don saya data access ne ya yi aiki a wani kamfanin a matsayin ɗalibin kwalejin likita ko ziyartar bincike. Bugu da ƙari, kunna data access, wannan tsari zai taimaka da bincike ƙarin koyo game da yadda data aka halitta, wanda yake shi ne da muhimmanci ga analysis.

Non-wakilin

Non-representativeness ne mai babbar matsala ga masu bincike da kuma gwamnatocin da suke so su yi kalamai game da dukan mazaunan yawan. Wannan shi ne kasa da damuwa ga kamfanoni da cewa suna mayar da hankali kan yawanci masu amfani. Don ƙarin a kan yadda za Statistics Netherlands ya ɗauki batun ba representativeness kasuwanci babban data, ga Buelens et al. (2014) .

A Babi na 3, zan bayyana daukan samfur da hakkin a mafi girma, daki-daki. Ko da bayanan suna ba wakilin, a karkashin wani yanayi, za su iya zama mai nauyi, don samar da mai kyau kimomi.

rugujewa

System gantali sosai wuya a gani daga waje. Duk da haka, MovieLens shiri (tattauna more a Babi na 4) An gudanar da fiye da shekaru 15 da wani ilimi da bincike kungiyar. Saboda haka, da suka rubuce da kuma raba bayanai game da hanyar da tsarin ya samo asali a tsawon lokaci da kuma yadda wannan zai tasiri analysis (Harper and Konstan 2015) .

A yawan malamai sun mayar da hankali a kan yin gantali a Twitter: Liu, Kliman-Silver, and Mislove (2014) da kuma Tufekci (2014) .

Algorithmically sunkuyar

Na farko da ya ji da kalmar "algorithmically sunkuyar" amfani da Jon Kleinberg a magana. Babban ra'ayin bayan performativity shi ne cewa wasu zamantakewa kimiyya theories ne "injuna ba kyamarori" (Mackenzie 2008) . Wato, suka zahiri siffar duniya maimakon kawai kama shi.

Dirty

Gwamnati na ilimin kididdiga hukumomin kira data tsabtatawa, ilimin kididdiga data tace. De Waal, Puts, and Daas (2014) bayyana ilimin kididdiga data tace dabaru raya for binciken bayanai da kuma bincika abin da har su zartar babban data kafofin, kuma Puts, Daas, and Waal (2015) ya gabatar da wasu daga cikin wannan ideas domin karin general masu sauraro.

Ga wasu misalai na karatu da hankali a kan spam a Twitter, Clark et al. (2016) da kuma Chu et al. (2012) . A karshe, Subrahmanian et al. (2016) ya bayyana sakamakon da DARPA Twitter bot Challenge.

m

Ohm (2015) duba baya da bincike kan ra'ayin m bayanai da kuma yayi wani Multi-factor gwajin. The hudu dalilai ya kawo shawara ne: yiwuwar cuta. Yiwuwar samun cuta. kasancewar wani sirri aminci. kuma ko da hadarin gani majoritarian damuwa.

  • Counting abubuwa (Sashe 2.4.1)

Farber ta nazarin taksin a New York da aka dogara ne a kan wani a baya binciken da Camerer et al. (1997) cewa amfani da uku daban-daban saukaka samfurori takarda tafiya zanen gado-takarda siffofin amfani da direbobi rikodin tafiya tashi lokaci, karshen lokaci, da kuma kudin tafiya. Wannan baya binciken gano cewa direbobi da jũna a zama manufa waɗanda suke tsiwurwurin: sun yi aiki kasa a kwana inda su ijãrõrinsu kasance mafi girma.

Kossinets and Watts (2009) da aka mayar da hankali a kan asalin homophily a social networks. Dubi Wimmer and Lewis (2010) ga wani daban-daban tsarin kula da wannan matsalar wanda yayi amfani da bayanai daga Facebook.

A m aiki, Sarki da kuma abokan aiki sun kara bincika online katsalandan a kasar Sin (King, Pan, and Roberts 2014; King, Pan, and Roberts 2016) . Ga wani related tsarin kula da aunawa online katsalandan a kasar Sin, ga Bamman, O'Connor, and Smith (2012) . Don ƙarin a kan ilimin kididdiga hanyoyi kamar wanda amfani da King, Pan, and Roberts (2013) zuwa kimanta da jin zuciya daga cikin miliyan 11 posts, gani Hopkins and King (2010) . Don ƙarin on dubawa ilmantarwa, gani James et al. (2013) (m fasaha) da kuma Hastie, Tibshirani, and Friedman (2009) (more fasaha).

  • Kiyasin (Sashe 2.4.2)

Kiyasin shi ne babban ɓangare na masana'antu data kimiyya (Mayer-Schönberger and Cukier 2013; Provost and Fawcett 2013) . Wata irin kiyasin cewa ana yi da zamantakewa masu bincike ne alƙaluma kiyasin, misali Raftery et al. (2012) .

Google Mura Trends ba a farkon aikin don amfani da search bayanai zuwa nowcast mura ruwan dare. A gaskiya ma, masu bincike, a {asar Amirka, (Polgreen et al. 2008; Ginsberg et al. 2009) da kuma Sweden (Hulth, Rydevik, and Linde 2009) sun gano cewa, wani search sharuddan (misali, "mura") annabta kasa kiwon lafiya kula data da aka saki. Daga bisani da yawa, mutane da yawa da sauran ayyukan yi kokarin amfani da digital alama data for cuta kula ganewa, gani Althouse et al. (2015) a review.

Bugu da ƙari, ta yin amfani da digital alama data hango ko hasashen kiwon lafiya sakamakon, akwai kuma ya kasance wata babbar adadin aiki ta amfani da Twitter data hango ko hasashen zaben sakamakon. domin reviews ganin Gayo-Avello (2011) , Gayo-Avello (2013) , Jungherr (2015) (Ch. 7), kuma Huberty (2015) .

Amfani search bayanai zuwa tsinkaya mura ruwan dare da kuma yin amfani Twitter data hango ko hasashen zaben ne duka misalai na amfani da wasu irin digital alama don hango ko hasashen wasu irin taron a duniya. Akwai wani babban yawan karatu da cewa suna da wannan general tsarin. Table 2.5 hada da 'yan wasu misalai.

Table 2.5: M jerin karatu amfani da wasu digital alama don hango ko hasashen wani taron.
digital alama sakamako lissafi
Twitter Box ofishin kudaden shiga na fina-finai a Amurka Asur and Huberman (2010)
search rajistan ayyukan Sales da fina-finai, music, littattafai, da kuma video games a Amurka Goel et al. (2010)
Twitter Dow Jones Industrial Average (US stock kasuwa) Bollen, Mao, and Zeng (2011)
  • Approximating gwajen (Sashe 2.4.3)

The mujallar PS Kimiyyar Siyasa da wani taron a kan babban data, causal hasashe, da kuma m ka'idar, kuma Clark and Golder (2015) ya takaita kowane taimako. The mujallar gabatarwar na National Academy of Sciences na United States of America yana da Taro kan causal hasashe da kuma babban data, kuma Shiffrin (2016) ya takaita kowane taimako.

A cikin sharuddan na halitta gwaje-gwajen, Dunning (2012) na samar da wani kyakkyawan littafin tsawon magani. Don ƙarin a kan yin amfani da Vietnam daftarin irin caca a matsayin halitta gwaji, gani Berinsky and Chatfield (2015) . Domin na'ura koyo fuskanci cewa ƙoƙari ta atomatik samu halitta gwajen ciki na babban data kafofin, ga Jensen et al. (2008) da kuma Sharma, Hofman, and Watts (2015) .

A cikin sharuddan matching, ga wani kaffa review, ga Stuart (2010) , da kuma a pessimistic review ganin Sekhon (2009) . Don ƙarin a kan matching matsayin irin pruning, gani Ho et al. (2007) . Domin littattafai da samar da kyakkyawan jiyya na matching, gani Rosenbaum (2002) , Rosenbaum (2009) , Morgan and Winship (2014) , da kuma Imbens and Rubin (2015) .