2.4.1.3 kuhlolwa kwabezeendaba abasekuhlaleni ngurhulumente Chinese

Abaphandi kokuphalwa amaziko eendaba zentlalo Chinese ukufunda kuhlolwa. Bendigobele kwezicelo kunye efihlakeleyo-uphawu inzuzo.

Ukongeza data ezinkulu ezisetyenziswa mizekelo mibini idlulileyo, abaphandi Unakho nokuzilanda data zabo wokuqwalasela, njengokuba siwubona ngokumangalisayo Gary nguKumkani, Jennifer Pan, kwaye Molly Roberts ' (2013) uphando ukuhluza ngurhulumente Chinese.

izithuba eendaba zentlalo China yayizihlolisisa yi isixhobo zikarhulumente omkhulu ekucingelwa ukuquka amashumi amawaka abantu. Abaphandi nabemi, nangona kunjalo, kuba nantsingiselo njani ezi ahlolayo isigqibo yeyiphi ikhontenti ingacinywa zosasazo zasekuhlaleni. Abaphengululi of China okunene balindele ezingquzulanayo zeziphi iintlobo zithuba ekunokwenzeka ukuba izakucinywa. Abanye bacinga ukuba ngabasemagunyeni zijolise izithuba ukuba ezibalulekileyo zikarhulumente ngoxa abanye bacinga ukuba ingqalelo izithuba ezikhuthaza lokuziphatha, ezifana qhankqalazo. Kungona zeziphi kwezi ezilindelekileyo ichanekile kuneempembelelo kwindlela abaphandi baqonde China kunye nabanye oorhulumente ngegqudu esizenza ngayo kuhlolwa. Ngoko ke, uKumkani noogxa babefuna thelekisa lezithuba yapapashwa kwaye kamva ayinakucinywa kwizithuba zapapashwa bangaze ukucimeka.

Ukuqokelela ezi zithuba sibandakanya zobunjineli feat emangalisayo ngenyakanyaka ngaphezulu kwama-1,000 websites-nganye eendaba yoluntu Chinese kunye nephepha ezahlukeneyo lobeko-ukufumana izithuba ezifanelekileyo, uze utyelele ezi zithuba ubone zaacinywa kamva. Ukongezelela kwiingxaki zobunjineli eqhelekileyo ezinxulumene omkhulu web-ukukhasa, le projekthi umngeni kongezwa ukuba kwakufuneka abe ngokukhawuleza kakhulu ngenxa yokuba izithuba ezininzi yayizihlolisisa zithathelwa phantsi ngaphantsi kweeyure ezingama-24. Ngamanye amazwi, ukuba umgaqi olucothayo luyakuphosa amaqashiso lezithuba zazihlolisiswa. Ngaphezu koko, le abagaqi kwafuneka ukuba nenze bonke obu ukuqokelelwa kwedatha ngexesha ezibenza singabonwa hleze websites eendaba zentlalo bayivale okanye ngenye ukutshintsha imigaqo yabo ukuphendula phando.

Xa lo msebenzi zobunjineli omkhulu sagqitywa, uKumkani noogxa wasizuza izithuba 11 lezigidi kwizihloko 85 ezahlukeneyo ezibe pre-elikhankanyiweyo ngokusekelwe nakwizinga elilindelekileyo uvakalelo. Umzekelo, isihloko uvakalelo oluphezulu iAyi Weiwei, umzobi Umqhankqalazi; ngesihloko uvakalelo eliphakathi uxabiso kwexabiso lwemali Chinese, kunye nesihloko njengezinobugcalagcala obubobona na iNdebe yeHlabathi. Kwezi zithuba kwezigidi 11 2 million sele zihlolwa, kodwa izithuba kwizihloko enobuzaza zazihlolisiswa kuphela kancinci rhoqo ngaphezu izithuba ngezihloko oluphakathi kunye eliphantsi novakalelo. Ngamanye amazwi, ahlolayo Chinese Umalunga njengoko kusenokwenzeka ukuba kubekho isithuba ukuba ukhankanya Ai Weiwei njengoko isithuba ukuba ukhankanya iNdebe yeHlabathi. Ezi zinto zifunyanisiweyo akazange utshatise ingcamango zilula ukuba urhulumente ahlolayo zonke izithuba okusentliziyweni.

Le ndlela yokubala elula izinga zibekw ngokuthi isihloko kunokuba ziyalahlekisa, kunjalo. Umzekelo, urhulumente ukuze kubekho izithuba ukuba ababaxhasayo eAyi Weiwei, kodwa ushiye izithuba abalulekileyo kuye. Ukuze ukwazi ukwahlula phakathi izithuba ndimthuma ngokukhawuleza, abaphandi kufuneka umlinganiselo ivakalelwa yesithuba ngasinye. Ngoko, inye indlela cinga ngayo ukuba ivakalelwa yesithuba ngasinye ziyinxalenye ebalulekileyo efihlakeleyo zesithuba ngasinye. Ngelishwa, nangona umsebenzi omninzi, iindlela automated ngokupheleleyo Ubhaqo luvo besebenzisa izichazi-pre-esele ikhona kakuhle kakhulu kwiimeko ezininzi (cinga emva iingxaki ngokudala amaxesha ngokweemvakalelo ngoSeptemba 11, 2001 kwiCandelo 2.3.2.6). Ngoko ke, uKumkani kunye nabalingane kwakufuneka indlela ukulebhelisha izithuba zazo 11 million yentlalo eendaba ukuba enoba 1) ezibalulekileyo zikarhulumente, 2) inkxaso karhulumente, okanye 3) iingxelo irrelevant okanye echanileyo ngeziganeko. Oku kuvakala ngathi umsebenzi omkhulu, kodwa abazisombulule ngayo usebenzisa iqhinga ezinamandla; lowo eqhelekileyo kwinzululwazi data kodwa ngoku inqabile kakhulu kwinzululwazi yoluntu.

Okokuqala, ngamanyathelo ebizwa ngokokuqhelekileyo pre-processing, abaphandi batshintsha izithuba eendaba zentlalo ibe matrix kuxwebhu-elide, apho kwakukho kumqolo omnye yoxwebhu ngamnye kwaye umhlathi omnye sibhalwe ukuba ngaba isithuba eziqulathwe igama elithile (umzekelo, lokuqhankqalaza, zendlela, njalo njalo). Okulandelayo, iqela abancedisi uphando isandla-elinombhalo ivakalelwa isampuli post. Emva koko, uKumkani kunye nabo basebenzisa oku data isandla-enombhalo ukuqikelela indlela yokufunda umatshini owawuza thelekelela ivakalelwa isithuba ngokobunjani bayo. Okokugqibela, basebenzisa eli sokufunda umatshini ukuqikelela ivakalelwa zonke izigidi-11 izithuba. Ngoko ke, kunokuba ukufunda ngesandla iilebhile izithuba kwezigidi 11 (leyo benkxaso ayinakwenzeka), ngesandla ke ixoki inani elincinci lezithuba aze asebenzise into data oosonzululwazi kubiza yokufunda supervised ukuqikelela iindidi zonke izithuba. Emva kokugqiba lo uhlalutyo, uKumkani kunye nabalingane bakwazi ukugqiba ukuba, noko kwakulindelekile, amathuba isithuba ukuba izakucinywa besingenanto enoba oko ezibalulekileyo zikarhulumente okanye inkxaso karhulumente.

Isazobe 2.3: oganogram iindiela inkqubo ezisetyenziswa King, Pan, kwaye Roberts (2013) ukuba nokuqikela ivakalelwa 11 million izithuba eendaba zentlalo Chinese. Okokuqala, ngamanyathelo ebizwa ngokokuqhelekileyo pre-processing, abaphandi batshintsha izithuba eendaba zentlalo ibe matrix uxwebhu-elide (bona Grimmer and Stewart (2013) ukufumana ingcaciso engaphezulu). Okwesibini, abaphandi ngesandla ezinebhakhowudi ivakalelwa isampulu encinane yezithuba. Okwesithathu, abaphandi laqeqesha sokufunda supervised sokuhlela ivakalelwa yezithuba. Okwesine, abaphandi wasebenzisa sokufunda supervised ukuqikelela ivakalelwa zonke izithuba. Bona uKumkani, Pan, kwaye Roberts (2013), Isihlomelo B yenkcazelo eneenkcukacha.

Isazobe 2.3: oganogram iindiela inkqubo ezisetyenziswa King, Pan, and Roberts (2013) ukuba nokuqikela ivakalelwa 11 million izithuba eendaba zentlalo Chinese. Okokuqala, ngamanyathelo ebizwa ngokokuqhelekileyo pre-processing, abaphandi batshintsha izithuba eendaba zentlalo ibe matrix uxwebhu-elide (bona Grimmer and Stewart (2013) ukuze ufumane inkcazelo engakumbi). Okwesibini, abaphandi ngesandla ezinebhakhowudi ivakalelwa isampulu encinane yezithuba. Okwesithathu, abaphandi laqeqesha sokufunda supervised sokuhlela ivakalelwa yezithuba. Okwesine, abaphandi wasebenzisa sokufunda supervised ukuqikelela ivakalelwa zonke izithuba. Bona King, Pan, and Roberts (2013) , Isihlomelo B yenkcazelo eneenkcukacha.

Ekugqibeleni, uKumkani noogxa bafumanisa ukuba ezintathu kuphela iindidi lezithuba zazihlolisiswa rhoqo: iphonografi, ukugxekwa ahlolayo, nabo aba collective amanyathelo isakhono (oko kukuthi, ukubakhona ekhokelela kuqhankqalazo ezinkulu-scale). Ngokugcina inani elikhulu lezithuba zingacinywa lezithuba abacinywanga, uKumkani kunye nabalingane bakwazi ukufunda ahlolayo ukusebenza kuphela ngokulinda nokubala. Kuphando ezilandelayo, eneneni wangenelela ngqo kwi Chinese zentlalo eendaba eziphilayo ngokudala izithuba enesiqulatho ngokucwangcisiweyo ezahlukeneyo yokulinganisa abanokuthi bafumane zazihlolisiswa (King, Pan, and Roberts 2014) . Siza kufunda okungakumbi iindlela zovavanyo kwiSahluko 4. Eminye wawunabo umxholo eya kwenzeka okoko le ncwadi, ezi ngxaki-leyo afihlakeleyo-lesiphumo ukuthelekelela amaxesha mawusonjululwe kweliso yokufunda-kujika abe aqheleka kakhulu kuphando lwezentlalo yobudala yesuntswana. Uya kubona efanayo imifanekiso kakhulu uMfanekiso 2.3 kwiSahluko 3 (Ukubuza imibuzo) kunye 5 (Ukudala intsebenziswano mass); yenye iingcamango ezimbalwa livela kwizahluko ezininzi.

Zontathu ezi mizekelo-ukuziphatha ukusebenza abaqhubi beeteksi eNew York, ubuhlobo ukusekwa ngabafundi, kunye nezentlalo amajelo yokuziphatha zibekw amaTshayina karhulumente-ibonisa ukuba ukubala olulula data lokuqwalasela ungenza abaphandi ukuvavanya iingqikelelo theoretical. Kwezinye iimeko, idata enkulu lwenza ukuba wenze oku kubalwa noko ngqo (njengoko kwimeko New York iTeksi). Kwezinye iimeko, abaphandi kuya kufuneka ukuqokelela idatha yabo lokuqwalasela (njengoko kwimeko ukuhluza Chinese); ukujongana kwezicelo ngokuthi kuhlanganiswe data kunye (njengoko kwimeko kothungelwano zazivelela); okanye ukwenza eminye uhlobo efihlakeleyo-uphawu ukuthelekelela (njengoko kwimeko ukuhluza Chinese). Njengoko Ndiyathemba le mizekelo ibonisa, kuba abaphandi abakwaziyo ukubuza imibuzo enomdla, enkulu kunedinga kakhulu.