What We Wiki

From dftwiki3
Revision as of 17:39, 20 July 2008 by Thiebaut (talk | contribs) (New page: === Methodology === The article [http://cs.smith.edu/%7Ethiebaut/wikipedia/thiebaut_whatwewiki.pdf What we wiki] is based on a statistical analysis of of all the pages of the English [htt...)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Methodology

The article What we wiki is based on a statistical analysis of of all the pages of the English Wikipedia captured on 04/02/07, and available here.

The whole collection of Wikipedia pages is scanned by a computer program to find the ten most frequent words and ten most frequent double words appearing in each page. We refer to these words and double-words as concepts. Each concept is associated with a counter that records the number of wiki pages in which it appears. The 200 concepts with the highest page count is shown in the table below. United States with a count of 125449 indicates that this concept was one of the ten most frequent double-words in 125,499 pages. Note that United and States as individual words have different counts because they may not make the ten most frequent words of a wiki page while United States might.

Stop Words

The following stop words are filtered out before the statistics are collected:

a about above across after afterwards again against all almost
alone along already also although always am among amongst amoungst amount
an and another any anyhow anyone anything anyway anywhere are around as
at
back be became because become becomes becoming been before beforehand
behind being below beside besides between beyond bill both bottom but by
call can cannot can’t co computer con could couldn’t cry
de describe detail do done down due during
each eg eight either eleven else elsewhere empty enough etc even ever
every everyone everything everywhere except
few fifteen fify fill find fire first five for former formerly forty found four
from front full further
get give go
had has hasn’t have he hence her here hereafter hereby herein hereupon hers
herself him himself his how however hundred
i ie if in inc indeed interest into is it its itself
keep
last latter latterly least less like ltd
made many may me meanwhile might mill mine more moreover most mostly
move much must my myself
name namely neither never nevertheless next nine no nobody none noone
nor not nothing now nowhere
of off often on once one only onto or other others otherwise our ours
ourselves out over own
part per perhaps please put
rather re
s same see seem seemed seeming seems serious several she should
show side since sincere six sixty so some somehow someone something
sometime sometimes somewhere still such system
take ten than that the their them themselves then thence there
thereafter thereby therefore therein thereupon these they thick
thin third this those though three through throughout thru thus
to together too top toward towards twelve twenty two
un under until up upon us
very via
was we well were what whatever when whence whenever where
whereafter whereas whereby wherein whereupon wherever whether
which while whither who whoever whole whom whose why will with
within without would yet you your yours yourself yourselves

References and Bibliography

  • Stacy Schiff, “Now It All, Can Wikipedia conquer expertise?”, The New Yorker, July 31, 2006.
  • Martin Hepp, Daniel Bachlechner, and Katharina Siorpaes, “Harvesting Wiki Consensus - Using Wikipedia Entries as ontology Elements,” citeseer.ist.psu.edu/747700.html, 2006
  • Don Tapscott and Anthony Williams, Wikinomics: How Mass Collaboration Changes Everything. Portfolio, 2006
  • Michael Stube, and Simone Paolo Ponzetto, “WikiRelate! Computing Semantic Relatedness Using Wikipedia,” in Proceedings of the 21st National Conference on Artificial Intelligence, Boston, Mass., 16-20 July, 2006, pp. 1419-1424.

Word and Double-Word Ranking


Data100.dat 700x700.png


The table below shows the ranking of the most popular words and double-words.
To search the actual database of concept, click here.

Rank Word/Double-Word Number of Pages
1 United States 125449
2 2 109068
3 1 103454
4 Talk 98000
5 New 92490
6 2005 83569
7 Album 73300
8 0 70149
9 3 67870
10 Film 61778
11 City 58813
12 New York 55902
13 Band 51047
14 Image 50318
15 4 49927
16 School 49450
17 County 49241
18 United 48213
19 Style 46666
20 Music 46057
21 University 42003
22 Age 41681
23 Census 41586
24 States 40602
25 5 40533
26 American 39945
27 Football 39527
28 18 37817
29 2007 36347
30 Game 35928
31 2004 35882
32 State 34716
33 Median Income 34643
34 Series 33914
35 Party 33433
36 District 33239
37 Town 31940
38 Population 31791
39 National 31764
40 South 30626
41 User 30547
42 World 30153
43 Company 29519
44 Team 29233
45 Song 29051
46 John 28856
47 Station 28547
48 Area 28523
49 War 27221
50 6 27143
51 Language 26807
52 States Census 26458
53 British 26081
54 Debate 25157
55 Background 24987
56 High School 24678
57 List 24555
58 Book 24293
59 Width 24010
60 North 23614
61 River 23500
62 League 23425
63 High 23382
64 7 23276
65 College 23259
66 York 22873
67 Fair Use 22685
68 World War 21895
69 Left 21819
70 Fair 21744
71 Australia 21593
72 Village 21553
73 Church 21445
74 Club 20324
75 Group 20061
76 2003 20014
77 Born 20011
78 January 20006
79 August 19547
80 Episode 19495
81 English 19464
82 March 19461
83 December 19458
84 Time 19415
85 8 19252
86 King 19222
87 2007 Utc 19187
88 House 19171
89 California 19147
90 People 19094
91 Class 19015
92 United Kingdom 18829
93 General 18627
94 0 18270
95 Family 18132
96 10 18087
97 Century 18068
98 Canada 17889
99 Season 17848
100 2002 17768
101 Island 17585
102 Park 17584
103 Election 17385
104 India 17377
105 Copyright 17321
106 Age 18 17224
107 July 17200
108 65 Years 17101
109 2000 17075
110 Live 17042
111 Line 17034
112 2001 16904
113 February 16900
114 9 16806
115 Air 16334
116 October 16169
117 Government 15977
118 French 15864
119 Known 15729
120 East 15574
121 Television 15552
122 Com 15443
123 West 15420
124 Text 15399
125 Art 15368
126 London 15350
127 Black 15315
128 F 15314
129 September 15169
130 Radio 15164
131 Border 15127
132 Tv 15055
133 France 15020
134 November 14952
135 International 14885
136 Army 14857
137 April 14634
138 D 14438
139 Nbsp 14313
140 Isbn 0 14216
141 Australian 14192
142 Color 14180
143 Articles 14120
144 German 14040
145 African American 14026
146 Font 13918
147 B 13902
148 Right 13900
149 Appropriate 13777
150 Award 13755
151 La 13728
152 History 13666
153 Character 13574
154 England 13540
155 Games 13463
156 Railway 13459
157 St 13426
158 Day 13364
159 Life 13354
160 Battle 13347
161 New Zealand 13225
162 Released 13133
163 12 13090
164 Best 13059
165 Road 13047
166 Guitar 13020
167 President 13019
168 Player 13017
169 Tv Series 13002
170 Played 12960
171 Species 12876
172 Law 12800
173 Work 12772
174 Japan 12694
175 Size 12669
176 1999 12515
177 Ii 12405
178 June 12336
179 11 12321
180 Rock 12236
181 American U 12218
182 Lake 12038
183 Single 11944
184 Cup 11908
185 Wikipedia 11904
186 20 11766
187 San 11721
188 Year 11680
189 Los Angeles 11619
190 Building 11544
191 Community 11495
192 Comments 11422
193 Uk 11341
194 Notable 11338
195 Built 11322
196 Canadian 11301
197 Route 11199
198 Www 11143
199 Township 11072
200 Located 11059
201 Students 11028
202 Public 10978
203 100 Females 9966
204 Prime Minister 9132
205 New Jersey 9113
206 2004 Utc 8431
207 General Election 8317
208 War Ii 8274
209 Jul-06 8161
210 South Wales 7920
211 Aug-06 7865
212 Summer Olympics 7708
213 Jan-07 7686
214 San Francisco 7565
215 Sep-06 7477
216 Jan-06 7390
217 Jun-06 7374
218 Mar-07 7345
219 New South 7307
220 National Football 7249
221 World Cup 7223
222 Football Team 7190
223 Feb-07 7180
224 Dec-05 7172
225 Dec-06 7145
226 Railway Station 7131
227 Oct-06 7069
228 North Carolina 7027
229 Civil War 6954
230 Nov-06 6862
231 South Africa 6594
232 Democratic Party 6160
233 Air Force 6156
234 Hong Kong 6132
235 Mar-06 5917
236 Apr-06 5853
237 Race United 5804
238 Feb-06 5704
239 0 0 5598
240 State University 5594
241 Video Game 5587
242 Supreme Court 5514
243 British Columbia 5030
244 Style Color 4988
245 Hip Hop 4861
246 University Press 4818
247 Major League 4814
248 Science Fiction 4612
249 World Championship 4611
250 Parliament Constituency 4605
251 Football League 4544
252 Roman Catholic 4432
253 Jul-05 4413
254 Soviet Union 4362
255 Labour Party 4213
256 Western Australia 4191
257 Liberal Party 4123
258 Conservative Party 4109
259 Native American 3878
260 San Diego 3829
261 North America 3722
262 Star Trek 3670
263 Northern Ireland 3613
264 School District 3609
265 National Park 3508
266 Grand Prix 3435
267 Republican Party 3418
268 State Route 3398
269 Vice President 3372
270 League Baseball 3288