閱讀文章 - 精華區 Oversea_Job

發信人LINC.bbs@ptt3.cc (Go cubs!),

看板Oversea_Job

標題請教一些面試問題

發信站批踢踢參 (Thu Aug 23 14:00:59 2007)

轉信站ptt!Group.NCTU!grouppost!Group.NCTU!ptt3

第一道題: Given N computers networked together, with each computer storing N integers, finds the "median" of all of the numbers. Assuming a computer can hold O(N) integers. Also assume that there exists a computer on the network without integers, that we can use to interface with the other computers storing the integers. 以下是我想的解法: 1. Sort each of piles of numbers separately in an array - 題目有講memory O(N), 所以假設in memory sort: qsort - O(N * log N) radix sort - O(N) 問題來了請問可以作radix sort嗎? 沒學過這個sort 只是聽說它可以O(N) 2. Use the extra computer as buffer, take the smallest number in each computers Construct a heap, each node has info: <data, source of data> - O(N * logN) time, O(N) space 3. Take out root node from heap, peek the source, then get a new number from that source computer - logN Insert new number into heap - logN Do it (N * N) / 2 -1 times, the root node's value is the answer - O(N^2 * logN) Total: O(N * logN) + O(N * logN) + O(N^2 * logN) = O(N^2 * logN) 不知到還有沒有人有更棒的解法?? 好想知道啊~ 第二道題: How to fast check if a URL is visited by web crawler? 我看到的解法: hash table (有這麼簡單嗎@@) 直覺上來說好像不對勁一個URL假設是20 char, 算20 bytes 假設Internet有5 billion pages -> 5 * 20 billion bytes = 100 billion bytes = 100 GB 100GB(至少) hastable? 有沒搞錯? 我查了一下wikipedia 上面也是說Google有個URL server專門在作這個URL revisit check 請問真的是用Hashing嗎還是Distributed Hashing?? -- ※ 發信站: 批踢踢參(ptt3.cc) ◆ From: 141.158.245.93 > -------------------------------------------------------------------------- < 發信人: michaelz.bbs@ptt3.cc (michaelz), 看板: Oversea_Job 標題: Re: 請教一些面試問題發信站: 批踢踢參 (Thu Aug 23 16:38:29 2007) 轉信站: ptt!Group.NCTU!grouppost!Group.NCTU!ptt3 ※ 引述《LINC (Go cubs!)》之銘言： : 第一道題: : Given N computers networked together, with each computer storing N integers, : finds the "median" of all of the numbers. Assuming a computer can hold O(N) : integers. Also assume that there exists a computer on the network without : integers, that we can use to interface with the other computers storing the : integers. : 第二道題: : How to fast check if a URL is visited by web crawler? : 我看到的解法: hash table (有這麼簡單嗎@@) : 直覺上來說好像不對勁 : 一個URL假設是20 char, 算20 bytes : 假設Internet有5 billion pages -> 5 * 20 billion bytes = 100 billion bytes : = 100 GB : 100GB(至少) hastable? 有沒搞錯? : 我查了一下wikipedia 上面也是說Google有個URL server專門在作這個URL revisit : check : 請問真的是用Hashing嗎還是Distributed Hashing?? 這看起來滿像google的題目第一題應該可以做到average NlogN 或是 linear, 把quick sort變一下就行了第二題的話可以用database 加上index 然後再加一層cache,太大的話做partition 分到不同的database上, 或是把database換成hashtable也行 -- ※ 發信站: 批踢踢參(ptt3.cc) ◆ From: 69.181.226.162

推 KamiSP:總共有N^2個數光是讀一次就要O(N^2)了吧 08/23 19:27

> -------------------------------------------------------------------------- < 作者: Baudelaire (遺憾太常。) 看板: Oversea_Job 標題: Re: 請教一些面試問題時間: Fri Aug 24 08:02:23 2007 ※ 引述《LINC.bbs@ptt3.cc (Go cubs!)》之銘言： : 第二道題: : How to fast check if a URL is visited by web crawler? : 我看到的解法: hash table (有這麼簡單嗎@@) : 直覺上來說好像不對勁 : 一個URL假設是20 char, 算20 bytes : 假設Internet有5 billion pages -> 5 * 20 billion bytes = 100 billion bytes : = 100 GB : 100GB(至少) hastable? 有沒搞錯? : 我查了一下wikipedia 上面也是說Google有個URL server專門在作這個URL revisit : check : 請問真的是用Hashing嗎還是Distributed Hashing?? 我會設計的方法： URL的有效字元 A-Z a-z 加上一些符號，大概總共算是60個symbol， n0*60^0+n1*60^1+n2*60^2+n3*60^3+n4*60^4+...+ni*60^i 不過這個數字大的一塌糊塗，所以不是什麼好方法；如果不想要collision的話，資料量可能就是那麼大。至於partition的話，用開頭字母就可以作uniform dist.了。 -- Je t'aime,o capitale infame. Tu m'as donne ta boue et j'en ai fait de l'or. Charles Baudelaire 1821-67 -- ※ 發信站: 批踢踢實業坊(ptt.cc) ◆ From: 216.145.49.21 ※ 編輯: Baudelaire 來自: 216.145.49.21 (08/24 08:03) > -------------------------------------------------------------------------- < 發信人: LINC.bbs@ptt3.cc (Go cubs!), 看板: Oversea_Job 標題: Re: 請教一些面試問題發信站: 批踢踢參 (Fri Aug 24 09:16:03 2007) 轉信站: ptt!Group.NCTU!grouppost!Group.NCTU!ptt3 ※ 引述《michaelz (michaelz)》之銘言： : ※ 引述《LINC (Go cubs!)》之銘言： : : 第一道題: : : Given N computers networked together, with each computer storing N integers, : : finds the "median" of all of the numbers. Assuming a computer can hold O(N) : : integers. Also assume that there exists a computer on the network without : : integers, that we can use to interface with the other computers storing the : : integers. : : 第二道題: : : How to fast check if a URL is visited by web crawler? : : 我看到的解法: hash table (有這麼簡單嗎@@) : : 直覺上來說好像不對勁 : : 一個URL假設是20 char, 算20 bytes : : 假設Internet有5 billion pages -> 5 * 20 billion bytes = 100 billion bytes : : = 100 GB : : 100GB(至少) hastable? 有沒搞錯? : : 我查了一下wikipedia 上面也是說Google有個URL server專門在作這個URL revisit : : check : : 請問真的是用Hashing嗎還是Distributed Hashing?? : 這看起來滿像google的題目 : 第一題應該可以做到average NlogN 或是 linear, 把quick sort變一下就行了 : 第二題的話可以用database 加上index 然後再加一層cache,太大的話做partition : 分到不同的database上, 或是把database換成hashtable也行今天看了一下以前上課上過的paper(Mercator web crawler, written in Java) 它有提到他們是怎麼作的簡單的敘述: 假設整個Internet URLs無法放進整個HashSet(註) 所以就把它放在disk上另外使用LRU cache來作in memory cache 程序如下新的URL進來: -> 用LRU cache看有無cache hit -> 有就丟掉 -> 無的話找disk上的資料 -> 在disk上丟掉 -> 不在disk上, unvisited URL, pop first URL in LRU cache and write it to disk. Add new URL to LRU cache. 註: 1998年為10 million web pages, 2006年為4.2 billion -- ※ 發信站: 批踢踢參(ptt3.cc) ◆ From: 141.158.245.93 > -------------------------------------------------------------------------- < 發信人: michaelz.bbs@ptt3.cc (michaelz), 看板: Oversea_Job 標題: Re: 請教一些面試問題發信站: 批踢踢參 (Fri Aug 24 17:42:08 2007) 轉信站: ptt!Group.NCTU!grouppost!Group.NCTU!ptt3 ※ 引述《Baudelaire.bbs@ptt.cc (遺憾太常。)》之銘言： : ※ 引述《LINC.bbs@ptt3.cc (Go cubs!)》之銘言： : : 第二道題: : : How to fast check if a URL is visited by web crawler? : : 我看到的解法: hash table (有這麼簡單嗎@@) : : 直覺上來說好像不對勁 : : 一個URL假設是20 char, 算20 bytes : : 假設Internet有5 billion pages -> 5 * 20 billion bytes = 100 billion bytes : : = 100 GB : : 100GB(至少) hastable? 有沒搞錯? : : 我查了一下wikipedia 上面也是說Google有個URL server專門在作這個URL revisit : : check : : 請問真的是用Hashing嗎還是Distributed Hashing?? : 我會設計的方法： : URL的有效字元 A-Z a-z 加上一些符號，大概總共算是60個symbol， : n0*60^0+n1*60^1+n2*60^2+n3*60^3+n4*60^4+...+ni*60^i : 不過這個數字大的一塌糊塗，所以不是什麼好方法； : 如果不想要collision的話，資料量可能就是那麼大。 : 至於partition的話，用開頭字母就可以作uniform dist.了。用開頭字母的話大概會看到一堆http, www之類的東西..然後所有的東西都要放在同一個 partition, 用整個url算hash code可能會好一點 -- ※ 發信站: 批踢踢參(ptt3.cc) ◆ From: 64.236.139.123 > -------------------------------------------------------------------------- < 作者: HYL (@Seattle) 看板: Oversea_Job 標題: Re: 請教一些面試問題時間: Fri Aug 24 19:33:13 2007 ※ 引述《michaelz.bbs@ptt3.cc (michaelz)》之銘言： : ※ 引述《Baudelaire.bbs@ptt.cc (遺憾太常。)》之銘言： : : 我會設計的方法： : : URL的有效字元 A-Z a-z 加上一些符號，大概總共算是60個symbol， : : n0*60^0+n1*60^1+n2*60^2+n3*60^3+n4*60^4+...+ni*60^i : : 不過這個數字大的一塌糊塗，所以不是什麼好方法； : : 如果不想要collision的話，資料量可能就是那麼大。 : : 至於partition的話，用開頭字母就可以作uniform dist.了。 : 用開頭字母的話大概會看到一堆http, www之類的東西..然後所有的東西都要放在同一個 : partition, 用整個url算hash code可能會好一點理論上要產生兩個有一樣MD5 Hash Code的"有意義"文字是不可能的，，可以把碰撞的狀況省下來，順便也不用存原始的 url進hashtable中如果是我，會用 MD5-128來把 url編碼。一個 url指占 16 byte 的空間，四十億個網頁大概是 59GByte ，還塞得下硬碟空間。考慮到讀取的效益，所以會把資料割成 1MB大小的檔案( 一個檔案存2^16 個 hashcode )，總共有59,000 個；再考量到 file system對同一目錄底下若有大量檔案，開檔的效率會下降，所以分目錄除存 -- ※ 發信站: 批踢踢實業坊(ptt.cc) ◆ From: 67.183.24.112 ※ 編輯: HYL 來自: 67.183.24.112 (08/24 19:34)

推 willieliao:如果這個問題真的是用來作search engine的話還要存上次 08/24 23:40

→ willieliao:抓的時間，不然yahoo.com的內容會停在199x年xd 08/24 23:42

> -------------------------------------------------------------------------- < 發信人: Baudelaire.bbs@ptt3.cc (醉鄉路穩宜頻到), 看板: Oversea_Job 標題: Re: 請教一些面試問題發信站: 批踢踢參 (Sat Aug 25 01:29:12 2007) 轉信站: ptt!Group.NCTU!grouppost!Group.NCTU!ptt3 ※ 引述《michaelz (michaelz)》之銘言： : ※ 引述《Baudelaire.bbs@ptt.cc (遺憾太常。)》之銘言： : : 我會設計的方法： : : URL的有效字元 A-Z a-z 加上一些符號，大概總共算是60個symbol， : : n0*60^0+n1*60^1+n2*60^2+n3*60^3+n4*60^4+...+ni*60^i : : 不過這個數字大的一塌糊塗，所以不是什麼好方法； : : 如果不想要collision的話，資料量可能就是那麼大。 : : 至於partition的話，用開頭字母就可以作uniform dist.了。 : 用開頭字母的話大概會看到一堆http, www之類的東西..然後所有的東西都要放在同一個 : partition, 用整個url算hash code可能會好一點這部份當然要把http 或者www fliter掉，用後面的domain name來作處理；這樣的balancing應該就夠了。如果要考慮cryptographic hash function，會把事情變得太複雜，加上我自己也寫不出來SHA....。 -- http://feedblendr.com/blends/14113.html 回來，我們重建家園穿過兩個夜晚的白色走廊或永遠走開，像慧星那樣在回聲四起的山谷裡燦爛而冷若冰霜你獨自歌唱擯棄黑暗，又沈溺於黑暗中〈北島‧慧星〉 -- ※ 發信站: 批踢踢參(ptt3.cc) ◆ From: 216.145.49.21 > -------------------------------------------------------------------------- < 發信人: LINC.bbs@ptt3.cc (Go cubs!), 看板: Oversea_Job 標題: Re: 請教一些面試問題發信站: 批踢踢參 (Sat Aug 25 12:35:44 2007) 轉信站: ptt!Group.NCTU!grouppost!Group.NCTU!ptt3 ※ 引述《michaelz (michaelz)》之銘言： : ※ 引述《Baudelaire.bbs@ptt.cc (遺憾太常。)》之銘言： : : 我會設計的方法： : : URL的有效字元 A-Z a-z 加上一些符號，大概總共算是60個symbol， : : n0*60^0+n1*60^1+n2*60^2+n3*60^3+n4*60^4+...+ni*60^i : : 不過這個數字大的一塌糊塗，所以不是什麼好方法； : : 如果不想要collision的話，資料量可能就是那麼大。 : : 至於partition的話，用開頭字母就可以作uniform dist.了。 : 用開頭字母的話大概會看到一堆http, www之類的東西..然後所有的東西都要放在同一個 : partition, 用整個url算hash code可能會好一點不知道有沒有人想過用tree 以www.upenn.edu, www.cis.upenn.edu, www.ese.upenn.edu來說看起來應該會像這樣: edu - upenn - www - cis - www - ese - www 考慮到DNS的distribution, root node 如com, edu, org應該可省下不少空間 -- ※ 發信站: 批踢踢參(ptt3.cc) ◆ From: 141.158.245.93 > -------------------------------------------------------------------------- < 作者: sunflier (叮噹) 看板: Oversea_Job 標題: Re: 請教一些面試問題時間: Sat Aug 25 15:16:21 2007 ※ 引述《LINC.bbs@ptt3.cc (Go cubs!)》之銘言： : ※ 引述《michaelz (michaelz)》之銘言： : : 用開頭字母的話大概會看到一堆http, www之類的東西..然後所有的東西都要放在同一個 : : partition, 用整個url算hash code可能會好一點 : 不知道有沒有人想過用tree : 以www.upenn.edu, www.cis.upenn.edu, www.ese.upenn.edu來說 : 看起來應該會像這樣: : edu - upenn - www : - cis - www : - ese - www : 考慮到DNS的distribution, root node 如com, edu, org應該可省下不少空間除了上述方式，再加上 Bloom filter 應該就可以省更多空間、時間了。 http://en.wikipedia.org/wiki/Bloom_filter <Bloom filter> A space-efficient probabilistic data structure that is used to test whether an element is a member of a set. False positives are possible, but false negatives are not. -- http://blog.sunflier.com 科技新知、爆笑圖文、理財心得 -- ※ 發信站: 批踢踢實業坊(ptt.cc) ◆ From: 125.231.192.219 > -------------------------------------------------------------------------- < 發信人: michaelz.bbs@ptt3.cc (michaelz), 看板: Oversea_Job 標題: Re: 請教一些面試問題發信站: 批踢踢參 (Sat Aug 25 18:07:55 2007) 轉信站: ptt!Group.NCTU!grouppost!Group.NCTU!ptt3 ※ 引述《LINC (Go cubs!)》之銘言： : ※ 引述《michaelz (michaelz)》之銘言： : : 用開頭字母的話大概會看到一堆http, www之類的東西..然後所有的東西都要放在同一個 : : partition, 用整個url算hash code可能會好一點 : 不知道有沒有人想過用tree : 以www.upenn.edu, www.cis.upenn.edu, www.ese.upenn.edu來說 : 看起來應該會像這樣: : edu - upenn - www : - cis - www : - ese - www : 考慮到DNS的distribution, root node 如com, edu, org應該可省下不少空間這個叫作trie, 是滿有趣的data structure, 但是問題在於這並不能減少空間如果把整個internet做成trie的話, 每一個網頁還是要佔去一個node,這樣子整體 node的的數量不會減少, 還可能會增加. 比如說沒有一個網址是叫作edu的,但是你還是需要用一個node去存edu... 如果每一個upenn.edu的前面都是www的話...那些是leaf的www可以去掉..或許能減少一點空間..但是整體上效用應該不會很大 -- ※ 發信站: 批踢踢參(ptt3.cc) ◆ From: 64.236.139.77

推 Baudelaire:trie的作法很有意思，跟tree發音一樣 :P 08/26 00:46