[問題] 有關df.loc[]的問題

作者sssh (叫我松高魂 ~~)

看板Python

標題[問題] 有關df.loc[]的問題

時間Tue Dec 4 20:39:05 2018

範例如下 https://imgur.com/vaZab8V 如果我今天要找出Store 1中的Cost df.loc["Store 1"]["Cost"] 老師說用這種方法好像會出現問題因此不建議原文如下： This looks pretty reasonable and gets us the result we wanted. But chaining can come with some costs and is best avoided if you can use another approach. In particular, chaining tends to cause Pandas to return a copy of the DataFrame instead of a view on the DataFrame. For selecting a data, this is not a big deal, though it might be slower than necessary. If you are changing data though, this is an important distinction and can be a source of error. 想請教大家，老師在這邊講的具體上來說是什麼問題？小的有點看不懂這樣的方法會帶來什麼狀況＠＠不知道是否有前輩可以幫忙指點一二？ -- ◤ ◤ ◣ ● Ο ο ◤ ◣ ◣ ◣ ◤ 。 ο ○ 。 ○ °● ◣ ≡ ◤ ° ο Ο ◣ ◤ ◤ ◣ ≡ ◤ -- ※ 發信站: 批踢踢實業坊(ptt.cc), 來自: 1.163.71.122 ※ 文章網址: https://www.ptt.cc/bbs/Python/M.1543927149.A.227.html

推 gmccntzx1: 參考: https://stackoverflow.com/questions/23296282 12/04 21:25

→ gmccntzx1: 還有這個: https://bit.ly/2Edy74i 12/04 21:25

→ gmccntzx1: 簡單來說， `df.loc["Store 1"]["Cost"]` 會透過 2 次 12/04 21:26

→ gmccntzx1: __getitem__ 來取值，後面行為的開始執行時取決於前面 12/04 21:28

→ gmccntzx1: 行為的完成時機。 12/04 21:28

→ gmccntzx1: 若資料可以允許寫成 `df.loc[:, ('Store 1', 'cost')]` 12/04 21:30

→ gmccntzx1: 則 pandas 可以一次根據後面的參數取值，相對來說較快 12/04 21:31

感謝gmccntzxl前輩的分享，我剛剛研究了一下，我的理解大致上是這樣： chain indexing容易出現問題的狀況是在賦值時，兩個中括號放在一起時，第一個中括號的工作（取值）但是取值後返回的不一定是view或是copy（依照內存狀況不一定）所以當在處理第二的中括號（賦值）時，若第一個返回的是copy就有可能會產生SettingWithCopy 這也是為什麼chain indexing這麼不穩定的原因不知道我這樣的理解是否正確？ ※ 編輯: sssh (1.163.71.122), 12/04/2018 23:49:08

推 gmccntzx1: 關於回傳值是 view 還是 copy ，基本上可以照著 12/05 00:48

→ gmccntzx1: stackoverflow 那篇回答的規則去判斷。 12/05 00:49

→ gmccntzx1: 要了解的更詳細的話，推薦你直接去追 source code： 12/05 00:51

→ gmccntzx1: pd.DataFrame.__getitem__ : https://git.io/fpPuH 12/05 00:51

→ gmccntzx1: 裡面有寫到好幾種狀況，比較值得注意的地方有 12/05 00:53

→ gmccntzx1: self._slice (generic._slice): https://git.io/fpPzx 12/05 00:54

→ gmccntzx1: self._take (generic._take): https://git.io/fpP2E 12/05 00:59

→ gmccntzx1: 修正一下：上面的 generic 應該是 generic.NDFrame 12/05 01:01

→ gmccntzx1: 所以說，用 chain indexing 問題在於一般情況下不容易 12/05 01:03

→ gmccntzx1: 判斷出取的值到底是 view 還是 copy （不了解如 12/05 01:04

→ gmccntzx1: stackoverflow 那篇回答所說的規則），而非資料在記憶 12/05 01:06

→ gmccntzx1: 體中的情況差異所影響。 12/05 01:07

→ gmccntzx1: 而因為會影響取值結果是 view/copy 的情況很多種，所以 12/05 01:11

→ gmccntzx1: 官方還是建議少用 chain indexing。 12/05 01:14

推 TitanEric: 推優文 12/05 10:12

→ sssh: 感謝gmccntzxl的分享 12/05 10:32

推 Angesi: df.loc["Store 1","Cost"] 指定位置讀應該最簡單 12/06 17:05

→ Angesi: 用chain index 實在有點奇怪 12/06 17:05

→ Angesi: 或者隱含索引 df.iloc[0, 0] 也行 12/06 17:19