{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": { "hide_input": true, "slideshow": { "slide_type": "skip" }, "tags": [ "hide-cell" ] }, "outputs": [], "source": [ "import numpy as np\n", "import scipy as sp\n", "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "import seaborn as sns\n", "import matplotlib as mp\n", "import sklearn\n", "from IPython.display import Image\n", "\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Low Rank Approximation and the SVD" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Today, we move on. \n", "\n", "However, let's look back and try to put the modeling we've done into a larger context." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Models are simplifications" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "One way of thinking about modeling or clustering is that we are building a __simplification__ of the data. \n", "\n", "That is, a model is a description of the data, that is simpler than the data.\n", "\n", "In particular, instead of thinking of the data as thousands or millions of individual data points, we might think of it in terms of a small number of clusters, or a parametric distribution, etc, etc." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "From this simpler description, we hope to gain __insight.__" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "There is an interesting question here: __why__ does this process often lead to insight? \n", "\n", "That is, why does it happen so often that a large dataset can be described in terms of a much simpler model?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "I don't know." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" }, "tags": [ "hide-input" ] }, "source": [ "
Data Type | Rows | Columns | Elements |
---|---|---|---|
Network Traffic | Sources | Destinations | Number of Bytes |
Social Media | Users | time bins | Number of Posts/Tweets/Likes |
Web Browsing | Users | Content Categories | Visit Counts/Bytes Downloaded |
Web Browsing | Users | time bins | Visit Counts/Bytes Downloaded |
\n", " | ATLA-ATLA | \n", "ATLA-CHIN | \n", "ATLA-DNVR | \n", "ATLA-HSTN | \n", "ATLA-IPLS | \n", "ATLA-KSCY | \n", "ATLA-LOSA | \n", "ATLA-NYCM | \n", "ATLA-SNVA | \n", "ATLA-STTL | \n", "... | \n", "WASH-CHIN | \n", "WASH-DNVR | \n", "WASH-HSTN | \n", "WASH-IPLS | \n", "WASH-KSCY | \n", "WASH-LOSA | \n", "WASH-NYCM | \n", "WASH-SNVA | \n", "WASH-STTL | \n", "WASH-WASH | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2003-09-01 00:00:00 | \n", "8466132.0 | \n", "29346537.0 | \n", "15792104.0 | \n", "3646187.0 | \n", "21756443.0 | \n", "10792818.0 | \n", "14220940.0 | \n", "25014340.0 | \n", "13677284.0 | \n", "10591345.0 | \n", "... | \n", "53296727.0 | \n", "18724766.0 | \n", "12238893.0 | \n", "52782009.0 | \n", "12836459.0 | \n", "31460190.0 | \n", "105796930.0 | \n", "13756184.0 | \n", "13582945.0 | \n", "120384980.0 | \n", "
2003-09-01 00:10:00 | \n", "20524567.0 | \n", "28726106.0 | \n", "8030109.0 | \n", "4175817.0 | \n", "24497174.0 | \n", "8623734.0 | \n", "15695839.0 | \n", "36788680.0 | \n", "5607086.0 | \n", "10714795.0 | \n", "... | \n", "68413060.0 | \n", "28522606.0 | \n", "11377094.0 | \n", "60006620.0 | \n", "12556471.0 | \n", "32450393.0 | \n", "70665497.0 | \n", "13968786.0 | \n", "16144471.0 | \n", "135679630.0 | \n", "
2003-09-01 00:20:00 | \n", "12864863.0 | \n", "27630217.0 | \n", "7417228.0 | \n", "5337471.0 | \n", "23254392.0 | \n", "7882377.0 | \n", "16176022.0 | \n", "31682355.0 | \n", "6354657.0 | \n", "12205515.0 | \n", "... | \n", "67969461.0 | \n", "37073856.0 | \n", "15680615.0 | \n", "61484233.0 | \n", "16318506.0 | \n", "33768245.0 | \n", "71577084.0 | \n", "13938533.0 | \n", "14959708.0 | \n", "126175780.0 | \n", "
2003-09-01 00:30:00 | \n", "10856263.0 | \n", "32243146.0 | \n", "7136130.0 | \n", "3695059.0 | \n", "28747761.0 | \n", "9102603.0 | \n", "16200072.0 | \n", "27472465.0 | \n", "9402609.0 | \n", "10934084.0 | \n", "... | \n", "66616097.0 | \n", "43019246.0 | \n", "12726958.0 | \n", "64027333.0 | \n", "16394673.0 | \n", "33440318.0 | \n", "79682647.0 | \n", "16212806.0 | \n", "16425845.0 | \n", "112891500.0 | \n", "
2003-09-01 00:40:00 | \n", "10068533.0 | \n", "30164311.0 | \n", "8061482.0 | \n", "2922271.0 | \n", "35642229.0 | \n", "9104036.0 | \n", "12279530.0 | \n", "29171205.0 | \n", "7624924.0 | \n", "11327807.0 | \n", "... | \n", "66797282.0 | \n", "40408580.0 | \n", "11733121.0 | \n", "54541962.0 | \n", "16769259.0 | \n", "33927515.0 | \n", "81480788.0 | \n", "16757707.0 | \n", "15158825.0 | \n", "123140310.0 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
2003-09-07 23:10:00 | \n", "8849096.0 | \n", "33461807.0 | \n", "5866138.0 | \n", "3786793.0 | \n", "19097140.0 | \n", "10561532.0 | \n", "26092040.0 | \n", "28640962.0 | \n", "8343867.0 | \n", "8820650.0 | \n", "... | \n", "65925313.0 | \n", "21751316.0 | \n", "11058944.0 | \n", "58591021.0 | \n", "17137907.0 | \n", "24297674.0 | \n", "83293655.0 | \n", "17329425.0 | \n", "20865535.0 | \n", "123125390.0 | \n", "
2003-09-07 23:20:00 | \n", "9776675.0 | \n", "31474607.0 | \n", "5874654.0 | \n", "11277465.0 | \n", "14314837.0 | \n", "9106198.0 | \n", "26412752.0 | \n", "26168288.0 | \n", "8638782.0 | \n", "9193717.0 | \n", "... | \n", "70075490.0 | \n", "29126443.0 | \n", "12667321.0 | \n", "54571764.0 | \n", "15383038.0 | \n", "25238842.0 | \n", "70015955.0 | \n", "16526455.0 | \n", "16881206.0 | \n", "142106800.0 | \n", "
2003-09-07 23:30:00 | \n", "9144621.0 | \n", "32117262.0 | \n", "5762691.0 | \n", "7154577.0 | \n", "17771350.0 | \n", "10149256.0 | \n", "29501669.0 | \n", "25998158.0 | \n", "11343171.0 | \n", "9423042.0 | \n", "... | \n", "68544458.0 | \n", "27817836.0 | \n", "15892668.0 | \n", "50326213.0 | \n", "12098328.0 | \n", "27689197.0 | \n", "73553203.0 | \n", "18022288.0 | \n", "18471915.0 | \n", "127918530.0 | \n", "
2003-09-07 23:40:00 | \n", "8802106.0 | \n", "29932510.0 | \n", "5279285.0 | \n", "5950898.0 | \n", "20222187.0 | \n", "10636832.0 | \n", "19613671.0 | \n", "26124024.0 | \n", "8732768.0 | \n", "8217873.0 | \n", "... | \n", "65087776.0 | \n", "28836922.0 | \n", "11075541.0 | \n", "52574692.0 | \n", "11933512.0 | \n", "31632344.0 | \n", "81693475.0 | \n", "16677568.0 | \n", "16766967.0 | \n", "138180630.0 | \n", "
2003-09-07 23:50:00 | \n", "8716795.6 | \n", "22660870.0 | \n", "6240626.4 | \n", "5657380.6 | \n", "17406086.0 | \n", "8808588.5 | \n", "15962917.0 | \n", "18367639.0 | \n", "7767967.3 | \n", "7470650.1 | \n", "... | \n", "65599891.0 | \n", "25862152.0 | \n", "11673804.0 | \n", "60086953.0 | \n", "11851656.0 | \n", "30979811.0 | \n", "73577193.0 | \n", "19167646.0 | \n", "19402758.0 | \n", "137288810.0 | \n", "
1008 rows × 121 columns
\n", "