TV Archives cracked Open "AI for IA"

<!doctype html><script src="eveal.js"></script>

TV Archives cracked Open "AI for IA"

Artificial Intelligence for Internet Archive

MozFest, London Oct 2017

by [traceypooh](https://twitter.com/tracey_pooh)

https://traceypooh.github.io/mozfest17 _?_ for key shortcuts

git clone /~https://github.com/traceypooh/mozfest17; open mozfest17/index.html

Gist

decentralized research and AI
built on top of
a library of stable, untampered worldwide TV recordings

Intro to archive.org

WayBack Machine
- past copies of 300B+ pages
- 15M books, lendable
- ~4M videos, ~4M audio & live concerts
- 3M images
- 200K software items & emulation (in JS!)

---

Library!

Absolute browser Privacy
- no personal data or IP addresses extracted
Validation & nontampering
- keep original versions with 2+ checksums and logs

<file name="commute.mp4" source="derivative">
<title>commute</title>
<format>h.264</format>
<original>commute.avi</original>
<mtime>1325973601</mtime>
<size>11919082</size>
<md5>ff17ed66e7db5693dd208dd6ac488ff8</md5>
<crc32>ad1df03a</crc32>
<sha1>e9f9de8379cd25653d487ab30d198fc61a050091</sha1>
<length>115.61</length>
<height>480</height>
<width>640</width>
</file>

External Blockchain of Proofs

of file mod times / checksums

OpenTimestamps
uses SHA-1 and Merkle trees
by Peter Todd - blog
brand new!

archive.org/tv

recording 50 - 100 channels
- 24 x 7
- around the world
- since 2000
2 million+ news shows
search captions/metadata
new Trump Administration and Congress subsets
citable reference clips
Popcorn editing/mashup clips
for AI experiments

Artificial Intelligence

text:
- chyron ("lower third") scanning OCR (Third Eye)
- caption alignment
- OCR captions from DVB-S
  - BBC News
- speech to text (VoiceBase)
  - Al Jazeera English
  - Deutsche Welle English
image:
- public officials facial detection
  (Faceomatic <-- Matroid <-- FaceNet)

Artificial Intelligence

audio:
- fingerprinting
  - audfprint - free/open like shazam
  - political Ad tracking
  - Duplitron 5000

Public Feeds

twitter bots & TSV
- Third Eye
slack bot
- Faceomatic
continuous captions feed from CSPAN
- https://openedcaptions.com
- https://pietropassarelli.gitbooks.io/textav/projects/opened-captions-service.html

- OCR 'lower third' - chyrons - overlaid text on broadcasts - not captions or descriptive text - editorial / summarizing in nature - 4 TV channels, 24x7, ~1 min from realtime - CNN - MSNBC - Fox News - BBC News

  AFTER WH MEETING, SCHUMER DISHES
  WHEN HE THOUGHT NIC WAS OFF

--- # bots - twitter bots - https://twitter.com/tvThirdEye - https://twitter.com/tvThirdEyeB - https://twitter.com/tvThirdEyeF - https://twitter.com/tvThirdEyeM - https://twitter.com/tvThirdEye/lists/all

API

Tab Separated Values
https://archive.org/services/third-eye.php
- nice for command-line
- import to google and excel spreadsheets
- filtered
- raw (~25MB / day)
  - more errors
  - 3rd-party filtering possible
- TSV files uploaded to https://archive.org/details/third-eye

Chyron filtering

tesseract OCR
- free; errors
simhash
- groups 'nearly the same'
  - character flips
  - word off in time
look for vowels
pick 'most seen' group every minute
- and tweet

TV AI Examples

Vox determined Puerto Rico was paid little attention by Fox News
- https://vox.com/2017/10/2/16401614/fox-news-puerto-rico-charts
audio fingerprints
- presented keynote paper on
  CSPAN floor speeches and vocal pitch
  Bryce Dietrich, UIowa
- discovered 375K political Ads
- find sound bites of speeches

clips

little JSON annotations
associate metadata to program start/end time range
auto expands each clip to a "synthetic" document
- to elastic search
JSONPatch for changes
track play counts, some referers
allows for decentralized annotations to other IA / research

clip

{
    "268.1|269.1": {
        "subject": [
            "Criminal Activity"
            "Crime"
        ],
        "factcheck": [
            "http://www.factcheck.org/2016/07/factchecking-trumps-big-speech/"
        ]
    },
    "266.7|267.2": {
        "ad_id": "PolAd_DonaldTrump_d9dsn",
        "type": "campaign",
        "race": "PRES",
        "cycle": "2016",
        "message": "pro",
        "sponsor": [
            "Republican National Cmte"
        ],
        "sponsor_type": "PAC",
        "subject": [
            "Job Accomplishments"
        ],
        "person": [
            "Donald Trump"
        ]
    },
    "268.1|269.1": {
        "collection": [
            "nancy_pelosi_archive"
        ],
        "subject": [
            "Voting",
        ],
    }
}

Where We're Going

https://archive.org/details/TVNewsKitchen
want to serve journalists, researchers, librarians & more
responsible behavior and access to data
non-consumptive use

[Part 2] "There Goes 2 Weeks"

deep dive into Image Matching and
Facial Recognition

An imposter does not have Imposter Syndrome

CNNs

Convolutional Neural Network
- filtered neural network
each layer uses output from prior layer as input
instead of rule-based learning, use classified datasets to learn
multi-node connections (but not "fully connected")
"data squashers"

CNN Example

feed in image
node looking for eyelash
node looking for iris
- could feed to node looking for eye
meanwhile... nose node
- all feed to face recognizer node
- could feed to "is this Barack Obama?"

Guru

Rik Heijdens from jwplayer

Demuxed 2017 talk
feed in video - for each shot, make 3 vectors:
- image Inception CNN (tensorflow)
- audio CNN spectrogram
- text transcripts/STT into Word2Vec
concat vectors, compare (cosine similarity), and graph
... yields scene detection
all just for ideal Ad insertion!

Image Matching

pixel diff algorithms (MAE, RMSE, MSE)
perceptual hashing pHash.org
- image => 8x8 grayscale
- convolve to 8x8 image with DCT
- reduce to 64bit number
- hamming distance Int64 pairs

pHash - to gray 8x8

TensorFlow & Training

https://www.tensorflow.org/tutorials/image_recognition
trained CNNs, locally run
GoogLeNet Inception general classifier
retrainable / customizable
- redo 'top layer' (Rik idea)
- https://www.tensorflow.org/tutorials/image_retraining
2048 multi-byte vectors (floats)
iOS smaller single-byte vectors
cosine distance comparisons
can just compare vectors (and ignore readable classification labels (Rik idea))

OpenFace

implementation of FaceNet
https://cmusatyalab.github.io/openface/demo-3-classifier
similar to tensorflow (Torch..)

OpenFace Training

3+ images per person/face
avoid 'overfit'
align eyes + nose (nostrils?)

Siamese "one shot" CNN recognizers

Rik idea
differentiate instead of classify
learns similarity of 2 inputs

- repo / py notebook ---

AI Ethics

face tracking only public figures
https://www.itic.org/resources/AI-Policy-Principles-FullReport2.pdf
- min. government regulation & access
- public/private partner; diversity/inclusion++
- preserve human dignity, rights, freedoms
- min. risk to humans; human control
- large datasets -- avoid harmful bias
open discussion

Demo Time

Siamese network
miniARchive
tensorflow
google translate

help Shape US with YOUR Thoughts

extend/shape our APIs
AI ideas
research, visualizations
tag clips with AI metadata or pointers to Decentralized metadata
more!

Ergo

decentralized research and AI
built on top of
a library of stable, untampered worldwide TV recordings

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
hash-images		hash-images
reveal.js		reveal.js
README.md		README.md
amatic-sc.css		amatic-sc.css
amatic-sc.ttf		amatic-sc.ttf
av.png		av.png
bookreader.png		bookreader.png
down.png		down.png
eveal.js		eveal.js
git.png		git.png
index.html		index.html
open-sans.css		open-sans.css
open-sans.ttf		open-sans.ttf
pitch.html		pitch.html
pitch.md		pitch.md
proposal.html		proposal.html
proposal.md		proposal.md
sky.css		sky.css
software.png		software.png
tvlogo-quarter.png		tvlogo-quarter.png
tvlogo.png		tvlogo.png
wayback-apple.png		wayback-apple.png

traceypooh/mozfest17

Folders and files

Latest commit

History

Repository files navigation

TV Archives cracked Open "AI for IA"

Artificial Intelligence for Internet Archive

MozFest, London Oct 2017

Gist

Intro to archive.org

Library!

External Blockchain of Proofs

of file mod times / checksums

archive.org/tv

Artificial Intelligence

Artificial Intelligence

Public Feeds

API

Chyron filtering

TV AI Examples

clips

clip

Where We're Going

[Part 2] "There Goes 2 Weeks"

deep dive into Image Matching and Facial Recognition

CNNs

CNN Example

Guru

Image Matching

pHash - to gray 8x8

TensorFlow & Training

OpenFace

OpenFace Training

Siamese "one shot" CNN recognizers

AI Ethics

Demo Time

Demo Time

help Shape US with YOUR Thoughts

Ergo

The End

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

deep dive into Image Matching and
Facial Recognition

Packages