Extraction with Regular Expressions¶
Extract URLs from text¶
pat = r''' (?x)( # verbose identify URLs within text (http|ftp|gopher) # make sure we find a resource type :// # ...needs to be followed by colon-slash-slash [^ \n\r]+ # some stuff then space, newline, tab is URL \w # URL always ends in alphanumeric char (?=[\s\.,\"]) # assert: followed by whitespace/period/comma ) # end of match group'''
Extract URL from embedded videos in web pages¶
This section concerns embedded videos only. There are several kinds :- <embed src="music.mp3">
- <embed src="/mediaplayer.swf" flashvars="file=video.flv"
- <embed src="http://site.com/player.swf?docId=-5757040"
- <object><param name="movie" value="http://www.dailymotion.com/swf/k7hL5"...
First, we should focus on embedded whose FLV is visible.
Then look at hosts supported by Clive, and find a way to retrieve original URL from embedded HTML code- Youtube
- GoogleVideo
- Dailymotion
- Metacafe
- Guba
- Sevenload
- Myvideo
HTML code :¶
/* YouTube */ <object width="425" height="344"><param name="movie" value="http://www.youtube.com/v/04RZrf3-Mgo&hl=en&fs=1"><param name="allowFullScreen" value="true"><embed src="http://www.youtube.com/v/04RZrf3-Mgo&hl=en&fs=1" type="application/x-shockwave-flash" allowfullscreen="true" width="425" height="344"></object> http://www.youtube.com/watch?v=04RZrf3-Mgo /* Dailymotion */ <object width="420" height="258"><param name="movie" value="http://www.dailymotion.com/swf/k7hL5JL5Ls164fIBeB&related=1"><param name="allowFullScreen" value="true"><param name="allowScriptAccess" value="always"><embed src="http://www.dailymotion.com/swf/k7hL5JL5Ls164fIBeB&related=1" type="application/x-shockwave-flash" allowfullscreen="true" allowscriptaccess="always" width="420" height="258"></object> http://www.dailymotion.com/video/x6bttp_chanson-geek-par-dedo-et-yacine-jcc /* Google video */ <embed style="width: 400px; height: 326px;" id="VideoPlayback" type="application/x-shockwave-flash" src="http://video.google.com/googleplayer.swf?docId=-5757040207684919969&hl=en" flashvars=""> video.google.com/videoplay?docid=-5757040207684919969
videodownloader¶
Look at this firefox extension : http://videodownloader.net
You can download from:
Angry Alien, Blennus, Blip.tv, Break.com, Dailymotion, Double Agent, eVideoShare, Free Video Blog, Google Video, Grinvi, iFilm, Keiichi Anime Forever, Metacafe, MySpace, MySpace Video Code, Putfile, Totally Crap, vidiLife, vSocial, AnimeEpisodes.Net, Badjojo, Blastro, Bofunk, Bolt, Castpost, CollegeHumor, Current TV, Dachix, Danerd, DailySixer.com, DevilDucky, Double Agent, EVTV1, FindVideos, Hiphopdeal, Kontraband, Lulu TV, Midis.biz, Music.com, MusicVideoCodes.info, Newgrounds, NothingToxic, PcPlanets, Pixparty, PlsThx, Revver, Sharkle, SmitHappens, StreetFire, That Video Site, VideoCodes4U, VideoCodesWorld, VideoCodeZone, Vimeo, Yikers YouTube and ZippyVideos.
New sites added everyday !
More embedded hosts...later use¶
/* sanchi.ro */ <object width="425" height="319"><embed src="http://www.sanchi.ro/flvPlayer.swf?hiddenGui=true&scaleMode=full&autoStart=false&startImage=http://www.sanchi.ro/thumb/2_11042.jpg&flvToPlay=http://www.sanchi.ro/embeders.php?flv=11042" type="application/x-shockwave-flash" allowfullscreen="true" bgcolor="#000000" width="425" height="319"></object> /* current.com */ <object classid="clsid:D27CDB6E-AE6D-11cf-96B8-444553540000" width="400" height="350"><param name="movie" value="http://current.com/e/89181144/en_US"><param name="wmode" value="transparent"><param name="allowfullscreen" value="true"><param name="allowscriptaccess" value="always"><embed src="http://current.com/e/89181144/en_US" type="application/x-shockwave-flash" wmode="transparent" allowfullscreen="true" allowscriptaccess="always" width="400" height="350"></object> /* liveleak.com */ <object width="450" height="370"><param name="movie" value="http://www.liveleak.com/e/a5c_1219176992"><param name="wmode" value="transparent"><embed src="http://www.liveleak.com/e/a5c_1219176992" type="application/x-shockwave-flash" wmode="transparent" width="450" height="370"></object> /* streetfire */ <embed src="http://videos.streetfire.net/vidiac.swf?video=495d11be-a25a-4432-aa94-9aee00cbe510" allowfullscreen="true" type="application/x-shockwave-flash" pluginspage="http://www.macromedia.com/go/getflashplayer" width="428" height="352"> /* koreus */ <object type="application/x-shockwave-flash" data="http://www.koreus.com/video/dilemma" width="400" height="320"><param name="movie" value="http://www.koreus.com/video/dilemma"><embed src="http://www.koreus.com/video/dilemma" type="application/x-shockwave-flash" width="400" height="300"></object> /* collegehumor */ <object type="application/x-shockwave-flash" data="http://www.collegehumor.com/moogaloop/moogaloop.swf?clip_id=1828310&fullscreen=1" width="480" height="360"><param name="allowfullscreen" value="true"><param name="AllowScriptAccess" value="true"><param name="movie" quality="best" value="http://www.collegehumor.com/moogaloop/moogaloop.swf?clip_id=1828310&fullscreen=1"></object>