dashwood.net -

Ryan Stefan's Micro Blog

Amazon as Keyword Research

Dec 182018

Amazon's qid in the url is the number of seconds since January 1st, 1979. They use it to detect whether a link is freshly searched thus signaling that that keyword is searched, or if its an old qid then that keyword was not used and someone just visited the link (I probably explained that like crap, but I've been coding for like 20 hours straight and my brain hurts). Here's the code:

import datetime


scrapy export to json:

scrapy crawl amazonproducts -o data3.json

Fixed Apache Log Directory Error

Nov 142018
sudo mkdir /var/log/apache2/
sudo touch /var/log/apache2/{access,error,other_vhosts_access,suexec}.log
sudo chown -R root:adm /var/log/apache2/
sudo chmod -R 750 /var/log/apache2

Spent hours getting the plotly graph working with flask and sized properly. I don't have the energy to even go into it, but here's a picture:


Here's the core code:

import json
import plotly

import pandas as pd
import numpy as np

def graph_data():
	# graph data/settings dict

    graphJSON = json.dumps(graphs, cls = plotly.utils.PlotlyJSONEncoder)
    return graphJSON


from graphs import graph_data

graph_data = graph_data()

def product(asin):
	return render_template('amazon-product-tests.html', product=product, graph_data=Markup(graph_data))

# jinja
<div id="trust-score-graph"></div>

# Javascript
var graphs = {{graph_data}};

            graphs[0].layout,{"showLink": false, "responsive": true,"staticPlot": true});


Added comments enabled by default to personal blog: comments_enabled in xml in content/private/comments.xml

Sharing and Flask Dev

Nov 102018

Added NTFS folder sharing over the network without actually having user permission of the folder. Here's how I enabled, adding usershare owner only = false below [global]

sudo nano /etc/samba/smb.conf

# Any line which starts with a ; (semi-colon) or a # (hash) 
# is a comment and is ignored. In this example we will use a #
# for commentary and a ; for parts of the config file that you
# may wish to enable
# NOTE: Whenever you modify this file you should run the command
# "testparm" to check that you have not made any basic syntactic 
# errors. 

#======================= Global Settings =======================


usershare owner only = false

## Browsing/Identification ###

ctrl + o

Fix NTFS Permissions

Found some hopfully looking insight on how to give user access to mounted drives.

If you mount a partition to a folder within /home/user it will be owned by the user. Here's the line I added to my /etc/fstab.

UUID=9e5bb53c-4443-4124-96a8-baeb804da204 /home/fragos/Data ext4 errors=remount-ro 0 1

Keyword Raking / Splitting

Going to rake keywords from the comments and then use a 1 sentence lexsum of all of the titles for loop display and other stuff.

# Rake keywords
rake = Rake(min_length=2, max_length=6,
ranking_metric=Metric.DEGREE_TO_FREQUENCY_RATIO) rake.extract_keywords_from_text(textjoin) sumkeywords.append(' : '.join(rake.get_ranked_phrases()))

Source: https://github.com/csurfer/rake-nltk

I had to change the word tokenizer in the class to the nltk twitter tokenizer so that it wouldn't split words by apostrophes.

from nltk.tokenize import wordpunct_tokenize, TweetTokenizer
tknzr = TweetTokenizer()


word_list = [word.lower() for word in tknzr.tokenize(sentence)]

I've also decided to use ' : ' as my official list of terms splitting format. Commas are too common and might add complications in the future.

Flask Dev

I turned the CSV file generated from the lexsum generator to preview the summaries and keyword extraction in the flask app.

# load data and create sub dataframe for product asin
data = pd.read_csv('./static/data/sample-products.csv', index_col=0)
product_comments = data.loc[data['asin'] == asin]

# create variables for each rating
for number in range(1,6):
    current = product_comments.loc[product_comments['rating'] == number]
    product['{}_keywords'.format(number)] = current['keywords'].tolist()[0]
    product['{}_title'.format(number)] = current['title'].tolist()[0]
    product['{}_text'.format(number)] = current['text'].tolist()[0]

# load variables inside flask template


Lexsum in Action

Nov 092018

I finally got around to working on my Amazon project again. 

Misc Notes

# Change postgres data directory

File path:

File System Headache

I decided to clean up my hard drives, but I forgot how much of a headache it was trying to get an NTFS drive to work with transmission-daemon. Whatever I'll just save to my EX4 partition for now and fix it later. 


I bricked my OS install and had to go down a 3 hour nightmare trying to fix it. I eventually discovered that it was a label from my old partition mount point in the fstab file. Solution:

sudo nano /etc/fstab

# comment out old label

ctrl + o to save
ctrl + x to exit


My computer still doesn't restart properly because I broke something in the boot order trying to fix it. Not a big deal I just enter my username/password in the terminal then type startx.

LexSum Progress

Had to slice to 50 for each rating to save time, but I can probably make it longer for launch. At first I was thinking there would be 60 million entities to process, but actually its more like 900k x 5 (for each rating) and as long as I don't lexsum 1000+ reviews for ratings it should finish in a few days. I reallllly need to add a timer function asap. I can just time 1000 or so products and multiply that by 900k or whatever the total number of products in my database is and I should have a pretty good idea how long it will take.

if len(titles) > 50:
    titlejoin = ' '.join(lex_sum(' '.join(titles[:50]), sum_count))
    textjoin = ' '.join(lex_sum(' '.join(comments[:50]), sum_count))
    titlejoin = ' '.join(lex_sum(' '.join(titles), sum_count))
    textjoin = ' '.join(lex_sum(' '.join(comments), sum_count))

I'm thinking I can clean these lines up now that I'm staring at it. Maybe something like:

titlejoin = ' '.join(
    lex_sum(' '.join(titles[:min(len(titles), 50)]), sum_count))
textjoin = ' '.join(
    lex_sum(' '.join(comments[:min(len(titles), 50)]), sum_count))

My estimated time remaining function adds time elapsed ever ten iterations to a list, takes the last 500 or less of that list and averages them, and finally multiplies that average by the total remaining iterations and displays it in a human readable format:

avg_sec = 0
times = []
start = time.time()

# Display time remaining
if avg_sec:
    seconds_left = ((limit - count) / 10) * avg_sec
    m, s = divmod(seconds_left, 60)
    h, m = divmod(m, 60)
    print('Estimated Time Left: {}h {}m {}s'.format(
        round(h), round(m), round(s)))

if(not count % 10):
    end = time.time()
    time_block = end - start
    start = end
    avg_sec = functools.reduce(
        lambda x, y: x + y, times[-min(len(times), 500):]) / len(times[-min(len(times), 500):])
    print('Average time per 10:', round(avg_sec, 2), 'seconds')

Another thought I had is that this save_df module I coded (it's at like 400 lines of code already x_x) is actually a crucial part of my ultimate code base. I'm pretty happy that I spent so much time writing it into proper functions.

Fixed Slow Database Queries - Indexing to the Rescue!

Nov 012018

So I ran my summarizer yesterday and it took literally all day to run only 200 products through the lex sum function. So I went through my code and added a timer for each major step in the process like so:

start = time.time()
asin_list = get_asins(limit)
end = time.time()
print('Get ASINs: ', end - start)

 Turns out it was taking over 60 seconds per query . I did the math and at the rate it was going, it would take almost two years to complete every product in my database. So I started looking around at different ways to group large databases. Turns out databases are a lot more complicated than I believed. It felt like looking for a PHP solution back in high school when I didn't know enough to know what to look for. Finally I stumbled upon a feature called Indexing. First I added the indexing code inside of my script, which had no effect, but it seemed like it had worked properly. Still though I was not going to give up that easy and I decided to open up postgres directly in the terminal and poke around to see if the indexing was applied properly. Turns out that it was not applied at all. Here is the code I used to index the asin table in reviews:

# Remote Connect 
postgres psql -U ryan -h -p 5432 databasename
# Display table Indexes
SELECT * FROM pg_indexes WHERE tablename = 'reviews';

# Create Index
CREATE INDEX asin_index ON reviews (asin);

Ureka! It worked, now the script that took all day to run yesterday ran in about a minute flat! That is the biggest difference in performance time I've ever experienced and I cant wait to see where else indexing will help my databases.

Other than that, Erin showed me a bunch of stuff in illustrator and Phototshop.

  • ctrl+click with select tool enables auto-select
  • ctrl+d — deselect
  • ctrl+shift+i — invert selection
  • ctrl+j — duplicate layer
  • ctrl+alt+j — duplicate and name layer