dashwood.net -

Ryan Stefan's Micro Blog

Amazon as Keyword Research

Dec 182018

Amazon's qid in the url is the number of seconds since January 1st, 1979. They use it to detect whether a link is freshly searched thus signaling that that keyword is searched, or if its an old qid then that keyword was not used and someone just visited the link (I probably explained that like crap, but I've been coding for like 20 hours straight and my brain hurts). Here's the code:

import datetime

round((datetime.datetime.today()-datetime.datetime(1970,1,1)).total_seconds())

scrapy export to json:

scrapy crawl amazonproducts -o data3.json

Shareasale Scraper and Converter

Dec 182018

I'm trying to find a clever way to get a bunch of keywords in a specific niche and of course my first instinct is to scrape them. Getting the data was pretty easy actually. I just made a url generator with a bunch of keywords and imported the urls into a chrome extension web scraper (that way I could avoid having to use sessions in a scraper and this was way easier). Make sure to use the web scraper I linked here because the other one's are garbage. The only annoying thing is that the scraper doesn't have a good way to group content that came from the same parent div unless you scrape all of the content of that div, which is super messy. So once the scrape finishes I just copy the column with all of the data, paste it into a text file, and find replace tabs with nothing (delete all the TABS ARGHGHH). It will look something like this:

"ITP
Geekcreit DUE R3 32 Bit ARM Module With USB Cable Arduino Compatible
SKU: 906466
Price: $12.99
Est. $0.78 Per Sale
45 Day Cookie
BANGGOOD TECHNOLOGY CO., LIMITED
Merchant ID: 32599
www.banggood.com
30 day Average Commission: $2.93
30 day Average Sale Amount: $42.15
30 Day Average Reversal Rate: 2.45 %
30 Day Conversion Rate: 6.81%
Join Program
Show More Products
Add to Favorites"
"
Wooden Mixing Paddle, 42"" Length
SKU: 10106
Price: $13.60
Est. $0.78 Per Sale
30 Day Cookie
Kerekes kitchen & Restaurant Supplies
Merchant ID: 57964
www.BakeDeco.com
30 day Average Commission: $0.82
30 day Average Sale Amount: $140.17
30 Day Average Reversal Rate: 0.00 %
30 Day Conversion Rate: 10.32%
Join Program
Show More Products
Add to Favorites"

So I had to create a convert scraped function that basically looks for the line that starts with a ", but not double "" (some products have double quotes). Surprisingly, it worked perfectly with zero issues, but even if a few got mixed up on a product I made it to where it will resets after each product. Anyways, here's the code:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import random
import csv


cats = [x.rstrip() for x in open('categories.txt', 'r').readlines()]
filters = [x.rstrip() for x in open('filters.txt', 'r').readlines()]
types = 'productSearch', 'basicKeyword'
pages = list(range(1, 452, 50))


def url_gen(search_type, keyword, page_start, search_filter):
    return ('https://account.shareasale.com/a-programs.cfm#searchType={search_type}&'
            'keyword={keyword}&ascordesc=desc&start={page_start}&order={search_filter}'
            .format(search_type=search_type, keyword=keyword, page_start=page_start, search_filter=search_filter))


def all_products(file_name):
    urls = []
    for cat in cats:
        for search_filter in filters:
            for page_start in pages:
                urls.append(url_gen(types[0], cat, page_start, search_filter))

    save_sitemap(create_sitemap(urls, file_name), file_name)


def create_sitemap(urls, file_name):
    urls_string_list = []
    count = 1
    urls_string_list.append('[')
    for url in urls:
        if count < len(urls):
            urls_string_list.append('"{url}",'.format(url=url))
            count += 1
        else:
            urls_string_list.append('"{url}"]'.format(url=url))
    urls_string = ''.join(urls_string_list)

    return ('{{"_id":"{file_name}{random_int}","startUrl":{urls_string},"selectors":[{{"id":"name",'
            '"type":"SelectorText","parentSelectors":["_root"],"selector":"div.mGeneral div.org",'
            '"multiple":true,"regex":"","delay":0}},{{"id":"pnk","type":"SelectorText","parentSelectors":["_root"],'
            '"selector":"div.org a","multiple":true,"regex":"","delay":0}},{{"id":"price","type":"SelectorText",'
            '"parentSelectors":["_root"],"selector":"div.price","multiple":true,"regex":"","delay":0}},{{"id":"per sale",'
            '"type":"SelectorText","parentSelectors":["_root"],"selector":"div.cookie","multiple":true,"regex":"","delay":0}}]}}'
            .format(file_name=file_name, random_int=str(random.randint(1, 999)), urls_string=urls_string))


def save_sitemap(sitemap, file_name):
    with open('./generated/{}-sitemap-{}.txt'.format(file_name, str(random.randint(1, 999))), 'w') as file:
        file.write(sitemap)

    print(file_name, 'saved in /generated')


def convert_scraped(file_name):

    keys = ['title', 'sku', 'price', 'per_sale', 'cookie', 'company', 'merch_id',
            'website', 'commission', 'sale_amount', 'reversal_rate', 'conversion_rate',
            'join', 'more', 'add']

    with open('./scraped/{file_name}.txt'.format(file_name=file_name), 'r') as f:
        with open('data.csv', 'w', newline='') as csvf:
            writer = csv.writer(csvf)
            writer.writerow(i for i in keys)
        count = 0
        data = {}
        for line in f.readlines():
            count += 1
            if line[0] == '\"' and line[1] != '\"':
                count = 0
                with open('data.csv', 'a', newline='') as csvf:
                    writer = csv.writer(csvf)
                    writer.writerow(data.values())
            else:
                data[keys[count - 1]] = line.rstrip()

    print('Data written to data.csv')


if __name__ == '__main__':
    # all_products('products')
    convert_scraped('shareasale1-data')