Go to top

Scrapinghub and GSoC 2016

At Scrapinghub, we love open source and we know the community can build amazing things.

If you haven’t heard about it already Google Summer of Code is a global program that offers students stipends to write code for open source projects. Scrapinghub is applying to GSoC for the 3rd time, and had participated in the GSoC 2014 & 2015. Julia Medina, our student in 2014, did an amazing work on Scrapy’s API and settings. Jakob de Maeyer, our student in 2015, did a great job getting Scrapy Addons off the ground. (We want to wrap this work for Scrapy 1.2.)

If you're interested in participating in GSoC 2016 as a student, take a look at the curated list of ideas below. Check the corresponding “Information for Students“ section and get in touch with the mentors. Don’t be afraid, we’re nice people :)

We would be thrilled to see any of the ideas below happen, but these are just our ideas, you are free to come up with a new subject, preferably around information retrieval :)

Let’s make it a great Google Summer of Code!

Scrapy Ideas for GSoC 2016

Scrapy and Google Summer of Code

Scrapy is a very popular web crawling and scraping framework for Python (15th in Github most trending Python projects) used to write spiders for crawling and extracting data from websites. Scrapy has a healthy and active community, and it's applying for Google Summer of Code in 2016.

Information for Students

If you're interested in participating in GSoC 2016 as a student, you should join the scrapy-users mailing list and post your questions and ideas there. You can also join the #scrapy IRC channel at Freenode to chat with other Scrapy users & developers. All Scrapy development happens at GitHub Scrapy repo.

Ideas

Asyncio Prototype

Advanced
Brief explanation The asyncio library provides infrastructure for writing single-threaded concurrent code using coroutines, multiplexing I/O access over sockets and other resources, running network clients and servers, and other related primitives. We are looking to see how it fits into Scrapy architecture.
Expected Results A working prototype of an asyncio-based Scrapy.
Required skills Python
Difficulty level Advanced
Mentor(s) Juan Riaza, Steven Almeroth

IPython IDE for Scrapy

Advanced
Brief explanation Develop a better IPython + Scrapy integration that would display the HTML page inline in the console, provide some interactive widgets and run Python code against the results. Here is an old scrapy-ipython proof of concept demo. See also: Splash custom IPython/Jupyter kernel.
Expected Results It should become possible to develop Scrapy spiders interactively and visually inside IPython notebooks.
Required skills Python, JavaScript, HTML, Interface Design, Security
Difficulty level Advanced
Mentor(s) Mikhail Korobov

Scrapy benchmarking suite

Advanced
Brief explanation Develop a more comprehensive benchmarking suite. Profile and address CPU bottlenecks found. Address both known memory inefficiencies (which will be provided) and new ones uncovered.
Expected Results Reusable benchmarks, measureable performance improvements.
Required skills Python, Profiling, Algorithms and Data Structures
Difficulty level Advanced
Mentor(s) Mikhail Korobov, Daniel Graña

Support for spiders in other languages

Intermediate
Brief explanation A project that allows users to define a Scrapy spider by creating a stand alone script or executable.
Expected Results Demo spiders in a programming languge other than Python, documented API and tests.
Required skills Python and other programming language
Related Issues https://github.com/scrapy/scrapy/issues/1125
Difficulty level Intermediate
Mentor(s) Pablo Hoffman

Scrapy has a lot of useful functionality not available in frameworks for other programming languages. The goal of this project is to allow developers to write spiders simply and easily in any programming language, while permitting Scrapy to manage concurrency, scheduling, item exporting, caching, etc. This project takes inspiration from hadoop streaming, a utility allowing hadoop mapreduce jobs to be written in any language.

This task will involve writing a Scrapy spider that forks a process and communicates with it using a protocol that needs to be defined and documented. It should also allow for crashed processes to be restarted without stopping the crawl.

Stretch goals:

  • Library support in python and another language. This should make writing spiders similar to how it is currently done in Scrapy
  • Recycle spiders periodically (e.g. to control memory usage)
  • Use multiple cores by forking multiple processes and load balancing between them.

Scrapy integration tests

Intermediate
Brief explanation Add integration tests for different networking scenarios.
Expected Results Be able to tests from vertical to horizontal crawling against websites in same and different ips respecting throttling and handling timeouts, retries, dns failures. It must be simple to define new scenarios with predefined components (websites, proxies, routers, injected error rates).
Required skills Python, Networking and Virtualization
Difficulty level Intermediate
Mentor(s) Daniel Graña, Joaquin Sargiotto

New HTTP/1.1 download handler

Advanced
Brief explanation Replace current HTTP1.1 downloader handler with a in-house solution easily customizable to crawling needs. Current HTTP1.1 download handler depends on code shipped with Twisted that is not easily extensible by us, we ship twisted code under scrapy.xlib.tx to support running Scrapy in older twisted versions for distributions that doesn't ship uptodate Twisted packages. But this is an ongoing cat-mouse game, the http download handler is an essential component of a crawling framework and having no control over its release cycle leaves us with code that is hard to support. The idea of this task is to depart from current Twisted code looking for a design that can cover current and future needs taking in count the goal is to deal with websites that don't follow standards to the letter.
Expected Results A HTTP parser that degrades nicely to parse invalid responses, filtering out the offending headers and cookies as browsers does. It must be able to avoid downloading responses bigger than a size limit, it can be configured to throttle bandwidth used per download, and if there is enough time it can lay out the interface to response streaming and support features such as HTTP pipelining.
Required skills Python, Twisted and HTTP protocol
Difficulty level Advanced
Mentor(s) Daniel Graña

New Scrapy signal dispatching

Easy
Brief explanation Profile and look for alternatives to the backend of our signal dispatcher based on pydispatcher lib. Django moved out of pydispatcher many years ago which simplified the API and improved its performance. We are looking to do the same with Scrapy. A major challenge of this task is to make the transition as seamless as possible, providing good documentation and guidelines, along with as much backwards compatibility as possible.
Expected Results The new signal dispatching implemented, documented and tested, with backwards compatibility support.
Required skills Python
Related Issues https://github.com/scrapy/scrapy/issues/8
Difficulty level Easy
Mentor(s) Daniel Graña, Pablo Hoffman, Julia Medina

Portia Ideas for GSoC 2016

Information for Students

If you're interested in participating in GSoC 2016 as a student, you should join the portia-scraper mailing list and post your questions and ideas there. All Portia development happens at GitHub Portia repo.

Ideas

Portia Spider Generation

Advanced
Brief explanation One problem with traditionally scraping websites using XPath and CSS selectors is that when a website changes its layout your spiders may no longer work. This project aims to use crawl datasets to try to build new Portia spiders from website content and extracted data, repair spiders if the website layout has changed and then merge the templates used by the spiders into a small manageable number.
Required skills Python
Difficulty level Advanced
Mentor(s) Ruairi Fahy

Splash Ideas for GSoC 2016

Information for Students

Splash doesn't yet have a mailing list, so if you're interested in discussing any of these ideas, drop us a line via email at gsoc@scrapinghub.com, or open an issue on GitHub. You can also check the documentation at https://splash.readthedocs.org/en/latest/.

All Splash development happens at GitHub Splash repo.

Ideas

Web Scraping Helpers

Intermediate
Brief explanation Currently there is no an easy way to click a link, fill and submit a form, extract data from a webpage using Splash Scripts (see http://splash.readthedocs.org/en/master/scripting-tutorial.html). We should develop a helper library to make these (and related) tasks easy.
Expected Results A set of useful functions available by default. We should provide web scraping helpers similar to the ones provided by Scrapy, Selenium, PhantomJS/CasperJS, etc.
Required skills Python, JavaScript, Lua, API design
Difficulty level Intermediate
Mentor(s) Mikhail Korobov, Pablo Hoffman, Denis Shpektorov

Migrate to QtWebEngine

Intermediate
Brief explanation Implement as much Splash features as possible using QtWebEngine instead of QtWebKit, while keeping QtWebKit compatibility
Expected Results Most (if not all) tests should pass under Python 3.4 with qt 5.5/5.6.
Required skills Python 2 and Python 3, PyQT
Difficulty level Intermediate
Mentor(s) Mikhail Korobov

Frontera Ideas for GSoC 2016

Frontera and Google Summer of Code

Frontera is a web crawling framework consisting of crawl frontier, and distribution/scaling primitives, allowing to build a large scale online web crawler. Frontera takes care of the logic and policies to follow during the crawl. It stores and prioritises links extracted by the crawler to decide which pages to visit next, and capable of doing it in distributed manner. Frontera is on it's way of building a healthy and active community and it's applying for Google Summer of Code in 2016.

Information for Students

You can check the documentation at https://frontera.readthedocs.org/en/latest/.

All Frontera development happens at GitHub Frontera repo. Please use the Frontera mailing list as a main communications channel.

Ideas

Python 3 Support

Intermediate
Brief explanation Framework needs to be running with CPython 3, because of moving of whole ecosystem to Python 3.
Expected Results Smooth operation on both CPython 2 and 3, and successfull tests passing.
Required skills Python 3, Travis CI, py.test
Difficulty level Intermediate
Mentor(s) Alexander Sibiryakov

Docker Support

Intermediate
Brief explanation There needs to be an easy way to run all components of distributed frontera in Docker components. Probably distributed spiders run mode with ZeroMQ message bus and MySQL-based transport is enough for a start. For that one would need to create an ensemble of docker containers: ZeroMQ broker, DB worker, Scrapy spider (would be nice to have an easy way to add more of them, by assigning an id through cmd line or config), MySQL database (with configurable disk storage). For Scrapy spider one could use examples/general-spider for beginning, it should be ok for a start.
Expected Results Docker images for all Frontera components and guide how to run and maintain them.
Required skills Docker ecosystem, Linux administration, Python, distributed systems design
Difficulty level Intermediate
Mentor(s) Alexander Sibiryakov, Joaquin Sargiotto

Reliable Queue|Spider Communication

Intermediate
Brief explanation There is a message loss could happen in spider feed, caused by different consumption/production rate, and dropping behavior of ZMQ socket. Possible solution is to introduce a flow control. Please see this issue for more details.
Expected Results Code contribution. A test case would be ideal along with problem solution.
Required skills ZeroMQ, messaging systems, frontera internals
Difficulty level Advanced
Mentor(s) Alexander Sibiryakov

Frontera Web UI

Advanced
Brief explanation A web management UI will ease the usage of Frontera, and make it more attractive for people new to Frontera. It could provide a way to get the status of all components, i.e. errors, download speed, view the storage contents and manage the crawler: stop generating new batches, revisit URLs, add new seeds, adjust priority. The possible solution is to build UI with Django and communicate with Frontera components using JSONRPC over HTTP.
Expected Results Standalone Web UI application, Frontera components modification to expose JSONRPC endpoints.
Required skills Django?, Web UI
Difficulty level Advanced
Mentor(s) Alexander Sibiryakov

Frontera cluster provisioning service

Advanced
Brief explanation When running Frontera cluster, one requires to watch multiple processes: spiders, SWs, DBWs, along with resources: network, disk, OS resources. If process is down, it needs to be restarted manually or with some custom watchdog script/service. The idea here is to build a client-server application, where client is watching resources and processes, and server is collecting all the processes and resources statistics from clients and sending commands for starting, stopping, etc of Frontera components. Something like simplified Cloudera Manager. see also scrapyd.
Expected Results Standalone project.
Required skills Linux administration, Python, Frontera
Difficulty level Advanced
Mentor(s) Alexander Sibiryakov, Joaquin Sargiotto