Schrodinger's Cat Strikes Back

Home » Technical issues » Import of Endangered SE-Questions

Import of Endangered SE-Questions

As announced in my last post, I like to introduce the prototype of our new Q2A-plugin for the import of endangered SE-questions. For the user it has become the simplest and most comfortable solution I can imagine. Starting point is the link to the question on any SE-site loaded in your browser, as for example:

SE link

Copy this link. Note that the complete link is required; do not use the shared links at the bottom of the questions. Then you may select the menu option “Import SE-Question” on our Physics Overflow site, which is only visible and accessible to dedicated users like administrators or moderators (selectable by the super administrator):

PO menu

Paste the link copied from the SE-site into the appropriate field of the import dialog:

Import SE

Select the desired Physics Overflow category and click the import button. In a little while, the process announces the successful import of the complete thread containing the question and  all answers and comments:

Import SE done

The import is made using the StackExchange API. This API implements throttles, which reduce the number of daily calls to 300 for a single IP, as long as the application has no valid access token. If the application has an access token (obtained via authenticating a user), this number is 10’000 calls per day and per IP. My plugin requires typically two calls for each import (one for the thread and a second for the user data), as long as no more than 30 users have contributed to the question. For every 30 users more, again a call is required (I have found questions with more than 100 contributing users). This means that without an access token, about 150 questions per day may be imported. I have no idea what happens, when this quota is trespassed. The API returns the remaining quota of calls, which is divided by two in our plugin and indicated in the dialog window (see image above). A part of an example import is shown in the next picture:

Attribution1

Attribution

Attribution is regularized in the API terms of use, which point to the Stack Exchange Terms of Service. As far as I understand, we are allowed to copy content from SE-sites, as far as we follow the rules under this last link. My proposition is to put an attribution line under every imported question, answer and comment, that looks like this:

Attribution details

Like this, the SE rules and the rules of the  Creative Commons Attribution Share Alike license should in my opinion be fulfilled. The exact date and time of the import is added, because it is not possible to synchronize edits that are made on SE after the import. So the import is a snapshot of the state at the time indicated by this date/time. The API also provides no way to import the edit history of the questions.

If anybody has more knowledge about attribution to SE, I would be glad to get some feedback. By the way, shouldn’t we also think about terms of use for our site?

Remaining Issues

There are some issues on importing user identities, which I try to explain below. Users are imported exactly the same way as during the migration of the closed SE.TP, with their display name and email hash. The following cases may occur:

  • User no more registered on SE-site. In this case, there exists no link to the user profile on the SE-site. The plugin then allocates the post to a user “UnknownToSE”, which is hidden in the list of users, similar to the voter introduced for the import of SE.TP questions.
  • Collision with an existing user name on Physics Overflow. A user has registered with the same display name on PO as the user to be imported. In this case, the plugin checks the email hashes of both users. In case of a match, the imported user is assigned to the existing user. If the hashes are different, I have not yet a useful solution. Actually, I use again the user “UnknownToSE”, but this is not a good solution. Any ideas?
  • Collision between identical users from different SE-sites. A StackExchange user may post for instance on SE Physics and also on SE Math, but using different email addresses. I have observed that such cases appear quite often. In contrast to user IDs on different sites, the only stable ID is the account ID of a user. Using the StackExchange API, it is possible to find this ID for active SE-users. However, the Area 51 dump did not provide this ID.

Any ideas for the solution of these issues are helpful.

Next Steps

I think it is slowly time to prepare the takeoff of Physics Overflow. In my next post I will make a proposition for this process. I hope Dilaton will have recovered soon and will be on board again. Get well soon!

Advertisement

15 Comments

  1. Dilaton says:

    Soooo nice, many thanks for this nice post, the nice plugin and everything :-)!

    Too bad that my present WLAN connection with my Laptop sucks more often than not :-/…

    However, my health is slowly and steadily improving and looking forward to our PhysicsOverflow helps me too 🙂

    Cheers

  2. Thank you for this amazing plug-in! This would definitely quicken the import of questions.

    Actually, the User ID is given by the data dump. the user ID and the row ID are actually the same!

    One line of the data dump for users says:

     
    <row Id="431" Reputation="263" CreationDate="2010-11-15T15:19:05.657" DisplayName="Nick T" LastAccessDate="2013-06-23T16:08:27.653" WebsiteUrl="http://None" Location="Chicago, IL" AboutMe="

    Graduate student at Northwestern University studying bio sciences; still know my electrical engineering though.

    • Learning:
      • Python (+ wxPython, PIL Image, others),
      • PowerShell
      • Java (on Android platform)
    • Know C on embedded well (what's a file? stream? :P)
    • Used to know HTML (sort-of now), CSS (less-so...), JScript (no.) At least I was able to forget most of this stuff just in time for AJAX to become useful. (Because it's totally different from DHTML...)

    I like stout, esp. oatmeal stout. Gin and tonic if we're talking mixed drinks. Also like Portal (hell, all Valve games), and would kill a bus full of baby nuns with fire if Gabe would release Half-Life 3 already.

    " Views="11" UpVotes="6" DownVotes="2" EmailHash="1a71658d81f8a82a8122050f21bb86d3" Age="27"/>

    the “Id=”431″” is actqually the user ID.

    As for the “Collision with an existing user name on Physics Overflow.” when the emails don’t match, can’t there be another user “UnknownToPO” for such users?

    Thanks again!

    • Um, the html got formatted accidentally, I thought “pre” would prevent that from happening, since it’s pre-formatted…

    • polarkernel says:

      The user-IDs in the data dump do not correspond to the IDs on the site, they are ordered in a contiguous sequence starting from 1. In order to understand the differences between the sites, try once to look at your StackExchange associations. Your account ID is 1497794. You may find your different IDs and email hashes using

      http://stackauth.com/1.0/users/1497794/associated

      It seems that you are always using the same email account, but this is not the case for all users. Actually, I am not very happy with the users unknown to PO, because these users will have no chance to regain their account, I will try some experiments to minimize such cases, maybe using the account ID, where available. In any case, these unknown user IDs have to be hidden, because they collect quite a large amount of points.

  3. Dilaton says:

    Haha, now I have just imported 2 new questions, one of the wrongly migrated (and closed there) to Math SE:

    http://physics.stackexchange.com/q/94901/2751
    http://math.stackexchange.com/q/648665/36639

    It works beautifully, so that we can happily continue to save questions that the dominating dimwits and dilettantes on Physics SE wrongly close for redisplayment later as soon PhysicsOverflow is up and running 🙂

    • polarkernel says:

      Fine to see you happy! Please note that unfortunately I have actually no code available to migrate these questions from your database to the final one. This will be quite complicated. We should hurry to go online soon.

  4. By the way, is there any progress on importing the questions with specific tags from the data dump for Physics.SE pre-September yet?

    • polarkernel says:

      I did not yet work on this. We have the data available and it will be possible to insert these questions also after the start of the site. We will be able to reuse some of the code developed for SE.TP migration.

  5. Dilaton says:

    There are certain things to be considered when thinking about importin whole tags from Physics SE:

    a) This is technically possible but not so easy to do. The procedure to import the TP questions from the data dump can not just be recycled, as for questions that are still “alive” on physics SE, the full attribution has to be done. So finding a solution for this would take its time …

    b) Larger interesting tags may contain more than the 300 questions that can be imported without SE explicitely asking for a permition

    c) Generally nice tags such as string theory and supersymmetry may also contain too low-level down to crapy or even trolling “pseudo” questions as can be seen when for example looking at the closed and/or seriously downvoted string theory questions on physics SE

    http://physics.stackexchange.com/search?q=closed%3A1+string+theory

    d) Even though we import the TP questions from the data dump to start with, Physics Overflow should be self-sustained at the end and not depend too heavily on content imported from elsewhere.

    So I am not sure how importent it is that we can import whole tags from Physics SE automatically, or if it would be enough to import questions we like, want to save from the dimwits etc just by Polarkernels plugin which works so beautifully …?

    These are just some thoughts of mine … 😉

    • That would make sense, but first consider the huge number of questions needed to be manually imported.

      For example, in my favourites alone, there are 527 questions, out of which maybe less than 150 – 200 are too basic for Physics Overflow. There are many other questions too, which would be highly appropriate on Physics Overflow.

      So if there are eventually around 500 to 600 are appropriate for BPS Overflow, then if you can import, say, between 25 to 50 questions daily, then it should be completed within 10 and 24 days, which is not a bad plan!

      In that case, I think we should have an extra thread for the importing of older questions (those present in the PSE data dump), which I will start soon, since the older “Possibly endangered and high-level questions” list will get too cramped.

  6. Dilaton says:

    Here you can have a first look at the future home of Physics Overflow

    http://www.physicsoverflow.org/

    and read a very short anouncement 😉

    Cheers !

    • Hurray! I see that if one zooms in to 250%, a bar pops up. I suppose this is because many mobile devices have such a zoom. I don’t see the point exactly though, since by clicking on the icon, one only sees a link to the homepage.

      By the way, shouldn’t the links have a different font-colour?

      • Dilaton says:

        The zooming behavior is probably some default behavior with not much thought put into it …

        Maybe Polarkernel could indeed give the link to the blog a different color, such that is is more obvious that it is a link … ;-)?

  7. […] up on this, please write down the URLs to interesting questions on Physics.SE, whose question ID is lesser […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: